Managed ML Platforms — SageMaker (AWS) & Vertex AI (GCP)¶
Multi-paradigm portfolio: Custom FastAPI + Kubernetes as the primary serving architecture, AND managed endpoints (SageMaker + Vertex AI) to demonstrate managed-platform skills. See ADR-017 for the full comparison.
Architecture: Three Paradigms, One Portfolio¶
┌───────────────────────────────────────────────────────────────────────────────────┐
│ ML-MLOps-Portfolio │
│ │
│ ┌──────────────────────┐ ┌───────────────────────┐ ┌────────────────────────┐ │
│ │ 1. Custom FastAPI │ │ 2. AWS SageMaker │ │ 3. GCP Vertex AI │ │
│ │ on K8s (GKE + EKS) │ │ Managed Endpoint │ │ Managed Endpoint │ │
│ │ ── PRIMARY ── │ │ ── COMPLEMENT ── │ │ ── COMPLEMENT ── │ │
│ │ │ │ │ │ │ │
│ │ ✅ SHAP middleware │ │ ✅ Built-in scaling │ │ ✅ Built-in scaling │ │
│ │ ✅ Prometheus │ │ ✅ CloudWatch │ │ ✅ Cloud Monitoring │ │
│ │ ✅ Multi-cloud │ │ ✅ Model Monitor │ │ ✅ Model Monitoring │ │
│ │ ✅ Full control │ │ ✅ 5-min deploy │ │ ✅ 5-min deploy │ │
│ │ ❌ More maintenance │ │ ❌ AWS-only │ │ ❌ GCP-only │ │
│ └──────────────────────┘ └───────────────────────┘ └────────────────────────┘ │
│ │
│ "I can build custom infrastructure AND use managed platforms — │
│ the right choice depends on team context, not personal preference." │
└───────────────────────────────────────────────────────────────────────────────────┘
Part 1: AWS SageMaker Endpoint¶
Prerequisites (AWS)¶
# AWS CLI configured
export AWS_PROFILE=ml-portfolio
aws sts get-caller-identity # Verify access
# Python packages
pip install sagemaker boto3
# SageMaker execution role (one-time setup)
bash scripts/sagemaker/setup-role.sh
Quick Start (SageMaker)¶
# 1. Deploy endpoint (~5 minutes)
python scripts/sagemaker/deploy_endpoint.py
# 2. Test endpoint
python scripts/sagemaker/deploy_endpoint.py test
# 3. Check status
python scripts/sagemaker/deploy_endpoint.py status
# 4. DELETE when done (stops charges!)
python scripts/sagemaker/deploy_endpoint.py delete
How SageMaker Works¶
Model Packaging¶
SageMaker requires a model.tar.gz containing:
model.tar.gz
├── model.joblib ← trained BankChurn StackingClassifier
└── inference.py ← SageMaker inference handler
The inference.py implements 4 functions that SageMaker calls automatically:
| Function | Purpose | Called When |
|---|---|---|
model_fn(model_dir) |
Load model from disk | Container startup |
input_fn(body, content_type) |
Deserialize JSON → DataFrame | Each request |
predict_fn(data, model) |
Run predict_proba() |
Each request |
output_fn(prediction, accept) |
Serialize dict → JSON | Each request |
SageMaker Deployment Flow¶
model.joblib + inference.py
│
▼
model.tar.gz ──▶ S3 bucket
│
▼
SageMaker Model (references S3 + sklearn container)
│
▼
Endpoint Config (instance type, count)
│
▼
Endpoint (InService) ──▶ HTTPS endpoint ready for invocations
SageMaker Auto-Scaling¶
# Example: Configure SageMaker auto-scaling (not in portfolio scripts — for reference)
import boto3
client = boto3.client("application-autoscaling")
# Register scalable target
client.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId="endpoint/bankchurn-endpoint/variant/primary",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=1,
MaxCapacity=5,
)
# Target tracking policy — scale on invocations per instance
client.put_scaling_policy(
PolicyName="bankchurn-scaling",
ServiceNamespace="sagemaker",
ResourceId="endpoint/bankchurn-endpoint/variant/primary",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 100.0, # 100 invocations per instance
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60,
},
)
Compare to the custom K8s HPA approach:
# k8s/bankchurn-hpa.yaml (portfolio's custom approach)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 1
maxReplicas: 3
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Key difference: SageMaker scales on invocations per instance (business metric), K8s HPA scales on CPU utilization (infrastructure metric). SageMaker's approach is more ML-native; K8s gives more infrastructure control.
SageMaker Model Monitor¶
# Example: Set up data quality monitoring (reference — not in portfolio scripts)
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat
monitor = DefaultModelMonitor(
role=role_arn,
instance_count=1,
instance_type="ml.m5.xlarge",
volume_size_in_gb=20,
)
# Create baseline from training data
monitor.suggest_baseline(
baseline_dataset="s3://bucket/training-data.csv",
dataset_format=DatasetFormat.csv(header=True),
)
# Schedule hourly monitoring
monitor.create_monitoring_schedule(
monitor_schedule_name="bankchurn-monitor",
endpoint_input=endpoint_name,
output_s3_uri="s3://bucket/monitoring-output",
schedule_cron_expression="cron(0 * ? * * *)",
)
Compare to the portfolio's custom drift detection:
# k8s/drift-detection-cronjob.yaml (portfolio's custom approach)
# Runs daily at 06:00 UTC — KS test + PSI + Evidently report
kubectl get cronjobs -n ml-portfolio
Key difference: SageMaker Model Monitor gives you out-of-the-box data quality checks with CloudWatch alerts. Custom drift detection (KS + PSI + Evidently) gives full control over statistical tests and thresholds.
SageMaker Infrastructure Created¶
| Resource | Name | Cost |
|---|---|---|
| SageMaker Model | bankchurn-model |
Free |
| Endpoint Config | bankchurn-endpoint-config |
Free |
| Endpoint | bankchurn-endpoint |
~$0.065/hr (~$47/mo if 24/7) |
| S3 artifact | s3://ml-portfolio-ml-models-production/sagemaker/bankchurn/model.tar.gz |
~$0.001/mo |
⚠️ Cost control: Only keep the endpoint alive during demos. Deploy → test → delete. Recreating takes ~5 minutes.
SageMaker Troubleshooting¶
| Issue | Cause | Solution |
|---|---|---|
NoSuchEntityException on role |
IAM role not created | Run bash scripts/sagemaker/setup-role.sh |
Endpoint stuck in Creating |
Large model or cold region | Wait up to 10 minutes; check CloudWatch logs |
ModelError on invoke |
inference.py error | Check aws logs for /aws/sagemaker/Endpoints/bankchurn-endpoint |
AccessDeniedException |
Role missing S3 or SageMaker perms | Re-run setup-role.sh |
| High latency on first call | Cold start | Second call will be ~150ms |
ValidationException |
Wrong sklearn framework version | Verify 1.2-1 matches your model's sklearn version |
Part 2: GCP Vertex AI Endpoint¶
Prerequisites (GCP)¶
# GCP CLI configured
export GCP_PROJECT=ml-portfolio-duque-om-202602
export GCP_REGION=us-central1
gcloud auth application-default login
# Python packages
pip install google-cloud-aiplatform google-cloud-storage
# Vertex AI service account (one-time setup)
bash scripts/vertex_ai/setup-service-account.sh
# Enable Vertex AI API (if not already)
gcloud services enable aiplatform.googleapis.com --project=$GCP_PROJECT
Quick Start (Vertex AI)¶
# 1. Deploy endpoint (~5-10 minutes)
python scripts/vertex_ai/deploy_endpoint.py
# 2. Test endpoint
python scripts/vertex_ai/deploy_endpoint.py test
# 3. Check status
python scripts/vertex_ai/deploy_endpoint.py status
# 4. DELETE when done (stops charges!)
python scripts/vertex_ai/deploy_endpoint.py delete
How Vertex AI Works¶
Model Upload¶
Vertex AI does NOT require model.tar.gz — it reads directly from a GCS directory:
For custom preprocessing, Vertex AI uses a Predictor class (vs SageMaker's 4 functions):
# scripts/vertex_ai/predictor.py — Vertex AI Custom Prediction Routine
class BankChurnPredictor:
def __init__(self):
"""Load model (called once at container startup)."""
model_path = os.path.join(os.environ["AIP_STORAGE_URI"], "model.joblib")
self._model = joblib.load(model_path)
def predict(self, instances: List[Dict]) -> List[Dict]:
"""Run prediction (called for each request)."""
df = pd.DataFrame(instances)
proba = self._model.predict_proba(df)
return [{"churn_probability": round(float(p[1]), 4)} for p in proba]
Compare the inference contracts:
| Aspect | SageMaker (inference.py) |
Vertex AI (predictor.py) |
|---|---|---|
| Pattern | 4 standalone functions | 1 class with methods |
| Model loading | model_fn(model_dir) |
__init__(self) |
| Input parsing | input_fn(body, content_type) |
SDK handles JSON automatically |
| Prediction | predict_fn(data, model) |
predict(self, instances) |
| Output | output_fn(prediction, accept) |
postprocess(self, prediction) (optional) |
| Batch support | Must handle in predict_fn |
instances is always a list |
Vertex AI Deployment Flow¶
model.joblib
│
▼
GCS bucket (gs://bucket/vertex-ai/bankchurn/)
│
▼
Vertex AI Model (references GCS + sklearn container)
│
▼
Vertex AI Endpoint (display name, region)
│
▼
Deploy Model to Endpoint (machine type, replicas)
│
▼
Endpoint (ready) ──▶ HTTPS endpoint for predictions
Vertex AI Auto-Scaling¶
# Vertex AI auto-scaling is configured during deploy
model.deploy(
endpoint=endpoint,
machine_type="n1-standard-2",
min_replica_count=1, # Minimum instances
max_replica_count=5, # Maximum instances
traffic_percentage=100,
# Auto-scaling is automatic based on CPU utilization
# No separate scaling policy needed (unlike SageMaker)
)
Key difference: Vertex AI auto-scaling is simpler — it's built into the deploy() call. SageMaker requires a separate Application Auto Scaling policy. Vertex AI scales based on CPU/GPU utilization automatically.
Vertex AI Model Monitoring¶
# Example: Set up model monitoring on Vertex AI (reference)
from google.cloud import aiplatform
# Create monitoring job
job = aiplatform.ModelDeploymentMonitoringJob.create(
display_name="bankchurn-monitoring",
endpoint=endpoint,
logging_sampling_strategy={"random_sample_config": {"sample_rate": 0.8}},
log_ttl="7776000s", # 90 days
# Skew detection (training vs serving data)
model_deployment_monitoring_objective_configs=[{
"deployed_model_id": deployed_model_id,
"objective_config": {
"training_dataset": {
"gcs_source": {"uris": ["gs://bucket/training-data.csv"]},
"data_format": "csv",
"target_field": "Exited",
},
"training_prediction_skew_detection_config": {
"skew_thresholds": {
"Age": {"value": 0.3},
"Balance": {"value": 0.3},
}
},
},
}],
schedule_config={"monitor_interval": {"seconds": 3600}}, # Hourly
)
Vertex AI Infrastructure Created¶
| Resource | Name | Cost |
|---|---|---|
| Vertex AI Model | bankchurn-model |
Free |
| Vertex AI Endpoint | bankchurn-endpoint |
~$0.095/hr (~$68/mo if 24/7) |
| GCS artifact | gs://...ml-models-production/vertex-ai/bankchurn/model.joblib |
~$0.001/mo |
⚠️ Cost control: Same as SageMaker — deploy only during demos. Delete immediately after.
Vertex AI Troubleshooting¶
| Issue | Cause | Solution |
|---|---|---|
PermissionDenied |
Missing Vertex AI roles | Run bash scripts/vertex_ai/setup-service-account.sh |
RESOURCE_EXHAUSTED |
Region quota exceeded | Try us-central1 or request quota increase |
| Model upload fails | GCS permissions | Verify storage.objectViewer role on service account |
| Endpoint stuck deploying | Container pull or health check failure | Check Vertex AI console logs |
InvalidArgument on predict |
Wrong input format | Vertex AI expects {"instances": [...]} |
| Slow first request | Container cold start | ~5-10s on first request, then ~150ms |
Part 3: Full Comparison — Custom vs SageMaker vs Vertex AI¶
Latency Comparison¶
| Metric | Custom FastAPI (K8s) | SageMaker (AWS) | Vertex AI (GCP) |
|---|---|---|---|
| p50 latency | ~103ms | ~150-200ms | ~150-200ms |
| p95 latency | ~150ms | ~300ms | ~300ms |
| Cold start | ~3s (pod) | ~5s (container) | ~5-10s (container) |
| Deploy time | ~15 min (CI/CD) | ~5 min | ~5-10 min |
Feature Matrix¶
| Feature | Custom FastAPI | SageMaker | Vertex AI |
|---|---|---|---|
SHAP explainability (?explain=true) |
✅ | ❌ | ❌ |
| Custom Prometheus metrics | ✅ | ❌ | ❌ |
| Multi-cloud portable | ✅ (GKE + EKS) | ❌ (AWS) | ❌ (GCP) |
| Auto-scaling | HPA (CPU-based) | Built-in (invocations) | Built-in (CPU) |
| A/B testing | Argo Rollouts | Built-in variants | Built-in traffic split |
| Model monitoring | Custom (Prometheus + Evidently) | Model Monitor | Model Monitoring |
| Data quality checks | Custom (Pandera + CronJob) | Baseline + scheduled | Skew detection |
| Canary deployments | Argo Rollouts manifests | Endpoint variants | Traffic split % |
| Cost (idle, portfolio) | Included in K8s cluster | ~$47/mo per endpoint | ~$68/mo per endpoint |
| Inference script pattern | FastAPI route + Pydantic | 4 functions (model/input/predict/output) | Predictor class |
| SDK lock-in | None (HTTP + OpenAPI) | sagemaker SDK |
google-cloud-aiplatform SDK |
When to Use Each (Decision Guide)¶
| Scenario | Best Choice | Why |
|---|---|---|
| Small team, no K8s | SageMaker / Vertex AI | Zero infra management |
| Existing K8s platform | Custom FastAPI | Leverage existing infra |
| Need SHAP/custom middleware | Custom FastAPI | Full control over request pipeline |
| Rapid prototyping | SageMaker / Vertex AI | 5-min deploys |
| Regulated industry | Custom FastAPI | SHAP middleware + audit trail |
| Multi-cloud required | Custom + K8s | Kustomize overlays portable |
| AWS-only shop | SageMaker | Native integration + Model Monitor |
| GCP-only shop | Vertex AI | Native integration + BigQuery ML |
| Portfolio / interviews | All three | Demonstrates versatility |
| Cost-sensitive startup | Custom on K8s | No per-endpoint charges |
| Larger team with ML ownership | SageMaker / Vertex AI | Focus on models, not infra |
Files¶
SageMaker (AWS)¶
| File | Purpose |
|---|---|
scripts/sagemaker/inference.py |
SageMaker inference handler (4 functions) |
scripts/sagemaker/deploy_endpoint.py |
Deploy, test, delete, status, package |
scripts/sagemaker/setup-role.sh |
Create IAM execution role |
Vertex AI (GCP)¶
| File | Purpose |
|---|---|
scripts/vertex_ai/predictor.py |
Vertex AI Custom Prediction Routine (Predictor class) |
scripts/vertex_ai/deploy_endpoint.py |
Deploy, test, delete, upload, status |
scripts/vertex_ai/setup-service-account.sh |
Create GCP service account + IAM roles |
Documentation¶
| File | Purpose |
|---|---|
docs/decisions/017-custom-vs-managed-ml-platforms.md |
ADR with full comparison |
docs/MANAGED_ML_GUIDE.md |
This guide |
docs/MULTI_CLOUD_COMPARISON.md |
Cross-cloud comparison including managed platforms |