Multi-Cloud Deployment Comparison: GCP (GKE) vs AWS (EKS)¶

Production metrics captured March 13, 2026. Both clouds running identical Kubernetes manifests via Kustomize overlays. Both clouds use nginx Ingress with real LoadBalancer (GCP: static IP, AWS: NLB).

Architecture Overview¶

Component	GCP	AWS
Kubernetes	GKE 1 node baseline, auto-scales to 5 (e2-medium)	EKS 1 node baseline, auto-scales to 5 (t3.small)
Container Registry	Artifact Registry	ECR
Object Storage	GCS	S3
IAM → Pods	Workload Identity	IRSA
Load Balancer	nginx-ingress + GCE LB (static IP)	nginx-ingress + NLB (AWS Load Balancer Controller)
Ingress Controller	nginx-ingress	nginx-ingress (portable)
Monitoring	Prometheus + Grafana	Prometheus + Grafana
ML Tracking	MLflow	MLflow
HPA	CPU-based (3 services)	CPU-based (3 services)
Drift Detection	CronJob (daily, completing)	CronJob (daily, completing)
Network Policies	Applied	Applied
PDB	Applied	Applied

Both clouds: real LoadBalancer with nginx Ingress path routing. AWS NLB provisioned 2026-03-18 (IAM permission fix applied).

Workload Summary¶

Pod	GCP Status	AWS Status
bankchurn-predictor	✅ Running	✅ Running
nlpinsight-analyzer	✅ Running	✅ Running
chicagotaxi-pipeline	✅ Running	✅ Running
prometheus	✅ Running	✅ Running
grafana	✅ Running	✅ Running
mlflow-server	✅ Running	✅ Running

Performance Comparison¶

Smoke Test — Idle Latencies (Locust, 6 users, 30s — 2026-03-18)¶

Both tests run against real LoadBalancer IPs — GCP: 136.111.152.72, AWS: NLB DNS (k8s-ingressn-ingressn-6775b5d876-17e8cdb571a0f652.elb.us-east-1.amazonaws.com). Same locustfile, same parameters. Results are directly comparable.

Service	GCP p50	GCP p95	AWS p50	AWS p95	Delta p50
BankChurn `/predict`	200ms	410ms	110ms	140ms	-45%
NLPInsight `/predict`	78ms	140ms	100ms	120ms	+28%
ChicagoTaxi `/demand`	100ms	400ms	120ms	230ms	+20%

Load Test (Locust, 50 users, 2 min — 2026-03-18)¶

Service	GCP p50	GCP p95	GCP Errors	AWS p50	AWS p95	AWS Errors
BankChurn `/predict`	3100ms	6500ms	0%	120ms	200ms	0%
NLPInsight `/predict`	84ms	570ms	0%	100ms	180ms	0%
ChicagoTaxi `/demand`	110ms	4900ms	0%	130ms	210ms	0%

Key finding: AWS EKS significantly outperforms GCP GKE under concurrent load. BankChurn 3100ms (GCP) vs 120ms (AWS) under 50 users suggests EC2 compute-optimized nodes handle StackingClassifier better than GKE e2-medium instances. NLPInsight and ChicagoTaxi are comparable on both clouds.

Stress Test (Locust, 100 users, 2 min — 2026-03-18)¶

Service	GCP p50	GCP Errors	AWS p50	AWS Errors
BankChurn `/predict`	8200ms	0.02%	130ms	0%
NLPInsight `/predict`	79ms	0%	100ms	0%
ChicagoTaxi `/demand`	100ms	0%	130ms	0%

Production readiness: 0% failure rate on both clouds under 100 concurrent users (after async inference fix — ADR-015). Pre-fix BankChurn had 81% failure rate under the same load. See full load test results.

Why BankChurn is Faster on AWS¶

Root cause: BankChurn's StackingClassifier is CPU-bound (~100ms pure CPU per prediction). AWS t3.medium has better single-thread performance than GCP e2-medium:

Factor	GCP `e2-medium`	AWS `t3.medium`	Impact
CPU	AMD EPYC Rome 2.2 GHz, shared	Intel Xeon Platinum 2.5-3.1 GHz, burstable	Critical
vCPU allocation	2 shared (multi-tenant)	2 burstable (better credits)	High
Cost	$24/mo	$30/mo	Minimal

Why NLPInsight/ChicagoTaxi don't improve: They're I/O-bound or use lightweight models (~5ms CPU) — not CPU-saturated.

Trade-off decision: Accepted the performance difference (ADR-016). Upgrading GCP to c2-standard-4 (4 vCPU @ 3.8 GHz) would cost $145/mo (6x increase) for marginal portfolio value. Both clouds meet SLAs (<500ms idle, 0% errors under load).

Resource Usage (AWS EKS, post stress test)¶

Pod	CPU	Memory
bankchurn-predictor	5m	332Mi
chicagotaxi-pipeline	500m	149Mi
nlpinsight-analyzer	5m	311Mi
grafana	2m	86Mi
mlflow-server	24m	416Mi
prometheus	2m	33Mi

Cloud-Specific Configurations¶

What Changes Between Clouds (Kustomize Overlays)¶

File	Purpose
`serviceaccount-aws.yaml`	IRSA annotation (vs Workload Identity on GCP)
`model-configmaps-aws.yaml`	S3 bucket paths (vs GCS paths)
`dataset-configmaps-aws.yaml`	S3 dataset paths (vs GCS paths)
`download-script-aws.yaml`	boto3 S3 download (vs google-cloud-storage)
`*-deployment-aws.yaml`	ECR image refs (vs Artifact Registry)

What Stays Identical (Base Manifests)¶

Kubernetes Deployments (resource limits, health checks, env vars)
Services (ClusterIP, port mappings)
Ingress (nginx rewrite-target rules)
Prometheus configuration
Grafana dashboards
MLflow server
HPAs (CPU thresholds)
Network Policies
Pod Disruption Budgets
CronJobs (drift detection, retraining triggers)

Key Portability Evidence¶

Same Ingress Controller: nginx-ingress on both clouds — identical path routing rules
Same Monitoring Stack: Prometheus + Grafana deployed from base manifests
Same HPA Behavior: CPU-based autoscaling triggered correctly on both clouds
Same Drift Detection: Daily CronJob checking health + prediction stability on both clouds
Init Container Pattern: Same architecture, different storage SDK (boto3 vs google-cloud-storage)
IRSA ↔ Workload Identity: Cloud-native pod identity, same ServiceAccount pattern

See ADR-013: Multi-Cloud Parity Policy for the full parity-by-layer policy.

Visual Evidence¶

Multi-Cloud HERO	EKS Pods	SHAP on EKS

GKE Workloads	EKS Cluster	kubectl Pods (EKS)

ECR Repositories	S3 Buckets	Artifact Registry

Infrastructure Details¶

GCP¶

Project: ml-portfolio-duque-om-202602
Region: us-central1
Cluster: ml-portfolio-gke-production
Nodes: 4 × e2-medium (2 vCPU, 4GB RAM) — autoscaler, min=1/max=5
Ingress IP: 136.111.152.72

AWS¶

Account: 531948420830
Region: us-east-1
Cluster: ml-portfolio-eks
Nodes: 3 × t3.small (2 vCPU, 2GB RAM)
External Access: Classic ELB — a6ed6b93fdbf14be2853d91bd2086d6b-1565798194.us-east-1.elb.amazonaws.com
OIDC Provider: oidc.eks.us-east-1.amazonaws.com/id/8BC2F3AD51513C1D272D463D49B28335
ECR: 531948420830.dkr.ecr.us-east-1.amazonaws.com/ml-portfolio/*
S3 Models: ml-portfolio-ml-models-production
S3 Datasets: ml-portfolio-datasets-production

Deployment Commands¶

# GCP
kubectl config use-context gke_ml-portfolio-duque-om-202602_us-central1_ml-portfolio-gke-production
kubectl apply -k k8s/overlays/gcp/

# AWS
AWS_PROFILE=ml-portfolio kubectl apply -k k8s/overlays/aws/