Multi-Cloud Deployment Comparison: GCP (GKE) vs AWS (EKS)
Production metrics captured March 13, 2026. Both clouds running identical Kubernetes manifests via Kustomize overlays.
Both clouds use nginx Ingress with real LoadBalancer (GCP: static IP, AWS: NLB).
Architecture Overview
| Component |
GCP |
AWS |
| Kubernetes |
GKE 1 node baseline, auto-scales to 5 (e2-medium) |
EKS 1 node baseline, auto-scales to 5 (t3.small) |
| Container Registry |
Artifact Registry |
ECR |
| Object Storage |
GCS |
S3 |
| IAM → Pods |
Workload Identity |
IRSA |
| Load Balancer |
nginx-ingress + GCE LB (static IP) |
nginx-ingress + NLB (AWS Load Balancer Controller) |
| Ingress Controller |
nginx-ingress |
nginx-ingress (portable) |
| Monitoring |
Prometheus + Grafana |
Prometheus + Grafana |
| ML Tracking |
MLflow |
MLflow |
| HPA |
CPU-based (3 services) |
CPU-based (3 services) |
| Drift Detection |
CronJob (daily, completing) |
CronJob (daily, completing) |
| Network Policies |
Applied |
Applied |
| PDB |
Applied |
Applied |
Both clouds: real LoadBalancer with nginx Ingress path routing. AWS NLB provisioned 2026-03-18 (IAM permission fix applied).
Workload Summary
| Pod |
GCP Status |
AWS Status |
| bankchurn-predictor |
✅ Running |
✅ Running |
| nlpinsight-analyzer |
✅ Running |
✅ Running |
| chicagotaxi-pipeline |
✅ Running |
✅ Running |
| prometheus |
✅ Running |
✅ Running |
| grafana |
✅ Running |
✅ Running |
| mlflow-server |
✅ Running |
✅ Running |
Smoke Test — Idle Latencies (Locust, 6 users, 30s — 2026-03-18)
Both tests run against real LoadBalancer IPs — GCP: 136.111.152.72, AWS: NLB DNS (k8s-ingressn-ingressn-6775b5d876-17e8cdb571a0f652.elb.us-east-1.amazonaws.com).
Same locustfile, same parameters. Results are directly comparable.
| Service |
GCP p50 |
GCP p95 |
AWS p50 |
AWS p95 |
Delta p50 |
BankChurn /predict |
200ms |
410ms |
110ms |
140ms |
-45% |
NLPInsight /predict |
78ms |
140ms |
100ms |
120ms |
+28% |
ChicagoTaxi /demand |
100ms |
400ms |
120ms |
230ms |
+20% |
Load Test (Locust, 50 users, 2 min — 2026-03-18)
| Service |
GCP p50 |
GCP p95 |
GCP Errors |
AWS p50 |
AWS p95 |
AWS Errors |
BankChurn /predict |
3100ms |
6500ms |
0% |
120ms |
200ms |
0% |
NLPInsight /predict |
84ms |
570ms |
0% |
100ms |
180ms |
0% |
ChicagoTaxi /demand |
110ms |
4900ms |
0% |
130ms |
210ms |
0% |
Key finding: AWS EKS significantly outperforms GCP GKE under concurrent load. BankChurn 3100ms (GCP) vs 120ms (AWS) under 50 users suggests EC2 compute-optimized nodes handle StackingClassifier better than GKE e2-medium instances. NLPInsight and ChicagoTaxi are comparable on both clouds.
Stress Test (Locust, 100 users, 2 min — 2026-03-18)
| Service |
GCP p50 |
GCP Errors |
AWS p50 |
AWS Errors |
BankChurn /predict |
8200ms |
0.02% |
130ms |
0% |
NLPInsight /predict |
79ms |
0% |
100ms |
0% |
ChicagoTaxi /demand |
100ms |
0% |
130ms |
0% |
Production readiness: 0% failure rate on both clouds under 100 concurrent users (after async inference fix — ADR-015). Pre-fix BankChurn had 81% failure rate under the same load. See full load test results.
Why BankChurn is Faster on AWS
Root cause: BankChurn's StackingClassifier is CPU-bound (~100ms pure CPU per prediction). AWS t3.medium has better single-thread performance than GCP e2-medium:
| Factor |
GCP e2-medium |
AWS t3.medium |
Impact |
| CPU |
AMD EPYC Rome 2.2 GHz, shared |
Intel Xeon Platinum 2.5-3.1 GHz, burstable |
Critical |
| vCPU allocation |
2 shared (multi-tenant) |
2 burstable (better credits) |
High |
| Cost |
$24/mo |
$30/mo |
Minimal |
Why NLPInsight/ChicagoTaxi don't improve: They're I/O-bound or use lightweight models (~5ms CPU) — not CPU-saturated.
Trade-off decision: Accepted the performance difference (ADR-016). Upgrading GCP to c2-standard-4 (4 vCPU @ 3.8 GHz) would cost $145/mo (6x increase) for marginal portfolio value. Both clouds meet SLAs (<500ms idle, 0% errors under load).
Resource Usage (AWS EKS, post stress test)
| Pod |
CPU |
Memory |
| bankchurn-predictor |
5m |
332Mi |
| chicagotaxi-pipeline |
500m |
149Mi |
| nlpinsight-analyzer |
5m |
311Mi |
| grafana |
2m |
86Mi |
| mlflow-server |
24m |
416Mi |
| prometheus |
2m |
33Mi |
Cloud-Specific Configurations
What Changes Between Clouds (Kustomize Overlays)
| File |
Purpose |
serviceaccount-aws.yaml |
IRSA annotation (vs Workload Identity on GCP) |
model-configmaps-aws.yaml |
S3 bucket paths (vs GCS paths) |
dataset-configmaps-aws.yaml |
S3 dataset paths (vs GCS paths) |
download-script-aws.yaml |
boto3 S3 download (vs google-cloud-storage) |
*-deployment-aws.yaml |
ECR image refs (vs Artifact Registry) |
What Stays Identical (Base Manifests)
- Kubernetes Deployments (resource limits, health checks, env vars)
- Services (ClusterIP, port mappings)
- Ingress (nginx rewrite-target rules)
- Prometheus configuration
- Grafana dashboards
- MLflow server
- HPAs (CPU thresholds)
- Network Policies
- Pod Disruption Budgets
- CronJobs (drift detection, retraining triggers)
Key Portability Evidence
- Same Ingress Controller: nginx-ingress on both clouds — identical path routing rules
- Same Monitoring Stack: Prometheus + Grafana deployed from base manifests
- Same HPA Behavior: CPU-based autoscaling triggered correctly on both clouds
- Same Drift Detection: Daily CronJob checking health + prediction stability on both clouds
- Init Container Pattern: Same architecture, different storage SDK (boto3 vs google-cloud-storage)
- IRSA ↔ Workload Identity: Cloud-native pod identity, same ServiceAccount pattern
See ADR-013: Multi-Cloud Parity Policy for the full parity-by-layer policy.
Visual Evidence
| Multi-Cloud HERO |
EKS Pods |
SHAP on EKS |
 |
 |
 |
| GKE Workloads |
EKS Cluster |
kubectl Pods (EKS) |
 |
 |
 |
| ECR Repositories |
S3 Buckets |
Artifact Registry |
 |
 |
 |
Infrastructure Details
GCP
- Project:
ml-portfolio-duque-om-202602
- Region:
us-central1
- Cluster:
ml-portfolio-gke-production
- Nodes: 4 × e2-medium (2 vCPU, 4GB RAM) — autoscaler, min=1/max=5
- Ingress IP:
136.111.152.72
AWS
- Account:
531948420830
- Region:
us-east-1
- Cluster:
ml-portfolio-eks
- Nodes: 3 × t3.small (2 vCPU, 2GB RAM)
- External Access: Classic ELB —
a6ed6b93fdbf14be2853d91bd2086d6b-1565798194.us-east-1.elb.amazonaws.com
- OIDC Provider:
oidc.eks.us-east-1.amazonaws.com/id/8BC2F3AD51513C1D272D463D49B28335
- ECR:
531948420830.dkr.ecr.us-east-1.amazonaws.com/ml-portfolio/*
- S3 Models:
ml-portfolio-ml-models-production
- S3 Datasets:
ml-portfolio-datasets-production
Deployment Commands
# GCP
kubectl config use-context gke_ml-portfolio-duque-om-202602_us-central1_ml-portfolio-gke-production
kubectl apply -k k8s/overlays/gcp/
# AWS
AWS_PROFILE=ml-portfolio kubectl apply -k k8s/overlays/aws/