ADR-013: Multi-Cloud Parity Policy (GKE vs EKS)¶

Status: Accepted
Date: 2026-03-13
Authors: Duque Ortega Mutis
Related: ADR-016 (performance trade-off)

TL;DR: Maintain functional parity (all ML services, monitoring, CI/CD, IaC) across GCP and AWS while accepting intentional infrastructure asymmetries (instance types, node counts, ingress controllers). Node counts are autoscaler-managed, not hard-coded — the right answer differs per cloud.

Decision¶

We accept intentional asymmetries between cloud environments while maintaining functional parity on all ML-critical workloads.

Functional Parity (must match)¶

Component	GCP (GKE)	AWS (EKS)	Status
ML services (3 APIs)	✅ Running	✅ Running	Parity
SHAP explainability	✅ Working	✅ Working	Parity
Prometheus + Grafana	✅ Running	✅ Running	Parity
MLflow tracking	✅ Running	✅ Running	Parity
Drift detection CronJob	✅ Completing	✅ Completing	Parity (2026-03-13)
CI/CD (GitHub Actions)	✅ deploy-gcp.yml	✅ deploy-aws.yml	Parity
IaC (Terraform)	✅ infra/terraform/gcp	✅ infra/terraform/aws	Parity

Accepted Asymmetries¶

Dimension	GCP	AWS	Rationale
Node count	4	3	Different instance sizes (e2-medium vs t3.medium); autoscaler decides independently per cloud
Instance memory	4 GB/node	4 GB/node	Both use 4GB instances; node count differs based on bin-packing
Ingress	GCE Ingress (L7 LB)	nginx-ingress (Classic ELB)	Both use real LoadBalancer; cloud-specific LB type
Container registry	Artifact Registry	ECR (private)	Cloud-native equivalents
Object storage	GCS	S3	Cloud-native equivalents
Drift detection	Jobs complete	Jobs complete	Parity achieved (2026-03-13)

Node Count Philosophy¶

The autoscaler operates independently per cloud based on: - Available memory per node (GKE e2-medium 4GB vs EKS t3.medium 4GB) - Pod resource requests and bin-packing efficiency - Cloud-specific overhead (kube-system, CNI, etc.)

Enterprise pattern: Node counts should never be hard-coded equal across clouds. The autoscaler's job is to right-size per environment.

Consequences¶

Drift detection on AWS requires a follow-up fix (resource constraints or endpoint accessibility)
Node count differences are expected and not a parity violation
Ingress type difference is documented as a temporary AWS account limitation
All ML-critical paths (predict, explain, health, metrics) work identically on both clouds