ADR-013: Multi-Cloud Parity Policy (GKE vs EKS)¶
- Status: Accepted
- Date: 2026-03-13
- Authors: Duque Ortega Mutis
- Related: ADR-016 (performance trade-off)
TL;DR: Maintain functional parity (all ML services, monitoring, CI/CD, IaC) across GCP and AWS while accepting intentional infrastructure asymmetries (instance types, node counts, ingress controllers). Node counts are autoscaler-managed, not hard-coded — the right answer differs per cloud.
Decision¶
We accept intentional asymmetries between cloud environments while maintaining functional parity on all ML-critical workloads.
Functional Parity (must match)¶
| Component | GCP (GKE) | AWS (EKS) | Status |
|---|---|---|---|
| ML services (3 APIs) | ✅ Running | ✅ Running | Parity |
| SHAP explainability | ✅ Working | ✅ Working | Parity |
| Prometheus + Grafana | ✅ Running | ✅ Running | Parity |
| MLflow tracking | ✅ Running | ✅ Running | Parity |
| Drift detection CronJob | ✅ Completing | ✅ Completing | Parity (2026-03-13) |
| CI/CD (GitHub Actions) | ✅ deploy-gcp.yml | ✅ deploy-aws.yml | Parity |
| IaC (Terraform) | ✅ infra/terraform/gcp | ✅ infra/terraform/aws | Parity |
Accepted Asymmetries¶
| Dimension | GCP | AWS | Rationale |
|---|---|---|---|
| Node count | 4 | 3 | Different instance sizes (e2-medium vs t3.medium); autoscaler decides independently per cloud |
| Instance memory | 4 GB/node | 4 GB/node | Both use 4GB instances; node count differs based on bin-packing |
| Ingress | GCE Ingress (L7 LB) | nginx-ingress (Classic ELB) | Both use real LoadBalancer; cloud-specific LB type |
| Container registry | Artifact Registry | ECR (private) | Cloud-native equivalents |
| Object storage | GCS | S3 | Cloud-native equivalents |
| Drift detection | Jobs complete | Jobs complete | Parity achieved (2026-03-13) |
Node Count Philosophy¶
The autoscaler operates independently per cloud based on: - Available memory per node (GKE e2-medium 4GB vs EKS t3.medium 4GB) - Pod resource requests and bin-packing efficiency - Cloud-specific overhead (kube-system, CNI, etc.)
Enterprise pattern: Node counts should never be hard-coded equal across clouds. The autoscaler's job is to right-size per environment.
Consequences¶
- Drift detection on AWS requires a follow-up fix (resource constraints or endpoint accessibility)
- Node count differences are expected and not a parity violation
- Ingress type difference is documented as a temporary AWS account limitation
- All ML-critical paths (predict, explain, health, metrics) work identically on both clouds