Skip to content

Architectural Decision Records

Key technical decisions with rationale. Written for technical reviewers and hiring managers.

Last Updated: March 2026 | Portfolio Version: 3.5.3

Granular ADRs: See docs/decisions/ for detailed per-decision records (ADR-001 through ADR-013).


Summary

# Decision Choice Key Rationale Revisit When
001 K8s Storage emptyDir ephemeral Models ~4MB, GCS download 2-5s, $0.00005/startup Models >500MB
002 Init Container python:3.11-alpine + pinned deps 50MB image, no gcloud SDK bloat, pip at runtime Pods recreate >1/hour
003 Model Versioning ConfigMaps separate from Deployments Model updates = config change, not infra change Adopt MLflow Registry
004 Download Resilience 3 retries, 10s backoff Handles GCS transient 503s, clean CrashLoopBackOff N/A
005 GKE Cluster e2-medium × 3 nodes ~$25/node, 4GB RAM fits all pods, 91% savings vs e2-standard-4 Need dedicated vCPUs
006 Networking Custom VPC, private subnets Single-region (us-central1), VPC peering for Cloud SQL Multi-region needed
007 Ingress GCE-native Load Balancer Managed, path-based routing, single IP, $18/mo Need NGINX/Istio features
008 Docker Images Multi-stage builds 1.2GB→400-500MB, no build tools in prod, non-root N/A
009 Serialization Joblib over Pickle 60-80% compression, numpy-optimized, safer N/A
010 Model Selection Auto-selection pipeline Compare RF/XGB/LGB, select best by primary metric Add neural network
011 Experiment Tracking Self-hosted MLflow on GKE Full control, 9 runs tracked, Cloud SQL backend SaaS MLflow alternative
012 Monitoring Prometheus + Grafana on GKE 15s scrape, 10-panel dashboard, auto-provisioned Managed monitoring
013 CI/CD GitHub Actions matrix 3 projects × Python versions, security + docker + integration N/A
014 Container Registry GCP Artifact Registry Regional, cleanup policies, integrated with GKE Multi-cloud registry
015 Storage GCS with lifecycle policies Versioned, Nearline after 90d, public access prevention N/A
016 IaC Terraform with remote state GCP + AWS modules, terraform plan = no drift N/A
017 Security Defense in depth Trivy + Bandit + Gitleaks + non-root + Workload Identity N/A
ADR-006 Drift-Triggered Retraining K8s CronJob → GitHub Actions dispatch No new infra; reuses CI pipeline; auditable retrain history >5 models or daily retraining frequency
ADR-007 Feature Store Deferred — not needed All features in request payload; skew prevented by serialized model.joblib Time-window aggregation features required

HPA Design Decision

CPU-only autoscaling for all ML services. Memory-based HPA removed because ML models have fixed RAM footprint (model loaded at startup). Memory never decreases → HPA never scales down. CPU correlates with request traffic → correct scaling signal.

Service CPU Target Pods Scale-down Scale-up
BankChurn 70% 1–3 300s / 50% 60s / max(100%, +2)
NLPInsight 75% 1–3 300s / 50% 60s / max(100%, +2)

Last Updated: March 2026 — v3.5.3