Architectural Decision Records¶

Key technical decisions with rationale. Written for technical reviewers and hiring managers.

Last Updated: March 2026 | Portfolio Version: 3.5.3

Granular ADRs: See docs/decisions/ for detailed per-decision records (ADR-001 through ADR-013).

Summary¶

#	Decision	Choice	Key Rationale	Revisit When
001	K8s Storage	emptyDir ephemeral	Models ~4MB, GCS download 2-5s, $0.00005/startup	Models >500MB
002	Init Container	python:3.11-alpine + pinned deps	50MB image, no gcloud SDK bloat, pip at runtime	Pods recreate >1/hour
003	Model Versioning	ConfigMaps separate from Deployments	Model updates = config change, not infra change	Adopt MLflow Registry
004	Download Resilience	3 retries, 10s backoff	Handles GCS transient 503s, clean CrashLoopBackOff	N/A
005	GKE Cluster	e2-medium × 3 nodes	~$25/node, 4GB RAM fits all pods, 91% savings vs e2-standard-4	Need dedicated vCPUs
006	Networking	Custom VPC, private subnets	Single-region (us-central1), VPC peering for Cloud SQL	Multi-region needed
007	Ingress	GCE-native Load Balancer	Managed, path-based routing, single IP, $18/mo	Need NGINX/Istio features
008	Docker Images	Multi-stage builds	1.2GB→400-500MB, no build tools in prod, non-root	N/A
009	Serialization	Joblib over Pickle	60-80% compression, numpy-optimized, safer	N/A
010	Model Selection	Auto-selection pipeline	Compare RF/XGB/LGB, select best by primary metric	Add neural network
011	Experiment Tracking	Self-hosted MLflow on GKE	Full control, 9 runs tracked, Cloud SQL backend	SaaS MLflow alternative
012	Monitoring	Prometheus + Grafana on GKE	15s scrape, 10-panel dashboard, auto-provisioned	Managed monitoring
013	CI/CD	GitHub Actions matrix	3 projects × Python versions, security + docker + integration	N/A
014	Container Registry	GCP Artifact Registry	Regional, cleanup policies, integrated with GKE	Multi-cloud registry
015	Storage	GCS with lifecycle policies	Versioned, Nearline after 90d, public access prevention	N/A
016	IaC	Terraform with remote state	GCP + AWS modules, `terraform plan` = no drift	N/A
017	Security	Defense in depth	Trivy + Bandit + Gitleaks + non-root + Workload Identity	N/A
ADR-006	Drift-Triggered Retraining	K8s CronJob → GitHub Actions dispatch	No new infra; reuses CI pipeline; auditable retrain history	>5 models or daily retraining frequency
ADR-007	Feature Store	Deferred — not needed	All features in request payload; skew prevented by serialized `model.joblib`	Time-window aggregation features required

HPA Design Decision¶

CPU-only autoscaling for all ML services. Memory-based HPA removed because ML models have fixed RAM footprint (model loaded at startup). Memory never decreases → HPA never scales down. CPU correlates with request traffic → correct scaling signal.

Service	CPU Target	Pods	Scale-down	Scale-up
BankChurn	70%	1–3	300s / 50%	60s / max(100%, +2)
NLPInsight	75%	1–3	300s / 50%	60s / max(100%, +2)

Last Updated: March 2026 — v3.5.3