Architectural Decision Records (ADRs)¶

Structured documentation of every significant technical decision in the ML-MLOps Portfolio, including context, alternatives evaluated, trade-offs accepted, and conditions for revisiting.

Why ADRs?¶

ADRs capture the reasoning behind decisions, not just the outcome. Six months from now, when someone asks "why didn't you use Airflow?" or "why is memory excluded from HPA?", the answer is documented with data, alternatives, and trade-offs — not buried in a Slack thread.

Index¶

ADR	Decision	Category	Status
001	CPU-Only HPA for ML Inference Services	Infrastructure	Accepted
002	emptyDir + Init Container for Model Storage	Infrastructure	Accepted
003	StackingClassifier for BankChurn	ML Modeling	Accepted
004	OpenTelemetry with Graceful No-Op Fallback	Observability	Accepted
005	Compatible Release Pinning (~=) for Dependencies	DevOps	Accepted
006	Drift-Triggered Retraining Architecture	MLOps	Accepted (stub)
007	Feature Store — Deferred with Design Document	MLOps	Deferred
008	Canary Deployments with Argo Rollouts	Infrastructure	Accepted
009	Simplification — Knowing When Not to Build	Architecture	Accepted
010	SHAP KernelExplainer for StackingClassifier	ML Explainability	Accepted
011	Gradio Demo — Not Deployed to Production	Architecture	Accepted
012	Security Scanner Staging vs Production Policy	Security	Accepted
013	Multi-Cloud Parity Policy (GKE vs EKS)	Infrastructure	Accepted
014	Single-Worker Pod Pattern for ML Inference	Infrastructure	Accepted
015	Async Inference via ThreadPoolExecutor	Performance	Accepted
016	GCP vs AWS Performance — Cost vs Latency Trade-off	Infrastructure	Accepted
017	Custom FastAPI + K8s vs Managed ML Platforms (SageMaker/Vertex AI)	Architecture	Accepted

Decision Flow¶

Many ADRs form a decision chain where one decision creates the context for the next:

ADR-003 (StackingClassifier)
  ├── ADR-010 (SHAP KernelExplainer — TreeExplainer incompatible)
  ├── ADR-015 (Async inference — CPU-bound predict blocks event loop)
  │     └── ADR-016 (GCP vs AWS latency — CPU-bound = cloud-sensitive)
  └── ADR-009 (Simplification — justified keeping despite complexity)

ADR-001 (CPU-only HPA)
  └── ADR-014 (Single-worker pod — refined HPA thresholds)
        └── ADR-015 (Async inference — eliminated 2-worker exception)

ADR-006 (Drift-triggered retraining)
  └── ADR-008 (Canary deployments — safe model promotion after retraining)

ADR-002 (emptyDir model storage)
  └── ADR-006 (Retraining writes new model → ConfigMap update → rollout)

ADR-012 (Security scanner policy)
  └── ADR-013 (Multi-cloud parity — security posture matches per cloud)

ADR-017 (Custom vs Managed ML Platforms)
  ├── ADR-003 (StackingClassifier — SHAP middleware justifies custom serving)
  ├── ADR-013 (Multi-cloud parity — SageMaker is AWS-only, custom is portable)
  └── ADR-009 (Simplification — SageMaker complement, not replacement)

Format¶

Every ADR follows a consistent structure:

TL;DR — One-sentence impact summary
Context — Problem statement with evidence/data
Decision — What was decided
Alternatives Considered — What else was evaluated and why rejected
Consequences — Positive and negative outcomes
Revisit When — Conditions that would change the decision
References — Related ADRs and external documentation

Reading Guide¶

Recruiters/Hiring Managers: Read the TL;DR of each ADR for a quick overview of technical depth. ADR-009 (Simplification), ADR-015 (Async Inference), and ADR-016 (Cost vs Performance) are recommended for understanding engineering judgment.
ML Engineers: Start with ADR-003 (model choice), ADR-010 (explainability), ADR-006 (drift retraining).
Platform/DevOps Engineers: Start with ADR-001 (HPA), ADR-014 (single-worker), ADR-002 (model storage), ADR-008 (canary).
Security Engineers: ADR-012 (scanner policy), ADR-005 (dependency pinning).
Hiring Managers (platform focus): ADR-017 (custom vs managed platforms) — demonstrates ability to articulate build-vs-buy trade-offs.