ADR-001: CPU-Only HPA for ML Inference Services¶
- Status: Accepted
- Date: 2026-02-20
- Authors: Duque Ortega Mutis
- Updated: 2026-03-18 (thresholds refined by ADR-014)
TL;DR: Removed memory from HPA metrics because ML models hold a fixed RAM footprint that never decreases — making memory-based autoscaling mathematically unable to scale down. CPU-only HPA reduced idle cost by scaling BankChurn 3 → 1 pods in 8 minutes.
Context¶
All three ML services (BankChurn, NLPInsight, ChicagoTaxi) load models into RAM at startup and hold them for the pod's lifetime. Memory usage is effectively constant regardless of traffic:
| Service | Measured Idle Memory | Under Load Memory | Delta |
|---|---|---|---|
| BankChurn (StackingClassifier + SHAP) | 332 Mi | ~335 Mi | < 1% |
| NLPInsight (TF-IDF + LogReg) | 311 Mi | ~314 Mi | < 1% |
| ChicagoTaxi (RandomForest + cache) | 149 Mi | ~152 Mi | < 2% |
Why Memory-Based HPA Fails for ML Inference¶
The HPA formula is desiredReplicas = ceil(currentReplicas × currentUtilization / targetUtilization). With constant memory:
3 replicas × 332Mi each, limit 512Mi, target 80%
→ utilization = 332/512 = 64.8%
→ desired = ceil(3 × 64.8/80) = ceil(2.43) = 3 ← never decreases
In a multi-metric HPA, Kubernetes takes the maximum recommendation across metrics. Even when CPU says "scale to 1," memory says "keep 3" — so 3 pods run indefinitely.
Observed: After a load spike scaled BankChurn to 3 replicas, it never scaled down. Pods sat idle for hours.
Decision¶
Remove memory from all HPAs. Scale exclusively on CPU utilization.
| Service | CPU Target | Min / Max Replicas | scaleUp Stabilization |
|---|---|---|---|
| BankChurn | 50% | 1 / 5 | 30s |
| NLPInsight | 60% | 1 / 3 | 30s |
| ChicagoTaxi | 60% | 1 / 3 | 30s |
Thresholds refined from 70-75% to 50-60% by ADR-014 for faster scale-out under the single-worker pattern.
Alternatives Considered¶
| Option | Verdict | Rationale |
|---|---|---|
| CPU + Memory HPA (original) | Rejected | Never scales down for fixed-footprint models |
| CPU-only HPA | Selected ✅ | Accurate signal for inference traffic |
| Custom metrics (p95 latency via Prometheus Adapter) | Deferred | Best signal, but requires Prometheus Adapter CRD; planned for future |
| KEDA (event-driven) | Deferred | Scale-to-zero capability; overkill for HTTP services at current scale |
Verification¶
After removing the memory metric:
- BankChurn correctly scaled 3 → 2 → 1 in ~8 minutes during low traffic
- Scale-up still triggered correctly under load (CPU-driven)
- No OOM kills observed — resources.limits.memory provides hard protection independently of HPA
Consequences¶
- Positive: Correct scale-down behavior — idle pods are reclaimed within minutes
- Positive: Cost savings during off-peak hours (fewer pods running)
- Positive: HPA signal is clean and predictable (CPU correlates with request volume)
- Negative: No memory-based OOM protection via HPA (mitigated by
resources.limits.memory)
Revisit When¶
- Memory usage becomes variable (e.g., dynamic model loading, batch inference with variable-size inputs)
- Custom metrics HPA (request latency) is implemented — would replace CPU as the scaling signal
- KEDA adoption for scale-to-zero during extended idle periods
References¶
- ADR-014: Single-Worker Pod Pattern — refined HPA thresholds
- Kubernetes HPA Algorithm
- Multi-Cloud Comparison — Resource Usage