ADR-014: Single-Worker Pod Pattern for ML Inference Under Kubernetes¶
- Status: Accepted (updated by ADR-015 — BankChurn exception removed)
- Date: 2026-03-18
- Authors: Duque Ortega Mutis
- Related: ADR-015 (async fix), ADR-001 (HPA thresholds)
- Discovered via: Locust load test — GCP production (50 users, 1 min)
TL;DR:
uvicorn --workers Nis an anti-pattern under Kubernetes. Multiple workers share a single CPU budget, causing thrashing instead of parallelism. Switched all services to 1 worker per pod + HPA horizontal scaling. BankChurn p50 dropped from 1700ms to 200ms. ADR-015 later eliminated the last multi-worker exception via async inference.
Context¶
During load testing against GCP production (GKE), BankChurn showed severely degraded latency under concurrent load:
| Metric | Expected (idle) | Observed (50 users) |
|---|---|---|
| BankChurn p50 | ~103ms | 1700ms |
| BankChurn p95 | ~300ms | 3200ms |
| NLPInsight p50 | ~5ms | 170ms ✅ |
| ChicagoTaxi p50 | ~75ms | 220ms ✅ |
Root cause: all 3 services used uvicorn --workers 2 with CPU limits insufficient for 2 concurrent inference processes.
Root Cause: Multi-Worker Anti-Pattern Under Kubernetes¶
uvicorn --workers N creates N OS processes via fork. Each process:
- Loads the full ML model independently → 2× memory footprint with no benefit
- Competes for the same CPU budget enforced by
resources.limits.cpu - Does not give Kubernetes per-worker CPU visibility → HPA cannot respond accurately
For CPU-bound ML inference — BankChurn uses StackingClassifier (5 base estimators + LR meta-learner) — multiple workers under a shared CPU limit cause CPU thrashing, not parallelism. Under 50 concurrent users at ~15 req/s, BankChurn needed ~1.05 CPU for inference alone. With 1 CPU limit split across 2 workers, requests queued and p50 jumped to 1700ms.
ChicagoTaxi was the most critical: 2 workers sharing only 500m CPU (effectively 250m each) — below minimum viable inference budget.
Additionally, uvicorn --workers with fork can cause issues with objects loaded at startup (SHAP KernelExplainer background data, FinBERT tokenizer), where forked child processes inherit inconsistent state.
Previous Configuration (All Services — GCP and AWS)¶
| Service | Workers | CPU Limit | HPA Threshold | scaleUp Stabilization | Issue |
|---|---|---|---|---|---|
| BankChurn | 2 | 1000m | 70% | 60s | p50 1700ms under load |
| NLPInsight | 2 | 1000m | 75% | 60s | Redundant workers |
| ChicagoTaxi | 2 | 500m | 70% | missing | CPU critically underprovisioned |
Decision¶
Remove uvicorn --workers from all service deployments. Use 1 process per pod and scale horizontally via Kubernetes HPA.
Scale-out strategy: Kubernetes pod replication (horizontal), not OS process replication (vertical).
New Configuration (GCP and AWS)¶
| Service | Workers | CPU Limit | HPA Threshold | scaleUp Stabilization | Reason |
|---|---|---|---|---|---|
| BankChurn | 1 | 1000m | 50% | 30s | Async inference via run_in_executor (ADR-015) |
| NLPInsight | 1 | 1000m | 60% | 30s | I/O-bound FinBERT — single worker sufficient |
| ChicagoTaxi | 1 | 750m | 60% | 30s + added behavior |
Lightweight LightGBM — single worker sufficient |
BankChurn: Async Inference (ADR-015 Implemented)¶
Initial load testing revealed BankChurn required 2 workers due to synchronous inference blocking the event loop. ADR-015 resolved this by implementing asyncio.run_in_executor(ThreadPoolExecutor) for all prediction endpoints.
StackingClassifier.predict()now runs in a 4-thread pool, freeing the uvicorn event loop- sklearn/XGBoost/LightGBM release the GIL during C extensions → real parallelism
- Single worker handles 100 concurrent users with 0% errors (was 81% pre-fix)
- CPU reduced from 2000m to 1000m (single process, no contention)
See ADR-015 for full details on the async inference pattern.
Files Changed¶
k8s/overlays/gcp/bankchurn-deployment.yamlk8s/overlays/aws/bankchurn-deployment-aws.yamlk8s/overlays/gcp/nlpinsight-deployment.yamlk8s/overlays/aws/nlpinsight-deployment-aws.yamlk8s/overlays/gcp/chicagotaxi-deployment.yamlk8s/overlays/aws/chicagotaxi-deployment-aws.yamlBankChurn-Predictor/Dockerfile
Deployment Strategy¶
No image rebuild required. K8s command: overrides Dockerfile CMD. Applying updated manifests via kubectl apply triggers a rolling update:
kubectl apply -f k8s/overlays/gcp/bankchurn-deployment.yaml -n ml-portfolio
kubectl apply -f k8s/overlays/gcp/nlpinsight-deployment.yaml -n ml-portfolio
kubectl apply -f k8s/overlays/gcp/chicagotaxi-deployment.yaml -n ml-portfolio
kubectl rollout status deployment/bankchurn-predictor -n ml-portfolio
kubectl rollout status deployment/nlpinsight-analyzer -n ml-portfolio
kubectl rollout status deployment/chicagotaxi-pipeline -n ml-portfolio
Rationale¶
Why single worker per pod is K8s-native¶
| Concern | --workers N |
HPA pod scaling |
|---|---|---|
| CPU isolation | Shared pool, thrashing | Each pod has its own budget |
| Memory | N × model size | N × model size (same, but explicit) |
| HPA signal | Diluted (N processes → averaged) | Clean (1 process = 1 pod metric) |
| Debugging | N log streams mixed | 1 log stream per pod |
| Startup safety | fork after model load (risky) |
Fresh process per pod (safe) |
| Graceful shutdown | All workers die together | Individual pod drain |
Why not gunicorn + uvicorn workers?¶
Gunicorn pre-fork workers solve the --workers CPU problem (each worker gets scheduled independently), but add an extra process manager layer with no benefit in K8s. HPA already provides the same capability at the infrastructure level.
CPU Limit Rationale¶
- BankChurn 1000m: StackingClassifier inference via thread pool. Single process, no multi-worker contention.
- ChicagoTaxi 750m: RandomForest regression + pandas lookup; 500m was insufficient even for 1 worker.
- NLPInsight 1000m: FinBERT inference is GPU-bound in theory; on CPU-only nodes, 1000m is sufficient for the observed load profile.
HPA Threshold Rationale¶
Lower thresholds trigger scale-out earlier, preventing latency spikes at the cost of slightly more pods. For ML inference services where latency SLAs are strict (<500ms p95), this is the correct trade-off.
Consequences¶
Positive¶
- BankChurn p50 dropped from 1700ms → 200ms idle / 120ms (AWS) under 50-user load
- ChicagoTaxi no longer CPU-starved; HPA now has a
behaviorblock for controlled scale-out - HPA CPU signals are accurate: 1 process = clean utilization metric
- Safer startup: no
forkafter loading StackingClassifier or FinBERT
Negative / Trade-offs¶
- Slightly higher cost at rest: BankChurn CPU request increased 300m → 400m
- Under sudden large spikes, a single pod absorbs more load before HPA reacts (30s stabilization window)
Monitoring¶
Verify the fix with load testing and compare p95 before/after:
# Re-run load test after rolling update completes
INGRESS_HOST=http://136.111.152.72 locust -f tests/load/locustfile.py \
--headless -u 50 -r 10 --run-time 1m --html=reports/load_test_gcp_post_fix.html
Verified (2026-03-18): BankChurn p50 200ms idle, 0% errors under 100 users on both GCP and AWS.
Alternatives Considered¶
1. Gunicorn + UvicornWorker¶
Each worker is a separate process scheduled by the OS, avoiding shared CPU. Rejected: adds process manager overhead; HPA achieves the same horizontal scaling at the K8s layer without complexity.
2. Keep --workers 2 + double CPU limit to 2000m¶
Would solve the thrashing. Rejected: 2× cost per pod, defeats K8s HPA-based elasticity. Better to have 2 × 1 CPU pods than 1 × 2 CPU pod.
3. Async workers (default uvicorn)¶
FastAPI is already async; uvicorn's single-process event loop handles concurrent I/O requests. CPU-bound inference (predict()) still blocks the event loop — addressed by HPA scaling, not more workers.
4. Background thread pools for inference¶
Run model inference in asyncio.run_in_executor() with a ThreadPoolExecutor. Implemented in ADR-015: unblocks the event loop and enables full concurrency within a single pod. Removed the need for the BankChurn 2-worker exception.