ADR-015: Async Inference via ThreadPoolExecutor for CPU-Bound ML Models¶
- Status: Accepted
- Date: 2026-03-18
- Authors: Duque Ortega Mutis
- Related: ADR-014 (single-worker pattern), ADR-016 (GCP vs AWS latency)
- Discovered via: Locust stress test — 81% failure rate under 100 concurrent users
TL;DR: Offloaded CPU-bound
StackingClassifier.predict()to aThreadPoolExecutor(max_workers=4)viaasyncio.run_in_executor(). sklearn/XGBoost/LightGBM release the GIL during C extensions, enabling real threading parallelism. Result: 81% error rate → 0%, CPU cost halved (2000m → 1000m), all services now follow the single-worker pod pattern.
Context¶
ADR-014 established the single-worker pod pattern for ML inference under Kubernetes. However, BankChurn required an exception (2 workers + 2000m CPU) because its StackingClassifier.predict() is a synchronous, CPU-bound operation that blocks the uvicorn async event loop.
Problem: Event Loop Blocking¶
Request A arrives → predict() starts (blocks event loop ~100ms)
Request B arrives → queued (event loop busy)
Request C arrives → queued
...
Request N arrives → nginx timeout → 503
Under 100 concurrent users, the single-worker BankChurn pod showed:
- 81% failure rate (1384 of 1707 requests)
- Error types: 503 Service Unavailable, 502 Bad Gateway
- Root cause: uvicorn event loop blocked by synchronous predictor.predict() → cannot accept new connections → nginx upstream timeout
The 2-worker workaround (ADR-014) reduced failures but doubled CPU cost (2000m) and still limited concurrency to 2 simultaneous requests per pod.
Decision¶
Offload CPU-bound inference to a ThreadPoolExecutor via asyncio.run_in_executor(), keeping uvicorn's event loop free to accept connections and health checks.
Implementation¶
from concurrent.futures import ThreadPoolExecutor
from functools import partial
# Thread pool for CPU-bound inference — unblocks uvicorn event loop
# sklearn/XGBoost/LightGBM release GIL during C extensions → real parallelism
_inference_executor = ThreadPoolExecutor(max_workers=4, thread_name_prefix="ml-infer")
def _sync_predict(customer_dict: dict, explain: bool) -> PredictionResponse:
"""CPU-bound prediction logic — runs in thread pool."""
df = pd.DataFrame([customer_dict])
results = predictor.predict(df, include_proba=True)
# ... build response ...
@app.post("/predict")
async def predict_churn(customer: CustomerData, explain: bool = False):
customer_dict = customer.model_dump()
loop = asyncio.get_running_loop()
return await loop.run_in_executor(
_inference_executor, partial(_sync_predict, customer_dict, explain)
)
Why This Works for sklearn¶
Python's GIL normally prevents true threading parallelism. However, sklearn, XGBoost, and LightGBM are implemented in C/C++ and release the GIL during computation:
RandomForestClassifier.predict()→ C extension, GIL releasedGradientBoostingClassifier.predict()→ C extension, GIL releasedXGBClassifier.predict()→ libxgboost C++, GIL releasedLGBMClassifier.predict()→ lib_lightgbm C++, GIL releasedLogisticRegression.predict()→ BLAS/LAPACK, GIL released
The StackingClassifier chains all of these, so the thread pool achieves near-true parallelism for inference.
Configuration Changes¶
| Service | Before (ADR-014) | After (ADR-015) |
|---|---|---|
| BankChurn workers | 2 | 1 |
| BankChurn CPU limit | 2000m | 1000m |
| BankChurn thread pool | N/A | 4 threads |
| Dockerfile CMD | --workers 2 |
default (1 worker) |
NLPInsight and ChicagoTaxi remain unchanged (1 worker, no thread pool needed — their inference is lightweight).
Files Changed¶
BankChurn-Predictor/app/fastapi_app.py—_sync_predict(),_sync_predict_batch(),_inference_executorBankChurn-Predictor/Dockerfile— removed--workers 2k8s/overlays/gcp/bankchurn-deployment.yaml— 1 worker, 1000m CPUk8s/overlays/aws/bankchurn-deployment-aws.yaml— 1 worker, 1000m CPU
Verification: Load Test Results¶
GCP (GKE) — INGRESS_HOST=http://136.111.152.72¶
| Test | Users | Duration | BankChurn p50 | BankChurn Errors | NLPInsight p50 | ChicagoTaxi p50 |
|---|---|---|---|---|---|---|
| Smoke (pre-fix) | 6 | 30s | 390ms | 0% | 82ms | 100ms |
| Load (pre-fix) | 50 | 2m | 2300ms | 0.58% | 73ms | 92ms |
| Stress (pre-fix) | 100 | 2m | 95ms* | 81.08% | 88ms | 120ms |
| Smoke (post-fix) | 6 | 30s | 200ms | 0% | 78ms | 100ms |
| Load (post-fix) | 50 | 2m | 3100ms | 0% | 84ms | 110ms |
| Stress (post-fix) | 100 | 2m | 8200ms | 0.02% | 79ms | 100ms |
*Pre-fix stress p50 is misleadingly low because 81% of requests failed immediately (503) — only fast successful requests counted.
AWS (EKS) — NLB ingress¶
| Test | Users | Duration | BankChurn p50 | BankChurn Errors | NLPInsight p50 | ChicagoTaxi p50 |
|---|---|---|---|---|---|---|
| Smoke | 6 | 30s | 110ms | 0% | 100ms | 120ms |
| Load | 50 | 2m | 120ms | 0% | 100ms | 130ms |
| Stress | 100 | 2m | 130ms | 0% | 100ms | 130ms |
Key Improvements¶
| Metric | Before | After | Improvement |
|---|---|---|---|
| BankChurn stress errors | 81.08% | 0.02% (GCP) / 0% (AWS) | ~100% reduction |
| BankChurn CPU limit | 2000m | 1000m | 50% cost reduction |
| BankChurn workers | 2 processes | 1 process + 4 threads | 50% memory reduction |
| All services idle p50 | 82-390ms | 78-200ms (GCP) / 100-120ms (AWS) | Improved |
Rationale¶
Why ThreadPoolExecutor (not ProcessPoolExecutor)?¶
| Factor | ThreadPoolExecutor | ProcessPoolExecutor |
|---|---|---|
| GIL release | sklearn C extensions release GIL → real parallelism | Full parallelism (separate processes) |
| Memory | Shared model memory (~300Mi) | N × model size (each process loads model) |
| Startup | Instant (threads share address space) | Slow (fork + model reload) |
| K8s compatibility | Single process → clean HPA metrics | Multiple processes → diluted HPA |
| Complexity | Minimal code change | Requires pickling/IPC for results |
Why 4 threads?¶
- BankChurn StackingClassifier inference: ~100ms CPU time
- At 4 threads: supports ~40 req/s per pod before queuing
- Matches the CPU limit (1000m) — 4 threads × ~250m each under C extensions
- Beyond 4 threads: diminishing returns on a 1-CPU pod
Why not apply to NLPInsight / ChicagoTaxi?¶
- NLPInsight (FinBERT): inference is I/O-bound (tokenizer) + lightweight CPU. Single-threaded async handles it. p50 = 78-100ms even under 100 users.
- ChicagoTaxi (LightGBM): inference is ~5ms CPU. No event loop blocking at any tested concurrency.
Consequences¶
Positive¶
- Eliminated BankChurn 2-worker exception — all services now follow the single-worker pod pattern
- 50% CPU cost reduction for BankChurn (2000m → 1000m)
- 0% error rate under 100 concurrent users on both GCP and AWS
- Event loop stays responsive for health checks and metrics during heavy inference load
- Pattern is reusable for any future CPU-bound service
Negative / Trade-offs¶
- Thread pool adds slight overhead (~1-2ms per request for scheduling)
- Thread safety:
predictorobject must be thread-safe (sklearn estimators are safe forpredict(), not forfit()) - BankChurn load p50 on GCP (3100ms) remains higher than AWS (120ms) due to GKE node CPU characteristics
Future Considerations¶
- If BankChurn latency under GCP load needs further improvement, consider increasing
max_workersto 6-8 or upgrading to compute-optimized nodes (c2/c2d) - Monitor thread pool saturation via Prometheus metrics if available