ADR-015: Async Inference via ThreadPoolExecutor for CPU-Bound ML Models¶

Status: Accepted
Date: 2026-03-18
Authors: Duque Ortega Mutis
Related: ADR-014 (single-worker pattern), ADR-016 (GCP vs AWS latency)
Discovered via: Locust stress test — 81% failure rate under 100 concurrent users

TL;DR: Offloaded CPU-bound StackingClassifier.predict() to a ThreadPoolExecutor(max_workers=4) via asyncio.run_in_executor(). sklearn/XGBoost/LightGBM release the GIL during C extensions, enabling real threading parallelism. Result: 81% error rate → 0%, CPU cost halved (2000m → 1000m), all services now follow the single-worker pod pattern.

Context¶

ADR-014 established the single-worker pod pattern for ML inference under Kubernetes. However, BankChurn required an exception (2 workers + 2000m CPU) because its StackingClassifier.predict() is a synchronous, CPU-bound operation that blocks the uvicorn async event loop.

Problem: Event Loop Blocking¶

Request A arrives → predict() starts (blocks event loop ~100ms)
Request B arrives → queued (event loop busy)
Request C arrives → queued
...
Request N arrives → nginx timeout → 503

Under 100 concurrent users, the single-worker BankChurn pod showed: - 81% failure rate (1384 of 1707 requests) - Error types: 503 Service Unavailable, 502 Bad Gateway - Root cause: uvicorn event loop blocked by synchronous predictor.predict() → cannot accept new connections → nginx upstream timeout

The 2-worker workaround (ADR-014) reduced failures but doubled CPU cost (2000m) and still limited concurrency to 2 simultaneous requests per pod.

Decision¶

Offload CPU-bound inference to a ThreadPoolExecutor via asyncio.run_in_executor(), keeping uvicorn's event loop free to accept connections and health checks.

Implementation¶

from concurrent.futures import ThreadPoolExecutor
from functools import partial

# Thread pool for CPU-bound inference — unblocks uvicorn event loop
# sklearn/XGBoost/LightGBM release GIL during C extensions → real parallelism
_inference_executor = ThreadPoolExecutor(max_workers=4, thread_name_prefix="ml-infer")

def _sync_predict(customer_dict: dict, explain: bool) -> PredictionResponse:
    """CPU-bound prediction logic — runs in thread pool."""
    df = pd.DataFrame([customer_dict])
    results = predictor.predict(df, include_proba=True)
    # ... build response ...

@app.post("/predict")
async def predict_churn(customer: CustomerData, explain: bool = False):
    customer_dict = customer.model_dump()
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(
        _inference_executor, partial(_sync_predict, customer_dict, explain)
    )

Why This Works for sklearn¶

Python's GIL normally prevents true threading parallelism. However, sklearn, XGBoost, and LightGBM are implemented in C/C++ and release the GIL during computation:

RandomForestClassifier.predict() → C extension, GIL released
GradientBoostingClassifier.predict() → C extension, GIL released
XGBClassifier.predict() → libxgboost C++, GIL released
LGBMClassifier.predict() → lib_lightgbm C++, GIL released
LogisticRegression.predict() → BLAS/LAPACK, GIL released

The StackingClassifier chains all of these, so the thread pool achieves near-true parallelism for inference.

Configuration Changes¶

Service	Before (ADR-014)	After (ADR-015)
BankChurn workers	2	1
BankChurn CPU limit	2000m	1000m
BankChurn thread pool	N/A	4 threads
Dockerfile CMD	`--workers 2`	default (1 worker)

NLPInsight and ChicagoTaxi remain unchanged (1 worker, no thread pool needed — their inference is lightweight).

Files Changed¶

BankChurn-Predictor/app/fastapi_app.py — _sync_predict(), _sync_predict_batch(), _inference_executor
BankChurn-Predictor/Dockerfile — removed --workers 2
k8s/overlays/gcp/bankchurn-deployment.yaml — 1 worker, 1000m CPU
k8s/overlays/aws/bankchurn-deployment-aws.yaml — 1 worker, 1000m CPU

Verification: Load Test Results¶

GCP (GKE) — `INGRESS_HOST=http://136.111.152.72`¶

Test	Users	Duration	BankChurn p50	BankChurn Errors	NLPInsight p50	ChicagoTaxi p50
Smoke (pre-fix)	6	30s	390ms	0%	82ms	100ms
Load (pre-fix)	50	2m	2300ms	0.58%	73ms	92ms
Stress (pre-fix)	100	2m	95ms*	81.08%	88ms	120ms
Smoke (post-fix)	6	30s	200ms	0%	78ms	100ms
Load (post-fix)	50	2m	3100ms	0%	84ms	110ms
Stress (post-fix)	100	2m	8200ms	0.02%	79ms	100ms

*Pre-fix stress p50 is misleadingly low because 81% of requests failed immediately (503) — only fast successful requests counted.

AWS (EKS) — NLB ingress¶

Test	Users	Duration	BankChurn p50	BankChurn Errors	NLPInsight p50	ChicagoTaxi p50
Smoke	6	30s	110ms	0%	100ms	120ms
Load	50	2m	120ms	0%	100ms	130ms
Stress	100	2m	130ms	0%	100ms	130ms

Key Improvements¶

Metric	Before	After	Improvement
BankChurn stress errors	81.08%	0.02% (GCP) / 0% (AWS)	~100% reduction
BankChurn CPU limit	2000m	1000m	50% cost reduction
BankChurn workers	2 processes	1 process + 4 threads	50% memory reduction
All services idle p50	82-390ms	78-200ms (GCP) / 100-120ms (AWS)	Improved

Rationale¶

Why ThreadPoolExecutor (not ProcessPoolExecutor)?¶

Factor	ThreadPoolExecutor	ProcessPoolExecutor
GIL release	sklearn C extensions release GIL → real parallelism	Full parallelism (separate processes)
Memory	Shared model memory (~300Mi)	N × model size (each process loads model)
Startup	Instant (threads share address space)	Slow (fork + model reload)
K8s compatibility	Single process → clean HPA metrics	Multiple processes → diluted HPA
Complexity	Minimal code change	Requires pickling/IPC for results

Why 4 threads?¶

BankChurn StackingClassifier inference: ~100ms CPU time
At 4 threads: supports ~40 req/s per pod before queuing
Matches the CPU limit (1000m) — 4 threads × ~250m each under C extensions
Beyond 4 threads: diminishing returns on a 1-CPU pod

Why not apply to NLPInsight / ChicagoTaxi?¶

NLPInsight (FinBERT): inference is I/O-bound (tokenizer) + lightweight CPU. Single-threaded async handles it. p50 = 78-100ms even under 100 users.
ChicagoTaxi (LightGBM): inference is ~5ms CPU. No event loop blocking at any tested concurrency.

Consequences¶

Positive¶

Eliminated BankChurn 2-worker exception — all services now follow the single-worker pod pattern
50% CPU cost reduction for BankChurn (2000m → 1000m)
0% error rate under 100 concurrent users on both GCP and AWS
Event loop stays responsive for health checks and metrics during heavy inference load
Pattern is reusable for any future CPU-bound service

Negative / Trade-offs¶

Thread pool adds slight overhead (~1-2ms per request for scheduling)
Thread safety: predictor object must be thread-safe (sklearn estimators are safe for predict(), not for fit())
BankChurn load p50 on GCP (3100ms) remains higher than AWS (120ms) due to GKE node CPU characteristics

Future Considerations¶

If BankChurn latency under GCP load needs further improvement, consider increasing max_workers to 6-8 or upgrading to compute-optimized nodes (c2/c2d)
Monitor thread pool saturation via Prometheus metrics if available