Skip to content

ADR-015: Async Inference via ThreadPoolExecutor for CPU-Bound ML Models

  • Status: Accepted
  • Date: 2026-03-18
  • Authors: Duque Ortega Mutis
  • Related: ADR-014 (single-worker pattern), ADR-016 (GCP vs AWS latency)
  • Discovered via: Locust stress test — 81% failure rate under 100 concurrent users

TL;DR: Offloaded CPU-bound StackingClassifier.predict() to a ThreadPoolExecutor(max_workers=4) via asyncio.run_in_executor(). sklearn/XGBoost/LightGBM release the GIL during C extensions, enabling real threading parallelism. Result: 81% error rate → 0%, CPU cost halved (2000m → 1000m), all services now follow the single-worker pod pattern.


Context

ADR-014 established the single-worker pod pattern for ML inference under Kubernetes. However, BankChurn required an exception (2 workers + 2000m CPU) because its StackingClassifier.predict() is a synchronous, CPU-bound operation that blocks the uvicorn async event loop.

Problem: Event Loop Blocking

Request A arrives → predict() starts (blocks event loop ~100ms)
Request B arrives → queued (event loop busy)
Request C arrives → queued
...
Request N arrives → nginx timeout → 503

Under 100 concurrent users, the single-worker BankChurn pod showed: - 81% failure rate (1384 of 1707 requests) - Error types: 503 Service Unavailable, 502 Bad Gateway - Root cause: uvicorn event loop blocked by synchronous predictor.predict() → cannot accept new connections → nginx upstream timeout

The 2-worker workaround (ADR-014) reduced failures but doubled CPU cost (2000m) and still limited concurrency to 2 simultaneous requests per pod.


Decision

Offload CPU-bound inference to a ThreadPoolExecutor via asyncio.run_in_executor(), keeping uvicorn's event loop free to accept connections and health checks.

Implementation

from concurrent.futures import ThreadPoolExecutor
from functools import partial

# Thread pool for CPU-bound inference — unblocks uvicorn event loop
# sklearn/XGBoost/LightGBM release GIL during C extensions → real parallelism
_inference_executor = ThreadPoolExecutor(max_workers=4, thread_name_prefix="ml-infer")

def _sync_predict(customer_dict: dict, explain: bool) -> PredictionResponse:
    """CPU-bound prediction logic — runs in thread pool."""
    df = pd.DataFrame([customer_dict])
    results = predictor.predict(df, include_proba=True)
    # ... build response ...

@app.post("/predict")
async def predict_churn(customer: CustomerData, explain: bool = False):
    customer_dict = customer.model_dump()
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(
        _inference_executor, partial(_sync_predict, customer_dict, explain)
    )

Why This Works for sklearn

Python's GIL normally prevents true threading parallelism. However, sklearn, XGBoost, and LightGBM are implemented in C/C++ and release the GIL during computation:

  • RandomForestClassifier.predict() → C extension, GIL released
  • GradientBoostingClassifier.predict() → C extension, GIL released
  • XGBClassifier.predict() → libxgboost C++, GIL released
  • LGBMClassifier.predict() → lib_lightgbm C++, GIL released
  • LogisticRegression.predict() → BLAS/LAPACK, GIL released

The StackingClassifier chains all of these, so the thread pool achieves near-true parallelism for inference.

Configuration Changes

Service Before (ADR-014) After (ADR-015)
BankChurn workers 2 1
BankChurn CPU limit 2000m 1000m
BankChurn thread pool N/A 4 threads
Dockerfile CMD --workers 2 default (1 worker)

NLPInsight and ChicagoTaxi remain unchanged (1 worker, no thread pool needed — their inference is lightweight).

Files Changed

  • BankChurn-Predictor/app/fastapi_app.py_sync_predict(), _sync_predict_batch(), _inference_executor
  • BankChurn-Predictor/Dockerfile — removed --workers 2
  • k8s/overlays/gcp/bankchurn-deployment.yaml — 1 worker, 1000m CPU
  • k8s/overlays/aws/bankchurn-deployment-aws.yaml — 1 worker, 1000m CPU

Verification: Load Test Results

GCP (GKE) — INGRESS_HOST=http://136.111.152.72

Test Users Duration BankChurn p50 BankChurn Errors NLPInsight p50 ChicagoTaxi p50
Smoke (pre-fix) 6 30s 390ms 0% 82ms 100ms
Load (pre-fix) 50 2m 2300ms 0.58% 73ms 92ms
Stress (pre-fix) 100 2m 95ms* 81.08% 88ms 120ms
Smoke (post-fix) 6 30s 200ms 0% 78ms 100ms
Load (post-fix) 50 2m 3100ms 0% 84ms 110ms
Stress (post-fix) 100 2m 8200ms 0.02% 79ms 100ms

*Pre-fix stress p50 is misleadingly low because 81% of requests failed immediately (503) — only fast successful requests counted.

AWS (EKS) — NLB ingress

Test Users Duration BankChurn p50 BankChurn Errors NLPInsight p50 ChicagoTaxi p50
Smoke 6 30s 110ms 0% 100ms 120ms
Load 50 2m 120ms 0% 100ms 130ms
Stress 100 2m 130ms 0% 100ms 130ms

Key Improvements

Metric Before After Improvement
BankChurn stress errors 81.08% 0.02% (GCP) / 0% (AWS) ~100% reduction
BankChurn CPU limit 2000m 1000m 50% cost reduction
BankChurn workers 2 processes 1 process + 4 threads 50% memory reduction
All services idle p50 82-390ms 78-200ms (GCP) / 100-120ms (AWS) Improved

Rationale

Why ThreadPoolExecutor (not ProcessPoolExecutor)?

Factor ThreadPoolExecutor ProcessPoolExecutor
GIL release sklearn C extensions release GIL → real parallelism Full parallelism (separate processes)
Memory Shared model memory (~300Mi) N × model size (each process loads model)
Startup Instant (threads share address space) Slow (fork + model reload)
K8s compatibility Single process → clean HPA metrics Multiple processes → diluted HPA
Complexity Minimal code change Requires pickling/IPC for results

Why 4 threads?

  • BankChurn StackingClassifier inference: ~100ms CPU time
  • At 4 threads: supports ~40 req/s per pod before queuing
  • Matches the CPU limit (1000m) — 4 threads × ~250m each under C extensions
  • Beyond 4 threads: diminishing returns on a 1-CPU pod

Why not apply to NLPInsight / ChicagoTaxi?

  • NLPInsight (FinBERT): inference is I/O-bound (tokenizer) + lightweight CPU. Single-threaded async handles it. p50 = 78-100ms even under 100 users.
  • ChicagoTaxi (LightGBM): inference is ~5ms CPU. No event loop blocking at any tested concurrency.

Consequences

Positive

  • Eliminated BankChurn 2-worker exception — all services now follow the single-worker pod pattern
  • 50% CPU cost reduction for BankChurn (2000m → 1000m)
  • 0% error rate under 100 concurrent users on both GCP and AWS
  • Event loop stays responsive for health checks and metrics during heavy inference load
  • Pattern is reusable for any future CPU-bound service

Negative / Trade-offs

  • Thread pool adds slight overhead (~1-2ms per request for scheduling)
  • Thread safety: predictor object must be thread-safe (sklearn estimators are safe for predict(), not for fit())
  • BankChurn load p50 on GCP (3100ms) remains higher than AWS (120ms) due to GKE node CPU characteristics

Future Considerations

  • If BankChurn latency under GCP load needs further improvement, consider increasing max_workers to 6-8 or upgrading to compute-optimized nodes (c2/c2d)
  • Monitor thread pool saturation via Prometheus metrics if available