Skip to content

BankChurn Debugging Deep Dive

Failure story and engineering judgment

From 81% API errors to a reliable inference path

This is the strongest debugging story in the portfolio because it shows the habit I want to bring into an entry-level / junior MLOps team: measure the failure, isolate the cause, make the smallest meaningful fix, and turn the lesson into reusable engineering guidance.

Incident Summary

Symptom

81% failures under load

A Locust stress test exposed a high API error rate. From the outside, it looked like a simple scaling or CPU allocation problem.

Root cause

Blocked event loop + worker contention

uvicorn --workers N inside one Kubernetes pod shared a single CPU budget, while synchronous ML inference blocked FastAPI's async serving path under concurrency.

Fix

ThreadPoolExecutor

The API moved to one worker per pod and the CPU-bound prediction work was placed behind asyncio.run_in_executor() with ThreadPoolExecutor.

Outcome

Errors removed in validation

The revised serving pattern was validated with load testing and became a documented rule for future services.

What I Saw First

The first signal was not a model metric. It was an operating symptom: the API failed when concurrent users hit the prediction endpoint. That matters because production ML failures often appear outside the model itself. A model can have a good AUC and still fail as a service if the serving path is wrong.

The initial question was: is this a resource problem, a Kubernetes scaling problem, or an application execution problem?

Hypotheses I Had To Separate

Hypothesis 1

Add more workers

This looked tempting, but multiple Uvicorn workers inside one Kubernetes pod share the same pod CPU budget. That can create contention instead of useful parallelism.

Hypothesis 2

Scale with memory

ML pods have a fixed model memory footprint. Memory stayed high even when traffic dropped, so memory-based HPA would not scale down cleanly.

Hypothesis 3

Unblock the event loop

The evidence pointed to the synchronous prediction call blocking the async server. That explained why adding resources alone was the wrong first fix.

The Root Cause

The BankChurn model uses a scikit-learn style pipeline and ensemble inference path. The prediction call is CPU-bound and synchronous. When that call runs directly inside an async FastAPI endpoint, it blocks the event loop. Under load, the service spends too much time waiting on inference work and cannot keep serving new connections reliably. Using uvicorn --workers N inside the same Kubernetes pod did not solve the issue because the workers still shared one pod CPU budget and made the HPA signal harder to reason about.

The key lesson was that async API code does not automatically make CPU-bound ML inference concurrent. The serving pattern must intentionally separate request handling from model computation.

The Fix

The fix was to keep a single Uvicorn worker per pod and move prediction work into a thread pool:

loop = asyncio.get_running_loop()
prediction = await loop.run_in_executor(
    app.state.inference_executor,
    predictor.predict,
    request_payload,
)

This works for this stack because scikit-learn, XGBoost and LightGBM execute heavy numerical work in compiled extensions that can release the GIL. The thread pool lets the event loop keep accepting and coordinating requests while the model computation runs off the main async path.

How I Verified It

Before

Stress test failure

The API reached an 81% failure rate under the target load scenario.

After

Load test recovery

The revised serving path removed the observed failure pattern in validation and preserved a simpler Kubernetes scaling model.

Documentation

ADR-backed lesson

The result became part of the portfolio's architecture decisions and later informed the reusable MLOps template.

Before: API error rate

81%

After: API error rate

0%

CPU request after fix

~50% lower

What This Became In The Template

The important outcome was not only that BankChurn worked. The lesson became a reusable rule: avoid uvicorn --workers N as the default Kubernetes answer for ML inference, keep one worker per pod, use HPA for horizontal scaling, and move CPU-bound prediction work away from the async event loop.

That is the difference between a one-time fix and an operating habit. The portfolio bug became a template guardrail.

What I Would Improve Next

If I were evolving this service on a real team, I would add distributed tracing around the prediction path, capture request-level timing by stage, and run a short scheduled traffic window to keep fresh Grafana/Prometheus evidence. I would also compare thread pool sizing under different model types instead of treating one executor configuration as universal.