Banking churn service

BankChurn Predictor¶

Predict which bank customers are likely to leave — and quantify the cost of getting it wrong.

Model quality AUC 0.87 Cost-aware churn ranking with a production threshold.

Coverage 90% 199 tests with CI threshold discipline.

Incident 81% -> 0% Error-rate fix documented as a serving-pattern lesson.

Explainability Kernel SHAP StackingClassifier explanations in original feature space.

The Problem¶

A bank with 100K customers and a 20% annual churn rate loses ~$30M/year in lifetime value. The question isn't "can we predict churn?" — it's "at what threshold do we act, and what does each error cost?"

Business Translation¶

Problem

Retention budget is limited¶

The useful question is which customers deserve intervention, not whether the model can produce a score.

Decision

Favor recall with a lower threshold¶

A missed churner is much more expensive than an unnecessary retention offer, so the threshold is tuned around cost of error.

Impact

Predictions become operational¶

The output supports action: probability, risk category, threshold rationale and explanation path.

Trade-off

More complexity for explainability¶

The ensemble improves performance, but it requires careful SHAP handling and serving-path discipline.

Why AUC-ROC, Not Accuracy¶

The dataset is 80/20 retained/churned. A model predicting "no churn" for everyone scores 79.6% accuracy — and catches zero churners. AUC-ROC measures rank-ordering quality across all thresholds, independently of class imbalance.

Metric	Value	Why It Matters
AUC-ROC	0.87	Rank-ordering: 87% of the time, a churner scores higher than a non-churner
F1	0.62	Harmonic mean at default threshold (0.50)
Precision	0.73	73% of flagged customers actually churn
Recall	0.54	54% of actual churners caught (at 0.50); 78% at production threshold 0.35

Production threshold: 0.35 — a missed churner costs ~$1,500 LTV; an unnecessary retention offer costs ~$50. At 30:1 cost ratio, we favor recall over precision.

Architecture¶

flowchart LR
    A[API Request] --> B[Pydantic\nValidation]
    B --> C[ColumnTransformer]
    C --> D{StackingClassifier}
    D --> E[RandomForest]
    D --> F[GradientBoosting]
    D --> G[XGBoost]
    D --> H[LightGBM]
    E & F & G & H --> I[LogisticRegression\nMeta-Learner]
    I --> J[Prediction +\nRisk Level]
    J --> K{explain=true?}
    K -->|Yes| L[SHAP Values\n+93ms]
    K -->|No| M[JSON Response\n~103ms]
    L --> M

Why StackingClassifier: 4 diverse base learners capture complementary patterns (bagging + boosting + tree + gradient). AUC improved from 0.84 (best single model) to 0.87. CV variance is tight (±0.006), confirming generalization over memorization. See ADR-003.

Engineering Trade-Off¶

Chosen: StackingClassifier + FastAPI serving + async executor pattern. Rejected: a simpler single-model API that would be easier to serve but less useful as a portfolio signal for ensemble modeling, explainability and inference-path debugging.

The StackingClassifier also forced a concrete explainability fix: TreeExplainer produced unusable all-zero outputs for this ensemble/pipeline shape, so the service moved to KernelExplainer through a predict-proba wrapper in the original feature space.

Read the debugging deep dive

Code Review Shortcuts¶

FastAPI app Dockerfile Tests K8s manifest CI pipeline

Operational¶

Metric	Value	Context
Test Coverage	90% (199 tests)	CI threshold: 85%
Docker Image	342 MB	`bankchurn:v3.6.0` on Artifact Registry (python:3.11-slim-bookworm)
Model Size	4.1 MB	Joblib compress=3; includes preprocessor + 4 base learners + meta-learner
P50 / P95 Latency	200ms / 410ms (GCP), 110ms / 140ms (AWS)	Through ingress, Locust smoke test (6 users)
SHAP	Lazy, CPU-only	`?explain=true` adds ~4.5s (KernelExplainer); skipped by default
HPA correction	Memory -> CPU-only	Replicas reduced from 3 to 1 in 8 minutes after rejecting memory as the scaling signal

Responsible AI¶

Fairness: Disparate impact ratio and equal opportunity difference audited by Gender and Geography
Drift: Evidently AI monitors PSI/KS per feature; alert fires if >30% features drift
Validation: Pandera schemas reject invalid inputs (CreditScore ∈ [300, 850], Age > 0)

Live Prediction¶

BankChurn Prediction	SHAP Explanation

Try It¶

Basic PredictionWith SHAP ExplanationHealth Check

curl -s -X POST http://localhost:8001/predict \
  -H "Content-Type: application/json" \
  -d '{"CreditScore":650,"Geography":"France","Gender":"Male","Age":40,"Tenure":5,"Balance":60000,"NumOfProducts":2,"HasCrCard":1,"IsActiveMember":1,"EstimatedSalary":50000}' | python3 -m json.tool

Expected: churn_probability, risk_category (low/medium/high), churn_prediction (0/1)

curl -s -X POST "http://localhost:8001/predict?explain=true" \
  -H "Content-Type: application/json" \
  -d '{"CreditScore":450,"Geography":"Germany","Gender":"Female","Age":55,"Tenure":1,"Balance":0,"NumOfProducts":1,"HasCrCard":0,"IsActiveMember":0,"EstimatedSalary":30000}' | python3 -m json.tool

Expected: Same fields + feature_contributions (SHAP values per feature). This high-risk customer should show Age and NumOfProducts as top churn drivers.

curl -s http://localhost:8001/health | python3 -m json.tool

📄 Full Model Card — includes metric rationale, performance benchmarks, and production decision narrative.

Last Updated: April 2026 — v3.6.0