BankChurn Predictor¶
Predict which bank customers are likely to leave — and quantify the cost of getting it wrong.

The Problem¶
A bank with 100K customers and a 20% annual churn rate loses ~$30M/year in lifetime value. The question isn't "can we predict churn?" — it's "at what threshold do we act, and what does each error cost?"
Why AUC-ROC, Not Accuracy¶
The dataset is 80/20 retained/churned. A model predicting "no churn" for everyone scores 79.6% accuracy — and catches zero churners. AUC-ROC measures rank-ordering quality across all thresholds, independently of class imbalance.
| Metric | Value | Why It Matters |
|---|---|---|
| AUC-ROC | 0.87 | Rank-ordering: 87% of the time, a churner scores higher than a non-churner |
| F1 | 0.62 | Harmonic mean at default threshold (0.50) |
| Precision | 0.73 | 73% of flagged customers actually churn |
| Recall | 0.54 | 54% of actual churners caught (at 0.50); 78% at production threshold 0.35 |
Production threshold: 0.35 — a missed churner costs ~$1,500 LTV; an unnecessary retention offer costs ~$50. At 30:1 cost ratio, we favor recall over precision.
Architecture¶
flowchart LR
A[API Request] --> B[Pydantic\nValidation]
B --> C[ColumnTransformer]
C --> D{StackingClassifier}
D --> E[RandomForest]
D --> F[GradientBoosting]
D --> G[XGBoost]
D --> H[LightGBM]
E & F & G & H --> I[LogisticRegression\nMeta-Learner]
I --> J[Prediction +\nRisk Level]
J --> K{explain=true?}
K -->|Yes| L[SHAP Values\n+93ms]
K -->|No| M[JSON Response\n~103ms]
L --> M
Why StackingClassifier: 4 diverse base learners capture complementary patterns (bagging + boosting + tree + gradient). AUC improved from 0.84 (best single model) to 0.87. CV variance is tight (±0.006), confirming generalization over memorization. See ADR-003.
Operational¶
| Metric | Value | Context |
|---|---|---|
| Test Coverage | 90% (199 tests) | CI threshold: 85% |
| Docker Image | 342 MB | bankchurn:v3.5.0 on Artifact Registry (python:3.11-slim-bookworm) |
| Model Size | 4.1 MB | Joblib compress=3; includes preprocessor + 4 base learners + meta-learner |
| P50 / P95 Latency | 200ms / 410ms (GCP), 110ms / 140ms (AWS) | Through ingress, Locust smoke test (6 users) |
| SHAP | Lazy, CPU-only | ?explain=true adds ~4.5s (KernelExplainer); skipped by default |
Responsible AI¶
- Fairness: Disparate impact ratio and equal opportunity difference audited by Gender and Geography
- Drift: Evidently AI monitors PSI/KS per feature; alert fires if >30% features drift
- Validation: Pandera schemas reject invalid inputs (CreditScore ∈ [300, 850], Age > 0)
Live Prediction¶
| BankChurn Prediction | SHAP Explanation |
|---|---|
![]() |
![]() |
Try It¶
curl -s -X POST http://localhost:8001/predict \
-H "Content-Type: application/json" \
-d '{"CreditScore":650,"Geography":"France","Gender":"Male","Age":40,"Tenure":5,"Balance":60000,"NumOfProducts":2,"HasCrCard":1,"IsActiveMember":1,"EstimatedSalary":50000}' | python3 -m json.tool
Expected: churn_probability, risk_category (low/medium/high), churn_prediction (0/1)
curl -s -X POST "http://localhost:8001/predict?explain=true" \
-H "Content-Type: application/json" \
-d '{"CreditScore":450,"Geography":"Germany","Gender":"Female","Age":55,"Tenure":1,"Balance":0,"NumOfProducts":1,"HasCrCard":0,"IsActiveMember":0,"EstimatedSalary":30000}' | python3 -m json.tool
Expected: Same fields + feature_contributions (SHAP values per feature). This high-risk customer should show Age and NumOfProducts as top churn drivers.
📄 Full Model Card — includes metric rationale, performance benchmarks, and production decision narrative.
Last Updated: March 2026 — v3.5.3

