BankChurn Predictor¶

Predict which bank customers are likely to leave — and quantify the cost of getting it wrong.

BankChurn API

The Problem¶

A bank with 100K customers and a 20% annual churn rate loses ~$30M/year in lifetime value. The question isn't "can we predict churn?" — it's "at what threshold do we act, and what does each error cost?"

Why AUC-ROC, Not Accuracy¶

The dataset is 80/20 retained/churned. A model predicting "no churn" for everyone scores 79.6% accuracy — and catches zero churners. AUC-ROC measures rank-ordering quality across all thresholds, independently of class imbalance.

Metric	Value	Why It Matters
AUC-ROC	0.87	Rank-ordering: 87% of the time, a churner scores higher than a non-churner
F1	0.62	Harmonic mean at default threshold (0.50)
Precision	0.73	73% of flagged customers actually churn
Recall	0.54	54% of actual churners caught (at 0.50); 78% at production threshold 0.35

Production threshold: 0.35 — a missed churner costs ~$1,500 LTV; an unnecessary retention offer costs ~$50. At 30:1 cost ratio, we favor recall over precision.

Architecture¶

flowchart LR
    A[API Request] --> B[Pydantic\nValidation]
    B --> C[ColumnTransformer]
    C --> D{StackingClassifier}
    D --> E[RandomForest]
    D --> F[GradientBoosting]
    D --> G[XGBoost]
    D --> H[LightGBM]
    E & F & G & H --> I[LogisticRegression\nMeta-Learner]
    I --> J[Prediction +\nRisk Level]
    J --> K{explain=true?}
    K -->|Yes| L[SHAP Values\n+93ms]
    K -->|No| M[JSON Response\n~103ms]
    L --> M

Why StackingClassifier: 4 diverse base learners capture complementary patterns (bagging + boosting + tree + gradient). AUC improved from 0.84 (best single model) to 0.87. CV variance is tight (±0.006), confirming generalization over memorization. See ADR-003.

Operational¶

Metric	Value	Context
Test Coverage	90% (199 tests)	CI threshold: 85%
Docker Image	342 MB	`bankchurn:v3.5.0` on Artifact Registry (python:3.11-slim-bookworm)
Model Size	4.1 MB	Joblib compress=3; includes preprocessor + 4 base learners + meta-learner
P50 / P95 Latency	200ms / 410ms (GCP), 110ms / 140ms (AWS)	Through ingress, Locust smoke test (6 users)
SHAP	Lazy, CPU-only	`?explain=true` adds ~4.5s (KernelExplainer); skipped by default

Responsible AI¶

Fairness: Disparate impact ratio and equal opportunity difference audited by Gender and Geography
Drift: Evidently AI monitors PSI/KS per feature; alert fires if >30% features drift
Validation: Pandera schemas reject invalid inputs (CreditScore ∈ [300, 850], Age > 0)

Live Prediction¶

BankChurn Prediction	SHAP Explanation

Try It¶

Basic PredictionWith SHAP ExplanationHealth Check

curl -s -X POST http://localhost:8001/predict \
  -H "Content-Type: application/json" \
  -d '{"CreditScore":650,"Geography":"France","Gender":"Male","Age":40,"Tenure":5,"Balance":60000,"NumOfProducts":2,"HasCrCard":1,"IsActiveMember":1,"EstimatedSalary":50000}' | python3 -m json.tool

Expected: churn_probability, risk_category (low/medium/high), churn_prediction (0/1)

curl -s -X POST "http://localhost:8001/predict?explain=true" \
  -H "Content-Type: application/json" \
  -d '{"CreditScore":450,"Geography":"Germany","Gender":"Female","Age":55,"Tenure":1,"Balance":0,"NumOfProducts":1,"HasCrCard":0,"IsActiveMember":0,"EstimatedSalary":30000}' | python3 -m json.tool

Expected: Same fields + feature_contributions (SHAP values per feature). This high-risk customer should show Age and NumOfProducts as top churn drivers.

curl -s http://localhost:8001/health | python3 -m json.tool

📄 Full Model Card — includes metric rationale, performance benchmarks, and production decision narrative.

Last Updated: March 2026 — v3.5.3