Monitoring Guide¶

Prometheus + Grafana + MLflow + Evidently monitoring stack deployed on GKE + EKS (both clouds).

Grafana Dashboard

Stack¶

Component	Purpose	Access
Prometheus	Metrics collection (15s scrape)	`:9090`
Grafana	Auto-provisioned 26-panel dashboard (6 rows)	`:3000`
MLflow	Experiment tracking (9 runs, 3 projects)	`:5000`
Evidently	Data drift detection (PSI/KS)	All 3 projects

Prometheus Metrics¶

All APIs expose /metrics. Key metrics per service:

Metric	BankChurn	NLPInsight	ChicagoTaxi
`{svc}_requests_total`	`bankchurn_requests_total`	`nlpinsight_requests_total`	`chicagotaxi_requests_total`
`{svc}_predictions_total`	`{risk_level}` label	`{sentiment}` label	`{demand_category}` label
`{svc}_request_*_seconds`	`_duration_seconds` (histogram)	`_duration_seconds` (histogram)	`_latency_seconds` (histogram)

Real Production Counters (measured 2026-03-11, post load-test)¶

Service	Total Requests	Predictions (breakdown)
BankChurn	416	HIGH=211 · MEDIUM=100 · LOW=105
NLPInsight	919	neutral=1110 · negative=449 · positive=212
ChicagoTaxi	225	1709 demand predictions across all categories

Prometheus Targets

MLflow Experiments (v3.5.3)¶

Experiment	Best Run	Key Metric
BankChurn	StackingClassifier (RF+GB+XGB+LGB→LR)	AUC 0.87
NLPInsight	TF-IDF + LogReg (prod) / FinBERT (GPU)	Acc 80.6%

MLflow Experiments

MLflow Run Comparison — BankChurn Model Selection¶

Parallel coordinates plot comparing 5 model runs (StackingClassifier, RandomForest, LogisticRegression, GradientBoosting) across hyperparameters and metrics. This visualization drove the selection of StackingClassifier as the production model (highest roc_auc at 0.86).

MLflow Model Comparison

ML Dashboard Panels	Load Test Results	P95 Latency

SLOs¶

Service	Availability	P95 Latency (in-pod)	P95 Latency (via ingress)	Error Rate
BankChurn	99.9%	111ms	481ms	<1%
NLPInsight	99.9%	15ms	9.6ms	<1%
ChicagoTaxi	99.9%	460ms	35.5ms	<1%

Load test baseline (2026-03-11): 2673 requests · 30 users · 120s · 0.07% error rate · 22.35 req/s aggregate throughput. Locust per-endpoint P50/P95: BankChurn predict 770ms/1700ms · NLPInsight predict 80ms/160ms · ChicagoTaxi demand 90ms/170ms.

HPA Autoscaling¶

CPU-only scaling (memory is fixed for ML models loaded at startup):

Service	CPU Target	Min/Max Pods	Memory (real)
BankChurn	70%	1–3	~344Mi
NLPInsight	70%	1–3	~283Mi
ChicagoTaxi	70%	1–3	~431Mi

Smoke Tests (Post-Deploy Validation)¶

Automated smoke tests run on every push to main to validate production services:

# Manual execution (local)
BANKCHURN_URL=http://136.111.152.72/bankchurn \
NLPINSIGHT_URL=http://136.111.152.72/nlpinsight \
CHICAGOTAXI_URL=http://136.111.152.72/chicagotaxi \
pytest tests/infra/smoke/test_smoke_services.py -v

# CI/CD (automatic)
# Runs on every push to main via .github/workflows/ci-infra.yml
# Tests both GCP (136.111.152.72) and AWS (NLB DNS) endpoints

Test Coverage: - ✅ Health endpoints (/health) - ✅ Prediction endpoints (/predict, /demand) - ✅ Prometheus metrics (/metrics) - ✅ OpenAPI docs (/docs, /openapi.json) - ✅ Response time validation (<2s) - ✅ Error handling (4xx for invalid input)

Load Testing¶

pytest tests/integration/test_smoke_k8s.py -v    # Smoke tests (deprecated, use above)
locust -f tests/load/locustfile.py --headless     # Load tests

Runbook¶

High errors: Check /health, review logs (kubectl logs), verify model loaded
High latency: Scale up (kubectl scale), check concurrent requests
Service down: Check pod status, resource limits (OOM), restart count

Last Updated: March 2026 — v3.5.3