Skip to content

Monitoring Guide

Prometheus + Grafana + MLflow + Evidently monitoring stack deployed on GKE + EKS (both clouds).

Grafana Dashboard

Stack

Component Purpose Access
Prometheus Metrics collection (15s scrape) :9090
Grafana Auto-provisioned 26-panel dashboard (6 rows) :3000
MLflow Experiment tracking (9 runs, 3 projects) :5000
Evidently Data drift detection (PSI/KS) All 3 projects

Prometheus Metrics

All APIs expose /metrics. Key metrics per service:

Metric BankChurn NLPInsight ChicagoTaxi
{svc}_requests_total bankchurn_requests_total nlpinsight_requests_total chicagotaxi_requests_total
{svc}_predictions_total {risk_level} label {sentiment} label {demand_category} label
{svc}_request_*_seconds _duration_seconds (histogram) _duration_seconds (histogram) _latency_seconds (histogram)

Real Production Counters (measured 2026-03-11, post load-test)

Service Total Requests Predictions (breakdown)
BankChurn 416 HIGH=211 · MEDIUM=100 · LOW=105
NLPInsight 919 neutral=1110 · negative=449 · positive=212
ChicagoTaxi 225 1709 demand predictions across all categories

Prometheus Targets

MLflow Experiments (v3.5.3)

Experiment Best Run Key Metric
BankChurn StackingClassifier (RF+GB+XGB+LGB→LR) AUC 0.87
NLPInsight TF-IDF + LogReg (prod) / FinBERT (GPU) Acc 80.6%

MLflow Experiments

MLflow Run Comparison — BankChurn Model Selection

Parallel coordinates plot comparing 5 model runs (StackingClassifier, RandomForest, LogisticRegression, GradientBoosting) across hyperparameters and metrics. This visualization drove the selection of StackingClassifier as the production model (highest roc_auc at 0.86).

MLflow Model Comparison

ML Dashboard Panels Load Test Results P95 Latency
ML Panels Load Test P95

SLOs

Service Availability P95 Latency (in-pod) P95 Latency (via ingress) Error Rate
BankChurn 99.9% 111ms 481ms <1%
NLPInsight 99.9% 15ms 9.6ms <1%
ChicagoTaxi 99.9% 460ms 35.5ms <1%

Load test baseline (2026-03-11): 2673 requests · 30 users · 120s · 0.07% error rate · 22.35 req/s aggregate throughput. Locust per-endpoint P50/P95: BankChurn predict 770ms/1700ms · NLPInsight predict 80ms/160ms · ChicagoTaxi demand 90ms/170ms.

HPA Autoscaling

CPU-only scaling (memory is fixed for ML models loaded at startup):

Service CPU Target Min/Max Pods Memory (real)
BankChurn 70% 1–3 ~344Mi
NLPInsight 70% 1–3 ~283Mi
ChicagoTaxi 70% 1–3 ~431Mi

Smoke Tests (Post-Deploy Validation)

Automated smoke tests run on every push to main to validate production services:

# Manual execution (local)
BANKCHURN_URL=http://136.111.152.72/bankchurn \
NLPINSIGHT_URL=http://136.111.152.72/nlpinsight \
CHICAGOTAXI_URL=http://136.111.152.72/chicagotaxi \
pytest tests/infra/smoke/test_smoke_services.py -v

# CI/CD (automatic)
# Runs on every push to main via .github/workflows/ci-infra.yml
# Tests both GCP (136.111.152.72) and AWS (NLB DNS) endpoints

Test Coverage: - ✅ Health endpoints (/health) - ✅ Prediction endpoints (/predict, /demand) - ✅ Prometheus metrics (/metrics) - ✅ OpenAPI docs (/docs, /openapi.json) - ✅ Response time validation (<2s) - ✅ Error handling (4xx for invalid input)

Load Testing

pytest tests/integration/test_smoke_k8s.py -v    # Smoke tests (deprecated, use above)
locust -f tests/load/locustfile.py --headless     # Load tests

Runbook

  • High errors: Check /health, review logs (kubectl logs), verify model loaded
  • High latency: Scale up (kubectl scale), check concurrent requests
  • Service down: Check pod status, resource limits (OOM), restart count

Last Updated: March 2026 — v3.5.3