Skip to content

Load Test Results — Production Performance Benchmarks

Date: 2026-03-18 | Tool: Locust | Infrastructure: GKE (GCP) + EKS (AWS)

Test Configuration

Parameter Smoke Load Stress
Users 6 50 100
Ramp rate 6/s 10/s 20/s
Duration 30s 2m 2m
SLA: error rate < 1% < 1% < 5%
SLA: p95 latency < 500ms < 800ms best-effort

All tests run via NGINX ingress with randomized payloads simulating production traffic patterns.


GCP (GKE) — us-central1

Ingress: http://136.111.152.72

Smoke Test (6 users, 30s) — Idle Latencies

Service Endpoint p50 p95 Errors
BankChurn POST /predict 200ms 410ms 0%
NLPInsight POST /predict 78ms 140ms 0%
ChicagoTaxi GET /demand 100ms 400ms 0%

Load Test (50 users, 2 min) — Production Baseline

Service Endpoint p50 p95 Errors Throughput
BankChurn POST /predict 3100ms 6500ms 0% 4.9 req/s
NLPInsight POST /predict 84ms 570ms 0% 9.0 req/s
ChicagoTaxi GET /demand 110ms 4900ms 0% 4.1 req/s

Stress Test (100 users, 2 min) — Breaking Point

Service Endpoint p50 p95 Errors Throughput
BankChurn POST /predict 8200ms 19000ms 0.02% 3.6 req/s
NLPInsight POST /predict 79ms 170ms 0% 19.6 req/s
ChicagoTaxi GET /demand 100ms 230ms 0% 10.2 req/s

AWS (EKS) — us-east-1

Ingress: NLB + NGINX Ingress Controller

Smoke Test (6 users, 30s) — Idle Latencies

Service Endpoint p50 p95 Errors
BankChurn POST /predict 110ms 140ms 0%
NLPInsight POST /predict 100ms 120ms 0%
ChicagoTaxi GET /demand 120ms 230ms 0%

Load Test (50 users, 2 min) — Production Baseline

Service Endpoint p50 p95 Errors Throughput
BankChurn POST /predict 120ms 200ms 0% 14.1 req/s
NLPInsight POST /predict 100ms 180ms 0% 10.5 req/s
ChicagoTaxi GET /demand 130ms 210ms 0% 5.2 req/s

Stress Test (100 users, 2 min) — Breaking Point

Service Endpoint p50 p95 Errors Throughput
BankChurn POST /predict 130ms 5100ms 0% 24.1 req/s
NLPInsight POST /predict 100ms 290ms 0% 18.1 req/s
ChicagoTaxi GET /demand 130ms 320ms 0% 8.7 req/s

Key Findings

1. Async Inference Eliminates BankChurn Failures

Metric Before (sync, 2 workers) After (async, 1 worker)
Stress errors (100 users) 81% 0.02% (GCP) / 0% (AWS)
CPU limit 2000m 1000m
Memory per pod ~600Mi (2 processes) ~300Mi (1 process)

See ADR-015 for implementation details.

2. AWS Outperforms GCP Under Load

AWS EKS shows significantly lower latencies under concurrent load, likely due to: - NLB (Layer 4) vs GKE's NGINX ingress (Layer 7) — less overhead - EC2 instance types with higher single-thread performance - Better CPU burst capacity on EKS managed nodes

3. NLPInsight and ChicagoTaxi Are Stable

Both services maintain sub-200ms p50 latencies even under 100 concurrent users on both clouds. No optimization needed.

4. BankChurn Latency Is Expected

BankChurn's StackingClassifier (5 base learners + meta-learner) is inherently more computationally expensive than single-model services. The 200ms idle latency is appropriate for the model complexity and the bank churn prediction use case (analyst tool, not real-time). See ADR-003.


Infrastructure Configuration

Service Workers CPU Limit HPA Target Thread Pool
BankChurn 1 1000m 50% 4 threads (async)
NLPInsight 1 1000m 60% N/A
ChicagoTaxi 1 750m 60% N/A

See ADR-014 for the single-worker pod pattern rationale.