Skip to content

Troubleshooting Guide

Quick Diagnostics

docker compose ps                    # Container status
kubectl get pods -n ml-portfolio     # K8s pod status
curl localhost:8001/health           # API health

Common Issues

Problem Cause Fix
Container won't start Missing model scripts/setup_demo_models.sh
Port in use Conflict lsof -i :8001 then kill <PID>
OOMKilled Memory limit Increase in K8s/compose resources
Model load fails Wrong sklearn version Retrain with scripts/train_production_models.py
422 on predict Invalid input Check Pydantic schema, required fields
500 on predict Feature mismatch Verify training/inference alignment
Slow predictions SHAP overhead Use ?explain=true only when needed
CI tests fail locally pass Python version Ensure Python 3.11, check pip freeze
Coverage below threshold Uncovered code pytest --cov --cov-report=html
Gitleaks false positive Non-secret string Add to .gitleaksignore
MLflow connection refused Not running kubectl port-forward svc/mlflow-service 5000:5000
Docker build stuck Cache/network docker builder prune -f && docker compose build --no-cache

Kubernetes Debugging

kubectl describe pod <pod> -n ml-portfolio    # Pod events
kubectl logs <pod> -n ml-portfolio            # App logs
kubectl logs <pod> -c init-download-model     # Init container logs
kubectl top pods -n ml-portfolio              # Resource usage
kubectl rollout undo deployment/<svc>         # Rollback

Dependency Issues

  • sklearn mismatch: Model trained with different version → retrain
  • NumPy/Pandas: Pin versions in requirements.txt
  • pip conflicts: pip check and pipdeptree to diagnose

Last Updated: March 2026