Skip to content

Reproducibility Guide

Environment

Component Version
Python 3.11.14
scikit-learn 1.8.0
LightGBM 4.6+
Transformers 4.48+
SHAP 0.50.0
MLflow 3.10

Reproduce Training

# 1. Create environment
conda create -n ml-py311 python=3.11.14 -y
conda activate ml-py311

# 2. Install dependencies (per project)
pip install -r BankChurn-Predictor/requirements.txt
pip install -r requirements.txt
pip install -r NLPInsight-Analyzer/requirements.txt

# 3. Train all models
python scripts/train_production_models.py

# 4. Verify metrics
# BankChurn: AUC 0.87, F1 0.62 (StackingClassifier)
# NLPInsight: Acc 80.6%, F1-macro 0.748 (TF-IDF + LogReg)
# ChicagoTaxi: R² 0.96, RMSE 7.87 (RandomForest)

Random Seeds

All models use random_state=42 for reproducibility.

Model Artifacts

Project Path Size Format
BankChurn models/model.joblib 4.1 MB Joblib (compress=3)
NLPInsight models/model.joblib ~5 MB Joblib (compress=3)
ChicagoTaxi models/model.joblib ~2 MB Joblib (compress=3)

Data Sources

Project Dataset Rows Source
BankChurn Churn_Modelling.csv 10,000 Kaggle
NLPInsight Twitter Financial News 11,931 HuggingFace (zeroshot)
ChicagoTaxi Taxi Trips (2013-2023) 6,364,313 Chicago Open Data Portal

MLflow Tracking

# Local
export MLFLOW_TRACKING_URI=file:./mlruns

# GKE
kubectl port-forward svc/mlflow-service 5000:5000 -n ml-portfolio
export MLFLOW_TRACKING_URI=http://localhost:5000

Last Updated: March 2026 — v3.5.3