ADR-002: emptyDir + Init Container for Model Storage in Kubernetes¶
- Status: Accepted
- Date: 2026-02-18
- Authors: Duque Ortega Mutis
TL;DR: Models are downloaded from cloud object storage (GCS/S3) into ephemeral
emptyDirvolumes at pod startup via Init Containers. This decouples model versioning from Docker images — a model update is a ConfigMap change, not an image rebuild.
Context¶
Each ML service requires its trained model artifact at startup. The artifacts are small:
| Service | Artifact | Size | Format |
|---|---|---|---|
| BankChurn | StackingClassifier pipeline | 4.1 MB | model.joblib |
| NLPInsight | TF-IDF + LogReg (production) / FinBERT (GPU backend) | ~5 MB / ~440 MB | model.joblib / model.tar.gz |
| ChicagoTaxi | RandomForest + predictions | ~2 MB | model.joblib |
The key design question: where should the model live relative to the container lifecycle?
Decision¶
Use emptyDir volumes with Init Containers that download models from cloud object storage (GCS on GKE, S3 on EKS) before the main container starts.
Architecture¶
Pod startup
│
├─ Init Container (python:3.11-alpine, 50MB)
│ ├─ Reads GCS_BUCKET, GCS_MODEL_PATH from ConfigMap
│ ├─ Downloads model.joblib → /models/model.joblib
│ └─ Exits (container destroyed, volume persists)
│
└─ Main Container (FastAPI app)
└─ Reads /models/model.joblib from shared emptyDir volume
Model Path Configuration¶
Each service has a dedicated ConfigMap (k8s/model-configmaps.yaml) specifying:
- GCS_BUCKET / S3_BUCKET — cloud storage bucket name
- GCS_MODEL_PATH / S3_MODEL_PATH — path within bucket (e.g., bankchurn/model.joblib)
- LOCAL_MODEL_PATH — mount path inside pod (/models/model.joblib)
Updating a model requires only: upload new artifact to GCS/S3, then kubectl rollout restart deployment/<service>. No Docker rebuild.
Alternatives Considered¶
| Option | Cost | Startup Overhead | Model Update Strategy | Verdict |
|---|---|---|---|---|
| PersistentVolumeClaim (PVC) | ~$10/mo per PV | None (already mounted) | Upload to PV (requires write access) | Rejected — persistent cost for <10MB artifacts; complicates multi-cloud (PV provisioners differ) |
| Bake into Docker image | $0 | None | Full Docker rebuild + push + rolling update | Rejected — couples model version to image version; 10-min rebuild cycle for a 4MB file change |
| emptyDir + Init Container ✅ | ~$0.00005/startup | 2-5s (GCS/S3 download) | ConfigMap change + rollout restart | Selected — zero persistent cost, decoupled versioning |
| CSI ephemeral volume (GCS FUSE / Mountpoint for S3) | $0 | 1-2s (FUSE mount) | Automatic (reads latest from bucket) | Deferred — requires CSI driver installation; adds cluster dependency |
Why Not Bake Models into Docker Images?¶
In production ML, model release cadence ≠ code release cadence. Models may be retrained weekly (via drift detection — see ADR-006), while application code changes monthly. Coupling them forces unnecessary image rebuilds and increases deployment risk.
Implementation Details¶
Init Container (scripts/download-model.py):
- Uses google-cloud-storage (GCS) or boto3 (S3) — no gcloud/awscli SDK bloat
- 3 retries with 10s exponential backoff
- Validates downloaded file size > 0 before exiting
- NLPInsight handles model.tar.gz extraction (transformer model directory)
Standardized model path: All 3 services use models/model.joblib as the canonical path, configurable via MODEL_PATH environment variable.
Consequences¶
- Positive: Model versioning fully decoupled from Docker images — deploy new model in <30s
- Positive: Zero persistent storage cost ($0 vs $10+/mo for PVC)
- Positive: Same pattern works on GKE (GCS) and EKS (S3) — only ConfigMap values differ
- Positive: Init Container is disposable —
python:3.11-alpine(50MB) is destroyed after download - Negative: Models re-downloaded on every pod restart (acceptable for <10MB; GCS/S3 latency is 2-5s)
- Negative: Pod startup depends on cloud storage availability (mitigated by retries)
Revisit When¶
- Model artifacts exceed 500MB (consider CSI FUSE mount or PVC)
- Pod restart frequency exceeds 1/hour (download cost becomes significant)
- Model registry (MLflow) supports direct K8s integration for model serving
References¶
- ADR-006: Drift-Triggered Retraining — model update trigger
- Kubernetes Init Containers
- GCS FUSE CSI Driver