Skip to content

Infrastructure

Terraform-managed, multi-cloud (GCP + AWS) infrastructure for the ML-MLOps Portfolio.

Multi-Cloud Architecture

flowchart TB
    subgraph GH ["GitHub"]
        Code[Source Code] --> CI[GitHub Actions\n10 jobs]
        CI --> |push images| AR[Artifact Registry]
        CI --> |push images| ECR[ECR]
    end

    subgraph GCP ["GCP — us-central1"]
        AR --> GKE[GKE Cluster\n1-5 nodes auto-scaling\ne2-medium]
        GCS[(Cloud Storage\nModels + Datasets)]
        CSQL[(Cloud SQL\nMLflow Backend)]
        GKE --> |init containers| GCS
        GKE --> CSQL
    end

    subgraph AWS ["AWS — us-east-1"]
        ECR --> EKS[EKS Cluster\n1-5 nodes auto-scaling\nt3.small]
        S3[(S3\nArtifacts + Datasets)]
        RDS[(RDS PostgreSQL\nMLflow Backend)]
        EKS --> S3
        EKS --> RDS
    end

    subgraph K8s ["Kubernetes — Same Manifests"]
        direction LR
        BC[BankChurn API] ~~~ NLP[NLPInsight API] ~~~ CT[ChicagoTaxi API]
        PROM[Prometheus] ~~~ GRAF[Grafana] ~~~ MLF[MLflow]
    end

    GKE --> K8s
    EKS --> K8s

    TF[Terraform IaC] --> GCP
    TF --> AWS

Side-by-Side: GCP vs AWS

Component GCP (Live Production) AWS (Live Production)
Cluster GKE ml-portfolio-gke-production (us-central1) EKS ml-portfolio-eks (us-east-1)
Nodes 1 baseline, auto-scales to 5 (e2-medium, 2 vCPU / 4 GB) 1 baseline, auto-scales to 5 (t3.small, 2 vCPU / 2 GB)
Container Registry Artifact Registry ECR
Object Storage Cloud Storage (versioned, lifecycle) S3 (versioned)
Database Cloud SQL PostgreSQL SQLite (in-pod)
Networking VPC + Private Subnets + VPC Peering VPC (eksctl-managed)
Ingress nginx + GCE LB (IP: 136.111.152.72) nginx + NLB (AWS Load Balancer Controller)
IaC Terraform (GCP modules) Terraform + eksctl + Kustomize
K8s Manifests Shared base + GCP overlays Shared base + AWS Kustomize overlays
Cost ~$51/month ~$124/month
Status ✅ Running (6 pods) ✅ Running (6 pods)

Cloud-agnostic design: The same K8s base manifests deploy to both clouds. Only image registry URLs and storage class annotations differ (via Kustomize overlays).

Cloud Resources

GCP (Live Production)

Resource Configuration
GKE Cluster ml-portfolio-gke-production, us-central1, 1-5 nodes (auto-scaling, e2-medium)
Artifact Registry 3 Docker images (bankchurn, nlpinsight, chicagotaxi)
Cloud Storage Models bucket + Datasets bucket (versioned, lifecycle policies)
Cloud SQL PostgreSQL for MLflow backend
VPC Custom network with private subnets, VPC peering for Cloud SQL
Cost ~$51/month (covered by Free Tier credits)

AWS (Live Production)

Resource Configuration
EKS Cluster ml-portfolio-eks, us-east-1, 1 node baseline, auto-scales to 5 (t3.small)
ECR 3 Docker images (bankchurn, nlpinsight, chicagotaxi)
S3 Models + datasets (versioned, lifecycle policies)
NLB nginx-ingress LoadBalancer via AWS Load Balancer Controller (path routing)
Cost ~$124/month
GKE Workloads EKS Workloads Container Registries
GKE EKS ECR
Artifact Registry (GCP) S3 Buckets (AWS) GCS Models (GCP)
AR S3 GCS

Kubernetes

Manifest Purpose
k8s/base/ Cloud-agnostic: namespace, storage, monitoring, network policies, PDBs, drift cronjobs
k8s/overlays/gcp/ GCP overlay: GKE deployments, GCS configmaps, Workload Identity SA, ingress
k8s/overlays/aws/ AWS overlay: EKS deployments, S3 configmaps, IRSA SA, ingress

Resource Calibration (2 uvicorn workers)

Service Memory (real/limit) CPU Target HPA
BankChurn ~344Mi / 1Gi 70% 1–3 pods
NLPInsight ~283Mi / 1Gi 70% 1–3 pods
ChicagoTaxi ~431Mi / 512Mi 70% 1–3 pods

CPU-only HPA: ML models have fixed memory footprint. Memory-based scaling would never scale down.

Terraform Commands

cd infra/terraform/gcp    # or aws/
terraform init
terraform plan -var-file=terraform.tfvars
terraform apply -var-file=terraform.tfvars

Infrastructure Testing

Automated validation suite in tests/infra/:

Test Type GCP AWS
terraform fmt Hard gate
terraform validate Hard gate
tfsec Advisory ✅ (51/71) ✅ (84/116)
checkov Advisory ✅ (51/71) ✅ (84/116)
YAML syntax Hard gate ✅ 24/24 ✅ 24/24
kube-linter Advisory ✅ 17 findings ✅ 17 findings
conftest (OPA) Hard gate ✅ 0 violations ✅ 0 violations
bash tests/infra/run_all_tests.sh
Terraform Multi-Cloud K8s Overlays Infra Tests
Terraform Overlays Tests

Security

  • Encryption at rest (GCS, Cloud SQL, S3, RDS)
  • Workload Identity (GCP) / IRSA (AWS) for pod-level IAM
  • Non-root containers (UID 1000)
  • Private database networking
  • CI/CD scanning: Trivy, Bandit, Gitleaks, pip-audit
  • Least-privilege: storage.objectViewer for GKE pods
  • IaC scanning: tfsec, checkov (advisory findings documented in .tfsec.yml)

Security Hardening (Terraform — Production-Grade)

The Terraform configuration includes security hardening that goes beyond what is applied to the running demo cluster:

Feature GCP (main.tf) AWS (main.tf)
Private cluster private_cluster_config (private nodes, public endpoint) endpoint_private_access = true
Authorized networks master_authorized_networks_config (VPC CIDR only) endpoint_public_access_cidrs
Network policy Calico CNI Calico CNI
VPC-native ip_allocation_policy (secondary pod/service ranges) VPC CNI (native)
Flow logs VPC Flow Logs enabled VPC Flow Logs enabled
Encryption GCS/Cloud SQL at-rest S3 KMS + public access blocks

Architecture Decision: The running GKE demo cluster was provisioned before the security hardening was added to the Terraform code. Applying these changes would force cluster recreation (private_cluster_config and ip_allocation_policy are ForceNew attributes in the GCP provider), destroying all 6 running pods and requiring full redeployment.

Additionally, master_authorized_networks_config restricts API access to the VPC subnet (10.10.0.0/24), which would require a bastion host or Cloud Shell for kubectl access — appropriate for production but impractical for a portfolio demo that requires frequent local interaction.

The Terraform code represents the production-ready target state. The running cluster demonstrates deployment capabilities (APIs, monitoring, autoscaling, CI/CD). Both are valid portfolio artifacts — the code shows security engineering, the cluster shows operational execution. A real production deployment would apply the hardened configuration from initial provisioning.

Monitoring Stack

Grafana — ML Production Dashboard Prometheus — 16/16 Targets UP
Grafana Prometheus
Request rate, P95 latency, predictions/hr, error rate, CPU, memory All ML services + K8s auto-discovered pods

Last Updated: March 2026 — v3.5.3 (both clouds live with LoadBalancer Ingress)