Skip to content

Portfolio Status

Operating status

Production-oriented evidence, currently in showcase mode

This page separates what is active today from what was proven during the live cloud deployment period. It is designed for recruiters, hiring managers, and technical reviewers who need the status in minutes, not a wall of operational detail.

Mode Showcase Reference implementation; live clusters are intentionally paused.
Last live deploy v3.6.0 Validated during March 2026 active development.
Reactivation 1-2h Plus temporary GCP/AWS budget.
Current proof CI + docs Code, tests, IaC, runbooks, ADRs and screenshots remain available.

Executive Readout

What this is

Reference MLOps portfolio

Three ML services, multi-cloud Kubernetes artifacts, Terraform, CI/CD, observability, drift detection and ADR-backed design decisions.

What is real

Implementation, not slideware

The code, manifests, Terraform and workflows were used against live clusters during development. Evidence from that period is preserved in the docs.

What is off

Cost-controlled infrastructure

GKE, EKS, MLflow, Prometheus and Grafana are not running continuously because the cloud control-plane cost is not justified for a permanent showcase.

Active vs Paused Surfaces

The fastest way to review the portfolio is to separate active engineering assets from intentionally paused cloud runtime surfaces.

Active Now

Active

Source code

All three services remain reviewable and tested on every push.

Active

Unit and integration CI

ci-mlops.yml runs on push and PR with 395+ tests and 90-96% coverage.

Active

Terraform validation

ci-infra.yml validates infrastructure changes without requiring live clusters.

Active

GitHub Pages docs

This site is the current public review surface for architecture, evidence and operations.

Active

Docker build path

Images are built as CI artifacts and the Dockerfiles remain production-oriented.

Paused or Inactive

Inactive

GKE cluster

Torn down after the live development and load-testing period.

Inactive

EKS cluster

Torn down for the same cost-control reason as GKE.

Inactive

MLflow and observability stack

MLflow, Prometheus and Grafana were deployed on the clusters; they are gone with them.

Paused

Promotion workflows

Artifact Registry and ECR promotion are disabled until a live demo is requested.

Paused

Daily drift detection

The scheduled trigger is disabled; workflow_dispatch is still available.

Paused

Daily retrain checks

Paused with the same maintenance-mode logic as drift detection.

Why Infrastructure Is Off

Running GKE + EKS + managed Postgres + container registries continuously costs roughly $180-$220/month combined. That spend was justified during active development, load testing and incident-style validation; it is not economical as an always-on showcase.

This is the same operating logic a team would use with any paid infrastructure: keep the evidence, automation and reactivation path available, but do not pay for idle runtime when nobody is reviewing it live.

The important distinction is that the portfolio is not claiming a fictional live environment. It keeps the evidence that matters: manifests, Terraform, runbooks, ADRs, screenshots, load-test results and incident notes from the real deployment period.

Decision record

The full rationale lives in ADR-018: Portfolio Maintenance Mode.

How Repository Noise Is Controlled

Drift issues

Workflow dispatch only

The daily schedule was disabled. The previous condition treated script failures as drift events; the workflow now checks for successful drift detection and an explicit drift flag.

Security alerts

Trivy signal cleanup

ignore-unfixed: true keeps unfixable base-image CVEs from becoming permanent noise while preserving actionable scanner findings.

Dependencies

Dependabot with limits

GitHub Actions updates run weekly and are capped at three open PRs, keeping maintenance visible without drowning the repo.

Reactivation Playbook

A live end-to-end demo can be restored from the existing Terraform, workflows and runbooks. Budget approval is the main prerequisite.

1

Provision infrastructure, about 30 min

cd infra/terraform/gcp && terraform apply -var-file=terraform.tfvars
cd ../aws && terraform apply -var-file=terraform.tfvars

2

Push images to cloud registries, about 15 min

gh workflow run promote-images.yml

3

Deploy to clusters, about 20 min

gh workflow run deploy-gcp.yml --ref v3.6.0
gh workflow run deploy-aws.yml --ref v3.6.0

4

Re-enable scheduled drift detection

Uncomment the schedule: block in .github/workflows/drift-detection.yml.

5

Run smoke tests

./scripts/smoke_test.sh

6

Teardown after the demo

Run terraform destroy in both cloud directories to avoid turning a demo into recurring cost.

Maintenance Pass Summary

Issue hygiene

168 stale drift alerts closed

Each closure points reviewers back to this status page and the maintenance mode decision.

Dependency hygiene

3 Dependabot PRs merged

GitHub Actions bumps were merged while heavier Docker-image changes were deferred to the next active sprint.

Security hygiene

210 legacy Trivy alerts handled

Legacy unfixable alerts were dismissed with documented won't fix reasoning.

Reviewer FAQ

Can this actually be redeployed? Yes. Terraform is current and validated on infrastructure changes. A full redeploy is roughly one hour from infrastructure apply to green smoke tests, assuming credentials and budget are ready.
How were the latency claims verified? With Locust load tests against live clusters during the v3.6.0 period. Raw results live in [Load Test Results](load-test-results.md), with visual evidence under `docs/media/`.
Why not keep a tiny cluster running? The control-plane floor alone is material: GKE and EKS each carry monthly control-plane cost even when workloads are near zero. ADR-018 documents the alternatives and the final maintenance-mode decision.
How would this noise problem be handled in production? The drift workflow bug was fixed so script failure is no longer treated as a drift event. With live data, drift jobs should complete successfully and open issues only when PSI exceeds the configured threshold.