Portfolio Status¶

Operating status

Production-oriented evidence, currently in showcase mode¶

This page separates what is active today from what was proven during the live cloud deployment period. It is designed for recruiters, hiring managers, and technical reviewers who need the status in minutes, not a wall of operational detail.

Review technical evidence Read ADR-018 Open deployment evidence

Mode Showcase Reference implementation; live clusters are intentionally paused.

Last live deploy v3.6.0 Validated during March 2026 active development.

Reactivation 1-2h Plus temporary GCP/AWS budget.

Current proof CI + docs Code, tests, IaC, runbooks, ADRs and screenshots remain available.

Executive Readout¶

What this is

Reference MLOps portfolio¶

Three ML services, multi-cloud Kubernetes artifacts, Terraform, CI/CD, observability, drift detection and ADR-backed design decisions.

What is real

Implementation, not slideware¶

The code, manifests, Terraform and workflows were used against live clusters during development. Evidence from that period is preserved in the docs.

What is off

Cost-controlled infrastructure¶

GKE, EKS, MLflow, Prometheus and Grafana are not running continuously because the cloud control-plane cost is not justified for a permanent showcase.

Active vs Paused Surfaces¶

The fastest way to review the portfolio is to separate active engineering assets from intentionally paused cloud runtime surfaces.

Active Now¶

Active

Source code¶

All three services remain reviewable and tested on every push.

Active

Unit and integration CI¶

ci-mlops.yml runs on push and PR with 395+ tests and 90-96% coverage.

Active

Terraform validation¶

ci-infra.yml validates infrastructure changes without requiring live clusters.

Active

GitHub Pages docs¶

This site is the current public review surface for architecture, evidence and operations.

Active

Docker build path¶

Images are built as CI artifacts and the Dockerfiles remain production-oriented.

Paused or Inactive¶

Inactive

GKE cluster¶

Torn down after the live development and load-testing period.

Inactive

EKS cluster¶

Torn down for the same cost-control reason as GKE.

Inactive

MLflow and observability stack¶

MLflow, Prometheus and Grafana were deployed on the clusters; they are gone with them.

Paused

Promotion workflows¶

Artifact Registry and ECR promotion are disabled until a live demo is requested.

Paused

Daily drift detection¶

The scheduled trigger is disabled; workflow_dispatch is still available.

Paused

Daily retrain checks¶

Paused with the same maintenance-mode logic as drift detection.

Why Infrastructure Is Off¶

Running GKE + EKS + managed Postgres + container registries continuously costs roughly $180-$220/month combined. That spend was justified during active development, load testing and incident-style validation; it is not economical as an always-on showcase.

This is the same operating logic a team would use with any paid infrastructure: keep the evidence, automation and reactivation path available, but do not pay for idle runtime when nobody is reviewing it live.

The important distinction is that the portfolio is not claiming a fictional live environment. It keeps the evidence that matters: manifests, Terraform, runbooks, ADRs, screenshots, load-test results and incident notes from the real deployment period.

Decision record

The full rationale lives in ADR-018: Portfolio Maintenance Mode.

How Repository Noise Is Controlled¶

Drift issues

Workflow dispatch only¶

The daily schedule was disabled. The previous condition treated script failures as drift events; the workflow now checks for successful drift detection and an explicit drift flag.

Security alerts

Trivy signal cleanup¶

ignore-unfixed: true keeps unfixable base-image CVEs from becoming permanent noise while preserving actionable scanner findings.

Dependencies

Dependabot with limits¶

GitHub Actions updates run weekly and are capped at three open PRs, keeping maintenance visible without drowning the repo.

Reactivation Playbook¶

A live end-to-end demo can be restored from the existing Terraform, workflows and runbooks. Budget approval is the main prerequisite.

1

Provision infrastructure, about 30 min¶

cd infra/terraform/gcp && terraform apply -var-file=terraform.tfvars
cd ../aws && terraform apply -var-file=terraform.tfvars

2

Push images to cloud registries, about 15 min¶

gh workflow run promote-images.yml

3

Deploy to clusters, about 20 min¶

gh workflow run deploy-gcp.yml --ref v3.6.0
gh workflow run deploy-aws.yml --ref v3.6.0

4

Re-enable scheduled drift detection¶

Uncomment the schedule: block in .github/workflows/drift-detection.yml.

5

Run smoke tests¶

./scripts/smoke_test.sh

6

Teardown after the demo¶

Run terraform destroy in both cloud directories to avoid turning a demo into recurring cost.

Maintenance Pass Summary¶

Issue hygiene

168 stale drift alerts closed¶

Each closure points reviewers back to this status page and the maintenance mode decision.

Dependency hygiene

3 Dependabot PRs merged¶

GitHub Actions bumps were merged while heavier Docker-image changes were deferred to the next active sprint.

Security hygiene

210 legacy Trivy alerts handled¶

Legacy unfixable alerts were dismissed with documented won't fix reasoning.

Reviewer FAQ¶

Can this actually be redeployed?

Yes. Terraform is current and validated on infrastructure changes. A full redeploy is roughly one hour from infrastructure apply to green smoke tests, assuming credentials and budget are ready.

How were the latency claims verified?

With Locust load tests against live clusters during the v3.6.0 period. Raw results live in [Load Test Results](load-test-results.md), with visual evidence under `docs/media/`.

Why not keep a tiny cluster running?

The control-plane floor alone is material: GKE and EKS each carry monthly control-plane cost even when workloads are near zero. ADR-018 documents the alternatives and the final maintenance-mode decision.

How would this noise problem be handled in production?

The drift workflow bug was fixed so script failure is no longer treated as a drift event. With live data, drift jobs should complete successfully and open issues only when PSI exceeds the configured threshold.