ADR-018: Portfolio Maintenance Mode¶
Status¶
Accepted — April 2026
Context¶
After reaching v3.6.0 (full multi-cloud deployment verified, managed ML platforms added, load-tested on both clouds), the portfolio entered a phase where keeping the infrastructure running continuously was economically unjustified:
- GKE control plane: ~$72/month (even at 0 workload)
- EKS control plane: ~$72/month (even at 0 workload)
- Managed Postgres (RDS + Cloud SQL): ~$30–50/month combined
- Artifact Registry / ECR storage: minor but non-zero
- NAT gateways, load balancers, disks: ~$40/month combined
- Total idle cost: ~$180–220/month for a portfolio nobody is actively using
At the same time, three symptoms appeared on the public repository:
- 168 stale "drift-alert" issues auto-created by the daily drift-detection
workflow. The workflow ran on schedule against stale data, failed for
unrelated reasons (missing data path, hash-pinned requirements conflict), and
the condition
if: steps.drift.outcome == 'failure'misinterpreted those script errors as drift events — creating empty-body issues every day. - ~210 Trivy code-scanning alerts for base-image (Debian/Python) CVEs,
most of them
severity=noteand most without an upstream fix available. - Multiple unmerged Dependabot PRs for GitHub Actions version bumps.
A visitor (recruiter, reviewer) arriving at the repo would see these as signals of an abandoned project, which contradicts the actual state (technically sound code, 395+ tests passing, fully documented).
Decision¶
Transition the portfolio to a formal "Reference / Showcase" mode and encode this state in visible artifacts:
- Declare the mode publicly — add a status badge and banner to README.md
linking to
PORTFOLIO_STATUS.mdwhich explains what is live, what is paused, and how to reactivate. - Convert scheduled workflows to manual-only — drop
schedule:triggers ondrift-detection.ymland related workflows while infrastructure is offline; keepworkflow_dispatchso everything can still be demonstrated on demand. - Fix the drift-alert bug — change the issue-creation condition from
steps.drift.outcome == 'failure'tosteps.drift.outcome == 'success' && steps.metrics.outputs.drift_detected == 'true'. This is a correctness fix that would apply even in active mode: script failure and drift event are distinct things and must not be conflated. - Harden Trivy scanning — add
ignore-unfixed: trueto the scanner so that base-image CVEs without upstream fixes don't re-accumulate; bulk-dismiss the existing ~210 alerts with reasonwon't fixand a documented comment linking to this ADR. - Clear existing noise — close the 168 stale drift-alert issues with a referenced explanation; merge the pending Dependabot PRs.
Alternatives considered¶
Alternative 1: Keep running with smaller clusters¶
Tried mentally. GKE and EKS both charge a fixed control-plane fee regardless of workload size. Scaling node pools to zero saves node cost but not the $144/month control-plane floor. Not economical for a permanent showcase.
Alternative 2: Archive the repository¶
Would stop all CI activity and thus all noise — but also removes the ability to demonstrate the CI/CD pipeline (unit tests, Terraform validate, Docker build) which is itself a key part of the portfolio. Rejected.
Alternative 3: Disable all workflows¶
Leaves the repo look "dead" and removes the ability to demonstrate that the CI actually runs and passes. Rejected — kept unit tests, Terraform validate, Docker build, Trivy scan, docs deployment on push/PR.
Alternative 4: Just close the issues without fixing the workflow¶
Leaves the underlying bug that will re-create the same noise next time the workflow runs. Rejected — the fix is a one-line correctness improvement that applies in both active and showcase modes.
Consequences¶
Positive¶
- Repo presents as "controlled and documented", not "abandoned"
- The drift-workflow bug fix benefits any future active period (real bug, real fix)
- Trivy
ignore-unfixed: trueprevents permanent growth of unfixable alerts - Clear reactivation path documented (~1 hour, $180–220/month ongoing)
- ADR + PORTFOLIO_STATUS.md turn a limitation into a demonstrated understanding of operational trade-offs — recruiters value documented decisions more than continuously-running infrastructure
Negative¶
- The live demo links in older docs (e.g.,
https://<elb-dns>/health) no longer resolve. Mitigated by the README banner and status doc; live-verification evidence preserved inDEPLOYMENT_EVIDENCE.md. - Any visitor running
terraform planagainst the state bucket will see full destroy plans. Mitigated by removing the remote state buckets after teardown; Terraform runs locally with empty state.
Reactivation criteria¶
The portfolio returns to "Active" mode when any of the following is true:
- Interview demonstration requested — reactivate for the interview window, teardown after
- Feature development — implementing a new ADR that requires live cloud validation (e.g., adding a new managed service comparison)
- Budget allocated — if the monthly cost becomes sustainable for a period
Reactivation checklist is in PORTFOLIO_STATUS.md §5.
References¶
PORTFOLIO_STATUS.md— visitor-facing state document.github/workflows/drift-detection.yml— contains the corrected alert condition.github/workflows/ci-mlops.yml— containsignore-unfixed: true- CHANGELOG v3.6.1 entry (to be written on next release)