A production-hardened AI inference platform that solves configuration drift, the Backstage adoption paradox, and the gap between AI code generation and governed production deployment — with 21 verified checks across 6 milestones.
Executive Summary: NeuroScale is a self-service AI inference platform on Kubernetes. A developer fills in a Backstage form, the platform creates a pull request, ArgoCD deploys it, and a production-grade KServe inference endpoint is live — with cost attribution, drift control, and policy guardrails enforced automatically at every stage.
owner/cost-center labels enforced before merge + live OpenCost showback Scale friction Adding a new service requires editing GitOps boilerplate ApplicationSet auto-discovers new model folders — zero registration overheadbootstrap/root-app.yaml) that watches infrastructure/apps/. That directory contains both static Application manifests for platform infrastructure and one ApplicationSet that auto-discovers model endpoint folders.infrastructure/apps/<name>-app.yaml file. This was purely mechanical boilerplate.apps/, and ArgoCD discovers and deploys it automatically.bootstrap/root-app.yaml):selfHeal: true means ArgoCD continuously reconciles. prune: true means resources removed from Git are removed from the cluster.owner and cost-center labels required by Kyverno feed directly into catalog attribution.KServe model endpoint template at backstage/templates/model-endpoint/template.yaml generates the file required by the GitOps pipeline:InferenceService named demo-iris-2 is:infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml sets disableIstioVirtualHost: true to signal this choice to KServe.infrastructure/kserve/sklearn-runtime.yaml defines a reusable runtime that all sklearn-based InferenceService objects reference. This separates how to serve from what to serve, allowing the runtime image to be patched in one place.kubectl change is auto-reverted; infrastructure is recoverable via git revert [✓ PASS] Drift self-heal: nginx-test recreated in ~20s B — AI serving baseline Inference with no deployment path — KServe is GitOps-managed; a new model is a PR, not a kubectl command [✓ PASS] InferenceServices: 2/2 Ready=True · [✓ PASS] Inference request: demo-iris-2 → {"predictions":[1,1]} C — Golden Path Backstage adoption paradox — developers interact with a form, not Kubernetes YAML; the platform handles the rest [✓ PASS] demo-iris-2 InferenceService exists (scaffolder output) D — Guardrails Security theater — policies that appear to work but don't block anything; the kyverno-cli false-green was undetected for 2 weeks before being fixed [✓ PASS] Non-compliant InferenceService correctly denied E — Cost + portability No cost accountability — unbounded resource consumption with no ownership trail; any engineer can reproduce the full platform on a laptop in one command [~ SKIP] Resource-delta PR comment + bootstrap script validated in CI F — Production hardening Scale friction — adding a new model required manual GitOps boilerplate; ApplicationSet eliminates that entirely [✓ PASS] ApplicationSet generates 3 child Applications · [✓ PASS] OpenCost deployment healthyUnknown state caused by repo-server CrashLoopBackOff — looks like a manifest error but is a controller connectivity failure 40 min Zero drift correction during outage window B — KServe serving Default KServe config assumes Istio; disableIstioVirtualHost is not disclosed in getting-started docs 3 hours All InferenceService creation blocked cluster-wide C — Backstage Golden Path 9 distinct failures; CI false-green on Kyverno policy violations for 2 weeks ~6 hours PR-time enforcement was silently not enforcing D — Guardrails kyverno-cli apply exits 0 on violations; $PIPESTATUS[0] required 2 weeks undetected Guardrails existed but did not enforce E — Cost + portability $patch: delete in a Kustomize file deleted the live CRD cluster-wide; all InferenceServices gone in seconds 4 min recovery SEV-1 equivalent: all inference endpoints deleted simultaneously F — Production hardening Backstage dangerouslyDisableDefaultAuthPolicy vs dangerouslyAllowOutsideDevelopment — these are different with different security implications 1 hour diagnosis Silent security exposure in production profilek3d installedkubectl installedhelm installedconnection refused; Argo Unknown comparison state; app stuck not syncing B — KServe Serving docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md Istio vs Kourier ingress mismatch; kube-rbac-proxy ImagePullBackOff; Knative CRD rendering conflict C — Golden Path docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md Backstage CrashLoopBackOff (Helm values mis-nesting); blank /create/actions (401 on scaffolder); PR merged but app stayed OutOfSync D — Guardrails docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md InferenceService CRD removed by patch; Kyverno label name mismatch; CI policy simulation false-green (fixed in Milestone E) E — Cost Proxy + Portability docs/REALITY_CHECK_MILESTONE_5_COST_PROXY.md CI false-green root cause + fix; cost proxy design trade-offs; bootstrap script decisions F — Production Hardening docs/REALITY_CHECK_MILESTONE_6_PRODUCTION_HARDENING.md ApplicationSet design; namespace quota reasoning; OpenCost label strategy; multi-env Backstage values; guest auth vs disabled authrequire-standard-labels-inferenceservice InferenceService without owner + cost-center labels Cost attribution is impossible without ownership metadata; OpenCost relies on these labels require-standard-labels-deployment Deployment without owner + cost-center labels Same attribution requirement for non-inference workloads require-resource-requests-limits Deployment without CPU/memory requests and limits Unbounded resources cause node contention and unpredictable cost disallow-latest-image-tag Deployment with :latest image Non-reproducible rollouts; breaks rollback guarantees disallow-root-containers Deployment containers without runAsNonRoot: true Container breakout from root containers can escalate to host accesskubeconform Malformed YAML, wrong API versions, missing required fields Policy simulation kyverno-cli apply + stdout check Policy violations against actual rendered manifests before merge (dual exit-code + stdout check guards against false-greens) Helm rendering helm template + render_backstage.sh Helm values hierarchy bugs (the exact failure class from the Backstage RCA) Resource delta Python + PyYAML CPU/memory requests in changed apps/ manifests; posts as PR comment and GitHub Actions job summarydocs/archive/MILESTONE_C_POSTMORTEM.md for the full runbook.bootstrap/root-app.yaml + infrastructure/apps/model-endpoints-appset.yaml "How does Backstage trigger a deployment?" Section 3.2 + backstage/templates/model-endpoint/template.yaml "How does KServe handle a request?" Section 3.3 + infrastructure/kserve/sklearn-runtime.yaml "How do you prevent bad configs from reaching prod?" Section 8 + infrastructure/kyverno/policies/ "How do you prevent containers running as root?" infrastructure/kyverno/policies/disallow-root-containers.yaml "How do you prevent runaway resource consumption?" infrastructure/namespaces/default/resource-quota.yaml + limit-range.yaml "How do you attribute costs to teams?" owner/cost-center labels enforced by Kyverno + OpenCost dashboard at localhost:9090 "Show me a real incident you've debugged" infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md "What failed and what did you learn?" docs/REALITY_CHECK_MILESTONE_*.md "Why Kourier instead of Istio?" Section 3.3 + docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md "How can I verify the platform works on my laptop?" scripts/bootstrap.sh + scripts/smoke-test.sh + scripts/port-forward-all.sh "How do you scale from 3 models to 300?" ApplicationSet in infrastructure/apps/model-endpoints-appset.yaml — zero GitOps boilerplate per model "What's different between dev and prod Backstage?" infrastructure/backstage/values.yaml vs values-prod.yaml — replicas, auth provider, limits, probe thresholds "How would you promote this to a real cloud cluster?" docs/CLOUD_PROMOTION_GUIDE.md — phase-by-phase: EKS/GKE Terraform, ArgoCD bootstrap, ingress swap, wildcard DNS, TLS (cert-manager or ACM), production Backstage, OpenCost billing APINote: This repo runs on local k3d (zero-cost, fully reproducible). The cloud promotion path — EKS/GKE Terraform, ingress swap, DNS, TLS, production Backstage — is documented step-by-step indocs/CLOUD_PROMOTION_GUIDE.md. The application manifests require no changes to run on a cloud cluster; only the cluster and network layer changes.
Posted May 4, 2026
Built & documented NeuroScale, a production AI inference platform on Kubernetes. 5+ technical articles, content reviewed by a Backstage core maintainer.