AI Engineering Workflow Toolkit: Governed Code Review by Sergiu NicoaraAI Engineering Workflow Toolkit: Governed Code Review by Sergiu Nicoara

AI Engineering Workflow Toolkit: Governed Code Review

Sergiu Nicoara

Sergiu Nicoara

A 5-layer governed review pipeline that enforces every LLM verdict traces to deterministic tool output. The review agent suppresses findings lacking traceable evidence, eliminating hallucinations at the design level, not the prompt level.

Architecture

Five layers, each with a clear contract:
Versioned AGENTS.md skill library with lifecycle hooks defining what each agent can do and when it activates.
Claude Code PostToolUse hook auto-triggered on every file edit, so no change escapes review.
MCP tools (ruff, mypy, bandit) as evidence gates. Every finding must be backed by deterministic tool output. No tool evidence, no finding.
Parallel subagents for security, architecture, and style running independently against the same changeset.
Independent review agent validating traceability and producing ranked line-level annotations.
The key constraint: the review agent suppresses any finding that lacks a traceable link to deterministic tool output. If ruff didn't flag it, mypy didn't type-check it, or bandit didn't scan it, the finding doesn't ship. This eliminates hallucinated code review comments at the architecture level.

Self-review validation

The pipeline validated itself in production when it flagged its own path-traversal vulnerability during a self-review. The security subagent (backed by bandit) caught a file-path construction that could escape the sandbox. The system reviewing its own code and finding a real vulnerability is the strongest proof that evidence-gated review works.

Evaluation harness

LLM-as-judge evaluation with a golden dataset and regression logging. Threshold set at 4.0/5.0. A --compare flag surfaces per-dimension score deltas (up/down arrows) across runs, so you can see exactly which dimensions improved or regressed between pipeline versions.

Observability

All agent spans instrumented with OpenTelemetry and exported via OTLP to Jaeger for live trace visibility. Every review cycle is fully traceable from hook trigger through subagent execution to final annotation output.

Deployment

Live React/WebSocket dashboard for real-time review status. Deployable as a hook, CLI, or MCP server. Containerized via Docker with persistent volume and single-command deployment.
Stack: Python, FastAPI, TypeScript, React, OpenTelemetry, Jaeger, Docker, Claude Code, MCP.
Like this project

Posted Jun 11, 2026

5-layer governed code review pipeline where every LLM verdict must trace to deterministic tool output. Parallel subagents, evidence gates via MCP tools, and an independent review agent that suppresses ungrounded findings.