A lot of people building by Brian PyattA lot of people building by Brian Pyatt

A lot of people building

Brian Pyatt

Completed work

AI Agent Engineer

AI Engineer

Claude

A lot of people building with Claude Code-style agents are still focused on prompt engineering.

I get why. It’s the most visible lever.

But I think the bigger opportunity is usually somewhere else: the reusable skills, workflows, or slash commands an agent relies on over and over again.

Those are what shape behavior over time. And in my experience, improving them is less about piling on new instructions and more about tightening the loop around failure.

Watch where the agent breaks. Figure out why. Fix the workflow. Repeat.

Sometimes that means adding a rule. Just as often, it means removing one.

Over the last couple of days, I rebuilt the /close-loop cycle in my framework, Rebar, and it clarified something I’ve been feeling for a while:

A lot of agent systems have evaluation. Fewer have a feedback loop that actually makes them simpler, cheaper, and more reliable over time.

That difference matters.

In the old version of my loop, a feature could be marked complete because the evaluator returned a PASS. The orchestrator would close the issue, everything looked fine, and only later would I realize something important was still missing — like a Prisma migration file.

So the feature wasn’t really done. It just had the appearance of being done.

The evaluator had often already pointed at the problem in its follow-up notes. But the system wasn’t treating that kind of language as blocking. “PASS with follow-ups” was getting interpreted too generously.

That was the real failure: not bad evaluation, but a weak handoff between evaluation and release.

So I rebuilt the loop around four gates, and all four have to pass before “done” means anything:

1 . Evaluator Checks code, scope, and completeness and writes structured findings.

2. Release gate Scans those findings for blocking language like “must generate,” “cannot ship,” or “before any live DB.” If that language shows up, the work is blocked.

3. Cycle-scoped improve step Promotes only the current cycle’s validated observations into the expertise file, instead of dragging in stale backlog noise.

4. Meta-improve Looks across evaluator logs for repeated failure patterns and proposes changes to the templates themselves, with a human review step before anything sensitive gets updated.

That last piece is where the compounding effect starts to show up.

Because the default instinct in agent systems is usually to add.

Add another reminder. Add another caveat. Add another paragraph to the template so the model doesn’t make that mistake again.

Sometimes that’s right. But it’s also how workflows slowly turn into bloated instruction stacks that cost more and work worse.

Every extra line gets paid for on every future run. And long prompts full of overlapping rules are often harder for models to follow consistently than a smaller number of clear ones.

So the better question is not “what else should we add?”

It’s “what actually belongs in the workflow?”

In the first real cycle of the rebuilt loop, I saw four patterns:

- schema changes without Prisma migrations - dirty working tree bleeding across features - orphan Vue refs that were declared but never rendered - Hono context typing debt across multiple routes

Only the first two justified workflow changes.

The orphan refs were already being caught by the evaluator, so there was no reason to duplicate that logic in the template. The Hono typing issue was real, but it was cleanup work, not a process problem.

That distinction matters more than it sounds.

If every bug becomes a workflow rule, the system gets heavier every week. If you’re disciplined about separating repeatable process failures from one-off implementation issues, the workflow stays lean.

And that’s really the bigger point here.

There are two things improving at the same time:

First, context gets better. Validated observations get promoted into structured expertise, so the next run starts with better knowledge of the codebase and less repeated discovery.

Second, workflow gets sharper. The system looks at repeated failures and changes the reusable commands around the agent — ideally by adding only what consistently matters and cutting what doesn’t.

That combination is where the gains compound.

The agent starts with better context, but a lighter operating model.

That’s a much healthier direction than what a lot of systems drift toward, which is more and more prompt text, more accumulated edge-case handling, and rising cost without much improvement in reliability.

The artifact trail is what makes this workable.

Each cycle leaves behind evidence: evaluator logs, raw findings, expertise updates, queued template patches, wiki notes.

After enough cycles, you’re not just reacting to the last annoying failure. You can actually see what keeps recurring, what was already covered elsewhere, and which instructions are no longer doing useful work.

That makes subtraction much easier to justify.

And yes, there’s a token-cost argument here too.

A 2,000-token template invoked 50 times a day costs 100,000 tokens a day just to load. Trim 500 tokens of dead guardrails and the savings add up quickly.

But the bigger win is clarity.

In practice, models usually do better with fewer, more coherent rules than with long prompts full of defensive clutter. So shortening the workflow isn’t just cheaper. It often improves quality too.

To me, this is the more interesting layer of agent design: not just agentic coding, but skill engineering.

The reusable commands around an agent should themselves be under active improvement. Not based on vibes. Not based on one weird miss. Based on repeated observation and actual evidence.

If your setup doesn’t have:

- an evaluator producing structured findings - a release gate that can interpret blockers - a way to detect recurring failure patterns - and a human review step for sensitive workflow changes

then there’s a good chance the system will get more expensive over time, not less.

Every miss turns into another sentence. Every edge case turns into another rule. Eventually you’re feeding the model more instructions and getting less leverage out of them.

The better path is a tighter loop:

less prompt where possible, more signal where it matters, and workflows that get sharper as the system learns.

That’s what I’m trying to build into Rebar.

Rebar is open-source. The close-loop command, the meta-improve queue, and the release gate are in the repo. Play with it, and if you see a dead instruction in my own templates, send me a pull request.

Like this project

Completed work

Posted Apr 20, 2026

A lot of people building with Claude Code-style agents are still focused on prompt engineering. I get why. It’s the most visible lever. But I think the bigge...

Likes

Views

A lot of people building

Challenges

Challenges