OpsAgent. AI Workflow Gallery & Run Visualizer by Muzzammil HussainOpsAgent. AI Workflow Gallery & Run Visualizer by Muzzammil Hussain

OpsAgent. AI Workflow Gallery & Run Visualizer

Muzzammil Hussain

Product Designer

AI Developer

Fullstack Engineer

N8N

OpenAI

Python

OpsAgent Gallery: drop-in AI workflows that actually survive production.

Most "AI for the back office" pitches end the moment the demo finishes and reality hits: hallucinations, no evals, no rollback plan, no idea what it is costing. OpsAgent ships nine pre-built workflows that solve real ops problems and come with evals, an explicit P50 budget, fallbacks, and a one-click rollback. This gallery is the front door.

The Challenge

Every internal-tools team I spoke to had the same shape of pain. They had three or four "AI scripts" running on a laptop somewhere, owned by whoever wrote them, with no visibility, no shared cost ledger, and no way for the next person to take over. The scripts also broke quietly: a vendor changed an invoice format and three months of accruals were wrong before anyone noticed.

What people actually wanted was a small library of opinionated workflows, each one production-grade, that they could deploy, watch, and replace with their own version when ready.

What I Built

A gallery of nine production workflows. Each one ships with the same shape: a YAML manifest, an evals folder, observability hooks, a fallback model, and a rollback plan.

Contract Review v3. Extract terms and risk flags from PDFs in Drive, route high-risk to legal in Slack.

Invoice to QuickBooks. Pull line items and GL codes from PDF or EML invoices, post drafts to QBO.

Lead Qualifier (BANT). Score inbound leads, enrich from Apollo, route to AE or SDR with reasoning.

Email Triage v2. Categorize and draft responses for shared inboxes (support@, ops@).

Slack to Linear. Detect bug reports in #help, draft Linear tickets with priority and repro steps.

Meeting Action Items. Read Fireflies or Granola transcripts, write owner-tagged action items into Notion.

Policy Q&A (HR). RAG over 280 HR policy docs with paragraph-level citations.

Customer Sentiment. Score Zendesk tickets and Gong calls, post weekly digest to #cs.

DocClassify (KYC). Tag uploaded docs (W-9, EIN, articles of incorporation), route to KYC.

Plus a run visualizer (the hero panel in this cover) that shows the most recent run, the extracted fields, the eval result, the cost, and where it was posted.

Technical Foundation

Python with LangGraph for orchestration, Pydantic for typed I/O.

OpenAI GPT-4 class as primary, Claude 3.5 as fallback. Auto-route by task affinity.

Postgres with pgvector for retrieval. Tantivy for hybrid keyword.

A custom evals harness with golden datasets per workflow. CI fails on a regression beyond 1%.

Observability via OpenLLMetry into a self-hosted Grafana, plus a per-workflow cost ledger.

Deployment: Docker images, one workflow per container, Argo Workflows for the heavy ones.

Outcome

14,872 production runs in the first month across pilot teams.

Average eval pass rate 93.6% across the gallery, with a 1.2% regression budget.

Average cost per run $0.018, with a $0.05 cap that fails closed.

Three pilot teams have replaced 80+ Zapier workflows that previously ran on a personal account.

Like this project

Posted May 8, 2026

A gallery of nine production-tested AI workflows for back-office teams. Each ships with evals, a P50 budget, and live observability.

Likes

Views