A production-grade multi-agent recruiting system that treats candidate evaluation as an engineering problem, not a prompt problem. Most AI recruiting tools are a single LLM call wrapped in a UI, with no structured evaluation and no reliability guarantees. This one is built differently.
Architecture
The pipeline runs as a deterministic state machine, not a free-running agent loop. Four sequential stages: role extraction, criteria parsing, project ranking, CV Q&A. Each stage has a defined input/output contract, so no stage can hallucinate its way into the next.
Key systems
MCP-based tool registry with an A2A critic layer. The recruiter agent hands off responses to an independent critic agent for structured automated review before results reach the user.
LLM-as-a-Judge evaluation suite with golden datasets and multi-metric scoring (faithfulness, relevancy, factuality).
Session-based memory and full trajectory logging across the entire pipeline.
Voice interface: Deepgram nova-2 STT and Google Neural2-D TTS over persistent WebSockets. ~600ms time-to-first-audio with sentence-level parallel synthesis and barge-in cancellation.
TTS benchmarking: why Google Neural2-D won
ElevenLabs was the slowest option. I benchmarked 3 TTS providers and the results surprised me.
Browser Web Speech API: ~300ms E2E. Free. Sounds robotic, zero production reliability.
ElevenLabs Turbo v2.5: ~977ms first audio. The "low-latency" model. GCP Cloud Run → ElevenLabs is a cross-cloud HTTP round trip. Physics wins.
Google Cloud Neural2-D: ~398ms first audio. Co-located with the rest of the stack on GCP. 1M chars/month free.
The full E2E breakdown (transcript → first audio byte): agent routing 35ms (deterministic, no LLM in hot path) + TTS synthesis ~400ms = ~600ms total.
Three things that actually moved the needle:
Deepgram endpointing. Default 300ms silence before is_final. Cut to 150ms. Shaved ~150ms off perceived latency for short commands.
Silence keepalive. During TTS playback, send zero-bytes to Deepgram so the connection stays warm. Without it, cold reconnect adds 200-400ms.
Barge-in as an asyncio.Event. When user interrupts, tts_cancel.set() aborts mid-synthesis. No queue backlog, no stale audio.
The lesson: co-location beats reputation. ElevenLabs sounds better. But "Turbo v2.5" latency claims assume you're calling from their network. Cross-cloud adds ~500ms you can't optimize away. For production voice AI, the fastest path is the shortest network path.
Observability
Every agent span instrumented with OpenTelemetry and exported to Langfuse for live trace visibility. Deployed to GCP Cloud Run.
Results
~600ms time-to-first-audio on the voice pipeline
LLM-as-a-Judge scoring across faithfulness, relevancy, and factuality with regression logging
Zero hallucination pass-through: the critic layer blocks unverified outputs before they reach the user