Building a 44.1kHz Ultra-Low Latency TTS Engine by Vladyslav SoliannikovBuilding a 44.1kHz Ultra-Low Latency TTS Engine by Vladyslav Soliannikov

Building a 44.1kHz Ultra-Low Latency TTS Engine

Vladyslav Soliannikov

Vladyslav Soliannikov

From arXiv Research to Production: Building a 44.1kHz Ultra-Low Latency TTS Engine
Existing TTS models are often too heavy for real-time applications or suffer from poor audio quality. I’ve successfully re-implemented and optimized the Supertonic v2 architecture from scratch, specifically tailored for high-fidelity speech synthesis.
The Technical Breakthrough: Instead of relying on public repos, I built this engine based on the original arXiv research (arXiv:2509.11084), implementing a complex 3-stage pipeline:
Speech Autoencoder (~47M params): Optimized with a HiFi-GAN generator to eliminate the "metallic" artifacts common in WaveNeXt-style heads.
Text-to-Latent Module (~19M params): Utilizing conditional Flow Matching and LARoPE (Length-Aware RoPE) for robust text-speech alignment.
Utterance-Level Duration Predictor: Simplified logic for natural prosody transfer.
Performance Benchmarks:
Speed: Achieved x167 Real-Time Factor (RTF) on consumer GPUs.
Quality: Full 44.1kHz studio-quality output.
Optimization: Fully optimized for production with ONNX export (~260MB total model size).
Zero-Shot Capability: Reliable voice cloning from as little as 5 seconds of reference audio.
I didn't just "train a model"; I re-engineered the audio pipeline to handle complex linguistics (character-level tokenization) and ensured the entire system is production-ready via Docker and optimized inference scripts.
If you need a custom, high-performance voice solution that sounds human and runs lightning-fast - let's connect
Like this project

Posted Feb 7, 2026

Supertonic v2 implementation from arXiv research. x167 RealTime Factor at 44.1kHz. Replaced WaveNeXt with HiFi-GAN for zero metallic artifacts. 260MB ONNX model