Building a 44.1kHz Ultra-Low Latency TTS Engine by Vladyslav SoliannikovBuilding a 44.1kHz Ultra-Low Latency TTS Engine by Vladyslav Soliannikov

Building a 44.1kHz Ultra-Low Latency TTS Engine

Vladyslav Soliannikov

ML Engineer

Researcher

AI Engineer

Python

PyTorch

Electronics

From arXiv Research to Production: Building a 44.1kHz Ultra-Low Latency TTS Engine

Existing TTS models are often too heavy for real-time applications or suffer from poor audio quality. I’ve successfully re-implemented and optimized the Supertonic v2 architecture from scratch, specifically tailored for high-fidelity speech synthesis.

The Technical Breakthrough: Instead of relying on public repos, I built this engine based on the original arXiv research (arXiv:2509.11084), implementing a complex 3-stage pipeline:

Speech Autoencoder (~47M params): Optimized with a HiFi-GAN generator to eliminate the "metallic" artifacts common in WaveNeXt-style heads.

Text-to-Latent Module (~19M params): Utilizing conditional Flow Matching and LARoPE (Length-Aware RoPE) for robust text-speech alignment.

Utterance-Level Duration Predictor: Simplified logic for natural prosody transfer.

Performance Benchmarks:

Speed: Achieved x167 Real-Time Factor (RTF) on consumer GPUs.

Quality: Full 44.1kHz studio-quality output.

Optimization: Fully optimized for production with ONNX export (~260MB total model size).

Zero-Shot Capability: Reliable voice cloning from as little as 5 seconds of reference audio.

I didn't just "train a model"; I re-engineered the audio pipeline to handle complex linguistics (character-level tokenization) and ensured the entire system is production-ready via Docker and optimized inference scripts.

If you need a custom, high-performance voice solution that sounds human and runs lightning-fast - let's connect

Like this project

Posted Feb 7, 2026

Supertonic v2 implementation from arXiv research. x167 RealTime Factor at 44.1kHz. Replaced WaveNeXt with HiFi-GAN for zero metallic artifacts. 260MB ONNX model

Likes

Views

Building a 44.1kHz Ultra-Low Latency TTS Engine

Challenges

Challenges