main.py): A single FastAPI application serves as the entry point for all HTTP requests. It handles API logic, authentication, and voice management. Instead of performing TTS tasks itself, it dispatches them to the worker pool via a ZeroMQ message queue.worker.py): For each available GPU (or for the CPU if no GPUs are present), a dedicated pool of worker processes is spawned. Each worker initializes its own TTS engine, loads the models onto its assigned device, and listens for jobs from the master.PUSH/PULL): The master pushes TTS requests to a central queue, and idle workers pull jobs from it.PUSH/PULL): Workers push the resulting audio chunks back to the master, which then streams them to the client.PUB/SUB): The master sends control commands (e.g., to warm up a new voice's cache or cancel a request) to all workers simultaneously..env file in the project root.API_KEY (Required) Your secret API key for securing the service. None HOST The host address for the application server. 0.0.0.0 PORT The port for the application server. 8000 DEBUG Enable debug mode. False LOG_LEVEL The logging level (e.g., INFO, DEBUG). INFO WORKERS_PER_DEVICE Number of worker processes to spawn per detected GPU or CPU device. 1 CONCURRENT_REQUESTS_PER_WORKER Maximum number of concurrent TTS requests to process per worker. 1 VOICES_DIR Directory where custom voices are stored. voices/ PRELOADED_VOICES_DIR Directory for preloaded voices. preloaded-voices/ MODEL_PATH Path to the directory containing TTS models. models CORS_ORIGINS A comma-separated list of allowed origins (e.g., "http://localhost:3000,https://your-frontend.com"). *TTS_.TTS_VOICE_EXAGGERATION_FACTOR Controls the expressiveness of the voice. 0.5 TTS_CFG_GUIDANCE_WEIGHT Influences how strongly the model adheres to the text prompt. 0.5 TTS_SYNTHESIS_TEMPERATURE Controls the randomness of the output. 0.8 TTS_TEXT_PROCESSING_CHUNK_SIZE Max characters per text chunk. Smaller values can reduce latency. 150 TTS_AUDIO_TOKENS_PER_SLICE Number of audio tokens per slice during streaming. Affects granularity. 35 TTS_REMOVE_LEADING_MILLISECONDS Milliseconds to trim from the start of the audio. 0 TTS_REMOVE_TRAILING_MILLISECONDS Milliseconds to trim from the end of the audio. 0 TTS_CHUNK_OVERLAP_STRATEGY Strategy for overlapping audio chunks: "full" or "zero". "full" TTS_CROSSFADE_DURATION_MILLISECONDS Duration in milliseconds for crossfading between audio chunks. 30 TTS_SPEECH_TOKEN_QUEUE_MAX_SIZE Buffer size between T3 and S3Gen models. Smaller values (~2) reduce initial latency. 2 TTS_PCM_CHUNK_QUEUE_MAX_SIZE Buffer size for outgoing audio. Smaller values (~3) reduce latency but increase stutter risk. 3.env file8000, providing the environment variables, and mounting a volume for persistent voice storage:--gpus all to the docker run command.latest and the commit SHA..github/workflows/publish-docker.yml.DOCKERHUB_USERNAME:Settings > Secrets and variables > Actions > Variables and add a new repository variable.DOCKERHUB_TOKEN:Settings > Secrets and variables > Actions > Secrets and add a new repository secret.GET /system-statuscurl:/tts/generateGET and POST requests, providing flexibility for different use cases.X-API-Key: <YOUR_API_KEY>?api_key=<YOUR_API_KEY>curl:src of an <audio> tag.curl:fmp4 (Fragmented MP4) format is recommended for use with MSE./tts/generate endpoint accepts several parameters to customize the audio generation. These can be provided in the query string for GET requests or in the JSON body for POST requests.text string (Required) The text to be converted to speech. voice_id string None The ID of the custom voice to use (e.g., your_voice.wav). If not provided, a default voice is used. format string wav The desired audio format. Supported values: wav, mp3, fmp4, raw_pcm, webm. Overrides the Accept header. cfg_guidance_weight float TTS_CFG_GUIDANCE_WEIGHT Classifier-Free Guidance weight. Higher values make the speech more closely follow the text, but can reduce naturalness. synthesis_temperature float TTS_SYNTHESIS_TEMPERATURE Controls the randomness of the output. Higher values produce more varied and creative speech, while lower values are more deterministic. text_processing_chunk_size integer TTS_TEXT_PROCESSING_CHUNK_SIZE The number of characters to process in each text chunk. Smaller values can reduce latency but may affect prosody. audio_tokens_per_slice integer TTS_AUDIO_TOKENS_PER_SLICE The number of audio tokens to generate in each slice. This affects the granularity of the streaming output. remove_trailing_milliseconds integer TTS_REMOVE_TRAILING_MILLISECONDS Milliseconds of audio to trim from the end of each generated audio chunk. This is useful for fine-tuning the merging between chunks. remove_leading_milliseconds integer TTS_REMOVE_LEADING_MILLISECONDS Milliseconds of audio to trim from the start of each generated audio chunk. This is useful for fine-tuning the merging between chunks. chunk_overlap_method string TTS_CHUNK_OVERLAP_STRATEGY The method for handling overlapping audio chunks. Can be "full" or "zero". crossfade_duration_milliseconds integer TTS_CROSSFADE_DURATION_MILLISECONDS The duration of the crossfade between audio chunks in milliseconds.torch.compile: Both the T3 (text-to-token) and S3Gen (token-to-audio) models are compiled using torch.compile in "reduce-overhead" mode. This significantly speeds up inference by optimizing the model's execution graph.ThreadPoolExecutors. This prevents these tasks from blocking the main asyncio event loop in both the master and worker processes, leading to smoother operation.text_processing_chunk_size, audio_tokens_per_slice, and chunk_overlap_method parameters are crucial for balancing audio quality and streaming latency. Understanding how they work together allows you to fine-tune the TTS engine for your specific needs.text_processing_chunk_size: This parameter determines how the input text is split into smaller pieces. The T3 model processes one chunk at a time.audio_tokens_per_slice: After the T3 model converts a text chunk into a sequence of speech tokens, this parameter determines how many of those tokens are sent to the S3Gen model at a time to be converted into audio.chunk_overlap_method: This parameter defines how the audio from different text chunks is stitched together."full": This method creates a seamless overlap between audio chunks, which generally produces the highest quality audio by avoiding clicks or pauses. It is slightly more computationally intensive."zero": This method simply concatenates the audio chunks. It is faster but may occasionally produce audible artifacts at the seams between chunks.crossfade_duration_milliseconds: This parameter determines the duration of the crossfade between audio chunks.POST /voicesvoice_id will be the filename.multipart/form-data with a file named voice.wav.X-API-Key: <YOUR_API_KEY>curl:201 Created):GET /voicesX-API-Key: <YOUR_API_KEY>curl:200 OK):DELETE /voices/{voice_id}X-API-Key: <YOUR_API_KEY>curl:200 OK):your-docker-username/chatterbox-tts:latest) or the pre-built image from Docker Hub (akashdeep000/chatterbox-tts:latest)./app/voices in the container.8000.Posted Aug 26, 2025
Developed a scalable TTS service with real-time audio streaming.
0
18