A guide to building a offline voice assistant using python

Philippe Mathew De Vera

This is a short guide on how you could build a voice assistant that is capable of running offline.
When it comes to voice assistants, there are a variety of options you can choose from. But majority of them are owned by a large company and there's very few voice assistant software out there that is free and open source.
So I've decided to write a short guide that can hopefully get you started on creating your own from scratch. Do keep in mind that this is a guide and not a tutorial so you definitely need a bit of python experience.
At the end of this guide, I'm going to be linking a personal voice assistant's source code that I've made a few months back. Keep in mind, that project is not open source but I've make the source code public for learning purposes.

Parts of a Voice Assistant

There are 3 main components when it comes to creating a VA and those are the Speech Recognition, Intent Recognition, and Text To Speech.

Speech Recognition

There's a lot of routes you can take when it comes to choosing what speech recognition method you are going to use. You could either use a third party API such as Google's Speech To Text, or for our use case, we're going to use a combination of both Google's free speech recognition from the python library SpeechRecognition. and Vosk which is a free offline speech recognition. The best part about Vosk is that it comes with a variety of pretrained models.
From Vosk's description itself:

Vosk models are small (50 Mb) but provide continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary and speaker identification.

We're going to be combing the two recognition methods in order to have a speech recognition that is accurate while online (Google's), and slightly less accurate when offline (Vosk).
Lucky for you, I've already created a python library that does exactly that. It is called pyvrs, it is a voice assistant ready speech recognition library.
On top of being able to work offline and online, it also supports preprocessing so that you can perform noise reduction or audio amplification before it gets converted into text. It also comes with a wakeword recognizer and you're free to use any wakewords you could.
For those who don't know, the wakeword is what triggers a voice assistant to start transcribing your words. An example of this is "Alexa". One of the beauties of creating a VA on your own is you can use any wakeword you want and you are not just restricted to a few.

Intent Recognition

Intent recognition is the part of the VA that is responsible for taking in transcribed text and trying to make sense of what it is doing. It is essentially the brain of a VA software.
Intent recognition works by providing a NLU library a plethora of sample data and what their intents are. The NLU library can then learn the text patterns that shows up in the sample data and correlate them with the intent.
There are 2 major parts to intent recognition. First one is the Intent which is basically what the intent of the user is. If the user intends to ask what time it is, we can create a intent named timeCheck and teach the NLU library what the text patterns are for the timeCheck intent.
The second major part are called Entities, entities live within a intent and they are parts of a user's text that might be familiar to a intent. For example, if the user intends to ask what time it is in France, the intent would be timeCheck and the entity would be France. Entities can also have names, in our example's case, it can be named location.
In my opinion, the best NLU library for python is Snips NLU. It offers simplicity but also customizability. It hides all the machine learning related things that might be too complex, under the hood.

Text To Speech

This one component is pretty straight forward. It is basically just responsible for turning text into audible speech. There's a library in python called pyttsx3 which interacts with the operating system's tts software so that you can easily convert text into audio files.
You can also use gTTS which uses Google's translator text to speech. This will obviously require a internet connection since it is leveraging a undocumented API from google.
These voices may sound limited but you could also perform some preprocessing before a speech audio is played. This can be in the form of increasing the pitch in order to get a different sounding voice, adding distortion, reverb, and any other effect you may like.
My go to audio effect library for simple things like these is Spotify's Pedalboard.

Combining all components

This is where the fun all begins, combining all the components of a VA to create fully working Voice Assistant.
I have a personal voice assistant that I've been working on called Mantra and you can use it as a starting point on learning how to combine all the components and create one fully functional voice assistant.
But to give you a break down of the general overview, Mantra uses a plugin architecture in order to create commands, intents, and responses. A plugin architecture is almost always necessary when creating software like these because you want to be able to add new commands without affecting the main source code.
It is also essential that you create your own mini libraries to prevent tons of code duplication, in my own VA, I created a context system similar to discord.py in order to handle commands cleanly, here's an example.
@commands.command("m/time_check") async def time_check(ctx: Context) -> None: time = get_time() # Pick a response from a list of responses inside a .txt file rather than # storing the response(s) in the source code directly await ctx.responder.respond_file("commands/time_check.txt", time=time)
Responses
As for responding to the user, I would highly recommend storing the responses in txt files and having a list of responses the VA can choose from. This will make it so that if you decided you wanted to change the response to a specific command, you can simply edit a txt file and you don't have to dive deep into the command's source code.

Conclusion

In conclusion, creating a VA software is no easy task but if you're motivated to do it, it is definitely worth it as you not only learn so much things about programming in general, but you also get to create the perfect VA that fits your own needs.
Like this project

Posted Oct 13, 2022

This is a guide on how you could build a voice assistant that is capable of running offline.

Likes

0

Views

106

GitHub - bossauh/mongoclass: A basic ORM like interface for mon…
GitHub - bossauh/mongoclass: A basic ORM like interface for mon…

Join 50k+ companies and 1M+ independents

Contra Logo

© 2025 Contra.Work Inc