First Steps: Making your own voice activated virtual assistant

Feb 14, 2023

For a long time I’ve thought it would be really cool to have my own completely local replacement for Siri or Alexa that was both not sending my requests off to be monetized and also completely programmable when I wanted it to do custom things and hopefully with ML progress available today, be a lot smarter when it comes to understanding requests. The most important part I always thought was understanding speech which in the past was just unusable, but whisper is incredible and really seems to work beyond expectations out of the box. This is the result of a couple of days of experimentation, by no means complete or even useful, but a sort of closed loop of understanding voice, generating a response, and speaking it back. Hopefully it will serve as a jumping off point for somebody to create something cool.

Record audio input

Simple and old. Recording an mp3 from a microphone is straightforward. Nowhere do I make claims that this code is *good*, all of this is the duct tape prototype. Not a polished gem or recommendation on style, efficiency, or good sense.

import sounddevice as sd
from scipy.io.wavfile import write
from pydub import AudioSegment

fs      = 44100  # Sample rate
seconds = 30  # Duration of recording

devices = sd.query_devices()

recording_devices = [device for device in devices if device["max_input_channels"] > 0]

for i, device in enumerate(recording_devices):
    print(f"{i}. {device['name']}")

myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=2)
sd.wait()  # Wait until recording is finished

write('output.wav', fs, myrecording)  # Save as WAV file

sound = AudioSegment.from_wav("output.wav")
sound.export("output.mp3", format="mp3")  # Convert to MP3

Voice to Text

OpenAI whisper https://github.com/openai/whisper does an excellent job of transcribing voice to text and supports many languages. This is by far the easiest and best performing step.

import whisper

# Load your model, a few options are available
# model = whisper.load_model("base")
model = whisper.load_model("large")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("output.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

Generate a Response

The transformers library supports many different pretrained models, GPT2 isn’t exactly up to the task of being a voice based virtual assistant out of the box like this. What it does do here is predict what you might say next; this is not supremely helpful but the goal here is also not to show you a completed product but a place to get started doing it yourself. The real work will be here selecting, fine-tuning, and making complicated routing of inputs and outputs. Next steps left as an exercise to the reader for now.

You could probably make plenty of things without doing ML at all here. Doing something as simple as running a function if the phrase “turn on the lights” is found would work just fine.

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and the model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model     = GPT2LMHeadModel.from_pretrained('gpt2')

# Prepare the input text generated in the previous step
input_text = result.text

input_ids  = torch.tensor(tokenizer.encode(input_text)).unsqueeze(0)  # Batch size 1

# Generate the next token given the input text
#with torch.no_grad():
outputs = model.generate(input_ids, 
    max_length=400, 
    do_sample=True, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3
)

predictions = outputs[0]

result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

Generating Voice from the Response Text

As opposed to the speech to text step which is excellent, the quality of this text to speech is not great. There are many options of libraries and models for this, I picked one at random which turned out to be functional at best. Again the goal here is to have something that works, not a polished gem.

import soundfile
from espnet2.bin.tts_inference import Text2Speech

# https://huggingface.co/docs/hub/espnet 
# https://huggingface.co/espnet/kan-bayashi_ljspeech_vits
model = Text2Speech.from_pretrained("espnet/kan-bayashi_ljspeech_vits")

speech = model(result)["wav"]

soundfile.write("out.wav", speech.numpy(), model.fs, "PCM_16")

Dependencies

I got this all going on a modern NVIDIA RTX graphics card on Windows using Miniconda for many of the dependencies. https://docs.conda.io/en/latest/miniconda.html Your setup may vary. This is not a complete guide with regards to setting up your environment, but if people have questions I’ll try to respond from time to time.

conda install pytorch torchaudio transformers pytorch-cuda=11.7 -c pytorch -c nvidia

Next Steps

Try out different models for generating responses, tuning them, training them, and using outputs to do more interesting things than just speak back
Find better text to speech libraries or models
Optimize for inference on cheap hardware to see how far you can get with only a little
Build an actual device to listen and respond

cole’s Substack

Discussion about this post