Build an Audio-Driven Speaker Recognition System Using Open-Source Technologies — Resemblyzer and QdrantDB.
Karan Shingde
Posted on January 18, 2024
Introduction:
In this article, we are going to explore how to match the voice of a speaker with an existing set of voices. You can think of it like a biometric system but using the human voice, unlike physical senses such as the thumb and the eye. To achieve this, we will use the magic of vector embeddings and open-source technologies.
This type of technology is used in Google Assistant or Siri. When you buy a new device, like an Android phone, while setting up Google on your system, it asks for your voice to capture its pattern, vocals, and so on, for security reasons. This is so that only you can access Google Assistant by saying “Ok, google”.
Before we get into the details, let’s understand first what vector embedding is and how it has been used for audio.
Vector Embeddings for Audio
Vector embeddings are a way to represent objects, such as words, sentences or, in our case, audio data, as vectors in a mathematical space. Audio data can be represented as vectors, where different aspects of the audio (features like frequency, amplitude, etc.) are mapped to specific positions in the vector.
In the context of audio data, machine learning models can be trained to learn these embeddings. The model analyzes the patterns and characteristics of the audio data to generate meaningful vector representations. Once the model is trained, it can encode audio data by transforming it into a vector representation. This vector now captures important information about the audio’s content and characteristics.
Audio content will have vectors that are close together in the embedding space. This allows for tasks like audio similarity comparison, where you can quickly identify how similar two audio clips are by measuring the distance between their respective embeddings. To generate vector embeddings, we will use an open-source tool called Resemblyzer and store those vectors in Qdrant DB.
We have a set of audio clips of some famous personalities: Cristiano Ronaldo, Donald Trump, and Homer Simpson (yes, he is famous).
Resemblyzer: An Overview
Resemblyzer allows us to derive high-level representation of voice through a deep learning model. It simplifies the life of developers by enabling them to convert audio clips into vectors with just a few lines of code, eliminating the need for neural networks. See official github repository.
Install Resemblyzer for Python (3.5+)
pip install resemblyzer
I’m using Google Colab with a free T4 GPU for this task. You can also use CPU, but it may take a long time. Click here to get audio data.
# import necessary libraries
from resemblyzer import preprocess_wav, VoiceEncoder
from pathlib import Path
from tqdm import tqdm
import numpy as np
from Ipython,display import Audio
from itertools import groupby
import heapq
# run sample audio
audio_sample = Audio(‘path-to-audio-folder/train/Trump.mp3’, autoplay=True)
display(audio_sample)
['Ronaldo', 'Ronaldo2', 'Homer', 'Homer2', 'Trump']
Now, the important part, we will unleash the power of Resemblyzer and convert audio clips into the vector embeddings with just a few lines of code. Preprocess the waves first for all audio clips.
wavs = np.array(list(map(preprocess_wav, tqdm(wav_fpaths, “Preprocessing wavs”. len(wav_fpaths)))), dtype=object)
speaker_wavs = {speaker: wavs[list(indiices] for speaker, indices in groupby(range(len(wavs)), lambda i: speakers[i])}
print(speaker_wavs)
{'Ronaldo': array([array([ 0.00045622, -0.00088888, 0.00016845, ..., -0.00079568,
-0.00718354, -0.01011641], dtype=float32) ],
dtype=object),
'Ronaldo2': array([array([ 0.0025312 , 0.00321749, 0.00460094, ..., -0.01093079,
-0.01293177, -0.01618683], dtype=float32) ],
dtype=object),
'Homer': array([array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)], dtype=object),
'Homer2': array([array([ 1.33051715e-14, 3.98843861e-14, -3.70518893e-15, ...,
5.39025990e-04, -5.10490616e-04, -5.79551968e-04], dtype=float32)],
dtype=object),
'Trump': array([array([-0.0165875 , 0.03297266, -0.01565401, ..., -0.03698713,
-0.03372933, -0.02938525], dtype=float32) ],
dtype=object)}
In the above code, we converted these sound waves into numerical representations with a few lines of code and without using any neural network.
Now, convert these numerical representations into embeddings.
# compute the embeddings
encoder = VoiceEncoder(“cuda”)
utterance_embeds = np.array(list(map(encoder.embed_utterance,
print(utterance_embeds)
[[0. 0. 0.0173962 ... 0. 0.04333723 0.00142971]
[0. 0.00967959 0.00503905 ... 0.04058945 0.09630667 0.0495304 ]
[0.15830468 0. 0.01373593 ... 0. 0. 0. ]
[0.18647183 0. 0.11558624 ... 0. 0. 0. ]
[0. 0.11265804 0. ... 0. 0. 0.14819394]]
This vector contains a float object where each row represents each audio clip.
For this task we are using Qdrant DB as our primary vector database. So, for that, we need to convert this representation into a suitable format. Basically, we need a list of dictionaries where each dictionary contains key as id and vector as keys. Id will be an incremental numeral value.
To get similar vectors, a unique id must be assigned to the vectors.
# Create an empty list to hold the embeddings in the desired format
embeddings = []
# Iterate through each embedding in the array
for i, embedding in enumerate(utterance_embeds):
# Create a dictionary with “id” and “vector” keys
embedding_dict = {“id”:i+1, “vector”:embedding.tolist()} # Start IDs from 1
# Append the dictionary to the embeddings list
embeddings.append(embedding_dict)
QdrantDB: An Overview
Qdrant DB is one of the most popular vector databases out there. Using Qdrant DB, web developers can store embeddings and retrieve them seamlessly. Here is the official documentation.
To start with Qdrant DB, Sign up on their Cloud Service to start with their free tier which has limits of up to 1GB per cluster. Get your API key (copy it locally and safely — you can’t see the API key again after copying).
For Python Qdrant DB has its own API qdrant_client which is very easy to use with fewer lines of code. Let’s set-up Qdrant DB.
Install qdrant_client via pip:
pip install qdrant_client
import qdrant_client
qdrant_uri = 'paste-your-db-uri' # Paste your URI
qdrant_api_key = 'paste-your-api-key' # Paste your API KEY
Let’s create a collection in the database; here the meaning of collection is the same as MongoDB collection.
# Create a collection
vectors_config = qdrant_client.http.models.VectorParams(
size=256, # requires for embeddings from resemblyzer
distance=qdrant_client.http.models.Distance.COSINE
)
Now, after initializing QdrantDB, we will upsert (or add) embeddings from Resemblyzer.
# Upsert embeddings
client.upsert('my-collection', embeddings)
Till here we stored our audio samples in an encoded version in Qdrant DB. Now we will test this using a new voice, which has a record in the database.
Speaker Recognition:
To recognize the user with a new voice is all about finding a similarity between the new voice and the set of voices already stored. For example, take the new voice of Cristiano Ronaldo and check if it’s recognized or not. We already have Ronaldo’s voice in the database.
I’m taking the iconic short speech by Ronaldo, which he gave after
winning the UCL:
“Muchas gracias afición esto para vosotros. Siuuuuuuuuu!”
Convert the new voice into embeddings
test_wav = preprocess_wav(“/content/drive/MyDrive/audio_data_colab/Siuu.mp3”)
# Create a voice encoder object
test_embeddings = encoder.embed_utterance(test_wav)
# Search related embeddings
results = client.search(“my-collection”, test_embeddings)
print(results)
[ScoredPoint(id=2, version=0, score=0.6956655, payload={}, vector=None, shard_key=None),
ScoredPoint(id=1, version=0, score=0.6705738, payload={}, vector=None, shard_key=None),
ScoredPoint(id=5, version=0, score=0.56731033, payload={}, vector=None, shard_key=None),
ScoredPoint(id=3, version=0, score=0.535391, payload={}, vector=None, shard_key=None),
ScoredPoint(id=4, version=0, score=0.42906034, payload={}, vector=None, shard_key=None)]
By the above result you can see that the ids 1 and 2 are associated with the Ronaldo clip (we did this in the embedding code). The highest score is about 70%, which is fine because we have a very small amount of data. Also the length of the clips are 3–4 seconds on an average. You can add more data and try this out.
To get top 2 similar results, just run the following code (you can also do it for top 1 or top 3 and then decide based on mode in top 3 case).
# Get the top two results based on scores, handling potential ties
top_two_results = heapq.nlargest(2, results, key=lambda result: result.score)
# Extract and align IDs, considering potential ties
top_two_ids = sorted({result.id - 1 for result in top_two_results}) # Remove duplicates
# Get corresponding names, checking for valid IDs
top_two_names = []
for aligned_id in top_two_ids:
if 0 <= aligned_id < len(speakers):
top_two_names.append(speakers[aligned_id])
else:
print(f”Invalid ID {aligned_id + 1} encountered.”)
print(“Top two speakers: “, top_two_names)
Top two speakers: ['Ronaldo', 'Ronaldo2']
Yes! It’s a match. We have successfully verified the new voice with the existing set of voices.
Conclusion
In this article we implemented the audio-driven speaker recognition with just a few lines of code by using open-source technologies such as Resemblyzer and Qdrant DB. Resemblyzer is the easiest way to work on audio data and encode them into embeddings. There is no need for a neural network or transformer architecture. Qdrant DB, on other hand, provides an efficient way to store and retrieve embeddings.
Thanks for reading this article!
References
Posted on January 18, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.