Let's Make Python Listen - Part 1.
Mahmoud Harmouch
Posted on April 14, 2022
Hello, fellow human being. In this series of articles, we are going to unravel the mysterious world of speech recognition systems and utilize Deepgram's services in this context. Many people may be interested in this subject matter on the grounds that many voice assistants are competing to quickly become the dominant smart speaker, such as Amazon's Alexa, Google's Assistant, Apple's Siri, that make use of different types of deep neural network(feedforward network and feedback networks). Deep neural networks were introduced in 2006 [0] by the godfather himself: Geoffrey Hinton [1]
๐ Table Of Content (TOC).
- What is Speech?
- What is Speech Recognition?
- Speach Recognition History
- What is a Deepgram?
- Deepgram's Unique Features
- Speech Recognition from a Live Microphone
- Deepgram python sdk
- Connecting Pyaudio and Deepgram
- Handle Exceptions
- Wrapping Up
- Reference
What is Speech?
๐ Go To TOC.
The human voice is a physical phenomenon that we cannot see. The shape of the back of the throat and its vibration are used to make a speech sound [2]. When a microphone picks up sounds, it converts them into an electrical signal that can be transmitted over a wired or wireless connection to software on your computer, speakers, or a voice-recognition device. The brain initiate speech by triggering your mouth muscles to produce sound [3]. For example, when someone speaks the word "Hello," they articulate it with lips and tongue while their vocal cords vibrate and air passes between them.
What is Speech Recognition?
๐ Go To TOC.
Speech recognition is the process of converting spoken words into text. In some cases, it can be used in conjunction with other technologies to provide computer input or replace the keyboard and mouse. It's a technology that has been around ever since the 1950s, but we have seen significant advancements in recent years. Speech recognition utilizes DSP's (digital signal processing) techniques to process and analyzes audio signals [4].
Speech recognition is often used as a stand-alone application or part of a larger software package that includes other features, such as dictation. It allows the user to control a computer or other device by speaking. It is also known by a variety of terms such as voice recognition, voice to text, speech-to-text, or speech recognition,
Having a brief introduction to speech recognition, Now let's take a look at the exciting history of speech recognition, which is surprisingly enough, ages around 72 years old starting from the 1950s, as mentioned above.
Speach Recognition History
๐ Go To TOC.
In the initial decade of the fiftieth century, scientists in the Bell System created the Audrey(Automatic Digit Recognizer) machine, which has three main components:
- A microphone that captures human speech.
- A piece of hardware that was programmed to do the actual transcription.
- A display that shows the number being spoken into the microphone(right-hand side of the image)
As the name suggests, this machine can recognize digits(0-9).
In 1962, IBM released the first device called Shoebox [7] to recognize spoken words; It can realize ten digits and six arithmetical words command(e.g., plus, minus, etc.). For example, if someone says 2 plus 2 through the microphone, Shoebox would trigger an adding function to calculate and display the result.
These technologies worked back then by transforming voice signals into electrical impulses, and then each word was split into small Phonetic Units. For example, the term "hello" would be divided into hello 'he l oh' or something along this line.
In the 1970s, the US Department of Defense stepped in financially support research. DARPA (Defense Advanced Research Projects Agency, the same agency that got allegedly exposed for facilitating biological experiments related to s@rs-c0v2 [8]. Damn, dude. All those conspiracy theories were true all the time.) funded one of the most significant speeches recognition projects. The result was to recognize more than a thousand words.
In 1982, SAM synthesizer [9] was the first commercial speech synthesis software giving voice to Commodore 64 computer 1982.
A significant milestone was achieved in the late 1980s when statistical-based models were introduced(e.g., the Hidden Markov Model.), which can recognize approximately five thousand words.
It works by assigning each letter to a node with a probability of predicting the following letter in the word that represents the edge. As you can see in the example below, the term 'potato' can be pronounced in various ways, such as 'p oh t ah t oh', 'p ah t ay t oh', and others.
The downside of these algorithms is that they only recognize discrete speech, so you cannot speak naturally; you need to pause between words which is unfortunate.
In the 1990s, the first commercial product became available for the masses when Dragon launched its product called Dragon Dictate, which is capable of recognizing approximately 60k words.
Entering the 2000s, Google released the voice search app for iPhone [12]. The app processes voice requests based on Google's cloud data center, matching them with a large pool of human-speech recordings and learning from queries collected from the users(230 billion words) trained by neural networks that got introduced in 2006, as mentioned at the beginning of the article.
I think that is enough history for today, which presumably will be continued in future posts about speech recognition. Now let's move on to the next section exploring Deepgram transcription services.
What is a Deepgram?
๐ Go To TOC.
Deepgram is a new promising AI-powered transcription tool that utilizes deep learning and machine learning algorithms to transcribe audio recordings by detecting words and phrases that occur within the recording. In simple terms, it is a voice recognition service that takes recordings and converts them into text. But, it is much more than that.
Apparently, Deepgram has many use cases. For example, it can be used as a transcription service for meetings, and phone calls, as a speech-to-text service for videos, or as an automated transcript for audio files. Detailed information is available on their website [13].
Deepgram's Unique Features
๐ Go To TOC.
Deepgram has been shown [15] to offer significantly higher accuracy rates(90%+ accuracy) than other translation systems out there. In addition, it also provides a much higher transcription speed than other systems(3 seconds to transcribe hour-long recordings) and lower costs(0.78$/hour), which makes it an attractive option for businesses that need to transcribe large quantities of content regularly.
When writing this article, this service supports most languages with a large variety of accents and dialects that can identify and transcribe audio across 16 languages [16].
The cool part about Deepgram is that it offers a free trial which anyone can use. Moreover, Deepgram provides open-source SDKs and free speech recognition tools that can be integrated into any application or system.
With the help of Deepgram, we don't have to reinvent the wheel and build a machine learning model from the bottom up(that would be a fantastic project to work on in the future.). Instead, we will use the Python SDK, which allows us to interact with various deepgram API endpoints that utilize the state-of-the-art machine learning model to perform speech transcription.
In essence, Deepgram transcription services are easy to use, accurate and fast. It can help you save time, money, and resources while still providing high-quality content.
Now, let's jump into the technical stuff.
Speech Recognition from a Live Microphone
๐ Go To TOC.
In this section, we will learn how to convert real-time speech into human-readable text. To accomplish this, we will use the deepgram-SDK along with the PyAudio package.
Install deepgram-sdk, pyaudio
๐ Go To TOC.
Python has a handy built-in module called wave
, but it does not support recording, just processing audio files on the fly. To record audio data, we can consult a third-party package called PyAudio. The official website is a good starting point on how to install and use this library on various platforms.
However, PyAudio depends on another library called portaudio
, which is not part of the default Linux dependencies. To install it on your machine, you need to issue the following command on your terminal:
$ sudo apt-get install portaudio19-dev
If the above command runs successfully, you can download and install pyaudio
on your system. However, Because we previously used poetry instead of the pip for dependency management, we can run the following command to import PyAudio into our project:
$ poetry add pyaudio
If the installation part was successful, you could look up the portaudio version by running:
$ python3 -c 'import pyaudio as p; print(p.get_portaudio_version())'
1246720
To install deepgram on your machine, you can follow along with their GitHub repo. Likewise, to import deepgram into our project with poetry, simply run:
$ poetry add deepgram-sdk
If the installation part was successful, you could look up the deepgram version by running:
$ python3 -c 'import deepgram; print(deepgram._version.__version__)'
0.2.5
Now, it is time to play with these modules. To do so, make sure your microphone is on by default and not muted.
Input and Output Devices
๐ Go To TOC.
Now, let's open up a REPL and test things out.
We will begin by importing the pyaudio
module and then instantiating the PyAudio class.
>>> import pyaudio
>>> py_audio = pyaudio.PyAudio()
If you are on linux, you may run into the following warnings:
ALSA lib pcm_dmix.c:1089:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field port
ALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field port
ALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for card
ALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for card
ALSA lib pcm_dmix.c:1089:(snd_pcm_dmix_open) unable to open slave
Let's ignore these warnings for now.
py_audio
has a lot of valuable attributes that you can use to get information about your input and output devices.
>>> for attr in dir(py_audio):
... if not attr.startswith("_"):
... print(attr)
...
close
get_default_host_api_info
get_default_input_device_info
get_default_output_device_info
get_device_count
get_device_info_by_host_api_device_index
get_device_info_by_index
get_format_from_width
get_host_api_count
get_host_api_info_by_index
get_host_api_info_by_type
get_sample_size
is_format_supported
open
terminate
For instance, to look up details about the default input device, you can call the following method:
>>> py_audio.get_default_input_device_info()
{
'index': 9,
'structVersion': 2,
'name': 'default',
'hostApi': 0,
'maxInputChannels': 32,
'maxOutputChannels': 32,
'defaultLowInputLatency': 0.008684807256235827,
'defaultLowOutputLatency': 0.008684807256235827,
'defaultHighInputLatency': 0.034807256235827665,
'defaultHighOutputLatency': 0.034807256235827665,
'defaultSampleRate': 44100.0
}
Keep in mind the value of the defaultSampleRate
key. We are going to use it when recording audio from the microphone.
Similarly, to get information about your default input device, you can call the following method:
>>> py_audio.get_default_output_device_info()
{
'index': 9,
'structVersion': 2,
'name': 'default',
'hostApi': 0,
'maxInputChannels': 32,
'maxOutputChannels': 32,
'defaultLowInputLatency': 0.008684807256235827,
'defaultLowOutputLatency': 0.008684807256235827,
'defaultHighInputLatency': 0.034807256235827665,
'defaultHighOutputLatency': 0.034807256235827665,
'defaultSampleRate': 44100.0
}
If you want to check the details of every I/O device on your machine, you can execute the following code:
>>> for index in range(py_audio.get_device_count()):
... device_info = py_audio.get_device_info_by_index(index)
... for key, value in device_info.items():
... print(key, value, sep=": ")
Audio Recording & Wave Files
๐ Go To TOC.
Experimentations
To record audio data from the microphone, you need to call the open
method:
>>> from rich import inspect
>>> inspect(py_audio.open)
โญโ <bound method PyAudio.open of <pyaudio.PyAudio object at 0x7f6c8bed5180>> โโฎ
โ def PyAudio.open(*args, **kwargs): โ
โ โ
โ Open a new stream. See constructor for โ
โ :py:func:`Stream.__init__` for parameter details. โ
โ โ
โ 27 attribute(s) not shown. Run inspect(inspect) for options. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
We are going to use rich for proper message display. Now, let's create a stream object for recording purposes:
>>> # open stream object as input & output
>>> audio_stream = py_audio.open(
rate=44100, # frames per second,
channels=1, # mono, change to 2 if you want stereo
format=pyaudio.paInt16, # sample format, 8 bytes. see inspect
input=True, # input device flag
output=False, # output device flag, if True, you can play back the audio.
frames_per_buffer=1024 # 1024 samples per frame
)
Now, You can take a look at the available attributes for this stream object.
>>> for attr in dir(audio_stream):
... if not attr.startswith("_"):
... print(attr)
...
close
get_cpu_load
get_input_latency
get_output_latency
get_read_available
get_time
get_write_available
is_active
is_stopped
read
start_stream
stop_stream
write
The read
and write
functions are the most useful functions for this tutorial. We can call the read
function to record audio samples in terms of frames.
>>> inspect(audio_stream.read)
โญโ <bound method Stream.read of <pyaudio.Stream object at 0x7f8310a41180>> โโฎ
โ def Stream.read(num_frames, exception_on_overflow=True): โ
โ โ
โ Read samples from the stream. Do not call when using โ
โ *non-blocking* mode. โ
โ โ
โ 27 attribute(s) not shown. Run inspect(inspect) for options. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Apparently, the read
method accepts frames number instead of duration. Therefore, we need to convert a duration, a given period of time to record data, to a frames number. To do so, we need to find how many frames are there in a given duration
. The following formula will do the trick:
num_frames = int(rate / samples_per_frame * duration)
We can make sure that the above formula is correct using dimensional analysis:
the unit of mesurement for:
- rate: samples/second
- samples_per_frame: samples/frames
- duration: second
The value on the left-hand side of the equation num_frames
should have a unit in frames which is the case of our formula if you do the math. Now we can iterate through all the frames and read 1024 samples per frame. The int
function was used to round down the result towards the nearest integer.
>>> frames = []
>>> for _ in range(int(44100 / 1024 * 3)):
... data = audio_stream.read(1024)
... frames.append(data)
...
>>> len(frames)
129
Each frame being added is a stream of bytes:
>>> type(frames[0])
<class 'bytes'>
Now, let's store this object into a wav file to confirm that it is indeed a 3-second worth of recordings. To do so, let's import the built-in wave
module:
>>> import wave
Let's see what the available attributes for this object are:
>>> for attr in dir(wave):
... if not attr.startswith("_"):
... print(attr)
...
Chunk
Error
WAVE_FORMAT_PCM
Wave_read
Wave_write
audioop
builtins
namedtuple
open
struct
sys
As you may guess, we are going to use the open
function to open a file in write mode.
>>> wave_file = wave.open("sound.wav", "wb")
Similarly, let's see all the attributes for this object:
>>> for attr in dir(wave_file):
... if not attr.startswith("_"):
... print(attr)
...
close
getcompname
getcomptype
getframerate
getmark
getmarkers
getnchannels
getnframes
getparams
getsampwidth
initfp
setcomptype
setframerate
setmark
setnchannels
setnframes
setparams
setsampwidth
tell
writeframes
writeframesraw
Since we are going to write into a file, then we have to use either writeframes
or writefranmesraw
. Go to the official documentation. You will realize that the writeframes
function has more logic involved than the writeframesraw
because it checks for several writing frames in the file. Thus, we will use this function for this tutorial.
But first, we need to set some parameters for the wave_file
object:
>>> wave_file.setnchannels(2)
>>> wave_file.setsampwidth(py_audio.get_sample_size(pyaudio.paInt16))
>>> wave_file.setframerate(44100)
Now, everything is set up; you can write the stream of data into the file:
>>> wave_file.writeframes(b"".join(frames))
>>> wave_file.close()
Having experimented with the wave and pyaudio modules, let's put it all together.
Putting it All Together
๐ Go To TOC.
There are two approaches you can bundle together the previous code, either using a functional programming approach or object-oriented programming.
Functional Programming
๐ Go To TOC.
import wave
from typing import List, Optional, TypeVar, Union, IO
import pyaudio # type: ignore
WaveWrite = TypeVar("WaveWrite", bound=wave.Wave_write)
def init_recording(
file_name: Union[str, IO[bytes]] = "sound.wav", mode: Optional[str] = "wb"
) -> WaveWrite:
wave_file = wave.open(file_name, mode)
wave_file.setnchannels(2)
wave_file.setsampwidth(2)
wave_file.setframerate(44100)
return wave_file
def record(wave_file: WaveWrite, duration: Optional[int] = 3) -> None:
py_audio = pyaudio.PyAudio()
audio_stream = py_audio.open(
rate=44100, # frames per second,
channels=2, # stereo, change to 1 if you want mono
format=8, # sample format, 8 bytes. see inspect
input=True, # input device flag
frames_per_buffer=1024, # 1024 samples per frame
)
frames = []
for _ in range(int(44100 / 1024 * 3)):
data = audio_stream.read(1024)
frames.append(data)
wave_file.writeframes(b"".join(frames))
audio_stream.close()
if __name__ == "__main__":
wave_file = init_recording() # type: ignore
record(wave_file)
wave_file.close()
Object Oriented Approach
๐ Go To TOC.
As described in the docstrings below, I assumed that each field of the AudioRecorder
class is private by default and only accessible through getters and setters. In python, it is not mandatory to use getters and setters, but I like to use this approach because I used to code in statical typed languages, mainly c# and Java.
Notice the use of the magic __attrs_post_init__
method that would set the wave_file
attribute at the moment of instantiation after calling the __init__
. I also used type hinting, as you can tell. In python, you are not required to do all of this, yet still an option. The __init__
is automatically generated using the atts
module(notice each attribute has a define
method).
This snippet of code was adapted from the audio_record
module of the deepwordle project.
import os
import wave
from os import PathLike
from typing import IO, List, Optional, TypeVar, Union
import pyaudio # type: ignore
from attrs import define, field
WaveWrite = TypeVar("WaveWrite", bound=wave.Wave_write)
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
@define
class AudioRecorder:
"""
A brief encapsulation of an audio recorder object attributes and methods.
All fields are assumed to be private by default, and only accessible through
getters/setters, but someone still could hack his/her way around it!
Attrs:
frames_per_buffer: An integer indicating the number of frames per buffer;
1024 frames/buffer by default.
audio_format: An integer that represents the number of bits per sample
stored as 16-bit signed int.
channels: An integer indicating how many channels a microphone has.
rate: An integer indicating how many samples per second: frequency.
py_audio: pyaudio instance.
data_stream: stream object to get data from microphone.
wave_file: wave class instance.
mode: file object mode.
file_name: file name to store audio data in it.
"""
_frames_per_buffer: int = field(init=True, default=1024)
_audio_format: int = field(init=True, default=pyaudio.paInt16)
_channels: int = field(init=True, default=1)
_rate: int = field(init=True, default=44100)
_py_audio: pyaudio.PyAudio = field(init=False, default=pyaudio.PyAudio())
_data_stream: IO[bytes] = field(init=False, default=None)
_wave_file: wave.Wave_write = field(init=False, default=None)
_mode: str = field(init=True, default="wb")
_file_name: Union[str, PathLike[str]] = field(init=True, default="sound.wav")
@property
def frames_per_buffer(self) -> int:
"""
A getter method that returns the value of the `frames_per_buffer` attribute.
:param self: Instance of the class.
:return: An integer that represents the value of the `frames_per_buffer` attribute.
"""
if not hasattr(self, "_frames_per_buffer"):
raise AttributeError(
f"Your {self.__class__.__name__!r} instance has no attribute named frames_per_buffer."
)
return self._frames_per_buffer
@frames_per_buffer.setter
def frames_per_buffer(self, value: int) -> None:
"""
A setter method that changes the value of the `frames_per_buffer` attribute.
:param value: An integer that represents the value of the `frames_per_buffer` attribute.
:return: NoReturn.
"""
setattr(self, "_frames_per_buffer", value)
@property
def audio_format(self) -> int:
"""
A getter method that returns the value of the `audio_format` attribute.
:param self: Instance of the class.
:return: A string that represents the value of the `audio_format` attribute.
"""
if not hasattr(self, "_audio_format"):
raise AttributeError(
f"Your {self.__class__.__name__!r} instance has no attribute named audio_format."
)
return self._audio_format
@audio_format.setter
def audio_format(self, value: int) -> None:
"""
A setter method that changes the value of the `audio_format` attribute.
:param value: An integer that represents the value of the `audio_format` attribute.
:return: NoReturn.
"""
setattr(self, "_frames_per_buffer", value)
@property
def channels(self) -> int:
"""
A getter method that returns the value of the `channels` attribute.
:param self: Instance of the class.
:return: An integer that represents the value of the `channels` attribute.
"""
if not hasattr(self, "_channels"):
raise AttributeError(
f"Your {self.__class__.__name__!r} instance has no attribute named channels."
)
return self._channels
@channels.setter
def channels(self, value: int) -> None:
"""
A setter method that changes the value of the `channels` attribute.
:param value: An integer that represents the value of the `channels` attribute.
:return: NoReturn.
"""
setattr(self, "_channels", value)
@property
def rate(self) -> int:
"""
A getter method that returns the value of the `rate`attribute.
:param self: Instance of the class.
:return: A string that represents the value of the `rate` attribute.
"""
if not hasattr(self, "_rate"):
raise AttributeError(
f"Your {self.__class__.__name__!r} instance has no attribute named rate."
)
return self._rate
@rate.setter
def rate(self, value: int) -> None:
"""
A setter method that changes the value of the `rate` attribute.
:param value: An integer that represents the value of the `rate` attribute.
:return: NoReturn.
"""
setattr(self, "_rate", value)
@property
def py_audio(self) -> pyaudio.PyAudio:
"""
A getter method that returns the value of the `py_audio`attribute.
:param self: Instance of the class.
:return: A PyAudio object that represents the value of the `py_audio` attribute.
"""
if not hasattr(self, "_py_audio"):
raise AttributeError(
f"Your {self.__class__.__name__!r} instance has no attribute named py_audio."
)
return self._py_audio
@py_audio.setter
def py_audio(self, value: int) -> None:
"""
A setter method that changes the value of the `py_audio` attribute.
:param value: A PyAudio object that represents the value of the `py_audio` attribute.
:return: NoReturn.
"""
setattr(self, "_py_audio", value)
@property
def data_stream(self) -> IO[bytes]:
"""
A getter method that returns the value of the `data_stream`attribute.
:param self: Instance of the class.
:return: A string that represents the value of the `data_stream` attribute.
"""
if not hasattr(self, "_data_stream"):
raise AttributeError(
f"Your {self.__class__.__name__!r} instance has no attribute named data_stream."
)
return self._data_stream
@data_stream.setter
def data_stream(self, value: IO[bytes]) -> None:
"""
A setter method that changes the value of the `data_stream` attribute.
:param value: A string that represents the value of the `data_stream` attribute.
:return: NoReturn.
"""
setattr(self, "_data_stream", value)
@property
def wave_file(self) -> wave.Wave_write:
"""
A getter method that returns the value of the `wave_file`attribute.
:param self: Instance of the class.
:return: A string that represents the value of the `wave_file` attribute.
"""
if not hasattr(self, "_wave_file"):
raise AttributeError(
f"Your {self.__class__.__name__!r} instance has no attribute named wave_file."
)
return self._wave_file
@wave_file.setter
def wave_file(self, value: wave.Wave_write) -> None:
"""
A setter method that changes the value of the `wave_file` attribute.
:param value: A string that represents the value of the `wave_file` attribute.
:return: NoReturn.
"""
setattr(self, "_wave_file", value)
@property
def file_name(self) -> Union[str, PathLike[str]]:
"""
A getter method that returns the value of the `file_name`attribute.
:param self: Instance of the class.
:return: A string that represents the value of the `file_name` attribute.
"""
if not hasattr(self, "_mode"):
raise AttributeError(
f"Your {self.__class__.__name__!r} instance has no attribute named file_name."
)
return self._file_name
@file_name.setter
def file_name(self, value: Union[str, PathLike[str]]) -> None:
"""
A setter method that changes the value of the `file_name` attribute.
:param value: A string that represents the value of the `file_name` attribute.
:return: NoReturn.
"""
setattr(self, "_file_name", value)
@property
def mode(self) -> str:
"""
A getter method that returns the value of the `mode`attribute.
:param self: Instance of the class.
:return: A string that represents the value of the `mode` attribute.
"""
if not hasattr(self, "_mode"):
raise AttributeError(
f"Your {self.__class__.__name__!r} instance has no attribute named mode."
)
return self._mode
@mode.setter
def mode(self, value: str) -> None:
"""
A setter method that changes the value of the `mode` attribute.
:param value: A string that represents the value of the `mode` attribute.
:return: NoReturn.
"""
setattr(self, "_mode", value)
def __repr__(self) -> str:
attrs: dict = {
"frames_per_buffer": self.frames_per_buffer,
"audio_format": self.audio_format,
"channels": self.channels,
"rate": self.rate,
"py_audio": repr(self.py_audio),
"data_stream": self.data_stream,
"wave_file": repr(self.wave_file),
"mode": self.mode,
"file_name": self.file_name,
}
return f"{self.__class__.__name__}({attrs})"
def __attrs_post_init__(self) -> None:
wave_file = wave.open(os.path.join(BASE_DIR, self.file_name), self.mode)
wave_file.setnchannels(self.channels)
wave_file.setsampwidth(self.py_audio.get_sample_size(self.audio_format))
wave_file.setframerate(self.rate)
self.wave_file = wave_file
del wave_file
def record(self, duration: int = 3) -> None:
self.data_stream = self.py_audio.open(
format=self.audio_format,
channels=self.channels,
rate=self.rate,
input=True,
output=True,
frames_per_buffer=self.frames_per_buffer,
)
frames: List[bytes] = []
num_frames: int = int(self.rate / self.frames_per_buffer * duration)
for _ in range(num_frames):
data = self.data_stream.read(self.frames_per_buffer)
frames.append(data)
self.wave_file.writeframes(b"".join(frames))
def stop_recording(self) -> None:
if self.data_stream:
self.data_stream.close()
self.py_audio.terminate()
self.wave_file.close()
if __name__ == "__main__":
rec = AudioRecorder()
print(rec)
rec.record()
rec.stop_recording()
Deepgram python sdk.
๐ Go To TOC.
Let's go back to our REPL and start playing with the deepgram SDK.
We will start by importing the deepgram
module and then instantiating a Deepgram instance.
>>> from deepgram import Deepgram
>>> for attr in dir(Deepgram):
... if not attr.startswith("_"):
... print(attr)
...
keys
projects
transcription
usage
As you can see, there are four main attributes in the Deepgram
class. Using deepgram, you can transcribe pre-recorded audio or live audio streams like the bbc radio. You can follow along the Readme file to get information on setting up a deepgram account and to get things started. Having a secret key, you can interact with the API to do the transcription. Once you get the API key, you need to store it in an environment variable to get the following code running successfully:
$ export DEEPGRAM_API_KEY="XXXXXXXXX"
from deepgram import Deepgram # type: ignore
import asyncio
import os
from os import PathLike
from typing import Union, IO
async def transcribe(file_name: Union[Union[str, bytes, PathLike[str], PathLike[bytes]], int]):
with open(file_name, "rb") as audio:
source = {"buffer": audio, "mimetype": "audio/wav"}
response = await deepgram.transcription.prerecorded(source)
return response["results"]["channels"][0]["alternatives"][0]["words"]
if __name__ == "__main__":
try:
deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
words = loop.run_until_complete(transcribe("sound.wav"))
string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)
print(f"You said: {string_words}!")
loop.close()
except AttributeError:
print("Please provide a valid `DEEPGRAM_API_KEY`.")
The above script will generate the following if the audio file contains only the words "hello" and "world":
You said: hello world!
Connecting Pyaudio and Deepgram
๐ Go To TOC.
import wave
from typing import List, Optional, TypeVar, Union, IO
import pyaudio # type: ignore
from deepgram import Deepgram # type: ignore
import asyncio
import os
from os import PathLike
WaveWrite = TypeVar("WaveWrite", bound=wave.Wave_write)
def init_recording(
file_name: Union[str, IO[bytes]] = "sound.wav", mode: Optional[str] = "wb"
) -> WaveWrite:
wave_file = wave.open(file_name, mode)
wave_file.setnchannels(2)
wave_file.setsampwidth(2)
wave_file.setframerate(44100)
return wave_file
def record(wave_file: WaveWrite, duration: Optional[int] = 3) -> None:
py_audio = pyaudio.PyAudio()
audio_stream = py_audio.open(
rate=44100, # frames per second,
channels=2, # stereo, change to 1 if you want mono
format=8, # sample format, 8 bytes. see inspect
input=True, # input device flag
frames_per_buffer=1024, # 1024 samples per frame
)
frames = []
for _ in range(int(44100 / 1024 * 3)):
data = audio_stream.read(1024)
frames.append(data)
wave_file.writeframes(b"".join(frames))
audio_stream.close()
async def transcribe(file_name: Union[Union[str, bytes, PathLike[str], PathLike[bytes]], int]):
with open(file_name, "rb") as audio:
source = {"buffer": audio, "mimetype": "audio/wav"}
response = await deepgram.transcription.prerecorded(source)
return response["results"]["channels"][0]["alternatives"][0]["words"]
if __name__ == "__main__":
# start recording
print("Python is listening...")
wave_file = init_recording() # type: ignore
record(wave_file)
wave_file.close()
# start transcribing
deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
words = loop.run_until_complete(transcribe("sound.wav"))
string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)
print(f"You said: {string_words}!")
loop.close()
Handle Exceptions
๐ Go To TOC.
Now, we need to handle errors to make our app more user-friendly by using the try-catch block to handle expected exceptions instead of causing our program to crash. The first error happens when your DEEPGRAM_API_KEY
is not correct, and this will cause the program to throw an Unauthorized exception.
try:
# start recording
print("Python is listening...")
wave_file = init_recording() # type: ignore
record(wave_file)
wave_file.close()
# start transcribing
deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
words = loop.run_until_complete(transcribe("sound.wav"))
string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)
print(f"You said: {string_words}!")
loop.close()
except Exception:
print("Unauthorized user. Please provide a valid `DEEPGRAM_API_KEY` value")
We can build a loop to record speech indefinitely until a condition is satisfied.
try:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
while True:
wave_file = init_recording() # type: ignore
print("Python is listening...")
record(wave_file)
wave_file.close()
# start transcribing
deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))
words = loop.run_until_complete(transcribe("sound.wav"))
string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)
print(f"You said: {string_words}!")
if string_words == "stop":
print('Goodbye!')
break
loop.close()
except Exception:
print("Unauthorized user. Please provide a valid `DEEPGRAM_API_KEY` value")
I/O operations bound the performance of this program. We will improve this version in the upcoming articles related to speech recognition.
Wrapping Up
๐ Go To TOC.
In this article, We have explored the history of speech recognition, and we learned how to use deepgram python SDK for speech recognition and pyaudio for audio recording. There is a lot more you can do with these libraries, which is beyond the scope of this article. Keep in mind that we can improve our project to directly send audio recordings from the microphone without writing into a wave file with the help of web sockets which is the work of future articles. We can also build a voice-controlled search engine based on this. I want to suggest playing around with the webbrowser
module to find even more exciting implementation ideas. We will be working on these kinds of projects throughout the upcoming articles on this series.
As always, this article is a gift to you, and you can share it with whomever you like or use it in any way that would be beneficial to your personal and professional development. Thank you in advance for your ultimate support!
Happy Coding, folks; see you in the next one.
Reference
๐ Go To TOC.
[0] wikipedia. Deep learning.
[1] wikipedia. Geoffrey Hinton.
[2] William F. Katz, 2016. What Produces Speech: Your Speech Anatomy, Phonetics For Dummies.
[3] Jacquelyn Cafasso, 2019. What Part of the Brain Controls Speech?, healthline.
[4] Steven W. Smith, in Digital Signal Processing: A Practical Guide for Engineers and Scientists, 2003.
[5] Sam Lawson, 2018, Bell-Laboratories-invented-Audrey, ClickZ.
[6] Pioneering Speech Recognition, IBM.
[7] IBM Cloud Education, 2020, What is Speech Recognition.
[8] Project Veritas, 2022, Military Documents about Gain of Function contradict Fauci testimony under oath, Youtube.
[9] Sebastian Macke, Software Automatic Mouth - Tiny Speech Synthesizer, Github.
[10] Dimitrakakis, Christos & Bengio, Samy. (2011). Phoneme and Sentence-Level Ensembles for Speech Recognition EURASIP J. Audio, Speech and Music Processing. 2011. 10.1155/2011/426792.
[11] Ed Grabianowski, How Speech Recognition Works.
[12] News from Google, 2008, New Version of Google Mobile App for iPhone, now with Voice Search.
[13] Deepgram, Different Environments Call for Different Speech Recognition Models.
[14] Deepgram, High Accuracy for Better Speech Analysis.
[15] Deepgram, WHY DEEPGRAM: Enterprise audio is complex Your ASR doesnโt have to be.
[16] Deepgram, Every customer. Heard and understood.)
Posted on April 14, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.