Speech Recognition with TensorFlow.js

bajcmartinez

Juan Cruz Martinez

Posted on June 23, 2020

Speech Recognition with TensorFlow.js

When we usually talk about AI, deep learning, machine learning we automatically think of Python, R, or C++, but what about JavaScript? Well... turns out, one of the most popular libraries for machine learning in Python is available for JavaScript as well, we are talking about Tensorflow, and today we will do a short introduction into the library, and we will build a fun project together.


What is Tensorflow.js and for what can be used?

TensorFlow.js is a JavaScript library developed by Google for training and deploying machine learning models in the browser and in Node.js. It's a companion library to TensorFlow, the popular ML library for Python.

TensorFlow.js is not just a toy library, it is serious business, the performance is surprising, especially when using hardware acceleration through WebGL, but should we train models with it? Perhaps no, even though you can achieve great performance, it's Python counterpart is even faster, and when working with Python you will find more libraries to support your code like Numpy and Pandas. In addition to learning materials, where there's not as much for TensorFlow.js as there is for TensorFlow.

Now, this doesn't mean you shouldn't use TensorFlow.js, on the contrary, I think it's a great library for deploying and running ML models, and it is what we are going to focus for the rest of the article.


Deploying a sample model with TensorFlow.js

As we said, TensorFlow.js is a powerful library, and we can work on a lot of different things like image classification, video manipulation, and speech recognition among others. For today I decided to work on a basic speech recognition example.

Our code will be able to listen through the microphone and identify what the user is saying, at least up to a few words as we have some limitations on the sample model I'm using. But rather than explaining, I think it's cool if we see it first in action:

Unfortunately, I can't run the code on medium, but you can access the live demo here
Unfortunately, I can't run the code on medium, but you can access the live demo here

Pretty cool? I know it can be a bit erratic, and it's limited to a few words, but if you use the right model, the possibilities are endless. Enough talking, let's start coding.

The first thing we need to do is to install the library and get our model. For installing TensorFlow.js there are a few options that can be reviewed here, in our case to keep it simple we will import it from CDN.

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script>
<script src="https://unpkg.com/@tensorflow-models/speech-commands"></script>
Enter fullscreen mode Exit fullscreen mode

Then we would use some HTML to show the list of words:

<div class="demo">
    <div>
        <label class="form-switch">
            <input type="checkbox" id="audio-switch">
            Microphone
        </label>
        <div id="demo-loading" class="hidden">Loading...</div>
    </div>
    <div id="sp-cmd-wrapper" class="grid"></div>
</div>
Enter fullscreen mode Exit fullscreen mode

So far nothing strange, we have our checkbox, a loading element and a wrapper element which we will use to render the list of words, so let's do that next:

const wrapperElement = document.getElementById('sp-cmd-wrapper');
for (let word of wordList) {
    wrapperElement.innerHTML += `<div id='word-${word}'>${word}</div>`;
}
Enter fullscreen mode Exit fullscreen mode

In order for the demo to start working we need to click on the Microphone checkbox, let's set an event listener there to trigger the loading and listening processes.

document.getElementById("audio-switch").addEventListener('change', (event) => {
    if(event.target.checked) {
        if(modelLoaded) {
            startListening();
        }else{
            loadModel();
        }
    } else {
        stopListening();
    }   
});
Enter fullscreen mode Exit fullscreen mode

When the checkbox changes its value we have 3 different possibilities, the user enabled the checkbox and the model is not loaded, in that case, we use the loadModel() function, if however the model was already loaded we trigger the listening process. If the user disabled the checkbox, we stop accessing the microphone.

Let's review each function implementation:

loadModel()

loadModel() is responsible for creating the recognizer instance and load the model. When the model is loaded we will be able to get the list of labels the model was trained on with recognizer.wordLabels(). This will be helpful later when evaluating the model.

async function loadModel() { 
    // Show the loading element
    const loadingElement = document.getElementById('demo-loading');
    loadingElement.classList.remove('hidden');

    // When calling `create()`, you must provide the type of the audio input.
    // - BROWSER_FFT uses the browser's native Fourier transform.
    recognizer = speechCommands.create("BROWSER_FFT");  
    await recognizer.ensureModelLoaded()

    words = recognizer.wordLabels();
    modelLoaded = true;

    // Hide the loading element
    loadingElement.classList.add('hidden');
    startListening();
}
Enter fullscreen mode Exit fullscreen mode

startListening()

startListening() will be called after the model loaded or the user enabled the microphone and will be responsible for accessing the microphone API and evaluate the model to see which word we were able to identify. This sounds complicated, but thanks to TensorFlow is just a few lines of code.

function startListening() {
    recognizer.listen(({scores}) => {

        // Everytime the model evaluates a result it will return the scores array
        // Based on this data we will build a new array with each word and it's corresponding score
        scores = Array.from(scores).map((s, i) => ({score: s, word: words[i]}));

        // After that we sort the array by scode descending
        scores.sort((s1, s2) => s2.score - s1.score);

        // And we highlight the word with the highest score
        const elementId = `word-${scores[0].word}`;
        document.getElementById(elementId).classList.add('active');

        // This is just for removing the highlight after 2.5 seconds
        setTimeout(() => {
            document.getElementById(elementId).classList.remove('active');
        }, 2500);
    }, 
    {
        probabilityThreshold: 0.70
    });
}
Enter fullscreen mode Exit fullscreen mode

Super easy! now the last function.

stopListening()

stopListening() will stop accessing the microphone and stop the evaluation.

function stopListening(){
    recognizer.stopListening();
}
Enter fullscreen mode Exit fullscreen mode

That's it, that's all that you need to build your first example of speech recognition on the web.


Putting it all together

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script>
<script src="https://unpkg.com/@tensorflow-models/speech-commands"></script>

<script type="text/javascript">
    let recognizer;
    let words;
    const wordList = ["zero","one","two","three","four","five","six","seven","eight","nine", "yes", "no", "up", "down", "left", "right", "stop", "go"];
    let modelLoaded = false;

    document.addEventListener('DOMContentLoaded', () => {
        const wrapperElement = document.getElementById('sp-cmd-wrapper');
        for (let word of wordList) {
            wrapperElement.innerHTML += `<div class='col-3 col-md-6'><div id='word-${word}' class='badge'>${word}</div></div>`;
        };

        document.getElementById("audio-switch").addEventListener('change', (event) => {
            if(event.target.checked) {
                if(modelLoaded) {
                    startListening();
                }else{
                    loadModel();
                }
            } else {
                stopListening();
            }   
        });
    });

    async function loadModel() { 
        // Show the loading element
        const loadingElement = document.getElementById('demo-loading');
        loadingElement.classList.remove('hidden');

        // When calling `create()`, you must provide the type of the audio input.
        // - BROWSER_FFT uses the browser's native Fourier transform.
        recognizer = speechCommands.create("BROWSER_FFT");  
        await recognizer.ensureModelLoaded()

        words = recognizer.wordLabels();
        modelLoaded = true;

        // Hide the loading element
        loadingElement.classList.add('hidden');
        startListening();
    }

    function startListening() {
        recognizer.listen(({scores}) => {

            # Everytime the model evaluates a result it will return the scores array
            # Based on this data we will build a new array with each word and it's corresponding score
            scores = Array.from(scores).map((s, i) => ({score: s, word: words[i]}));

            # After that we sort the array by scode descending
            scores.sort((s1, s2) => s2.score - s1.score);

            # And we highlight the word with the highest score
            const elementId = `word-${scores[0].word}`;
            document.getElementById(elementId).classList.add('active');

            # This is just for removing the highlight after 2.5 seconds
            setTimeout(() => {
                document.getElementById(elementId).classList.remove('active');
            }, 2500);
        }, 
        {
            probabilityThreshold: 0.70
        });
    }

    function stopListening(){
        recognizer.stopListening();
    }
</script>

<div class="demo">
    Please enable the microphone checkbox and authorize this site to access the microphone.
    <br />
    Once the process finished loading speak one of the word bellow and see the magic happen.
    <br /><br />
    <div>
        <label class="form-switch">
            <input type="checkbox" id="audio-switch">
            Microphone
        </label>
        <div id="demo-loading" class="hidden">Loading...</div>
    </div>
    <div id="sp-cmd-wrapper" class="grid"></div>
</div>
Enter fullscreen mode Exit fullscreen mode

Conclusion

TensorFlow.js is a powerful library that is ideal for deploying ML learning models. Today we learned that with just a few lines of code we were able to load a model and start generating results. As with most ML solutions, it is just as good as the model and the data.

Please let me know in the comments if you have good ideas you can use TensorFlow.js for, or if you know any good models I can use to build the next project and present it on the blog.

As always, thanks for reading!


If you like the story, please don't forget to subscribe to our free newsletter so we can stay connected: https://livecodestream.dev/subscribe

💖 💪 🙅 🚩
bajcmartinez
Juan Cruz Martinez

Posted on June 23, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Speech Recognition with TensorFlow.js
javascript Speech Recognition with TensorFlow.js

June 23, 2020