Audio to Text Input via Google Speech to Text

In this article we will look into following topics

navigator.mediaDevices.getUserMedia browser Api
google speech to text Api

we will start by creating react hook which will be doing all the stuffs like startRecording, stopRecording, creating Audio Blob, error handling etc.

There are few other things to take care of before we get into the meat of the hook

Minimum decibel above which we would consider a dialogue as input eg -35db (just an random number)
How long should be the pause that would indicate that user has stopped the input e.g 2000ms

const VOICE_MIN_DECIBELS = -35
const DELAY_BETWEEN_DIALOGUE = 2000

Lets name our hook as useAudioInput.ts we would be using the browser apis like navigator.mediaDevices.getUserMedia, MediaRecorder and AudioContext. AudioContext will help us identify whether the input audio is higher than the minimum decibel that is required for it to be considered as input, so we would start with following variables and props

const defaultConfig = {
    audio: true
};

type Payload = Blob;

type Config = {
    audio: boolean;
    timeSlice?: number
    timeInMillisToStopRecording?: number
    onStop: () => void;
    onDataReceived: (payload: Payload) => void
};

export const useAudioInput = (config: Config = defaultConfig) => {
    const mediaChunks = useRef<Blob[]>([]);
    const [isRecording, setIsRecording] = useState(false);
    const mediaRecorder = useRef<MediaRecorder | null>(null);
    const [error, setError] = useState<Error| null>(null);
    let requestId: number;
    let timer: ReturnType<typeof setTimeout>;

    const createBlob = () => {
      const [chunk] = mediaChunks.current;
      const blobProperty = { type: chunk.type };
      return new Blob(mediaChunks.current, blobProperty)
    }
  ...
}

In the above code we would use mediaChunks as variable to hold input blob and mediaRecorder to have instance of new MediaRecorder which takes stream as input from navigator.mediaDevices.getUserMedia. Next lets take care of cases where getUserMedia is not available

...
useEffect(() => {
        if(!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
            const notAvailable = new Error('Your browser does not support Audio Input')
            setError(notAvailable)
        }

    },[]);
...

we will start writing the actual functionality of the hook which will consists of various functions like setupMediaRecorder, setupAudioContext, onRecordingStart, onRecordingActive, startRecording, stopRecording etc.

const onRecordingStart = () => mediaChunks.current = [];

const onRecordingActive = useCallback(({data}: BlobEvent) => {
        if(data) {
            mediaChunks.current.push(data);
            config?.onDataReceived?.(createBlob())
        }
    },[config]);

const startTimer = () => {
        timer = setTimeout(() => {
            stopRecording();
        }, config.timeInMillisToStopRecording)
    };

const setupMediaRecorder = ({stream}:{stream: MediaStream}) => {
        mediaRecorder.current = new MediaRecorder(stream)
        mediaRecorder.current.ondataavailable = onRecordingActive
        mediaRecorder.current.onstop = onRecordingStop
        mediaRecorder.current.onstart = onRecordingStart
        mediaRecorder.current.start(config.timeSlice)

    };

 const setupAudioContext = ({stream}:{stream: MediaStream}) => {
        const audioContext = new AudioContext();
        const audioStreamSource = audioContext.createMediaStreamSource(stream);
        const analyser = audioContext.createAnalyser();

        analyser.minDecibels = VOICE_MIN_DECIBELS;

        audioStreamSource.connect(analyser);
        const bufferLength = analyser.frequencyBinCount;
        const domainData = new Uint8Array(bufferLength)

        return {
            domainData,
            bufferLength,
            analyser
        }
    };

const startRecording = async () => {
        setIsRecording(true);

        await navigator.mediaDevices
            .getUserMedia({
                audio: config.audio
            })
            .then((stream) => {
                setupMediaRecorder({stream});
                if(config.timeSlice) {
                    const { domainData, analyser, bufferLength } = setupAudioContext({ stream });
                    startTimer()
                }
            })
            .catch(e => {
                setError(e);
                setIsRecording(false)
            })
    };



    const stopRecording = () => {
        mediaRecorder.current?.stop();

        clearTimeout(timer);
        window.cancelAnimationFrame(requestId);

        setIsRecording(false);
        onRecordingStop()
    };

    const createBlob = () => {
        const [chunk] = mediaChunks.current;
        const blobProperty = { type: chunk.type };
        return new Blob(mediaChunks.current, blobProperty)
    }

    const onRecordingStop = () => config?.onStop?.();

with the above code we are almost done with the hook, the only pending thing is to identify whether user has stopped speaking or not, we would use DELAY_BETWEEN_DIALOGUE as the time that we would wait for, if there is no input for 2 secs we will assume that user has stopped speaking and will hit the speech to text endpoint.

...
const detectSound = ({ 
        recording,
        analyser,
        bufferLength,
        domainData
    }: {
        recording: boolean
        analyser: AnalyserNode
        bufferLength: number
        domainData: Uint8Array
    }) => {
        let lastDetectedTime = performance.now();
        let anySoundDetected = false;

        const compute = () => {
            if (!recording) {
                return;
            }

            const currentTime = performance.now();

            const timeBetweenTwoDialog =
                anySoundDetected === true && currentTime - lastDetectedTime > DELAY_BETWEEN_DIALOGUE;

            if (timeBetweenTwoDialog) {
                stopRecording();

                return;
            }

            analyser.getByteFrequencyData(domainData);

            for (let i = 0; i < bufferLength; i += 1) {
                if (domainData[i] > 0) {
                    anySoundDetected = true;
                    lastDetectedTime = performance.now();
                }
            }

            requestId = window.requestAnimationFrame(compute);
        };

        compute();

    }
...

const startRecording = async () => {
 ... 
  detectSound()
 ... 
}

in the above code we are using requestAnimationFrame to detect the user audio input, with this we are done with the hook and now can start using the hook in various places.

e.g


  const onDataReceived = async (data: BodyInit) => {
    const rawResponse = await fetch('https://backend-endpoint', {
      method: 'POST',
      body: data
    });
    const response = await rawResponse.json();

    setText(response)
  };

  const { isRecording, startRecording, error } = useAudioInput({
    audio: true,
    timeInMillisToStopRecording: 2000,
    timeSlice: 400,
    onDataReceived
  })

The second part is to wire up a node server which can communicate with google speech to text api, i have attached the documentation that i referred while creating the node side of things.
https://codelabs.developers.google.com/codelabs/cloud-speech-text-node.

// demo node server which connects with google speech to text api endpoint

const express = require('express');
const cors = require('cors');

const speech = require('@google-cloud/speech');

const client = new speech.SpeechClient();

async function convert(audioBlob) {
  const request = {
    config: {
      encoding: 'WEBM_OPUS', // Ensure this matches the format of the audio being sent
      sampleRateHertz: 48000, // This should match the sample rate of your recording
      languageCode: 'en-US'
    },
    audio: {
      content: audioBlob
    }
  };

  const [response] = await client.recognize(request);

  const transcription = response.results
    .map(result => result.alternatives[0].transcript)
    .join('\n');
  return transcription;
}

const app = express();

app.use(cors())
app.use(express.json());

app.post('/upload', express.raw({ type: '*/*' }), async (req, res) => {
    const audioBlob = req.body;

    const response = await convert(audioBlob);

    res.json(response);
});

app.listen(4000,'0.0.0.0', () => {
  console.log('Example app listening on port 4000!');
});

in this article i have covered sending audio content or blob to the google speech to text endpoint, we can also send an blob uri instead of content the only change will be is the payload

// sending url as part of audio object to speech to text api 
...
audio: {url: audioUrl} or audio: {content: audioBlob}
...

The code related to the article is present in Github.

Blog

Audio to Text Input via Google Speech to Text

shubhadip

Join Our Newsletter. No Spam, Only the good stuff.

Related