Build a Telegram voice chatbot using ChatGPT API and Whisper

In this article, I will provide you with a step-by-step guide on how to create your own voice chatbot in Telegram. With this, you will be able to engage in conversations with your chatbot in a way that is natural and intuitive.

You can chat with a Telegram bot or send it a voice file, and it will send a response back along with a voice file as a reply. The conversation can continue until you choose to reset it.

You can check the source code here.

https://github.com/ngviethoang/telegram-voice-chatbot

This guide requires you to deploy on your own server with a public IP so you can set up webhooks for your Telegram bot.

Also, this is my personal bot built for both Messenger and Telegram, you can check it here.

https://github.com/ngviethoang/ai-chatbot

It also has Dall-E 2 integrated, and other models in Replicate. If you are curious about this, I will write another blog to talk about it.

Set up project

We will use Bottender - a framework for writing Telegram bot faster. It also supports Session for us to store past conversation messages and other data, so it’s more convenient to build a chatbot with conversational memory.

To create a project, run this command:

npx create-bottender-app telegram-bot

In this step, select Telegram platform.

You can check their documentation here for more details. https://bottender.js.org/docs/

We will need to add Express server for serving static files also. And add Typescript so it’s easier to develop and maintain later.

npm install body-parser express
npm install typescript ts-node nodemon --save-dev
// or yarn
yarn add body-parser express
yarn add typescript ts-node nodemon --dev

Update scripts in package.json file

{
    "scripts": {
        "build": "tsc",
    "dev": "nodemon --exec ts-node src/server.ts",
    "lint": "eslint . --ext=js,ts",
    "start": "tsc && node dist/server.js",
    "test": "jest"
  },
}

Add tsconfig.json file

{
  "include": ["src/**/*"],
  "exclude": ["**/__tests__", "**/*.{spec,test}.ts"],
  "compilerOptions": {
    "target": "es2016",
    "lib": ["es2017", "es2018", "es2019", "es2020", "esnext.asynciterable"],
    "module": "commonjs",
    "skipLibCheck": true,
    "moduleResolution": "node",
    "esModuleInterop": true,
    "strict": true,
    "forceConsistentCasingInFileNames": true,
    "resolveJsonModule": true,
    "isolatedModules": true,
    "rootDir": "./src",
    "outDir": "./dist",
    "types": ["node", "jest"]
  }
}

Install NPM packages

In this project, we will use OpenAI APIs like model gpt-3.5-turbo for chat completion and Whisper to transcribe text from audio.

With generating voice from text, I will use Azure service, so we will install package microsoft-cognitiveservices-speech-sdk. You guys can use any other services like Google, Amazon,… to serve this purpose.

Also install other packages for helper function: axios util uuid

npm install openai gpt-3-encoder microsoft-cognitiveservices-speech-sdk axios util uuid
npm install @types/uuid --save-dev
// Or use yarn
yarn add openai gpt-3-encoder microsoft-cognitiveservices-speech-sdk axios util uuid
yarn add @types/uuid --dev

Telegram setup

Edit file bottender.config.js with Telegram channel enabled

module.exports = {
  channels: {
    telegram: {
      enabled: true,
      path: '/webhooks/telegram',
      accessToken: process.env.TELEGRAM_ACCESS_TOKEN,
    },
  },
};

Make sure to set the channels.telegram.enabled field to true.

Create a bot and generate access token

You can get a Telegram bot account and a bot token by sending the /newbot command to @BotFather on Telegram.

After you get your Telegram Bot Token, paste the value into the TELEGRAM_ACCESS_TOKEN field in your .env file:

TELEGRAM_ACCESS_TOKEN=<Your Telegram Bot Token>

Set up commands for bot

Run /setcommands in Botfather to create commands for our bot.

new - Clear old conversation and create a new one
voice - Set up voice for bot to speak
language - Set up whisper language

Set up Express server

Change the code in file index.js in the root directory with this code.

/* eslint-disable import/no-unresolved */
module.exports = require('./dist').default;

Create a server.ts file in src directory and copy this code into it.

import bodyParser from 'body-parser';
import express from 'express';
import { bottender } from 'bottender';

const app = bottender({
  dev: process.env.NODE_ENV !== 'production',
});

const port = Number(process.env.PORT) || 5000;

const handle = app.getRequestHandler();

app.prepare().then(() => {
  const server = express();

  server.use(
    bodyParser.json({
      verify: (req, _, buf) => {
        (req as any).rawBody = buf.toString();
      },
    })
  );

    server.use('/static', express.static('static'));

  server.get('/api', (req, res) => {
    res.json({ ok: true });
  });

  server.all('*', (req, res) => {
    return handle(req, res);
  });

  server.listen(port, () => {
    console.log(`> Ready on http://localhost:${port}`);
  });
});

As you can see, I have a static directory to store all static files. We will use this directory to store the voice files that we generate in the next steps.

Let’s create static directory in the root directory and voices directory inside it.

mkdir static
mkdir static/voices

Handling Telegram events

There are 3 events that we need to handle from Telegram

Commands
Text messages
Voice messages

We will go over each of these events in detail.

Delete old files index.js and index.test.js. Create a index.ts file inside src directory.

Let’s use router to route different events to each handler.

import { Action, TelegramContext } from 'bottender';
import { router, text } from 'bottender/router';

export default async function App(
  context: TelegramContext
): Promise<Action<any> | void> {
  if (context.event.voice) {
    return HandleVoice;
  }
  return router([
    text(/^[/.](?<command>\w+)(?:\s(?<content>.+))?/i, HandleCommand),
    text('*', HandleText),
  ])
};

Handling text messages

First, we will handle the user’s messages by sending them to ChatGPT API, send the response to the user then save these messages so we can continue this conversation.

async function HandleText(context: TelegramContext) {
  await context.sendChatAction(ChatAction.Typing);
  let { text, replyToMessage } = context.event;
    // Add reply message to text content
  const { text: replyText } = replyToMessage || {}
  if (replyText) {
    text += `\n${replyText}`
  }

  await handleChat(context, text)
}

Next, let’s write a function to handle chat completion with ChatGPT API.

const configuration = new Configuration({
  apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);

export const createCompletion = async (messages: ChatCompletionRequestMessage[], max_tokens?: number, temperature?: number) => {
  const response = await openai.createChatCompletion({
    model: "gpt-3.5-turbo",
    messages,
    max_tokens,
    temperature,
  });
  return response.data.choices;
};

export const createCompletionFromConversation = async (
  context: TelegramContext,
  messages: ChatCompletionRequestMessage[]) => {
  try {
    // limit response to avoid message length limit, you can change this if you want
    const response_max_tokens = 500
    const GPT3_MAX_TOKENS = 4096
    const max_tokens = Math.min(getTokens(messages) + response_max_tokens, GPT3_MAX_TOKENS)

    const response = await createCompletion(messages, max_tokens);
    return response[0].message?.content;
  } catch (e) {
    return null;
  }
};

We need to add OPENAI_API_KEY variable to our .env file. You can get your API key from here.

https://platform.openai.com/account/api-keys

OPENAI_API_KEY=<Your OpenAI API key>

You can also set the system role message with your own prompt. In this way, your bot will have its own character and can serve your own purposes, such as a personal trainer, advisor, or any character from movies, novels...

Now, we will send a response back to user as a message in Markdown format.

We also want this conversation to keep going when user sends a new message. So we will save these messages in database. Bottender supports this through Session state. You can check about it here.

https://bottender.js.org/docs/the-basics-session

export const handleChat = async (context: TelegramContext, text: string) => {
  const response = await createCompletionFromConversation(context, [
    ...context.state.context as any,
    { role: 'user', content: text },
  ]);
  if (!response) {
    await context.sendText(
      'Sorry! Please try again`'
    );
    return;
  }
  let content = response.trim()

  await context.sendMessage(content, { parseMode: ParseMode.Markdown });
  await handleTextToSpeech(context, content, getAzureVoiceName(context))

    // save current conversation in session
  context.setState({
    ...context.state,
    context: [
      ...context.state.context as any,
      { role: 'user', content: text },
      { role: 'assistant', content },
    ],
  });
}

You can also set up different session driver by edit the session.driver in the bottender.config.js file.

// bottender.config.js

module.exports = {
  session: {
    driver: 'memory',
    stores: {
      memory: {
        maxSize: 500,
      },
      file: {
        dirname: '.sessions',
      },
      redis: {
        port: 6379,
        host: '127.0.0.1',
        password: 'auth',
        db: 0,
      },
      mongo: {
        url: 'mongodb://localhost:27017',
        collectionName: 'sessions',
      },
    },
  },
};

Send bot’s response as voice

Next, we want to convert this response to voice file in Telegram and send to the user like the bot is talking to them. To do this, I will use Azure Speech Service.

You can set up Azure service by following the documentation here.

Text-to-speech quickstart - Speech service - Azure Cognitive Services | Microsoft Learn

After creating Speech resource, let’s set the environment variables in .env file.

AZURE_SPEECH_KEY=
AZURE_SPEECH_REGION=

It will convert the message from the bot to an audio file in ogg format.

import { SpeechConfig, AudioConfig, SpeechSynthesizer, ResultReason } from 'microsoft-cognitiveservices-speech-sdk'

export const textToSpeech = async (text: string, outputFile: string, voiceName?: string) => {
  return new Promise((resolve, reject) => {
    // This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    const speechConfig = SpeechConfig.fromSubscription(process.env.AZURE_SPEECH_KEY || '', process.env.AZURE_SPEECH_REGION || '');
    const audioConfig = AudioConfig.fromAudioFileOutput(outputFile);

    // The language of the voice that speaks.
    speechConfig.speechSynthesisVoiceName = voiceName || "en-US-JennyNeural";

    // Create the speech synthesizer.
    const synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);

    synthesizer?.speakTextAsync(text,
      function (result) {
        if (result.reason === ResultReason.SynthesizingAudioCompleted) {
          // console.log("synthesis finished.");
        } else {
          console.error("Speech synthesis canceled, " + result.errorDetails +
            "\nDid you set the speech resource key and region values?");
        }
        synthesizer?.close();
        resolve(result);
      },
      function (err) {
        console.trace("err - " + err);
        synthesizer?.close();
        reject(err);
      });
  });
}

export const handleTextToSpeech = async (context: TelegramContext, message: string, voiceName?: string) => {
  try {
    await context.sendChatAction(ChatAction.Typing);

    // set random filename
    const fileId = uuidv4().replaceAll('-', '')
    const outputDir = `static/voices`
    const outputFile = `${outputDir}/voice_${fileId}.ogg`
    const encodedOutputFile = `${outputDir}/voice_${fileId}_encoded.ogg`

    const result = await textToSpeech(
      message || '',
      outputFile,
      voiceName || getAzureVoiceName(context)
    )
    await encodeOggWithOpus(outputFile, encodedOutputFile)

    const voiceUrl = `${process.env.PROD_API_URL}/${encodedOutputFile}`

    await context.sendVoice(voiceUrl)
  } catch (err) {
    console.trace("err - " + err);
  }
}

In order to send this audio file as a voice in Telegram, we must do a small step to encode this ogg file with opus. The detail in here. I figured one way to do this is by ffmpeg package.

Let’s install this package on our machine first. You can check how to install it here.

In Windows, run this command

choco install ffmpeg

In Linux, run this command

sudo apt install ffmpeg

Next, we will run this command in our JS code to convert ogg file to an encoded file.

import { exec } from 'child_process';
import { promisify } from 'util';

const asyncExec = promisify(exec);

export const encodeOggWithOpus = async (inputFile: string, outputFile: string) => {
  try {
    const { stdout, stderr } = await asyncExec(`ffmpeg -loglevel error -i ${inputFile} -c:a libopus -b:a 96K ${outputFile}`);
    // console.log(stdout);

    if (stderr) {
      console.error(stderr);
    }
  } catch (err) {
    console.error(err);
  }
}

Great, after converting this new file, we will send it to the user.

One thing to notice is that I send this file as an URL, so we will store these files in static directory we created earlier. You will need to set the full URL, so remember to insert your domain you use to run this bot.

const voiceUrl = `${process.env.PROD_API_URL}/${encodedOutputFile}`

await context.sendVoice(voiceUrl)

Set environment varible PROD_API_URL in .env file with your domain like: https://example.com

PROD_API_URL=<your api url>

Handling commands

Handling all commands by this code below.

async function HandleCommand(
  context: TelegramContext,
  {
    match: {
      groups: { command, content },
    },
  }: any
) {
  switch (command.toLowerCase()) {
    case 'new':
      await clearServiceData(context);
      break;
    case 'voice':
      await setAzureVoiceName(context, content)
      break;
    case 'language':
      await setWhisperLang(context, content)
      break;
    default:
      await context.sendText('Sorry! Command not found.');
      break;
  }
}

With /new command, we will simply clear conversation’s data from state.

export const clearServiceData = async (context: TelegramContext) => {
  context.setState({
    ...context.state,
    context: [],
  });
  await context.sendText('New conversation.');
};

With /voice command, we will save this option to settings state.

const getSettings = (context: TelegramContext): any => {
  return context.state.settings || {}
}

export const setSettings = async (context: TelegramContext, key: string, value: string) => {
  let newValue: any = value
  if (value === 'true') {
    newValue = true
  } else if (value === 'false') {
    newValue = false
  }
  context.setState({
    ...context.state,
    settings: {
      ...getSettings(context),
      [key]: newValue,
    },
  })
}

export const setAzureVoiceName = async (context: TelegramContext, voiceName: string) => {
  await setSettings(context, 'azureVoiceName', voiceName)
}

You can check available voices support here:

Language support - Speech service - Azure Cognitive Services | Microsoft Learn

With /language command, we will do as same as /voice command. We will use this to set the Whisper API’s language parameter.

export const setWhisperLang = async (context: TelegramContext, language: string) => {
  await setSettings(context, 'whisperLang', language)
}

Handling voice messages

When user sends voice file to our bot, we need to transcribe this file to text. In order to do this, we will use Whisper API to transcribe.

Handling voice event by this code below.

async function HandleVoice(context: TelegramContext) {
  await handleAudioForChat(context)
}

export const handleAudioForChat = async (context: TelegramContext) => {
  let transcription: any
  const fileUrl = await getFileUrl(context.event.voice.fileId)
  if (fileUrl) {
    transcription = await getTranscription(context, fileUrl)
  }
  if (!transcription) {
    await context.sendText(`Error getting transcription!`);
    return
  }

  await context.sendMessage(`_${transcription}_`, { parseMode: ParseMode.Markdown });

  await context.sendChatAction(ChatAction.Typing);
  await handleChat(context, transcription)
}

When we receive voice events from webhooks, we only receive the file id. So, in the next step, we will use Telegram API to get the full path of this voice file.

import axios from "axios"

export const getFileUrl = async (file_id: string) => {
  try {
    const response = await axios({
      method: 'GET',
      url: `https://api.telegram.org/bot${process.env.TELEGRAM_ACCESS_TOKEN}/getFile`,
      params: {
        file_id
      }
    })
    if (response.status !== 200) {
      console.error(response.data);
      return null;
    }

    const filePath = response.data.result.file_path;
    return `https://api.telegram.org/file/bot${process.env.TELEGRAM_ACCESS_TOKEN}/${filePath}`
  } catch (e) {
    console.error(e);
    return null;
  }
}

After receiving the file path, we will download it to static directory with oga format. However, in order to use whisper API, the audio file must be in mp3 format. So we need to convert it to mp3 by this command, also using ffmpeg.

const asyncExec = promisify(exec);

export const convertOggToMp3 = async (inputFile: string, outputFile: string) => {
  try {
    const { stdout, stderr } = await asyncExec(`ffmpeg -loglevel error -i ${inputFile} -c:a libmp3lame -q:a 2 ${outputFile}`);
    // console.log(stdout);

    if (stderr) {
      console.error(stderr);
    }
  } catch (err) {
    console.error(err);
  }
}

Now, let’s send this mp3 file to Whisper API to get the transcription.

const downloadsPath = './static/voices';

export const getTranscription = async (context: TelegramContext, url: string, language?: string) => {
  try {
    let filePath = await downloadFile(url, downloadsPath);
    if (filePath.endsWith('.oga')) {
      const newFilePath = filePath.replace('.oga', '.mp3')
      await convertOggToMp3(filePath, newFilePath)
      filePath = newFilePath
    }
    const response = await openai.createTranscription(
      fs.createReadStream(filePath) as any,
      'whisper-1',
      undefined, undefined, undefined,
      language,
    );
    return response.data.text
  } catch (e) {
    return null;
  }
}

After we get the transcription, our job is similar to the previous step, we will use the function handleChat like before to handle user’s message and send ChatGPT’s reply back to user.

export const handleAudioForChat = async (context: TelegramContext) => {
  let transcription: any
  const fileUrl = await getFileUrl(context.event.voice.fileId)
  if (fileUrl) {
    transcription = await getTranscription(context, fileUrl)
  }
  if (!transcription) {
    await context.sendText(`Error getting transcription!`);
    return
  }

  await context.sendMessage(`_${transcription}_`, { parseMode: ParseMode.Markdown });

  await context.sendChatAction(ChatAction.Typing);
  await handleChat(context, transcription)
}

You can see our bot also converts its reply to speech and responds back to the user.

You can also have a settings option here to disable text message responses and only send back voice to the user.

Deployment

Run Bottender on your server by the following command:

npm start
# or use yarn
yarn start

Set Up Webhook for Production

Run this command to set up Webhook for Telegram bot, supposed your URL to be https://example.com/webhooks/telegram

npx bottender telegram webhook set -w https://example.com/webhooks/telegram

Now you are ready to talk to your bot.

Send voice file and wait for the bot to respond.

Conclusion

I hope with this guide, you can build your own talking chatbot and have fun talking with it.

Hope you like it!

Blog