Build a Telegram voice chatbot using ChatGPT API and Whisper
Viet Hoang
Posted on March 30, 2023
In this article, I will provide you with a step-by-step guide on how to create your own voice chatbot in Telegram. With this, you will be able to engage in conversations with your chatbot in a way that is natural and intuitive.
You can chat with a Telegram bot or send it a voice file, and it will send a response back along with a voice file as a reply. The conversation can continue until you choose to reset it.
You can check the source code here.
https://github.com/ngviethoang/telegram-voice-chatbot
This guide requires you to deploy on your own server with a public IP so you can set up webhooks for your Telegram bot.
Also, this is my personal bot built for both Messenger and Telegram, you can check it here.
https://github.com/ngviethoang/ai-chatbot
It also has Dall-E 2 integrated, and other models in Replicate. If you are curious about this, I will write another blog to talk about it.
Set up project
We will use Bottender - a framework for writing Telegram bot faster. It also supports Session for us to store past conversation messages and other data, so it’s more convenient to build a chatbot with conversational memory.
To create a project, run this command:
npx create-bottender-app telegram-bot
In this step, select Telegram platform.
You can check their documentation here for more details. https://bottender.js.org/docs/
We will need to add Express server for serving static files also. And add Typescript so it’s easier to develop and maintain later.
npm install body-parser express
npm install typescript ts-node nodemon --save-dev
// or yarn
yarn add body-parser express
yarn add typescript ts-node nodemon --dev
Update scripts in package.json
file
{
"scripts": {
"build": "tsc",
"dev": "nodemon --exec ts-node src/server.ts",
"lint": "eslint . --ext=js,ts",
"start": "tsc && node dist/server.js",
"test": "jest"
},
}
Add tsconfig.json
file
{
"include": ["src/**/*"],
"exclude": ["**/__tests__", "**/*.{spec,test}.ts"],
"compilerOptions": {
"target": "es2016",
"lib": ["es2017", "es2018", "es2019", "es2020", "esnext.asynciterable"],
"module": "commonjs",
"skipLibCheck": true,
"moduleResolution": "node",
"esModuleInterop": true,
"strict": true,
"forceConsistentCasingInFileNames": true,
"resolveJsonModule": true,
"isolatedModules": true,
"rootDir": "./src",
"outDir": "./dist",
"types": ["node", "jest"]
}
}
Install NPM packages
In this project, we will use OpenAI APIs like model gpt-3.5-turbo
for chat completion and Whisper to transcribe text from audio.
With generating voice from text, I will use Azure service, so we will install package microsoft-cognitiveservices-speech-sdk
. You guys can use any other services like Google, Amazon,… to serve this purpose.
Also install other packages for helper function: axios util uuid
npm install openai gpt-3-encoder microsoft-cognitiveservices-speech-sdk axios util uuid
npm install @types/uuid --save-dev
// Or use yarn
yarn add openai gpt-3-encoder microsoft-cognitiveservices-speech-sdk axios util uuid
yarn add @types/uuid --dev
Telegram setup
Edit file bottender.config.js
with Telegram channel enabled
module.exports = {
channels: {
telegram: {
enabled: true,
path: '/webhooks/telegram',
accessToken: process.env.TELEGRAM_ACCESS_TOKEN,
},
},
};
Make sure to set the channels.telegram.enabled
field to true
.
Create a bot and generate access token
You can get a Telegram bot account and a bot token by sending the /newbot
command to @BotFather on Telegram.
After you get your Telegram Bot Token, paste the value into the TELEGRAM_ACCESS_TOKEN
field in your .env
file:
TELEGRAM_ACCESS_TOKEN=<Your Telegram Bot Token>
Set up commands for bot
Run /setcommands
in Botfather to create commands for our bot.
new - Clear old conversation and create a new one
voice - Set up voice for bot to speak
language - Set up whisper language
Set up Express server
Change the code in file index.js
in the root directory with this code.
/* eslint-disable import/no-unresolved */
module.exports = require('./dist').default;
Create a server.ts
file in src
directory and copy this code into it.
import bodyParser from 'body-parser';
import express from 'express';
import { bottender } from 'bottender';
const app = bottender({
dev: process.env.NODE_ENV !== 'production',
});
const port = Number(process.env.PORT) || 5000;
const handle = app.getRequestHandler();
app.prepare().then(() => {
const server = express();
server.use(
bodyParser.json({
verify: (req, _, buf) => {
(req as any).rawBody = buf.toString();
},
})
);
server.use('/static', express.static('static'));
server.get('/api', (req, res) => {
res.json({ ok: true });
});
server.all('*', (req, res) => {
return handle(req, res);
});
server.listen(port, () => {
console.log(`> Ready on http://localhost:${port}`);
});
});
As you can see, I have a static
directory to store all static files. We will use this directory to store the voice files that we generate in the next steps.
Let’s create static
directory in the root directory and voices
directory inside it.
mkdir static
mkdir static/voices
Handling Telegram events
There are 3 events that we need to handle from Telegram
- Commands
- Text messages
- Voice messages
We will go over each of these events in detail.
Delete old files index.js
and index.test.js
. Create a index.ts
file inside src
directory.
Let’s use router
to route different events to each handler.
import { Action, TelegramContext } from 'bottender';
import { router, text } from 'bottender/router';
export default async function App(
context: TelegramContext
): Promise<Action<any> | void> {
if (context.event.voice) {
return HandleVoice;
}
return router([
text(/^[/.](?<command>\w+)(?:\s(?<content>.+))?/i, HandleCommand),
text('*', HandleText),
])
};
Handling text messages
First, we will handle the user’s messages by sending them to ChatGPT API, send the response to the user then save these messages so we can continue this conversation.
async function HandleText(context: TelegramContext) {
await context.sendChatAction(ChatAction.Typing);
let { text, replyToMessage } = context.event;
// Add reply message to text content
const { text: replyText } = replyToMessage || {}
if (replyText) {
text += `\n${replyText}`
}
await handleChat(context, text)
}
Next, let’s write a function to handle chat completion with ChatGPT API.
const configuration = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
export const createCompletion = async (messages: ChatCompletionRequestMessage[], max_tokens?: number, temperature?: number) => {
const response = await openai.createChatCompletion({
model: "gpt-3.5-turbo",
messages,
max_tokens,
temperature,
});
return response.data.choices;
};
export const createCompletionFromConversation = async (
context: TelegramContext,
messages: ChatCompletionRequestMessage[]) => {
try {
// limit response to avoid message length limit, you can change this if you want
const response_max_tokens = 500
const GPT3_MAX_TOKENS = 4096
const max_tokens = Math.min(getTokens(messages) + response_max_tokens, GPT3_MAX_TOKENS)
const response = await createCompletion(messages, max_tokens);
return response[0].message?.content;
} catch (e) {
return null;
}
};
We need to add OPENAI_API_KEY
variable to our .env
file. You can get your API key from here.
https://platform.openai.com/account/api-keys
OPENAI_API_KEY=<Your OpenAI API key>
You can also set the system role message with your own prompt. In this way, your bot will have its own character and can serve your own purposes, such as a personal trainer, advisor, or any character from movies, novels...
Now, we will send a response back to user as a message in Markdown format.
We also want this conversation to keep going when user sends a new message. So we will save these messages in database. Bottender supports this through Session state. You can check about it here.
https://bottender.js.org/docs/the-basics-session
export const handleChat = async (context: TelegramContext, text: string) => {
const response = await createCompletionFromConversation(context, [
...context.state.context as any,
{ role: 'user', content: text },
]);
if (!response) {
await context.sendText(
'Sorry! Please try again`'
);
return;
}
let content = response.trim()
await context.sendMessage(content, { parseMode: ParseMode.Markdown });
await handleTextToSpeech(context, content, getAzureVoiceName(context))
// save current conversation in session
context.setState({
...context.state,
context: [
...context.state.context as any,
{ role: 'user', content: text },
{ role: 'assistant', content },
],
});
}
You can also set up different session driver by edit the session.driver in the bottender.config.js
file.
// bottender.config.js
module.exports = {
session: {
driver: 'memory',
stores: {
memory: {
maxSize: 500,
},
file: {
dirname: '.sessions',
},
redis: {
port: 6379,
host: '127.0.0.1',
password: 'auth',
db: 0,
},
mongo: {
url: 'mongodb://localhost:27017',
collectionName: 'sessions',
},
},
},
};
Send bot’s response as voice
Next, we want to convert this response to voice file in Telegram and send to the user like the bot is talking to them. To do this, I will use Azure Speech Service.
You can set up Azure service by following the documentation here.
Text-to-speech quickstart - Speech service - Azure Cognitive Services | Microsoft Learn
After creating Speech resource, let’s set the environment variables in .env
file.
AZURE_SPEECH_KEY=
AZURE_SPEECH_REGION=
It will convert the message from the bot to an audio file in ogg format.
import { SpeechConfig, AudioConfig, SpeechSynthesizer, ResultReason } from 'microsoft-cognitiveservices-speech-sdk'
export const textToSpeech = async (text: string, outputFile: string, voiceName?: string) => {
return new Promise((resolve, reject) => {
// This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
const speechConfig = SpeechConfig.fromSubscription(process.env.AZURE_SPEECH_KEY || '', process.env.AZURE_SPEECH_REGION || '');
const audioConfig = AudioConfig.fromAudioFileOutput(outputFile);
// The language of the voice that speaks.
speechConfig.speechSynthesisVoiceName = voiceName || "en-US-JennyNeural";
// Create the speech synthesizer.
const synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
synthesizer?.speakTextAsync(text,
function (result) {
if (result.reason === ResultReason.SynthesizingAudioCompleted) {
// console.log("synthesis finished.");
} else {
console.error("Speech synthesis canceled, " + result.errorDetails +
"\nDid you set the speech resource key and region values?");
}
synthesizer?.close();
resolve(result);
},
function (err) {
console.trace("err - " + err);
synthesizer?.close();
reject(err);
});
});
}
export const handleTextToSpeech = async (context: TelegramContext, message: string, voiceName?: string) => {
try {
await context.sendChatAction(ChatAction.Typing);
// set random filename
const fileId = uuidv4().replaceAll('-', '')
const outputDir = `static/voices`
const outputFile = `${outputDir}/voice_${fileId}.ogg`
const encodedOutputFile = `${outputDir}/voice_${fileId}_encoded.ogg`
const result = await textToSpeech(
message || '',
outputFile,
voiceName || getAzureVoiceName(context)
)
await encodeOggWithOpus(outputFile, encodedOutputFile)
const voiceUrl = `${process.env.PROD_API_URL}/${encodedOutputFile}`
await context.sendVoice(voiceUrl)
} catch (err) {
console.trace("err - " + err);
}
}
In order to send this audio file as a voice in Telegram, we must do a small step to encode this ogg file with opus. The detail in here. I figured one way to do this is by ffmpeg
package.
Let’s install this package on our machine first. You can check how to install it here.
In Windows, run this command
choco install ffmpeg
In Linux, run this command
sudo apt install ffmpeg
Next, we will run this command in our JS code to convert ogg file to an encoded file.
import { exec } from 'child_process';
import { promisify } from 'util';
const asyncExec = promisify(exec);
export const encodeOggWithOpus = async (inputFile: string, outputFile: string) => {
try {
const { stdout, stderr } = await asyncExec(`ffmpeg -loglevel error -i ${inputFile} -c:a libopus -b:a 96K ${outputFile}`);
// console.log(stdout);
if (stderr) {
console.error(stderr);
}
} catch (err) {
console.error(err);
}
}
Great, after converting this new file, we will send it to the user.
One thing to notice is that I send this file as an URL, so we will store these files in static
directory we created earlier. You will need to set the full URL, so remember to insert your domain you use to run this bot.
const voiceUrl = `${process.env.PROD_API_URL}/${encodedOutputFile}`
await context.sendVoice(voiceUrl)
Set environment varible PROD_API_URL
in .env
file with your domain like: https://example.com
PROD_API_URL=<your api url>
Handling commands
Handling all commands by this code below.
async function HandleCommand(
context: TelegramContext,
{
match: {
groups: { command, content },
},
}: any
) {
switch (command.toLowerCase()) {
case 'new':
await clearServiceData(context);
break;
case 'voice':
await setAzureVoiceName(context, content)
break;
case 'language':
await setWhisperLang(context, content)
break;
default:
await context.sendText('Sorry! Command not found.');
break;
}
}
With /new
command, we will simply clear conversation’s data from state.
export const clearServiceData = async (context: TelegramContext) => {
context.setState({
...context.state,
context: [],
});
await context.sendText('New conversation.');
};
With /voice
command, we will save this option to settings
state.
const getSettings = (context: TelegramContext): any => {
return context.state.settings || {}
}
export const setSettings = async (context: TelegramContext, key: string, value: string) => {
let newValue: any = value
if (value === 'true') {
newValue = true
} else if (value === 'false') {
newValue = false
}
context.setState({
...context.state,
settings: {
...getSettings(context),
[key]: newValue,
},
})
}
export const setAzureVoiceName = async (context: TelegramContext, voiceName: string) => {
await setSettings(context, 'azureVoiceName', voiceName)
}
You can check available voices support here:
Language support - Speech service - Azure Cognitive Services | Microsoft Learn
With /language
command, we will do as same as /voice
command. We will use this to set the Whisper API’s language parameter.
export const setWhisperLang = async (context: TelegramContext, language: string) => {
await setSettings(context, 'whisperLang', language)
}
Handling voice messages
When user sends voice file to our bot, we need to transcribe this file to text. In order to do this, we will use Whisper API to transcribe.
Handling voice event by this code below.
async function HandleVoice(context: TelegramContext) {
await handleAudioForChat(context)
}
export const handleAudioForChat = async (context: TelegramContext) => {
let transcription: any
const fileUrl = await getFileUrl(context.event.voice.fileId)
if (fileUrl) {
transcription = await getTranscription(context, fileUrl)
}
if (!transcription) {
await context.sendText(`Error getting transcription!`);
return
}
await context.sendMessage(`_${transcription}_`, { parseMode: ParseMode.Markdown });
await context.sendChatAction(ChatAction.Typing);
await handleChat(context, transcription)
}
When we receive voice events from webhooks, we only receive the file id. So, in the next step, we will use Telegram API to get the full path of this voice file.
import axios from "axios"
export const getFileUrl = async (file_id: string) => {
try {
const response = await axios({
method: 'GET',
url: `https://api.telegram.org/bot${process.env.TELEGRAM_ACCESS_TOKEN}/getFile`,
params: {
file_id
}
})
if (response.status !== 200) {
console.error(response.data);
return null;
}
const filePath = response.data.result.file_path;
return `https://api.telegram.org/file/bot${process.env.TELEGRAM_ACCESS_TOKEN}/${filePath}`
} catch (e) {
console.error(e);
return null;
}
}
After receiving the file path, we will download it to static
directory with oga format. However, in order to use whisper API, the audio file must be in mp3 format. So we need to convert it to mp3 by this command, also using ffmpeg
.
const asyncExec = promisify(exec);
export const convertOggToMp3 = async (inputFile: string, outputFile: string) => {
try {
const { stdout, stderr } = await asyncExec(`ffmpeg -loglevel error -i ${inputFile} -c:a libmp3lame -q:a 2 ${outputFile}`);
// console.log(stdout);
if (stderr) {
console.error(stderr);
}
} catch (err) {
console.error(err);
}
}
Now, let’s send this mp3 file to Whisper API to get the transcription.
const downloadsPath = './static/voices';
export const getTranscription = async (context: TelegramContext, url: string, language?: string) => {
try {
let filePath = await downloadFile(url, downloadsPath);
if (filePath.endsWith('.oga')) {
const newFilePath = filePath.replace('.oga', '.mp3')
await convertOggToMp3(filePath, newFilePath)
filePath = newFilePath
}
const response = await openai.createTranscription(
fs.createReadStream(filePath) as any,
'whisper-1',
undefined, undefined, undefined,
language,
);
return response.data.text
} catch (e) {
return null;
}
}
After we get the transcription, our job is similar to the previous step, we will use the function handleChat
like before to handle user’s message and send ChatGPT’s reply back to user.
export const handleAudioForChat = async (context: TelegramContext) => {
let transcription: any
const fileUrl = await getFileUrl(context.event.voice.fileId)
if (fileUrl) {
transcription = await getTranscription(context, fileUrl)
}
if (!transcription) {
await context.sendText(`Error getting transcription!`);
return
}
await context.sendMessage(`_${transcription}_`, { parseMode: ParseMode.Markdown });
await context.sendChatAction(ChatAction.Typing);
await handleChat(context, transcription)
}
You can see our bot also converts its reply to speech and responds back to the user.
You can also have a settings option here to disable text message responses and only send back voice to the user.
Deployment
Run Bottender on your server by the following command:
npm start
# or use yarn
yarn start
Set Up Webhook for Production
Run this command to set up Webhook for Telegram bot, supposed your URL to be https://example.com/webhooks/telegram
npx bottender telegram webhook set -w https://example.com/webhooks/telegram
Now you are ready to talk to your bot.
Send voice file and wait for the bot to respond.
Conclusion
I hope with this guide, you can build your own talking chatbot and have fun talking with it.
Hope you like it!
Posted on March 30, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.