How to create the smartest multilingual Virtual Assistant using AWS and ChatGPT
Robert Garcia Ventura
Posted on December 12, 2022
Last week ChatGPT was released and everyone has been trying amazing things. I also started playing with it and wanted to try how it would integrate using the AI services from AWS and the results are AWSome!
In this post I will explain step by step how I created this project so you can also do it!
Best of all, you don't need to be an AI expert to create this!
I will assume you already know what ChatGPT is and have an account to play with AWS. In case you don't know what ChatGPT is, please check here what ChatGPT is and how to try it yourself.
How to create the smartest multilingual Virtual Assistant using AWS and ChatGPT
ChatGPT + AWS
Last week ChatGPT was released and everyone has been trying amazing things. I also started playing with it and wanted to try how it would integrate using the AI services from AWS and the results are AWSome!
In this post I will explain step by step how I created this project so you can also do it:
Transcribe and detect the language of the audio saved in S3 using Amazon Transcribe
Amazon Transcribe saves the transcript in Amazon S3
Send the transcription to ChatGPT
Receive the text answer from ChatGPT and remove code chunks
Convert the text to audio using the language detected in step 3 using Amazon Polly and download the audio in MP3 format
Reproduce the audio file
Before we start, we need to define the general parameters that you will need to create and later replace in the following code. The creation of this credentials will be explained on the next steps.
First, we will need to record the audio in where we will ask the question we want ChatGPT to answer. For that we will use the package sounddevice. Make sure that you have selected the correct microphone in the default configuration of your OS.
In this case, the amount of time it will be recording the voice is 4 seconds. In case you want to increase or decrease this time just modify the value of the parameter duration.
The script will save the audio inside a folder called audio in the current working directory. In case this folder doesn't exists it will create it using the os module.
defrecord_audio(duration,filename):print("[INFO] Start of the recording")mydata=sd.rec(int(samplerate*duration),samplerate=samplerate,channels=1,blocking=True)print("[INFO] End of the recording")sd.wait()sf.write(filename,mydata,samplerate)print(f"[INFO] Recording saved on: {filename}")#Check if folder "audios" exists in current directory, if not then create it
ifnotos.path.exists("audio"):os.makedirs("audio")# Create a unique file name using UUID
filename=f'audio/{uuid.uuid4()}.wav'record_audio(duration,filename)
2. Upload the audio file to Amazon S3
In this step, first we need to create an Amazon S3 Bucket. For that we go to the AWS Console and search for the service Amazon S3. Then click on Create bucket.
We need to put the name of our bucket (bucket names must be unique across all AWS accounts in all the AWS Regions) and select the AWS Region.
The rest of params we can left them as default. Finally, click on Create bucket at the bottom of the page.
In the parameters section from the beginning we need to replace this values with the bucket name and the region selected:
Next step is to create a new user that we will use to access to this S3 bucket using boto3. Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.
To create the new user, we search for IAM on the AWS Console. Then click on Users on the left menu under Access management:
Click on Add users on the top-right corner. We need to provide a user name and then click on the checkbox of Access key - Programmatic access.
Then click on Next: Permissions. Here click on Attach existing policies directly and then on Create policy.
Here I would like to mention that we could just select the policy called AmazonS3FullAccess and it would work but that goes against the principal of least-privilege permissions. In this case we will just provide access to the bucket we created before.
On the Create policy page click on Choose a service and search for S3 and click on it. Then on Actions click the options:
ListBucket
GetObject
DeleteObject
PutObject
On Resources click on Specific and then on bucket click Add ARN, put the bucket name we created before and click on Add. On object click also on Add ARN and put the bucket name created before and on Object name click the checkbox Any.
Then click on Next: Tags and Next: Review. Finally, put a name to the new policy and click on Create policy.
Once the policy has been created, go back to the creation of the user page and search for the new policy created. In case it doesn't appear, click on the refresh button.
Then click on Next: Tags and Next: Review. Finally, review everything is ok and click on Create user.
On the next page we will get the Access key ID and the Secret access key. Make sure to save them (specially the secret access key) and don't share them. In the parameters section from the beginning, we need to replace this values:
With that we have a user with permissions to write into the S3 bucket created before.
# Connect to Amazon S3 using Boto3
defget_s3_client():return(boto3.client('s3',region_name=aws_default_region,aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key))defupload_file_to_s3(filename):s3_client=get_s3_client()try:withopen(filename,"rb")asf:s3_client.upload_fileobj(f,aws_default_s3_bucket,filename)print(f"[INFO] File has been uploaded successfully in the S3 bucket: '{aws_default_s3_bucket}'")except:raiseValueError(f"[ERROR] Error while uploading the file in the S3 bucket: '{aws_default_s3_bucket}'")upload_file_to_s3(filename)
3-4. Transcribe and detect the language of the audio saved in S3 using Amazon Transcribe
Amazon Transcribe is an AWS Artificial Intelligence (AI) service that makes it easy for you to convert speech to text. Using Automatic Speech Recognition (ASR) technology, you can use Amazon Transcribe for a variety of business applications, including transcription of voice-based customer service calls, generation of subtitles on audio/video content, and conduct (text-based) content analysis on audio/video content.
To be able to use Amazon Transcribe with the IAM user created on the previous step we need to provide access to it via a IAM Policy.
For that we need to go to IAM in the AWS Console, click on Users on the left menu and then click on the user created before. Click on Add permissions and then Attach existing policies directly. Search for AmazonTranscribe and click the checkbox of AmazonTranscribeFullAccess.
Click on Next: Review and Add permissions.
At this point this user should have 2 attached policies:
After adding this extra permission you don't need to modify/update the access key id nor the secret access key.
On the following python code we use Amazon Transcribe via the boto3 package to transcribe the voice recorded in the audio to text. Amazon Transcribe also detects the language that is being used on the audio.
Here you can read all the documentation regarding the TranscribeService on the boto3 documentation.
The transcription is saved in a JSON file in Amazon S3. You can either choose to save your transcript in your own Amazon S3 bucket, or have Amazon Transcribe use a secure default bucket. In my case, I choose the default option that is on an Amazon S3 bucket owned. If we choose the default option, the transcript is deleted when the job expires (90 days). If we want to keep the transcript past this expiration date, we must download it.
# Generate UUID for the job id
job_id=str(uuid.uuid4())# Connect to Amazon Transcribe using Boto3
defget_transcribe_client():return(boto3.client('transcribe',region_name=aws_default_region,aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key))defget_text_from_audi(filename):transcribe=get_transcribe_client()print("[INFO] Starting transcription of the audio to text")transcribe.start_transcription_job(TranscriptionJobName=job_id,Media={'MediaFileUri':f"https://{aws_default_s3_bucket}.s3.{aws_default_region}.amazonaws.com/{filename}"},MediaFormat='wav',IdentifyLanguage=True)print("[INFO] Transcribing text: *",end="")whileTrue:status=transcribe.get_transcription_job(TranscriptionJobName=job_id)ifstatus['TranscriptionJob']['TranscriptionJobStatus']in['COMPLETED','FAILED']:breakprint("*",end='')time.sleep(2)print("")#End of line after loading bar
ifstatus['TranscriptionJob']['TranscriptionJobStatus']=='COMPLETED':response=urllib.request.urlopen(status['TranscriptionJob']['Transcript']['TranscriptFileUri'])data=json.loads(response.read())language_detected=data['results']['language_identification'][0]['code']transcript=data['results']['transcripts'][0]['transcript']print(f"[INFO] Transcription completed!")print(f"[INFO] Transcript language: {language_detected}")print(f"[INFO] Transcript text: {transcript}")return(transcript,language_detected)else:raiseValueError("[ERROR] The process to convert audio to text using Amazon Transcribe has failed.")transcript,language_detected=get_text_from_audi(filename)
5. Send the transcription to ChatGPT
Once we received the transcript from Amazon Transcribe we need to send this to ChatGPT. For that, I am using the revChatGPT package. To use this package we need to authenticate to ChatGPT, this can be done using username and password or using the session_token. In my case, because I am using the Google OAuth authentication method I will use the session_token.
To get the session token we need to log in into ChatGPT and then click F12 or right-click and Inspect. Then search for the Application tab and on the left menu search Cookies. Select the website https://chat.openai.com and then search the cookie with the name __Secure-next-auth.session-token and copy the value of this cookie.
In the parameters section from the beginning, we need to replace this value with the session token value you have:
chatGPT_session_token="<SESSION-TOKEN>"
In case you want to use the email and password as an authentication method you can check the steps on how to do it here.
Once this is done, we should be able to connect to ChatGPT using Python.
defget_gpt_answer(prompt):print(f"[INFO] Sending transcript to ChatGPT")config={"email":"<API-KEY>","session_token":chatGPT_session_token}chatbot=Chatbot(config,conversation_id=None)chatbot.refresh_session()response=chatbot.get_chat_response(prompt,output="text")["message"]print(f"[INFO] ChatGPT answer: {response}")return(response)chatgpt_answer=get_gpt_answer(transcript)
6. Receive the text answer from ChatGPT and remove code chunks
Once we get the answer from ChatGPT can be that we get one or more chunks of code. In this case, I am applying a regex function to remove the chunks of code.
Here you can also add your own rules on how to filter or clean the answer from ChatGPT.
defclean_audio_text(text):# Clean the code chuncks from the audio using regex
result=re.sub(r"```
[^\S\r\n]*[a-z]*\n.*?\n
```",'',text,0,re.DOTALL)return(result)
7. Convert the text to audio using the language detected on step 3 using Amazon Polly and download the audio in MP3 format
Amazon Polly uses deep learning technologies to synthesize natural-sounding human speech, so we can convert text to speech.
After cleaning the answer from ChatGPT we are ready to send it to Amazon Polly.
To be able to use Amazon Polly with the user created before we need to provide access to it using a policy like we did in the previous step with Amazon Transcribe.
For that we need to go to IAM in the AWS Console, click on Users on the left menu and then click on the user created before. Then click on Add permissions and then Attach existing policies directly. Search for AmazonPolly and click the checkbox of AmazonPollyFullAccess.
Click on Next: Review and Add permissions.
At this point this user should have 3 attached policies:
Amazon Polly supports multiple languages and different genders. In this case, the code I provide has predefined 3 languages: English, Spanish and Catalan. Also, note that for each language you can have different variations depending on the country. For example, for English we have en-US, en-GB, en-IN and others.
The full list of all available languages and variations are available here.
After sending the text to Amazon Polly then we will receive a stream containing the synthesized speech.
defget_polly_client():returnboto3.client('polly',region_name=aws_default_region,endpoint_url=f"https://polly.{aws_default_region}.amazonaws.com",aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)defgenerate_audio(polly,text,output_file,voice,format='mp3'):text=clean_audio_text(text)resp=polly.synthesize_speech(Engine='neural',OutputFormat=format,Text=text,VoiceId=voice)soundfile=open(output_file,'wb')soundBytes=resp['AudioStream'].read()soundfile.write(soundBytes)soundfile.close()print(f"[INFO] Response audio saved in: {output_file}")defget_speaker(language_detected):# Get speaker based on the language detected by Amazon Transcribe (more info about available voices: https://docs.aws.amazon.com/polly/latest/dg/voicelist.html)
voice=""iflanguage_detected=="en-US":voice="Joanna"eliflanguage_detected=="en-GB":voice="Amy"eliflanguage_detected=="en-IN":voice="Kajal"eliflanguage_detected=="ca-ES":voice="Arlet"eliflanguage_detected=="es-ES":voice="Lucia"eliflanguage_detected=="es-MX":voice="Mia"eliflanguage_detected=="es-US":voice="Lupe"else:voice="Joanna"print(f"[WARNING] The language detected {language_detected} is not supported on this code. In this case the default voice is Joanna (en-US).")print(f"[INFO] Speaker selected: {voice}")return(voice)polly=get_polly_client()voice=get_speaker(language_detected)output_file=f"audio/{job_id}.mp3"generate_audio(polly,chatgpt_answer,output_file,voice=voice)
8. Reproduce the audio file
Finally, we just need to play the audio result from Amazon Polly.
Depending on the OS or from where you are running this it may not work. In my case when I run the function speak_script(output_file) from the Terminal in a macOS it works. In case you are using a notebook like Jupyter Notebook then use the function speak_notebook(output_file).
If we followed all the previous steps, we should be ready to start playing with our new multilingual virtual assistant. To show you how an output looks like, I recorded myself asking "What is Amazon Web Services?" and you can clearly see that's exactly the transcript generated by Amazon Transcribe and then the answer provided by ChatGPT.
$ python3 ChatGPT-AWS.py
[INFO] Start of the recording
[INFO] End of the recording
[INFO] Recording saved on: audio/6032133a-ec26-4fa0-8d0b-ad705293be09.wav
[INFO] File has been uploaded successfully in the S3 bucket: 'chatgpt-transcribe'
[INFO] Starting transcription of the audio to text
[INFO] Transcribing text: *********
[INFO] Transcription completed!
[INFO] Transcript language: en-US
[INFO] Transcript text: What is Amazon Web Services?
[INFO] Sending transcript to ChatGPT
[INFO] ChatGPT answer: Amazon Web Services (AWS) is a cloud computing platform that provides a wide range of services, including computing, storage, and content delivery. AWS offers these services on a pay-as-you-go basis, allowing businesses and individuals to access the resources they need without having to invest in expensive infrastructure. AWS is widely used by organizations of all sizes, from small startups to large enterprises.
[INFO] Speaker selected: Joanna
[INFO] Response audio saved in: audio/168a94de-1ba2-4f65-8a4c-d3c9c832246d.mp3
[INFO] Start reproducing response audio
I hope you enjoy it as much as I did when I was building and playing with these services. I think these state-of-the-art technologies have a lot of opportunities/potential and when we use all of them together the results are AWSome!
If you have any question, suggestion or comment please feel free to add them on the comments or contact me directly! :)