How to create the smartest multilingual Virtual Assistant using AWS and ChatGPT

rogave

Robert Garcia Ventura

Posted on December 12, 2022

How to create the smartest multilingual Virtual Assistant using AWS and ChatGPT

Last week ChatGPT was released and everyone has been trying amazing things. I also started playing with it and wanted to try how it would integrate using the AI services from AWS and the results are AWSome!

In this post I will explain step by step how I created this project so you can also do it!

Best of all, you don't need to be an AI expert to create this!

I will assume you already know what ChatGPT is and have an account to play with AWS. In case you don't know what ChatGPT is, please check here what ChatGPT is and how to try it yourself.

The full code for this project can be found here.

GitHub logo robertgv / chatgpt-aws

How to create the smartest multilingual Virtual Assistant using AWS and ChatGPT

ChatGPT + AWS

Last week ChatGPT was released and everyone has been trying amazing things. I also started playing with it and wanted to try how it would integrate using the AI services from AWS and the results are AWSome!

In this post I will explain step by step how I created this project so you can also do it:

https://dev.to/aws-builders/how-to-create-the-smartest-multilingual-virtual-assistant-using-aws-and-chatgpt-4i5k

Best of all, you don't need to be an AI expert to create this!

Steps of the project

Image description

I have devided this project in 8 steps:

  1. Record an audio and save it in WAV format
  2. Upload the audio file to Amazon S3
  3. Transcribe and detect the language of the audio saved in S3 using Amazon Transcribe
  4. Amazon Transcribe saves the transcript in Amazon S3
  5. Send the transcription to ChatGPT
  6. Receive the text answer from ChatGPT and remove code chunks
  7. Convert the text to audio using the language detected in…

Steps of the project

Image description

I have divided this project in 8 steps:

  1. Record an audio and save it in WAV format
  2. Upload the audio file to Amazon S3
  3. Transcribe and detect the language of the audio saved in S3 using Amazon Transcribe
  4. Amazon Transcribe saves the transcript in Amazon S3
  5. Send the transcription to ChatGPT
  6. Receive the text answer from ChatGPT and remove code chunks
  7. Convert the text to audio using the language detected in step 3 using Amazon Polly and download the audio in MP3 format
  8. Reproduce the audio file

Before we start, we need to define the general parameters that you will need to create and later replace in the following code. The creation of this credentials will be explained on the next steps.

# ChatGPT params
chatGPT_session_token = "<SESSION-TOKEN>"

# AWS params
aws_access_key_id = "<ACCESS-KEY-ID>"
aws_secret_access_key = "<SECRET-ACCESS-KEY>"
aws_default_region = "<AWS-REGION>"
aws_default_s3_bucket = "<S3-BUCKET>"

# Voice recording params
samplerate = 48000
duration = 4 #seconds
Enter fullscreen mode Exit fullscreen mode

1. Record an audio and save it in WAV format

First, we will need to record the audio in where we will ask the question we want ChatGPT to answer. For that we will use the package sounddevice. Make sure that you have selected the correct microphone in the default configuration of your OS.
In this case, the amount of time it will be recording the voice is 4 seconds. In case you want to increase or decrease this time just modify the value of the parameter duration.
The script will save the audio inside a folder called audio in the current working directory. In case this folder doesn't exists it will create it using the os module.

def record_audio(duration, filename):
    print("[INFO] Start of the recording")
    mydata = sd.rec(int(samplerate * duration), samplerate=samplerate,channels=1, blocking=True)
    print("[INFO] End of the recording")
    sd.wait()
    sf.write(filename, mydata, samplerate)
    print(f"[INFO] Recording saved on: {filename}")

#Check if folder "audios" exists in current directory, if not then create it
if not os.path.exists("audio"):
    os.makedirs("audio")

# Create a unique file name using UUID
filename = f'audio/{uuid.uuid4()}.wav'

record_audio(duration, filename)
Enter fullscreen mode Exit fullscreen mode

2. Upload the audio file to Amazon S3

In this step, first we need to create an Amazon S3 Bucket. For that we go to the AWS Console and search for the service Amazon S3. Then click on Create bucket.

We need to put the name of our bucket (bucket names must be unique across all AWS accounts in all the AWS Regions) and select the AWS Region.

Image description

The rest of params we can left them as default. Finally, click on Create bucket at the bottom of the page.

In the parameters section from the beginning we need to replace this values with the bucket name and the region selected:

aws_default_region = "<AWS-REGION>"
aws_default_s3_bucket = "<S3-BUCKET>"
Enter fullscreen mode Exit fullscreen mode

Next step is to create a new user that we will use to access to this S3 bucket using boto3. Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.

To create the new user, we search for IAM on the AWS Console. Then click on Users on the left menu under Access management:

Image description

Click on Add users on the top-right corner. We need to provide a user name and then click on the checkbox of Access key - Programmatic access.

Image description

Then click on Next: Permissions. Here click on Attach existing policies directly and then on Create policy.

Image description

Here I would like to mention that we could just select the policy called AmazonS3FullAccess and it would work but that goes against the principal of least-privilege permissions. In this case we will just provide access to the bucket we created before.

On the Create policy page click on Choose a service and search for S3 and click on it. Then on Actions click the options:

  • ListBucket
  • GetObject
  • DeleteObject
  • PutObject

On Resources click on Specific and then on bucket click Add ARN, put the bucket name we created before and click on Add. On object click also on Add ARN and put the bucket name created before and on Object name click the checkbox Any.

Image description

Then click on Next: Tags and Next: Review. Finally, put a name to the new policy and click on Create policy.

Once the policy has been created, go back to the creation of the user page and search for the new policy created. In case it doesn't appear, click on the refresh button.

Image description

Then click on Next: Tags and Next: Review. Finally, review everything is ok and click on Create user.

Image description

On the next page we will get the Access key ID and the Secret access key. Make sure to save them (specially the secret access key) and don't share them. In the parameters section from the beginning, we need to replace this values:

aws_access_key_id = "<ACCESS-KEY-ID>"
aws_secret_access_key = "<SECRET-ACCESS-KEY>"
Enter fullscreen mode Exit fullscreen mode

With that we have a user with permissions to write into the S3 bucket created before.

# Connect to Amazon S3 using Boto3
def get_s3_client():
    return(boto3.client('s3', region_name=aws_default_region, aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key))

def upload_file_to_s3(filename):
    s3_client = get_s3_client()
    try:
        with open(filename, "rb") as f: 
            s3_client.upload_fileobj(f, aws_default_s3_bucket, filename)
            print(f"[INFO] File has been uploaded successfully in the S3 bucket: '{aws_default_s3_bucket}'")
    except:
        raise ValueError(f"[ERROR] Error while uploading the file in the S3 bucket: '{aws_default_s3_bucket}'")

upload_file_to_s3(filename)
Enter fullscreen mode Exit fullscreen mode

3-4. Transcribe and detect the language of the audio saved in S3 using Amazon Transcribe

Amazon Transcribe is an AWS Artificial Intelligence (AI) service that makes it easy for you to convert speech to text. Using Automatic Speech Recognition (ASR) technology, you can use Amazon Transcribe for a variety of business applications, including transcription of voice-based customer service calls, generation of subtitles on audio/video content, and conduct (text-based) content analysis on audio/video content.

To be able to use Amazon Transcribe with the IAM user created on the previous step we need to provide access to it via a IAM Policy.

For that we need to go to IAM in the AWS Console, click on Users on the left menu and then click on the user created before. Click on Add permissions and then Attach existing policies directly. Search for AmazonTranscribe and click the checkbox of AmazonTranscribeFullAccess.

Image description

Click on Next: Review and Add permissions.

At this point this user should have 2 attached policies:

Image description

After adding this extra permission you don't need to modify/update the access key id nor the secret access key.

On the following python code we use Amazon Transcribe via the boto3 package to transcribe the voice recorded in the audio to text. Amazon Transcribe also detects the language that is being used on the audio.

Here you can read all the documentation regarding the TranscribeService on the boto3 documentation.

The transcription is saved in a JSON file in Amazon S3. You can either choose to save your transcript in your own Amazon S3 bucket, or have Amazon Transcribe use a secure default bucket. In my case, I choose the default option that is on an Amazon S3 bucket owned. If we choose the default option, the transcript is deleted when the job expires (90 days). If we want to keep the transcript past this expiration date, we must download it.

# Generate UUID for the job id
job_id = str(uuid.uuid4())

# Connect to Amazon Transcribe using Boto3
def get_transcribe_client():
    return(boto3.client('transcribe', region_name=aws_default_region, aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key))

def get_text_from_audi(filename):
    transcribe = get_transcribe_client()
    print("[INFO] Starting transcription of the audio to text")
    transcribe.start_transcription_job(TranscriptionJobName=job_id, Media={'MediaFileUri': f"https://{aws_default_s3_bucket}.s3.{aws_default_region}.amazonaws.com/{filename}"}, MediaFormat='wav', IdentifyLanguage=True)
    print("[INFO] Transcribing text: *",end="")
    while True:
        status = transcribe.get_transcription_job(TranscriptionJobName=job_id)
        if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
            break
        print("*",end='')
        time.sleep(2)
    print("") #End of line after loading bar
    if status['TranscriptionJob']['TranscriptionJobStatus'] == 'COMPLETED':
        response = urllib.request.urlopen(status['TranscriptionJob']['Transcript']['TranscriptFileUri'])
        data = json.loads(response.read())
        language_detected = data['results']['language_identification'][0]['code']
        transcript = data['results']['transcripts'][0]['transcript']
        print(f"[INFO] Transcription completed!")
        print(f"[INFO] Transcript language: {language_detected}")
        print(f"[INFO] Transcript text: {transcript}")
        return(transcript, language_detected)
    else:
        raise ValueError("[ERROR] The process to convert audio to text using Amazon Transcribe has failed.")

transcript, language_detected = get_text_from_audi(filename)
Enter fullscreen mode Exit fullscreen mode

5. Send the transcription to ChatGPT

Once we received the transcript from Amazon Transcribe we need to send this to ChatGPT. For that, I am using the revChatGPT package. To use this package we need to authenticate to ChatGPT, this can be done using username and password or using the session_token. In my case, because I am using the Google OAuth authentication method I will use the session_token.

To get the session token we need to log in into ChatGPT and then click F12 or right-click and Inspect. Then search for the Application tab and on the left menu search Cookies. Select the website https://chat.openai.com and then search the cookie with the name __Secure-next-auth.session-token and copy the value of this cookie.

Image description

In the parameters section from the beginning, we need to replace this value with the session token value you have:

chatGPT_session_token = "<SESSION-TOKEN>"
Enter fullscreen mode Exit fullscreen mode

In case you want to use the email and password as an authentication method you can check the steps on how to do it here.

Once this is done, we should be able to connect to ChatGPT using Python.

def get_gpt_answer(prompt):
    print(f"[INFO] Sending transcript to ChatGPT")
    config = {"email": "<API-KEY>","session_token": chatGPT_session_token}
    chatbot = Chatbot(config, conversation_id=None)
    chatbot.refresh_session()
    response = chatbot.get_chat_response(prompt, output="text")["message"]
    print(f"[INFO] ChatGPT answer: {response}")
    return(response)

chatgpt_answer = get_gpt_answer(transcript)
Enter fullscreen mode Exit fullscreen mode

6. Receive the text answer from ChatGPT and remove code chunks

Once we get the answer from ChatGPT can be that we get one or more chunks of code. In this case, I am applying a regex function to remove the chunks of code.

Here you can also add your own rules on how to filter or clean the answer from ChatGPT.

def clean_audio_text(text):
    # Clean the code chuncks from the audio using regex
    result = re.sub(r"```

[^\S\r\n]*[a-z]*\n.*?\n

```", '', text, 0, re.DOTALL)
    return(result)
Enter fullscreen mode Exit fullscreen mode

7. Convert the text to audio using the language detected on step 3 using Amazon Polly and download the audio in MP3 format

Amazon Polly uses deep learning technologies to synthesize natural-sounding human speech, so we can convert text to speech.

After cleaning the answer from ChatGPT we are ready to send it to Amazon Polly.

To be able to use Amazon Polly with the user created before we need to provide access to it using a policy like we did in the previous step with Amazon Transcribe.

For that we need to go to IAM in the AWS Console, click on Users on the left menu and then click on the user created before. Then click on Add permissions and then Attach existing policies directly. Search for AmazonPolly and click the checkbox of AmazonPollyFullAccess.

Image description

Click on Next: Review and Add permissions.

At this point this user should have 3 attached policies:

Image description

Amazon Polly supports multiple languages and different genders. In this case, the code I provide has predefined 3 languages: English, Spanish and Catalan. Also, note that for each language you can have different variations depending on the country. For example, for English we have en-US, en-GB, en-IN and others.

The full list of all available languages and variations are available here.

After sending the text to Amazon Polly then we will receive a stream containing the synthesized speech.

def get_polly_client():
    return boto3.client('polly', region_name=aws_default_region, endpoint_url=f"https://polly.{aws_default_region}.amazonaws.com", aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)

def generate_audio(polly, text, output_file, voice, format='mp3'):
    text = clean_audio_text(text)
    resp = polly.synthesize_speech(Engine='neural', OutputFormat=format, Text=text, VoiceId=voice)
    soundfile = open(output_file, 'wb')
    soundBytes = resp['AudioStream'].read()
    soundfile.write(soundBytes)
    soundfile.close()
    print(f"[INFO] Response audio saved in: {output_file}")

def get_speaker(language_detected):
    # Get speaker based on the language detected by Amazon Transcribe (more info about available voices: https://docs.aws.amazon.com/polly/latest/dg/voicelist.html)
    voice = ""
    if language_detected == "en-US":
        voice = "Joanna"
    elif language_detected == "en-GB":
        voice = "Amy"
    elif language_detected == "en-IN":
        voice = "Kajal"
    elif language_detected == "ca-ES":
        voice = "Arlet"
    elif language_detected == "es-ES":
        voice = "Lucia"
    elif language_detected == "es-MX":
        voice = "Mia"
    elif language_detected == "es-US":
        voice = "Lupe"
    else:
        voice = "Joanna"
        print(f"[WARNING] The language detected {language_detected} is not supported on this code. In this case the default voice is Joanna (en-US).")
    print(f"[INFO] Speaker selected: {voice}")
    return(voice)

polly = get_polly_client()
voice = get_speaker(language_detected)
output_file = f"audio/{job_id}.mp3"
generate_audio(polly, chatgpt_answer, output_file,voice=voice)
Enter fullscreen mode Exit fullscreen mode

8. Reproduce the audio file

Finally, we just need to play the audio result from Amazon Polly.

Depending on the OS or from where you are running this it may not work. In my case when I run the function speak_script(output_file) from the Terminal in a macOS it works. In case you are using a notebook like Jupyter Notebook then use the function speak_notebook(output_file).

def speak_notebook(output_file):
    print(f"[INFO] Start reproducing response audio")
    display(Audio(output_file, autoplay=True))

def speak_script(output_file):
    print(f"[INFO] Start reproducing response audio")
    return_code = subprocess.call(["afplay", output_file])

speak_script(output_file)
Enter fullscreen mode Exit fullscreen mode

Example output

If we followed all the previous steps, we should be ready to start playing with our new multilingual virtual assistant. To show you how an output looks like, I recorded myself asking "What is Amazon Web Services?" and you can clearly see that's exactly the transcript generated by Amazon Transcribe and then the answer provided by ChatGPT.

$ python3 ChatGPT-AWS.py
[INFO] Start of the recording
[INFO] End of the recording
[INFO] Recording saved on: audio/6032133a-ec26-4fa0-8d0b-ad705293be09.wav
[INFO] File has been uploaded successfully in the S3 bucket: 'chatgpt-transcribe'
[INFO] Starting transcription of the audio to text
[INFO] Transcribing text: *********
[INFO] Transcription completed!
[INFO] Transcript language: en-US
[INFO] Transcript text: What is Amazon Web Services?
[INFO] Sending transcript to ChatGPT
[INFO] ChatGPT answer: Amazon Web Services (AWS) is a cloud computing platform that provides a wide range of services, including computing, storage, and content delivery. AWS offers these services on a pay-as-you-go basis, allowing businesses and individuals to access the resources they need without having to invest in expensive infrastructure. AWS is widely used by organizations of all sizes, from small startups to large enterprises.
[INFO] Speaker selected: Joanna
[INFO] Response audio saved in: audio/168a94de-1ba2-4f65-8a4c-d3c9c832246d.mp3
[INFO] Start reproducing response audio
Enter fullscreen mode Exit fullscreen mode

I hope you enjoy it as much as I did when I was building and playing with these services. I think these state-of-the-art technologies have a lot of opportunities/potential and when we use all of them together the results are AWSome!

If you have any question, suggestion or comment please feel free to add them on the comments or contact me directly! :)

đź’– đź’Ş đź™… đźš©
rogave
Robert Garcia Ventura

Posted on December 12, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related