Google AI Vision & Text to Speech on a Raspberry Pi
Matt Grofsky
Posted on November 11, 2020
As the CTO of Ytel, Inc., I work a lot with communications technology and machine learning. MMS is typically the defacto standard when it comes to sending photos back and forth on a mobile device outside of downloaded OTT applications.
RCS is now starting to appear on mobile devices, and media sharing is expected to accelerate. I thought it would be interesting to see how hard it would be to build out an IoT type device outside the Google Cloud Platform proper that can interact with some of Google’s prebuilt AI models and interpret this media.
I will provide some tools and code that will allow you to build a demo that will take a photo of a scene, analyze it, and then speak back the results.
To fully build out the proof of concept you will need:
A Raspberry Pi
A Raspberry Pi Camera
A Google Cloud Platform account
The first step is to make sure you have Python 3.7.x or higher installed on the Pi and add your requirements.txt.
In your main.py, declare your imports, provide your GCP credentials, and instantiate your Google SDK clients. Your credentials will reference a JSON file, and it should have permissions to the Cloud Vision and Cloud Text-to-Speech APIs.
importpicameraimporttimeimportosfromgoogle.cloudimportvisionfromgoogle.cloudimporttexttospeech# Needs permission for Cloud Vision API and Cloud Text-to-Speech API
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="YourServiceAccount.json"client_vision=vision.ImageAnnotatorClient()client_tts=texttospeech.TextToSpeechClient()
To analyze a photo, you first have to take a picture. The beautiful thing about a Raspberry Pi camera is that this is a simple task.
Once your camera is plugged in and enabled in your Configuration, utilize the PiCamera library to take a photo. Below is a simple function for taking a picture using PiCamera.
deftakephoto():camera=picamera.PiCamera()camera.resolution=(1024,768)# Show me a quick preview before snapping the photo (If you have a monitor)
camera.start_preview()time.sleep(1)# Take the photo
camera.capture('image.jpg')
The primary function will execute the takephoto() function and start the process where an image.jpg file populates to the local drive. The file will be read into memory and processed by the Cloud Vision SDK, then analyzed by Google’s Cloud Vision AI service.
In this instance, I chose to use the label_detection feature to help identify objects in the photo. The service also has separate functions to recognize the existence of faces, famous logos, and more. For some detailed info on what it can do, visit the official Google Cloud Vision AI docs page.
The Text-to-Speech utilizes SSML and Google’s premium Wavenet voices. I don’t fully use SSML in the below example, but if you would like to see documentation highlighting some of the deeper SSML capabilities, you can do so here.
As for the voices, I highly recommend Google Wavenet voices for all TTS applications that demand near human quality synthesis.
The Speech is streamed back and stored as an MP3 file on the local drive. Once saved, mpg123 is used to play the MP3 over any speaker hooked up to the Raspberry Pi. If you have not done so already, install mpg123 via the apt install mpg123 command.
defmain():takephoto()withopen('image.jpg','rb')asimage_file:content=image_file.read()image=vision.types.Image(content=content)response=client_vision.label_detection(image=image)response=client_vision.label(image=image)labels=response.label_annotationsprint('Labels:')synthesis_input=''# Make a simple comma delimited string type sentence.
forlabelinlabels:print(label.description)synthesis_input=label.description+', '+synthesis_inputsynthesis_in=texttospeech.SynthesisInput(text=synthesis_input)# Let's make this a premium Wavenet voice in SSML
voice=texttospeech.VoiceSelectionParams(language_code="en-US",name="en-US-Wavenet-A",ssml_gender=texttospeech.SsmlVoiceGender.MALE)# Select the type of audio file you want returned
audio_config=texttospeech.AudioConfig(audio_encoding=texttospeech.AudioEncoding.MP3)# Perform the text-to-speech request on the text input with the selected
# voice parameters and audio file type
response=client_tts.synthesize_speech(input=synthesis_in,voice=voice,audio_config=audio_config)# The response's audio_content is binary.
withopen("output.mp3","wb")asout:# Write the response to the output file.
out.write(response.audio_content)print('Audio content written to file "output.mp3"')file="output.mp3"# apt install mpg123
# Save the audio file to the dir
os.system("mpg123 "+file)if__name__=='__main__':main()
All code for this tutorial is on Github. Feel free to take it and modify it into something better…stronger…faster. 💪