As the CTO of Ytel, Inc., I work a lot with communications technology and machine learning. MMS is typically the defacto standard when it comes to sending photos back and forth on a mobile device outside of downloaded OTT applications.

RCS is now starting to appear on mobile devices, and media sharing is expected to accelerate. I thought it would be interesting to see how hard it would be to build out an IoT type device outside the Google Cloud Platform proper that can interact with some of Google’s prebuilt AI models and interpret this media.

I will provide some tools and code that will allow you to build a demo that will take a photo of a scene, analyze it, and then speak back the results.

To fully build out the proof of concept you will need:

A Raspberry Pi
A Raspberry Pi Camera
A Google Cloud Platform account

The first step is to make sure you have Python 3.7.x or higher installed on the Pi and add your requirements.txt.

google-cloud-vision==1.0.0
google-cloud-texttospeech==2.2.0
picamera==1.13

Next, let’s build out the application.

In your main.py, declare your imports, provide your GCP credentials, and instantiate your Google SDK clients. Your credentials will reference a JSON file, and it should have permissions to the Cloud Vision and Cloud Text-to-Speech APIs.

import picamera
import time
import os
from google.cloud import vision
from google.cloud import texttospeech

# Needs permission for Cloud Vision API and Cloud Text-to-Speech API

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="YourServiceAccount.json"
client_vision = vision.ImageAnnotatorClient()
client_tts = texttospeech.TextToSpeechClient()

To analyze a photo, you first have to take a picture. The beautiful thing about a Raspberry Pi camera is that this is a simple task.

Once your camera is plugged in and enabled in your Configuration, utilize the PiCamera library to take a photo. Below is a simple function for taking a picture using PiCamera.

def takephoto():
    camera = picamera.PiCamera()
    camera.resolution = (1024, 768)

    # Show me a quick preview before snapping the photo (If you have a monitor)

    camera.start_preview()
    time.sleep(1)

    # Take the photo
    camera.capture('image.jpg')

The primary function will execute the takephoto() function and start the process where an image.jpg file populates to the local drive. The file will be read into memory and processed by the Cloud Vision SDK, then analyzed by Google’s Cloud Vision AI service.

In this instance, I chose to use the label_detection feature to help identify objects in the photo. The service also has separate functions to recognize the existence of faces, famous logos, and more. For some detailed info on what it can do, visit the official Google Cloud Vision AI docs page.

The Text-to-Speech utilizes SSML and Google’s premium Wavenet voices. I don’t fully use SSML in the below example, but if you would like to see documentation highlighting some of the deeper SSML capabilities, you can do so here.

As for the voices, I highly recommend Google Wavenet voices for all TTS applications that demand near human quality synthesis.

The Speech is streamed back and stored as an MP3 file on the local drive. Once saved, mpg123 is used to play the MP3 over any speaker hooked up to the Raspberry Pi. If you have not done so already, install mpg123 via the apt install mpg123 command.

def main():
    takephoto()

    with open('image.jpg', 'rb') as image_file:
            content = image_file.read()

    image = vision.types.Image(content=content)

    response = client_vision.label_detection(image=image)

    response = client_vision.label(image=image)
        labels = response.label_annotations
        print('Labels:')

        synthesis_input = ''

        # Make a simple comma delimited string type sentence.
        for label in labels:
                print(label.description)
                synthesis_input = label.description + ', ' + synthesis_input

        synthesis_in = texttospeech.SynthesisInput(text=synthesis_input)

    # Let's make this a premium Wavenet voice in SSML
        voice = texttospeech.VoiceSelectionParams(
            language_code="en-US",
            name="en-US-Wavenet-A",
            ssml_gender=texttospeech.SsmlVoiceGender.MALE
        )

    # Select the type of audio file you want returned
        audio_config = texttospeech.AudioConfig(
            audio_encoding=texttospeech.AudioEncoding.MP3
        )

    # Perform the text-to-speech request on the text input with the     selected
        # voice parameters and audio file type
        response = client_tts.synthesize_speech(
        input=synthesis_in, voice=voice, audio_config=audio_config
        )

    # The response's audio_content is binary.
        with open("output.mp3", "wb") as out:
            # Write the response to the output file.
            out.write(response.audio_content)

    print('Audio content written to file "output.mp3"')

    file = "output.mp3"
        # apt install mpg123
        # Save the audio file to the dir
        os.system("mpg123 " + file)

    if __name__ == '__main__':
        main()