Wesley Chun (@wescpy)
Posted on April 25, 2024
TL;DR:
The previous post in this ongoing series introduced developers to the Gemini API by providing a more user-friendly and useful "Hello World!" sample than in the official Google documentation. The next steps: enhance that example to learn a few more features of the Gemini API, for example, support for streaming output and multi-turn conversations (chat), upgrade to the latest 1.0 or even 1.5 API versions, and switch to multimodality... stick around to find out how!
Sep 2024 update: Updated code samples from Gemini 1.0 Pro (
gemini-pro
) and Pro Vision (gemini-pro-vision
) models to Gemini 1.5 Flash (gemini-1.5-flash
).
Introduction
Are you a developer interested in using Google APIs? You're in the right place as this blog is dedicated to that craft from Python and sometimes Node.js. Previous posts showed you how to use Google credentials like API keys or OAuth client IDs for use with Google Workspace (GWS) APIs. Other posts introduced serverless computing or showed you how to export Google Docs as PDF. If you're interested in Google APIs, you're in the right place.
The previous post kicked off the conversation about generative AI, presenting "Hello World!" examples that help you get started with the Gemini API in a more user-friendly way than in the docs. It presented samples showing you how to use the API from both Google AI as well as GCP Vertex AI.
This post follows up with a multimodal example, with local as well as online images, one that supports streaming output, another one leveraging multi-turn conversations ("chat"), and finally, another one that combines most of the above.
Whereas your initial journey began with code in both Python & Node.js plus API access from both Google AI & Vertex AI, this post focuses specifically on the "upgrades," so we're just going to stick with one of each: Python-only and only on Google AI. Use the previous post's variety of samples to "extrapolate" porting to Node.js or running on Vertex AI.
Prerequisites
The example assumes you've performed the prerequisites from the previous post:
- Installed the Google GenAI Python package with:
pip install -U pip google-generativeai
- Created an API key
- Saved API key as a string to
settings.py
asAPI_KEY = 'YOUR_API_KEY_HERE'
(and followed the suggestions for only hard-coding it in prototypes and keeping it safe when deploying to production)
For today's code sample, there a couple more packages to install:
- The popular Python HTTP
requests
library - The Python Imaging Library (PIL)'s flexible fork, Pillow
You can do so along with updating the GenAI package with: pip install -U Pillow requests google-generativeai
(or pip3
)
The "OG"
Let's start with the original script from the first post that we're going to upgrade here, gemtxt-simple-gai.py
:
There are also Javascript equivalents of this script, in a traditional CommonJS format or as a modern JS/ECMAScript module. Review the original post for coverage of these samples.
The focus is on Python in this post, and that version is the starting point for the remaining examples here. JS developers can make equivalent adjustments to whichever version they started with.
Streaming
The next easiest update is to change to streaming output. When sending a request to an LLM (large language model), sometimes you don't want to wait for all of the output from the model to return before displaying to users. To give them a better experience, "stream" the output in chunks as they come instead of waiting until the LLM is completely done:
Switching to streaming requires only the stream=True
flag passed to the model's generate_content()
method. The loop displays the chunk
s of data returned by the LLM as they come in. To keep the spacing consistent, set Python's print()
function to not output a NEWLINE (\n
) after each chunk with the end
parameter.
Instead, keep chaining the chunks together and issue the NEWLINE after all have been retrieved and displayed. This version is also available in the repo as gemtxt-stream-gai.py
. Its output here isn't going to reveal the output as it is streamed, so you have to take my work for it. :-)
$ python3 gemtxt-stream-gai.py
** GenAI text: 'gemini-1.5-flash' model & prompt 'Describe a
cat in a few sentences'
A cat is a small, carnivorous mammal with soft fur, retractable
claws, and sharp teeth. They are known for their independence,
cleanliness, and playful nature. With its keen senses and
graceful movements, a cat exudes both mystery and intrigue. Its
sleek body is covered in sleek fur that ranges in color from
black to white to tabby.
Multi-turn conversations (chat)
Next is chat, or "multi-turn conversations." Now, you may be building a chat application or executing a workflow where your user or system must interact with the model more than once, keeping context between messages. To facilitate this exchange, Google provides a convenience chat object, obtained with start_chat()
which features a send_message()
method for communicating with the model instead of generate_content()
, as shown below:
While the flow is slightly different from what you've already seen, the basic operations are the same: send a prompt to the model and await the response. The core difference is that you're sending multiple messages in a row, with each subsequent message maintaining the full context of the ongoing "conversation." This version is found in the repo as gemtxt-simple-chat-gai.py
, and shown here is one sample exchange with the model:
$ python3 gemtxt-simple-chat-gai.py
** GenAI text: 'gemini-1.5-flash' model
USER: Describe a cat in a few sentences
MODEL: The sleek, black cat padded silently across the
sunlit floor, its emerald eyes gleaming with mischief. A
flick of its tail, a twitch of its whiskers, and it
vanished into the shadows, leaving only the faint scent of
honeysuckle in its wake.
USER: Since you're now a feline expert, what are the top
three most friendly cat breeds for a family with small
children?
MODEL: While I can't claim to be a feline expert, I can
access and process a lot of information about cats! Here
are three breeds often considered friendly and good with
children, based on their general temperament and history:
1. **Ragdoll:** These gentle giants are known for their
laid-back personalities and tolerance for handling, making
them ideal for families with young children. They love to
cuddle and often enjoy being held and carried.
2. **Maine Coon:** These large, fluffy cats are playful
and affectionate. Their gentle nature and tolerance for
boisterous kids make them good companions for families.
They're also known for being relatively low-maintenance.
3. **Persian:** These luxurious cats are known for their
sweet, docile personalities. While not as active as some
breeds, they enjoy gentle interaction and are often
described as being "dog-like" in their affection.
Remember, even with these breeds, it's essential to
supervise interactions between cats and young children and
teach kids how to handle cats respectfully. Every cat is
an individual, and personality can vary within a breed.
I'm not a cat owner, so I can't vouch for Gemini's accuracy. Add a comment below if you have a take on it. Now let's switch gears a bit.
So far, all of the enhancements and corresponding samples are text-based, single-modality requests. A whole new class of functionality is available if a model can accept data in addition to text, in other form factors such as images, audio, or video content. The Google AI documentation states that this wider variety of input, "creates many additional possibilities for generating content, analyzing data, and solving problems."
Multimodal
Some Gemini models, and by extension, their corresponding APIs, support multimodality, "prompting with text, image, and audio data". Video is also supported, but you need to use the File API to convert them to a series of image frames. You can also use the File API to upload the assets to use in your prompts.
The sample script below takes an image and asks the LLM for some information about it, specifically this image:
The prompt is a fairly straightforward query: Where is this located, and what's the waterfall's name?
. Here is the multimodal version of the script posing this query... it's available in the repo as gemmmd-simpleloc-gai.py
:
These are the key updates from the original app:
- Change to multimodal model: Gemini 1.0 Pro to Gemini 1.0 Pro Vision
- Import Pillow and use it to read the image data given its filename
- New prompt: pass in prompt string plus image payload
The MODEL
variable now points to gemini-1.5-flash
, the image filename is passed to Pillow to read its DATA
, and rather than a single PROMPT
string, pass in both the PROMPT
and image DATA
as a 2-tuple to generate_content()
. Everything else stays the same. Let's see what Gemini says:
$ python3 gemmmd-simple-loc-gai.py
** GenAI multimodal: 'gemini-1.5-flash' model & prompt
"Where is this located, and what's the waterfall's name?"
This is the Rain Vortex at Changi Airport in Singapore.
Online data vs. local
The final update is to take the previous example and change it to access images online rather than requiring it be available on the local filesystem. For this, we'll use one of Google's stock images:
This one is pretty much identical as the one above, but uses the Python requests
library to access the image for Pillow. The script below asks Gemini to Describe the scene in this photo
and can be accessed in the repo as gemmmd-simple-url-gai.py
:
New includes the import of requests
followed by its use to perform an HTTP GET on the image URL (IMG_URL
), reading the binary payload into IMG_RAW
, which is passed along with the text prompt to generate_content()
. Running this script results in the following output:
$ python3 gemmmd-simple-url-gai.py
** GenAI multimodal: 'gemini-1.5-flash' model & prompt
'Describe the scene in this photo'
A man is sitting at a desk in front of a large window. He is
smiling and gesturing with his hands, as if he is talking to
someone. There is a couch behind him, and a lamp on a tripod
next to it. There are some papers and a laptop on the desk in
front of him. The room looks like an office or a home office.
The man is dressed in a light blue shirt. The window looks
out over a city skyline.
Online, multimodal, and multi-turn
For the last sample, let's merge the earlier incarnations into a fifth derivative featuring the same (online) image as the previous sample for a multimodal, multi-turn conversation chat app, further querying the model with, You are a marketing expert. If a company uses this photo in a press release, what product or products could they be selling?
New includes the import of requests
followed by its use to perform an HTTP GET on the image URL (IMG_URL
), reading the binary payload into IMG_RAW
, which is passed along with the text prompt to generate_content()
. Running this script results in the following output:
$ python3 gemmmd-simple-url-chat-gai.py
** GenAI multimodal: 'gemini-1.5-flash' model
USER: Describe the scene in this photo
MODEL: A man is sitting at a desk with his hands up in a
welcoming gesture. He is smiling broadly and appears to be
excited. He is in an office setting, with a large window
behind him looking out at a city skyline. There is a couch
and a chair to his right, and there are two lamps in the room.
USER: You are a marketing expert. If a company uses this photo
in a press release, what product or products could they be
selling?
MODEL: This image could be used to promote a variety of
products, depending on the tone and message the company wants
to convey. Here are some possibilities:
* **Productivity software:** The man's excited expression and
the office setting suggest a product that helps people be more
efficient and successful. The laptop and papers on the desk
could further reinforce this idea.
* **Communication tools:** The man's open body language and
welcoming gestures could be used to market a product that
facilitates communication and collaboration, like a video
conferencing platform or a messaging app.
* **Workspace design or furniture:** The stylish office
environment could be used to promote products related to
office design and furniture, such as ergonomic chairs,
modern desks, or stylish lighting.
* **Online learning platform:** The image could suggest a
platform that helps people learn new skills or expand their
knowledge base. The man's enthusiasm could be linked to the
excitement of acquiring new knowledge.
* **Co-working space:** The image could be used to promote
a co-working space that offers a modern and inspiring
environment for professionals to work, connect, and
collaborate.
Ultimately, the best product to promote with this image will
depend on the company's specific goals and target audience.
Summary
Developers are eager to jump into the world of AI/ML, especially GenAI & LLMs, and accessing Google's Gemini models via API is part of that picture. The previous post in the series got your foot in the door, presenting a more digestible user-friendly "Hello World!" sample to help developers get started.
This post presents possible next steps, providing "102" samples that enhance the original script, furthering your exploration of Gemini API features but doing so without overburdening you with large swaths of code.
More advanced features are available via the Gemini API we didn't cover here — they merit separate posts on their own:
Look for more posts Gemini coming soon, including using its API in web apps. If you found an error in this post or have a topic you want me to cover in the future, drop a note in the comments below! I've been on the road lately talking about Google APIs, AI included of course. Find the travel calendar at the bottom of my consulting site... I'd love to meet you IRL if I'm visiting your region!
NEXT POST: Part 3: Gemini API 102a... Putting together basic GenAI web apps
Resources
-
Blog post code samples
-
Gemini API (Google AI)
-
Gemini API (GCP Vertex AI)
-
Gemini API (differences between both platforms)
-
Gemini 1.5 Flash
- Home page
- Flash launch (May 2024)
- Pro launch (Feb 2024)
- 1.5 models paper
-
Other Generative AI and Gemini resources
WESLEY CHUN, MSCS, is a Google Developer Expert (GDE) in Google Cloud (GCP) & Google Workspace (GWS), author of Prentice Hall's bestselling "Core Python" series, co-author of "Python Web Development with Django", and has written for Linux Journal & CNET. He runs CyberWeb specializing in GCP & GWS APIs and serverless platforms, Python & App Engine migrations, and Python training & engineering. Wesley was one of the original Yahoo!Mail engineers and spent 13+ years on various Google product teams, speaking on behalf of their APIs, producing sample apps, codelabs, and videos for serverless migration and GWS developers. He holds degrees in Computer Science, Mathematics, and Music from the University of California, is a Fellow of the Python Software Foundation, and loves to travel to meet developers worldwide at conferences, user group events, and universities. Follow he/him @wescpy & his technical blog. Find this content useful? Contact CyberWeb or buy him a coffee (or tea)!
Posted on April 25, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.