ppaanngggg
Posted on June 3, 2024
Introduction
Google Gemini exhibits strong performance in multi-model tasks, particularly the latest Gemini 1.5 Flash and Gemini 1.5 Pro. There are two benchmarks for multi-model tasks: reasoning and math. As demonstrated, the Gemini 1.5 Pro performs on par with the latest GPT-4o in visual math tasks π.
Benchmark | Description | Gemini 1.5 Flash | Gemini 1.5 Pro | GPT-4o |
---|---|---|---|---|
MMMU | Multi-discipline college-level reasoning problems | 56.1% | 62.2% | 69.1% |
MathVista | Mathematical reasoning in visual contexts | 58.4% | 63.9% | 63.8% |
In this blog, I will guide you on how to unlock the vision capabilities of Google Gemini. Let's get started π.
Prerequisite
In my latest blog, I demonstrated how to use Google Gemini with Next.js for streaming output. While the previous guide focused on text input, this article will show you how to upload images to Google Gemini, using a simple demo. If you're unfamiliar with registering a Google AI API Key or using the Vercel AI SDK, I recommend reading the previous blog first.
Server-Side
Here is the complete server-side function. I made a few modifications, namely removing the custom Message
and importing CoreMessage
instead.
"use server";
import { google } from "@ai-sdk/google";
import { CoreMessage, LanguageModel, streamText } from "ai";
import { createStreamableValue } from "ai/rsc";
export async function continueConversation(history: CoreMessage[]) {
"use server";
const stream = createStreamableValue();
const model = google.chat("models/gemini-1.5-pro-latest");
(async () => {
const { textStream } = await streamText({
model: model,
messages: history,
});
for await (const text of textStream) {
stream.update(text);
}
stream.done();
})().then(() => {});
return {
messages: history,
newMessage: stream.value,
};
}
The CoreMessage
is a complex structure that can accept various types of data. CoreUserMessage
is a message sent by a user, it has a fixed role user
and flexible content
. The UserContent
can either be a plain string, a TextPart
object, or an ImagePart
object.
type CoreUserMessage = {
role: 'user';
content: UserContent;
};
type UserContent = string | Array<TextPart$1 | ImagePart>;
interface TextPart$1 {
type: 'text';
text: string;
}
interface ImagePart {
type: 'image';
/**
Image data. Can either be:
- data: a base64-encoded string, a Uint8Array, an ArrayBuffer, or a Buffer
- URL: a URL that points to the image
*/
image: DataContent | URL;
/**
Optional mime type of the image.
*/
mimeType?: string;
}
Delve deep into the ImagePart
. You can pass either base64-encoded image data or an image URL into the image field. In this instance, to simplify the system, we will pass base64-encoded image data into the message.
Client-Side
This page requires key modifications. We need to upload an image, encode it into a base64 message, and preview the image within the message. The following are the complete codes for the page after the update. You can copy and paste this code, and I'll explain the key points afterward.
"use client";
import { useState } from "react";
import { continueConversation } from "./actions";
import { readStreamableValue } from "ai/rsc";
import { CoreMessage } from "ai";
export default function Home() {
const [conversation, setConversation] = useState<CoreMessage[]>([]);
const [imageInput, setImageInput] = useState<string>("");
const [textInput, setTextInput] = useState<string>("");
async function getBase64(file: File): Promise<string> {
return new Promise((resolve) => {
const reader = new FileReader();
reader.readAsDataURL(file);
reader.onload = () => {
resolve(reader.result as string);
};
});
}
return (
<div>
<div>
{conversation.map((message, index) => (
<div key={index}>
{message.role}:{" "}
{
// if it's string, just show it, else if it is image, preview image, if it is text, show the text
typeof message.content === "string" ? (
message.content
) : message.content[0].type === "image" ? (
<img
alt=""
src={
("data:image;base64," + message.content[0].image) as string
}
width={640}
/>
) : message.content[0].type === "text" ? (
message.content[0].text
) : (
""
)
}
</div>
))}
</div>
<div>
<input
type="file"
onChange={(event) => {
if (event.target.files) {
const file = event.target.files[0];
getBase64(file).then((result) => {
setImageInput(result);
});
} else {
setImageInput("");
}
}}
/>
<input
type="text"
value={textInput}
onChange={(event) => {
setTextInput(event.target.value);
}}
/>
<button
onClick={async () => {
// append user messages
const userMessages: CoreMessage[] = [];
if (imageInput.length) {
// remove data:*/*;base64 from result
const pureBase64 = imageInput
.toString()
.replace(/^data:image\/\w+;base64,/, "");
userMessages.push({
role: "user",
content: [{ type: "image", image: pureBase64 }],
});
}
if (textInput.length) {
userMessages.push({
role: "user",
content: [{ type: "text", text: textInput }],
});
}
const { messages, newMessage } = await continueConversation([
...conversation,
...userMessages,
]);
// collect assistant message
let textContent = "";
for await (const delta of readStreamableValue(newMessage)) {
textContent = `${textContent}${delta}`;
setConversation([
...messages,
{
role: "assistant",
content: [{ type: "text", text: textContent }],
},
]);
}
}}
>
Send Message
</button>
</div>
</div>
);
}
- Due to the complexity of
CoreMessage
, I have added some conditional branches to handle message previews. This is particularly the case when using the<img />
tag to display base64-encoded images. - Add another
<input>
withtype="file"
to upload an image. When a change occurs, read the image file and convert it into a base64 string. - Finally, when the send button is clicked, we need to convert the image and text inputs into an array of
CoreMessage
. Please note that the base64 header should be discarded from the image input.
Body Size Config
The default bodySizeLimit
for Next.js
is set to 1MB. If you wish to upload files larger than 1MB, you need to adjust the configuration as follows.
const nextConfig = {
experimental: {
serverActions: {
bodySizeLimit: '10mb'
}
}
};
Letβs Test Now
I upload the cover image from the previous blog and ask, "What is this picture about?" Then, I click the send button.
Examine the assistant's output; it's quite impressive πππ.
References
- Documentation for the AI SDK: https://sdk.vercel.ai/docs/introduction
- Google AI Studio: https://ai.google.dev/aistudio
Conclusion
In this post, I've explored the key features and benefits of Google Gemini in front-end.
If you're interested in seeing Google Gemini in action, check out these products that have successfully implemented it:
- AI Math Solver - A webapp that help users to solve math problems. Learn more: AIMathSolver
Have you used Google Gemini in your projects? Share your experiences in the comments below!
Posted on June 3, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.