Leverage Gemini Pro Vision on Android to get better at Bananagrams 🍌

lethargicpanda

Thomas Ezan

Posted on March 20, 2024

Leverage Gemini Pro Vision on Android to get better at Bananagrams 🍌

Bananagram is a cooler version of Scrabble! In a race to the finish, players build word grids with their letters, aiming to use them all first.

The Bananagrams pouch with a grid of letters

But word games can be tough when you're playing in a language that you aren't native in.
Except if you have an Android app leveraging Gemini Pro vision!

What are we building

We will build Potassium (😁) an application that suggests words that can be spelled given the tiles that are available. To do this we'll leverage Gemini Pro vision to:

  1. Analyze a picture of the tiles and extract the list letters available,
  2. List words that can be spelled with these letters.

Image description

Can Gemini Pro actually do this?

Experimenting with the ML model is key as crafting a prompt often requires multiple iterations before reaching a satisfying result.

Let’s use Google AI studio to evaluate Gemini Pro vision capabilities.

Can the model create a list of the letters based on a picture of the tiles?

Image description

Then, can the model then return a list of words made with these letters?

Image description

Add Gemini to your application

Now that we crafted a prompt that returns a relatively satisfying response (some suggested words might or might not be valid Scrabble words), let’s create the app!
On the top left of Google AI studio, click on “Get API key” to get your Gemini API key.

Then, click on Get code on the top right of Google AI studio to access the code snippet.

  1. Add the Gradle dependencies to your app’s build.gradle file:
implementation("com.google.ai.client.generativeai:generativeai:0.1.1")
Enter fullscreen mode Exit fullscreen mode
  1. In your Kotlin code create a GenerativeModel:

Define the generationConfig that will be used by the model. e.g:

val generationConfig = generationConfig {
                temperature = 0.15f
                topK = 32
                topP = 1f
                maxOutputTokens = 4096
        }
Enter fullscreen mode Exit fullscreen mode

The configuration is reflecting the adjustments you made in the "Run settings" section of the console. These parameters define the creativity and diversity of the text generated during inference.

topK: the Top-K value defines, out of the token generated by the model, the number (k) of tokens considered for the output.

topP: the Top-P value is used to define the cumulative probability of the k tokens (after normalization of their probability) considered for the output.

temperature: controls the level of randomness of the token selected for the output.

To learn more about the LLM sampling mechanism, Vibudh Singh wrote a good explainer.

Then instantiate the GenerativeModel:

val model = GenerativeModel(
            "gemini-pro-vision",
            "your_gemini_key",
            generationConfig = generationConfig,
            safetySettings = listOf(
                SafetySetting(HarmCategory.HARASSMENT, BlockThreshold.MEDIUM_AND_ABOVE),
                SafetySetting(HarmCategory.HATE_SPEECH, BlockThreshold.MEDIUM_AND_ABOVE),
                SafetySetting(HarmCategory.SEXUALLY_EXPLICIT, BlockThreshold.MEDIUM_AND_ABOVE),
                SafetySetting(HarmCategory.DANGEROUS_CONTENT, BlockThreshold.MEDIUM_AND_ABOVE),
            ),
        )
Enter fullscreen mode Exit fullscreen mode
  1. You can then call the model as follow:
viewModelScope.launch {
   val result = model.generateContent(
      content {
         image(bitmap)
         text("What are the letters displayed on the tiles? " +
            "And given these letter which Scrabble words can you spell with it?")
      }
   )
}
Enter fullscreen mode Exit fullscreen mode

You’ll note that we pass both an image (as a bitmap) and a text as content.

To create your bitmap, you can simply access the camera using rememberLauncherForActivityResult in compose:

val resultLauncher =
       rememberLauncherForActivityResult(ActivityResultContracts.StartActivityForResult()) { result: ActivityResult ->
            if (result.resultCode == Activity.RESULT_OK) {
                if (result?.data != null) {
                    bitmap = result.data?.extras?.get("data") as Bitmap
                }
            }
        }

[...]
Button (
  onClick = {
    resultLauncher.launch(cameraIntent)
  },
)
Enter fullscreen mode Exit fullscreen mode

You'll find a very basic compose scaffolding in this gist.

💖 💪 🙅 🚩
lethargicpanda
Thomas Ezan

Posted on March 20, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related