Information Extraction with Google Gemini

Introduction

In this blog post, you will be presented with the mechanism on how to perform the information extraction with ease using the large language models like Google Gemini Pro.

The LLMs are the hottest topic of the year 2022/23. Since then there has been a great demand and a ton of innovation and applications are being build by directly utilizing the LLMs or in combination with vector databases etc. However, in this blog post, you will be presented with the information extraction aspects only.

Background

Information Extraction has been an ever challenging one in the history of mankind. Considering the complexities of data extraction, especially when dealing with the unstructured to structured data previously involved a ton of complexities. However, these days, things have changed or evolved with the introduction of large language models.

Hands-on

Please head over to the Google Colab
Make sure to login to the Google Cloud and get the Project Id and Location Info.
Use the below code for Vertex AI initialization purposes.

import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

PROJECT_ID = "<<project_id>>"  # @param {type:"string"}
LOCATION = "<<location>>"  # @param {type:"string"}

if "google.colab" in sys.modules:
    # Define project information
    PROJECT_ID = PROJECT_ID
    LOCATION = LOCATION

    # Initialize Vertex AI
    import vertexai
    vertexai.init(project=PROJECT_ID, location=LOCATION)

For the purpose of this post, let's consider a scenario of web data extraction.

Here's the code snippet for performing the textual data extraction. Our goal is to extract the meaningful information from the specified content consists of a ton of information includes links, images and HTML tags for example. It could be anything for that matter.

def get_text_extract_prompt(title, content):
  prompt = f"""
  Here is its title: {title}
  Here is some text extracted:
  ---------
  {content}
  ---------

  Web pages can have a lot of useless junk in them.
  For example, there might be a lot of ads, or a
  lot of navigation links, or a lot of text that
  is not relevant to the topic of the page. We want
  to extract only the useful information from the text.

  You can use the url and title to help you understand
  the context of the text.
  Please extract only the useful information from the text.
  Try not to rewrite the text, but instead extract
  only the useful information from the text.
  """
  return prompt

Now let's take a look into the code snippet which is responsible for executing the prompt using the Google Gemini Pro LLM. Here's the code snippet.

import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

def execute_prompt(prompt, max_output_tokens=8192):
  model = GenerativeModel("gemini-pro")
  responses = model.generate_content(
    prompt,
    generation_config={
        "max_output_tokens": max_output_tokens,
        "temperature": 0,
        "top_p": 1
    },
  stream=True,
  )

  final_response = []

  for response in responses:
      final_response.append(response.candidates[0].content.parts[0].text)

  return ".".join(final_response)

Let's take a look into the code snippet for performing the above-mentioned calls. Here's the code snippet. Notice below, the text extracts prompt is constructed based on the specific title and the content, further the execute prompt is being called for performing the information extraction using the Gemini Pro LLM.

information_extraction = []
text_extract_prompt = get_text_extract_prompt(title, content)
prompt_response = execute_prompt(text_extract_prompt)
information_extraction.append(prompt_response)

Blog

Information Extraction with Google Gemini

Ranjan Dailata

Introduction

Background

Hands-on

Join Our Newsletter. No Spam, Only the good stuff.

Related