Steps to Build RAG Application with Gemma 7B LLM

Introduction

As large language models are advancing, the craze for building RAG (Retrieval Augmented Generation) applications is increasing. Google just launched an open-source model: Gemma. As we know, RAG represents a fusion between two fundamental methodologies: retrieval-based techniques and generative models. Retrieval-based techniques involve sourcing pertinent information from expansive knowledge repositories or corpora in response to specific queries. Generative models excel in crafting original text or responses by leveraging insights taken from training data to create new content from scratch. With this launch, why not try the new open-source model for building a RAG pipeline and see how it is performing?

Let’s get started and break the process into these steps:

Loading the Dataset: Cosmopedia
Embedding Generation with Hugging Face
Storing in the FAISS DB
Gemma: Introducing the SOTA model
Querying the RAG Pipeline

Building RAG Application on Gemma 7B

Before rolling our sleeves on, let’s install and import the required dependencies.

%pip install -q -U langchain torch transformers sentence-transformers datasets faiss-cpu

import torch
from datasets import load_dataset
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA

Loading the Dataset: Cosmopedia

To make a RAG application, we have selected a Hugging Face dataset, Cosmopedia. This dataset consists of synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. The dataset contains over 30 million files and 25 billion tokens, which makes it the largest open synthetic dataset to date.

This dataset contains 8 subsets. We’ll move with the ‘stories’ subset. We’ll load the dataset using the datasets library.

data = load_dataset("HuggingFaceTB/cosmopedia", "stories", split="train")

Then, we will convert it to a Pandas dataframe, and save it to a CSV file.

data = data.to_pandas()
data.to_csv("dataset.csv")
data.head()

Now that the dataset is saved on our system, we will use LangChain to load the dataset.

loader = CSVLoader(file_path='./dataset.csv')
data = loader.load()

Now that the data is loaded, we need to split the documents inside the data. Here, we split the documents into chunk sizes of 1000. This will help the model to work fast and efficiently.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

Embedding Generation with Hugging Face

After that, we will generate embeddings using Hugging Face Embeddings and with the help of the Sentence Transformers model.

modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     
    model_kwargs=model_kwargs, 
    encode_kwargs=encode_kwargs 
)

Storing in the FAISS DB

The embeddings are generated, but we need them to be stored in a vector database. We’ll be saving those embeddings in the FAISS vector store, which is a library for efficient similarity search and clustering dense vectors.

db = FAISS.from_documents(docs, embeddings)

Gemma: Introducing the SOTA model

Gemma offers two model sizes, with 2 billion and 7 billion parameters respectively, catering to different computational constraints and application scenarios. Both pre-trained and fine-tuned checkpoints are provided, along with an open-source codebase for inference and serving. It is trained on up to 6 trillion tokens of text data and leverages similar architectures, datasets, and training methodologies as the Gemini models. Both exhibit strong generalist capabilities across text domains and excel in understanding and reasoning tasks on a large scale.

The release includes raw, pre-trained checkpoints as well as fine-tuned checkpoints optimized for specific tasks such as dialogue, instruction-following, helpfulness, and safety. Comprehensive evaluations have been conducted to assess the models' performance and address any shortcomings, which enables thorough research and investigation into model tuning regimes and the development of safer and more responsible model development methodologies. Gemma's performance surpasses that of comparable-scale open models across various domains, including question-answering, commonsense reasoning, mathematics and science, and coding, as demonstrated through both automated benchmarks and human evaluations. To know more about the Gemma model, visit this technical report.

To get started with the Gemma model, you should acknowledge their terms on Hugging Face. Then pass the Hugging Face token while logging in.

from huggingface_hub import notebook_login
notebook_login()

Initialize the tokenizer with the model.

model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", padding=True, truncation=True, max_length=512)

Create a text generation pipeline.

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer,
    return_tensors='pt',
    max_length=512,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda"
)

Initialize the LLM with pipeline and model kwargs.

llm = HuggingFacePipeline(
    pipeline=pipe,
    model_kwargs={"temperature": 0.7, "max_length": 512},
)

Now it is time to use the vector store and the LLM for question-answering retrieval.

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever()
)

Querying the RAG Pipeline

The RAG pipeline is ready; let’s pass the queries and see how it performs.

qa.invoke("Write an educational story for young children.")

The result is:

Once upon a time, in a cozy little village nestled between rolling hills and green meadows, there lived a curious kitten named Whiskers. Whiskers loved to explore every nook and cranny of the village, from the bustling marketplace to the quiet corners where flowers bloomed. One sunny morning, as Whiskers trotted down the cobblestone path, he spotted something shimmering in the distance. With his whiskers twitching in excitement, he scampered towards it, his little paws pitter-pattering on the ground. To his delight, he found a shiny object peeking out from beneath a bush--a beautiful, colorful kite! With a twinkle in his eye, Whiskers decided to take the kite on an adventure. He tugged at the string, and the kite soared into the sky, dancing gracefully with the gentle breeze. Whiskers giggled with joy as he watched the kite soar higher and higher, painting the sky with its vibrant colors.

Final Words

The Gemma 7B model performed very well. We got to read a beautiful story about a kitten. The new SOTA model was interesting and exciting to use. With the help of the FAISS vector store, we were able to build a RAG pipeline. Thanks for reading!

This article was originally published here: https://blog.superteams.ai/steps-to-build-rag-application-with-gemma-7b-llm-43f7251a36a1

Blog