Build a Multimodal Chat App using LLaVA, Chainlit, and Replicate
alisdairbr
Posted on March 12, 2024
In the rapidly evolving landscape of artificial intelligence (AI), multimodal vision models stand out as an incredible innovation. These models merge the visual understanding of images with the linguistic comprehension of text to create systems that can interpret and interact with the world in ways similar to humans.
However, the true potential of such groundbreaking AI technologies can only be realized when they are made accessible to a broader audience. This is where the importance of user-friendly interfaces, like Chainlit, become unmistakably clear. By wrapping complex AI capabilities in interfaces that are intuitive and easy to navigate, these technologies enable users from various backgrounds to leverage the power of AI without the need for deep technical expertise.
This tutorial showcases how to build a chat interface, using Chainlit for the front end and LLaVA for powering the back-end API.
You can deploy the multimodal vision chat application as configured in this guide using the Deploy to Koyeb button below.
Requirements
To successfully follow this tutorial, you will need the following:
- Replicate Account: A Replicate account must use their API that allows interaction with a LlaVa model.
- Koyeb Account: A Koyeb account will be required for deploying and managing the chat application in a cloud environment, taking advantage of Koyeb’s seamless integration and deployment capabilities.
- GitHub Account: A GitHub account is necessary for version control and managing the project's codebase.
Understanding of the components
Overview of Multimodal Vision Models
Multimodal vision models are a major advancement in the field of artificial intelligence, standing at the intersection of visual perception and natural language processing (NLP). These sophisticated models interpret and analyze data from multiple sources or modalities, primarily focusing on visual (images, videos) and textual (descriptions, questions) inputs.
By using convolutional neural networks (CNNs) or transformers to analyze visual data, thsee models extract features and patterns that define the content and context of an image. Simultaneously, they employ NLP techniques to understand and process textual data, enabling them to grasp the semantics of user queries or descriptions related to the images.
When these models integrate the processed information from both modalities, they can not only identify objects within an image but also understand their attributes, the relationships between them, and the overall context or scene depicted. This more comprehensive understanding allows the models to generate nuanced responses to queries, make inferences, and even generate descriptive texts or answer questions about unseen images accurately.
Overview of LLaVA
LLaVA represents a state-of-the-art open-source framework designed to facilitate the integration of language and vision models. It enables the seamless processing of multimodal queries, where users can ask questions or make requests that involve both textual and visual information. By combining the capabilities of language models with vision processing algorithms, LLaVA provides a robust backend solution capable of understanding and responding to complex queries about images.
It incorporates the latest advancements in AI and machine learning, including deep learning models for image recognition and natural language processing. Being open-source, it allows for customization and optimization according to specific project needs, making it a versatile choice for developers. Designed with scalability in mind, it is capable of handling a wide range of query volumes and complexities.
Overview of Replicate
Replicate is a web-based platform that allows users to deploy and scale machine learning models easily. The platform provides a simple interface for managing machine learning models and handling tasks such as data preprocessing, model training, and model deployment.
It is designed to make it easy for developers and data scientists to build and deploy machine learning applications without having to worry about the underlying infrastructure. The platform supports a variety of machine learning frameworks, including TensorFlow, PyTorch, and Scikit-learn, and allows users to deploy models in a variety of environments, including on-premises hardware, virtual machines, and cloud-based infrastructure.
It also provides features such as version control, collaboration tools, and automated testing to help teams work together more effectively and ensure that their machine-learning models are accurate and reliable. Overall, the platform is designed to simplify the process of building and deploying machine learning applications, allowing developers and data scientists to focus on building great models rather than worrying about infrastructure and deployment.
Steps
To build this chat interface you will follow these few steps:
- Set Up the Environment: Here you will set up your project folder, install any dependencies, and prepare environment variables.
- Set Up Chainlit: In this section, you will install Chainlit and set up the initial chat interface.
- Integrate LlaVa API from Replicate: In this section, you will integrate with a LlaVa API from Replicate that will process the images and return the text response.
- Run Examples: Testing the newly created application with a set of examples.
- Deploy to Koyeb
Set Up the Environment
First, let’s start by creating a new project. To keep your Python dependencies organized you should create a virtual environment.
You can create a local folder on your computer with:
# Create and move to the new folder
mkdir VisionChat
cd VisionChat
# Create a virtual environment
python -m venv venv
# Active the virtual environment (Windows)
.\venv\Scripts\activate.bat
# Active the virtual environment (Linux)
source ./venv/bin/activate
Next, you can install the required dependencies:
pip install chainlit openai replicate requests python-decouple
Along with the expected libraries for Chainlit, and Replicate, we also installed OpenAI (for Chainlit) and requests
to be able later on to upload the image files, and python-decouple
for loading environment variables.
Don’t forget to save your dependencies to the requirements.txt
file:
pip freeze > requirements.txt
As mentioned before, you will need a Replicate account to access a Llava model, if you don’t have an account, you can create one here. After that, you will get access to the API key.
The next step is precisely to create a .env
file to store the API key and Model settings for Replicate:
REPLICATE_API_KEY=<YOUR_REPLICATE_API_KEY>
REPLICATE_MODEL=yorickvp/llava-v1.6-mistral-7b
REPLICATE_MODEL_VERSION=19be067b589d0c46689ffa7cc3ff321447a441986a7694c01225973c2eafc874
For this tutorial, we will use a Llava model that also contains the Mistral 7B model.
Set Up Chainlit
Now you can start implementing the Chainlit application. The code implementation will reside in one single file, so you can create a new file app.py
:
import time
import chainlit as cl
import replicate
import requests
from chainlit import user_session
from decouple import config
Overall, this code sets up the necessary imports and functions for a script that uses the chainlit
and replicate
libraries to deploy and interact with a machine learning model. The requests
and decouple
libraries are also used for making HTTP requests and managing configuration settings, respectively.
The first function to be called when a chat starts is on_chat_start
, so let’s write its implementation:
# On chat start
@cl.on_chat_start
async def on_chat_start():
# Message history
message_history = []
user_session.set("MESSAGE_HISTORY", message_history)
# Replicate client
client = replicate.Client(api_token=config("REPLICATE_API_KEY"))
user_session.set("REPLICATE_CLIENT", client)
This code defines a function on_chat_start()
that is decorated with @cl.on_chat_start
. This decorator indicates that the function should be executed when a new chat session is started in a Chainlit application.
Inside the on_chat_start()
function, two things happen:
- A new empty list called
message_history
is created. This list will be used to store the history of messages exchanged between the user and the chatbot. The list is then stored in the user's session using theuser_session.set()
function. - A new instance of the
replicate.Client
class is created using an API token that is retrieved from the configuration using theconfig()
function from thedecouple
library. Thereplicate.Client
class is used to interact with the Replicate platform. The instance is then stored in the user's session using theuser_session.set()
function.
Integrate LlaVa API from Replicate
The Llava model that we will use requires that the image uploaded be accessible over the internet with a remote URL. To make this available, we will use a Replicate endpoint where the images can be uploaded and a remote URL is returned. This endpoint is normally used to upload images for training models, but in this case, we will use it as a file repository.
Let’s implement that function in the [app.py](http://app.py)
file:
# Upload image to Replicate
def upload_image(image_path):
# Get upload URL from Replicate (filename is hardcoded, but not relevant)
upload_response = requests.post(
"https://dreambooth-api-experimental.replicate.com/v1/upload/filename.png",
headers={"Authorization": f"Token {config('REPLICATE_API_KEY')}"},
).json()
# Read file
file_binary = open(image_path, "rb").read()
# Upload file to Replicate
requests.put(upload_response["upload_url"], headers={'Content-Type': 'image/png'}, data=file_binary)
# Return URL
url = upload_response["serving_url"]
return url
This code defines a function called upload_image()
that takes a single argument, image_path
, which is the file path of an image file to be uploaded to the Replicate platform.
The function performs the following steps:
- It sends a POST request with an authorization header that includes the Replicate API key. The response from this request is a JSON object containing an upload URL.
- It reads the binary data from the image file using the
open()
function with the"rb"
(read binary) mode. - It sends a PUT request to the upload URL obtained previously with the binary data of the image file as the request body and a header indicating the content type. This uploads the image file to the Replicate platform.
- It extracts the serving URL from the JSON response obtained previously and returns it as the function output.
Overall, this function uploads an image file to the Replicate platform and returns the serving URL that can be used to access the uploaded image.
With this helper function prepared, you can now write the core of the Chainlit application, the function that processes messages and integrates with the Replicate model:
# On message
@cl.on_message
async def main(message: cl.Message):
# Send empty message for loading
msg = cl.Message(
content=f"",
author="Vision Chat",
)
await msg.send()
# Processing images (if any)
images = [file for file in message.elements if "image" in file.mime]
# Setup prompt
prompt = """You are a helpful Assistant that can help me with image recognition and text generation.\n\n"""
prompt += """Prompt: """ + message.content
# Retrieve message history
message_history = user_session.get("MESSAGE_HISTORY")
# Retrieve Replicate client
client = user_session.get("REPLICATE_CLIENT")
# Check if there are images and set input
if len(images) >= 1:
# Clear history (we clear history when we have a new image)
message_history = []
# Upload image to Replicate
url = upload_image(images[0].path)
# Set input with image and without history
input_vision = {
"image": url,
"top_p": 1,
"prompt": prompt,
"max_tokens": 1024,
"temperature": 0.5,
}
else:
# Set input without image and with history
input_vision = {
"top_p": 1,
"prompt": prompt,
"max_tokens": 1024,
"temperature": 0.5,
"history": message_history
}
# Call Replicate
output = client.run(
f"{config('REPLICATE_MODEL')}:{config('REPLICATE_MODEL_VERSION')}",
input=input_vision
)
# Process the output
ai_message = ""
for item in output:
# Stream token by token
await msg.stream_token(item)
# Sleep to provide a better user experience
time.sleep(0.1)
# Append to the AI message
ai_message += item
# Send the message
await msg.send()
# Add to history
user_text = message.content
message_history.append("User: " + user_text)
message_history.append("Assistant:" + ai_message)
user_session.set("MESSAGE_HISTORY", message_history)
This code defines a function called main()
that is decorated with @cl.on_message
. This decorator indicates that the function should be executed when a new message is received in a Chainlit chat session.
Inside the main()
function, the following steps are performed:
- An empty message is sent to indicate that the chatbot is processing the user's message.
- Any images attached to the user's message are extracted and stored in the
images
list. - A prompt is set up that includes a description of the chatbot's capabilities and the user's message.
- The chat history is retrieved from the user's session using the
user_session.get()
function. - The
replicate.Client
instance is retrieved from the user's session using theuser_session.get()
function. - Suppose there are any images attached to the user's message. In that case, the chat history is cleared, the image is uploaded to the Replicate platform using the
upload_image()
function, and the input to the Llava model is set to include the uploaded image and the prompt. If there are no images, the input to the Llava model is set to include only the prompt and the chat history. - The Llava model is called using the
client.run()
function with the appropriate input. - The output from the Llava model is processed token by token and streamed to the user. The output is also stored in the
ai_message
variable. - The final output message is sent to the user using the
msg.send()
function. - The user's message and the chatbot's response are added to the chat history and stored in the user's session using the
user_session.set()
function.
Overall, this code implements a chatbot that can handle both text and image inputs. That is all the code necessary to implement a multimodal chat with Chainlit, Replicate, and Lllava.
Next, let’s take a look at some working examples.
Run Examples
To run the Chainlit application you just need to execute in the terminal:
chainlit run app.py
A browser window automatically opens with the landing screen for the Chainlit application:
Let’s see some examples in action:
Disclaimer: When using machine learning models deployed on the Replicate platform, it is important to note that the first execution of a model after a period of inactivity may take longer than subsequent executions. This is known as a "cold boot" or "cold start".
This is because when a model is not being used, the platform may shut down some of the resources allocated to it to conserve resources and reduce costs. When the model is invoked again, these resources need to be reallocated and the model needs to be loaded back into memory, which can take some time.
The duration of the cold boot period can vary depending on the size and complexity of the model, as well as the current load on the platform. In general, expect that the first image processed in the chat may take longer to process and it might even timeout. If that happens, simply upload the image and send a message again.
Deploy to Koyeb
Now that you have the application running locally you can also deploy it on Koyeb and make it available on the Internet.
Create a repository on your GitHub account, for instance, called VisionChat
.
You can download a standard .gitignore
file for Python from GitHub to exclude certain folders and files from being pushed to the repository:
curl -L https://raw.githubusercontent.com/github/gitignore/main/Python.gitignore -o .gitignore
Run the following commands in your terminal to commit and push your code to the repository:
echo "# VisionChat" >> README.md
git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin [Your GitHub repository URL]
git push -u origin main
You should now have all your local code in your remote repository. Now it is time to deploy the application.
Within the Koyeb control panel, while on the Overview tab, initiate the app creation and deployment process by clicking Create Web Service.
On the application deployment page:
- Select GitHub as your deployment method.
- Choose the repository where your code resides. For example, ChatbotFastUI.
- Under Builder configure your Run command by selecting Override and adding the command to run the application with public access:
chainlit run app.py
- Under Edit environment variables, click Add Variable button to add your Replicate API key named
REPLICATE_API_KEY
. Add also theREPLICATE_MODEL
andREPLICATE_MODEL_VERSION
. - In the Instance selection, click “Eco” and select "Free".
- Under App and Service names, rename your App to whatever you’d like. For example,
vision-chat
. Note the name will be used to create the public URL for this app. You can add a custom domain later if you’d like. - Finally, click the Deploy button or hit ⌘D.
Once the application is deployed, you can visit the Koyeb service URL (ending in .koyeb.app
) to access the chatbot interface.
Conclusion
In this guide, we used Chainlit and LLaVA to create a user-friendly interface for complex AI operations. These tools enable building applications that enhance our collective interaction with digital content and democratize the use of advanced AI for a broader audience.
Multimodal models combine visual understanding with linguistic understanding. They open up new possibilities for human-computer interactions and make AI technologies more accessible and interactive.
The potential implications of this development are vast, ranging from education and accessibility to entertainment and beyond. It represents a shift towards more natural and intuitive ways of interacting with technology, where conversation with images becomes as commonplace as texting.
Posted on March 12, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.