Apify
Posted on November 21, 2023
In this tutorial, were going to build a custom AI chatbot. Our chatbot is going to work on top of data that will be fed to a large language model (LLM). In other words, well be developing a retrieval-augmented chatbot. The main tools well use are Streamlit and LangChain.
Streamlit is a tool for the quick creation of web apps. Well use it to implement the chat interface.
LangChain is a framework that simplifies the building of LLM apps. It mostly acts as the glue between vector databases, LLMs, and your custom code.
Well split this tutorial into 3 steps:
First, well get some data that can be used as context for the LLM.
Second, well use Streamlit to create the chat interface.
Lastly, well connect everything together using LangChain.
The code is available at https://github.com/apify/chat-with-a-website.
➡Related: What is retrieval-augmented generation, and why use it for chatbots?
Obtaining the data and saving it in a vector database
First, we want to collect some data. We'll later use this as the context provided to the LLM when chatting. Our example code will use Apifys Website Content Crawler to scrape the selected website and store it in a local vector database.
First, lets create an .env
file that will contain the website we want to chat with and API tokens for Apify and OpenAI:
OPENAI_API_KEY=your_api_key
APIFY_API_TOKEN=your_api_key
WEBSITE_URL="<https://docs.apify.com/platform>"
Next, lets install all the required packages:
pip install apify-client chromadb langchain openai python-dotenv streamlit tiktoken
Our environments all set, so lets write some Python code!
Lets create a new file called scrape.py
. First, we want to import the necessary packages and load our .env
file:
import os
from apify_client import ApifyClient
from dotenv import load_dotenv
from langchain.document_loaders import ApifyDatasetLoader
from langchain.document_loaders.base import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
# Load environment variables from a .env file
load_dotenv()
Next, well write the main function:
if __name__ == ' __main__':
apify_client = ApifyClient(os.environ.get('APIFY_API_TOKEN'))
website_url = os.environ.get('WEBSITE_URL')
print(f'Extracting data from "{website_url}". Please wait...')
actor_run_info = apify_client.actor('apify/website-content-crawler').call(
run_input={'startUrls': [{'url': website_url}]}
)
print('Saving data into the vector database. Please wait...')
loader = ApifyDatasetLoader(
dataset_id=actor_run_info['defaultDatasetId'],
dataset_mapping_function=lambda item: Document(
page_content=item['text'] or '', metadata={'source': item['url']}
),
)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(
documents=docs,
embedding=embedding,
persist_directory='db2',
)
vectordb.persist()
print('All done!')
We'll run the Website Content Crawler Actor on Apify to scrape the target website, then use the ApifyDatasetLoader
that is integrated into LangChain to load the scraped documents.
Then, we use the RecursiveCharacterTextSplitter
to chunk the documents, and finally, we use OpenAIs embeddings to convert our documents into vectors that get stored in the db
directory.
Creating the chat interface
We're gonna use Streamlit to create the interface. Well base it on examples provided at https://github.com/langchain-ai/streamlit-agent.
Lets start with the imports and some settings:
import os
import streamlit as st
from dotenv import load_dotenv
from langchain.callbacks.base import BaseCallbackHandler
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.memory import ConversationBufferMemory
from langchain.memory.chat_message_histories import StreamlitChatMessageHistory
from langchain.vectorstores import Chroma
load_dotenv()
website_url = os.environ.get('WEBSITE_URL', 'a website')
st.set_page_config(page_title=f'Chat with {website_url}')
st.title('Chat with a website')
Next, we'll implement some helpers. The get_retriever
function will create a retriever based on data we extracted in the previous step using scrape.py
. The StreamHandler
class will be used for streaming the responses from ChatGPT to our application.
@st.cache_resource(ttl='1h')
def get_retriever():
embeddings = OpenAIEmbeddings()
vectordb = Chroma(persist_directory='db', embedding_function=embeddings)
retriever = vectordb.as_retriever(search_type='mmr')
return retriever
class StreamHandler(BaseCallbackHandler):
def __init__ (self, container: st.delta_generator.DeltaGenerator, initial_text: str = ''):
self.container = container
self.text = initial_text
def on_llm_new_token(self, token: str, **kwargs) -> None:
self.text += token
self.container.markdown(self.text)
Finally, lets add the main code. We use the ConversationalRetrievalChain
utility provided by LangChain along with OpenAIs gpt-3.5-turbo
. The rest of the code sets up the Streamlit chat interface.
retriever = get_retriever()
msgs = StreamlitChatMessageHistory()
memory = ConversationBufferMemory(memory_key='chat_history', chat_memory=msgs, return_messages=True)
llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0, streaming=True)
qa_chain = ConversationalRetrievalChain.from_llm(
llm, retriever=retriever, memory=memory, verbose=False
)
if st.sidebar.button('Clear message history') or len(msgs.messages) == 0:
msgs.clear()
msgs.add_ai_message(f'Ask me anything about {website_url}!')
avatars = {'human': 'user', 'ai': 'assistant'}
for msg in msgs.messages:
st.chat_message(avatars[msg.type]).write(msg.content)
if user_query := st.chat_input(placeholder='Ask me anything!'):
st.chat_message('user').write(user_query)
with st.chat_message('assistant'):
stream_handler = StreamHandler(st.empty())
response = qa_chain.run(user_query, callbacks=[stream_handler])
Connecting everything together
If youve followed along with this tutorial, then by now, you should have three files: .env
, [
scrape.py
](<
http://scrape.py
>)
and chat.py
. Lets take what weve created and use it to chat with a website!
First, run python
scrape.py
to extract the relevant data from the target website. Note that this step may take a while since the website might be pretty big. You can check the progress at https://console.apify.com/actors/runs.
After the data extraction is done, you can start chatting with the website by running streamlit run
chat.py
!
Posted on November 21, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.