Querying news articles via a streamlit app using openAI, langchain, and qdrant db

Introduction

Chatbots integrated into news querying serve various crucial purposes. They offer a convenient and conversational approach for users to access news, eliminating the need to navigate websites or apps. Users can simply ask a chatbot for news on a specific topic or event, making information more accessible, especially for those who find traditional methods challenging.

Personalization is a key feature of chatbots in the news industry. By learning from users' past queries and interactions, chatbots can present more relevant and personalized news articles, enhancing the overall user experience. This tailoring of content ensures that users receive information aligned with their preferences.

Time efficiency is another significant advantage. Chatbots quickly sift through vast amounts of information, presenting users with the most relevant news articles. This time-saving aspect is particularly beneficial for users who would otherwise have to manually search and filter through numerous sources.

Moreover, chatbots contribute to interactive news consumption. They engage users in a conversation, answering follow-up questions and providing additional context or related information. This interactive approach adds depth to the news-reading experience, surpassing the passive nature of traditional methods.

Information overload is a common issue in the digital age, and chatbots help mitigate it by filtering out noise. They deliver news that is most relevant to the user's interests and needs, streamlining the consumption process and enhancing user satisfaction.

Visually impaired users benefit significantly from chatbots, especially when integrated with voice technology. This combination provides an invaluable audio-based method for accessing news, promoting inclusivity in news consumption.

Integration into commonly used platforms, such as messaging apps, enhances user convenience. Users can receive news updates in the same environment where they communicate with others, streamlining their digital experience.

Automated updates and alerts are a proactive feature of chatbots. Programmed to send timely news updates or alerts about breaking news, chatbots ensure that users stay informed in real time, contributing to a more connected and aware user base.

Language and regional customization further extend the accessibility of news. Chatbots can be designed to deliver news in multiple languages and tailor content to regional or local interests, catering to diverse demographics and preferences.

In summary, chatbots in the news industry elevate user experience through convenient, personalized, and interactive access to news. They address various challenges in traditional news consumption methods while catering to a diverse range of user needs and preferences.

In this article, we’ll design an RAG pipeline using OpenAI, Langchain, and Qdrant DB and encase it in an user interface via Streamlit.

What Is RAG

"RAG" stands for "Retrieval-Augmented Generation." It's a technique used in natural language processing and machine learning, particularly in the development of advanced language models like chatbots.In a RAG setup, when a query is input (like a question or a prompt), the retrieval system first searches through its database to find relevant information or documents. This information is then passed on to the generative model, which synthesizes it to create a coherent and contextually appropriate response.

A Brief Note on the Components

Langchain: LangChain is an open-source framework designed to simplify the creation of applications using large language models (LLMs). LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis. LangChain enables developers to connect LLMs to other data sources, interact with their environment, and build complex applications. It is written in Python and JavaScript and supports a variety of language models, including GPT-3, LLAMA, Hugging Face Jurassic-1 Jumbo, and more.

Qdrant: Qdrant is an open-source vector similarity search engine and vector database written in Rust. It provides a production-ready service with a convenient API to store, search, and manage points—vectors with an additional payload. Qdrant is tailored to extended filtering support, making it useful for various neural network or semantic-based matching, faceted search, and other applications.

Setting Up the Environment & Code

First, in your directory, create a requirements.txt file with the following content:

langchain
streamlit
requests
opeanai
qdrant-client
tiktoken

Then run the command to install these dependencies:

pip install -r requirements.txt

Now create a file name app.py and paste the following code in it, the comments explain their functionality:

#importing the needed libraries
import streamlit as st
import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Qdrant
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
import os

#function to fetch text data from the links of news websites
def fetch_article_content(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        st.error(f"Error fetching {url}: {e}")
        return ""

#function to collate all the text from the news website into a single string
def process_links(links):
    all_contents = ""
    for link in enumerate(links):
        content = fetch_article_content(link.strip())
        all_contents += content + "\n\n"
    return all_contents

#function to chunk the articles beofore creating vector embeddings
def get_text_chunks_langchain(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    texts = text_splitter.split_text(text)
    return texts

#creating the streamlit app

def main():
    st.title('News Article Fetcher')

    # Initialize state variables
    if 'articles_fetched' not in st.session_state:
        st.session_state.articles_fetched = False
    if 'chat_history' not in st.session_state:
        st.session_state.chat_history = ""

    # Model selection
    model_choice = st.radio("Choose your model", ["GPT 3.5", "GPT 4"], key= "model_choice")
    model = "gpt-3.5-turbo-1106" if st.session_state.model_choice == "GPT 3.5" else "gpt-4-1106-preview"

    #API_KEY
    API_KEY = st.text_input("Enter your OpenAI API key", type="password", key= "API_KEY")

    # Ensure API_KEY is set before proceeding
    if not API_KEY:
        st.warning("Please enter your OpenAI API key.")
        st.stop()

    #asking user to upload a text file with links to news articles (1 link per line)
    uploaded_file = st.file_uploader("Upload a file with links", type="txt")

    # Read the file into a list of links
    if uploaded_file:
        stringio = uploaded_file.getvalue().decode("utf-8")
        links = stringio.splitlines()

    # Fetch the articles' content
    if st.button("Fetch Articles") and uploaded_file:
        progress_bar = st.progress(0)
        with st.spinner('Fetching articles...'):
            article_contents = process_links(links)
            progress_bar.progress(0.25)  # Update progress to 25%

            #Process the article contents
            texts = get_text_chunks_langchain(article_contents)
            progress_bar.progress(0.5)  # Update progress to 50%

            #storing the chunked articles as embeddings in Qdrant
            os.environ["OPENAI_API_KEY"] =  st.session_state.API_KEY
            embeddings = OpenAIEmbeddings()
            vector_store = Qdrant.from_texts(texts, embeddings, location=":memory:",)
            retriever = vector_store.as_retriever()
            progress_bar.progress(0.75)  # Update progress to 75%

            #Creating a QA chain against the vectorstore
            llm = ChatOpenAI(model_name= model)
            if 'qa' not in st.session_state:
                st.session_state.qa = RetrievalQA.from_llm(llm= llm, retriever= retriever)
            progress_bar.progress(1)

            st.success('Articles fetched successfully!')
            st.session_state.articles_fetched = True

    #once articles are fetched, take input for user query

    if 'articles_fetched' in st.session_state and st.session_state.articles_fetched:

        query = st.text_input("Enter your query here:", key="query")

        if query:
            # Process the query using your QA model (assuming it's already set up)
            with st.spinner('Analyzing query...'):
                qa = st.session_state.qa
                response = qa.run(st.session_state.query)  
            # Update chat history
            st.session_state.chat_history += f"> {st.session_state.query}\n{response}\n\n"

        # Display conversation history
        st.text_area("Conversation:", st.session_state.chat_history, height=1000, key="conversation_area")
        # JavaScript to scroll to the bottom of the text area
        st.markdown(
            f"<script>document.getElementById('conversation_area').scrollTop = document.getElementById('conversation_area').scrollHeight;</script>",
            unsafe_allow_html=True
        )

if __name__ == "__main__":
    main()

Then save the app.py and run the following command in your terminal:

streamlit run app.py

This launches your application at localhost with port number 8051.

Here’s how the UI of the news article fetcher application looks like:

Conclusion

In conclusion, the integration of chatbots into news querying not only addresses the challenges of traditional news consumption but also significantly enhances user experience by providing a convenient, personalized, and interactive access to information. The discussed RAG pipeline, incorporating OpenAI, Langchain, and Qdrant DB, coupled with a Streamlit-based user interface, exemplifies the cutting-edge technological advancements in natural language processing and machine learning. This comprehensive solution not only streamlines the process of fetching and analyzing news articles but also showcases the potential of AI-driven systems in delivering tailored content, mitigating information overload, and ensuring inclusivity for visually impaired users. The outlined code implementation serves as a practical guide for developers interested in building advanced chatbot applications for news retrieval, demonstrating the fusion of language models, vector similarity search engines, and efficient UI design. Ultimately, this innovative approach represents a paradigm shift in news consumption, offering a glimpse into the future of user-centric and technology-driven information access.

References

https://api.python.langchain.com/en/latest/api_reference.html

https://python.langchain.com/docs/integrations/vectorstores/qdrant

Blog