Using LLM, Postgres VectorDB, and OpenAI to Perform Semantic Search on PDF Documents

rajeshkumarbehura

Rajesh Kumar

Posted on February 15, 2024

Using LLM, Postgres VectorDB, and OpenAI to Perform Semantic Search on PDF Documents

Objective

The goal of this project is to build a prototype app that can perform similarity search on PDF documents using VectorSearch (Postgres). The app will use the Langchain llm framework and OpenAI to extract and compare semantic vectors from the PDFs.

TechStack

  1. Postgres with VectorSearch (Run in docker)
  2. OpenAI (Create/Must have OpenAI key) : https://openai.com/
  3. Langchain LLM
  4. Python

Source: https://github.com/rajeshkumarbehura/pdf-reader-search

Keywords to learn

langchain framework, embeedding, vectorsearch or vector database, PVector, Document Loader

Explanation

PDF is a common format for documents in organizations, and it is fascinating to test llm semantic search on PDFs.

We used the book "Teach yourself Java in 21 days" for our testing.

This app is a test case for extracting and embedding PDF content and storing it in a database. We use Postgres as the vector database and DBeaver as the database viewer. The PVector framework handles the data design and embedding process.

Understand Steps -

  1. Extracted the PDF content and split it into a list of documents.
  2. Created a database connection string and used it for PGVector (framework class) to handle the creation of embeddings and push them into the database. If the table did not exist, it created it automatically.
  3. Loaded the documents and their embeddings into the database.
  4. Used Dbeaver tool to view the data and tables.(https://dbeaver.io/)

Image description

  1. Wrote question for testing as ask_question(query="What is Incrementing and Decrementing ?")

Note : For more search options, check out the langchain documentation on different methods such as
similarity_search
search
similarity_search_with_score
Do more experiments on this functions and get better understanding.

Execution



  1. Run postgess.yml using docker-compose command.
  2. Update openai_key in Reader.py file
  3. Run main function either to extract pdf file or search query.
Enter fullscreen mode Exit fullscreen mode




Reference

https://www.youtube.com/watch?v=zxo3T4aQj6Q&t=1224s
https://github.com/pgvector/pgvector
https://python.langchain.com/docs/integrations/vectorstores/pgvector
https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf
https://python.langchain.com/docs/integrations/vectorstores/pgembedding

💖 💪 🙅 🚩
rajeshkumarbehura
Rajesh Kumar

Posted on February 15, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related