Querying PDFs with AI: A Beginner’s Guide to Using LangChain, FAISS, and OpenAI

Ravjot Singh
5 min readAug 27, 2024

--

Discover how ChatGPT can make finding info in PDFs as simple as asking a question!

This blog walks you through a project where we build an intelligent system to answer questions from PDF documents using AI. Even if you’re new to AI or programming, I’ll guide you step-by-step on how each part of the code works. By the end of this blog, you’ll understand how to set up a system that can search through documents and find answers to questions automatically.

  1. Importing the necessary libraries

Following libraries gives us the building blocks to read, break down, and search the text in our PDF. Each one plays a crucial role in the process.

from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
  • PyPDF2: This library lets us read and extract text from PDF files. Since we want to pull information from a PDF, we need this tool to first get the text out.
  • OpenAIEmbeddings: This creates “embeddings,” which are numerical representations of text. Think of embeddings as a way to convert words into numbers that a computer can understand. These numbers help compare how similar different pieces of text are.
  • CharacterTextSplitter: Long pieces of text are hard to manage, so this tool splits the text into smaller parts. By splitting the text, we make it easier for the AI to process and find relevant answers.
  • FAISS: This is a powerful search tool created by Facebook AI. It helps us quickly find the most relevant text parts based on a question.

2. Setting up the OpenAI API Key

import os
os.environ["OPENAI_API_KEY"] = ""

Without the API key, we wouldn’t be able to use OpenAI’s language model, which is the brain behind answering questions.

  • os.environ: This command sets an environment variable. Here, we’re telling our code what the OpenAI API key is. The API key is like a password that lets you use OpenAI’s tools (like ChatGPT or other models).

3. Loading the PDF and Extracting Raw Text

Before we can analyze or search the text, we first need to get it out of the PDF. This code does exactly that, turning the PDF into something we can work with.

pdfreader = PdfReader('sample_report-pages.pdf')
raw_text = ''
for i, page in enumerate(pdfreader.pages):
content = page.extract_text()
if content:
raw_text += content
  • PdfReader: We use this to open the PDF file and read its contents.
  • for i, page in enumerate(pdfreader.pages): This loop goes through each page in the PDF one by one.
  • content = page.extract_text(): For each page, it pulls out the text (if there is any).
  • raw_text += content: All the text from each page is combined into a single variable called raw_text.

4. Splitting the Text into Smaller Parts

Splitting the text helps us process large documents more effectively. If the text was left as one long piece, the AI could struggle to handle it. By breaking it down, we make it easier for the AI to understand and find answers.

text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=800,
chunk_overlap=200,
length_function=len
)
text = text_splitter.split_text(raw_text)
  • CharacterTextSplitter: This breaks the long text into smaller “chunks” that are easier to manage.
  • separator=”\n”: This tells the splitter to break the text at new lines (\n).
  • chunk_size=800: Each chunk will have a maximum of 800 characters. This size is chosen to give enough context to the AI without making the chunks too large.
  • chunk_overlap=200: This adds some overlap between chunks, ensuring that important information isn’t missed if it’s split across chunks.
  • split_text(raw_text): The split_text method applies the splitting to our extracted text, storing the results in a list called text.

5. Generating Text Embeddings using OpenAI

Embeddings are the backbone of our search system. By turning text into numbers, we can compare it mathematically and find the best match for our query.

embeddings = OpenAIEmbeddings()
  • OpenAIEmbeddings: This creates a model that converts text into embeddings (numerical representations). Embeddings allow us to compare how similar two pieces of text are.

6. Storing the Embeddings using FAISS

FAISS allows us to quickly search through large amounts of text and find the parts that are most relevant to our query. It’s like building a mini Google search engine just for our PDF!

document_search = FAISS.from_texts(text, embeddings)
  • FAISS.from_texts: This command takes all the chunks of text and their embeddings and stores them in a searchable database.

7. Setting up the LangChain Pipeline for Question Answering

The LangChain pipeline is where everything comes together. It takes the relevant text chunks and processes them with OpenAI to generate the answer to your question.

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(OpenAI(), chain_type='stuff')
  • load_qa_chain: This loads a pre-built “chain” for question-answering. A chain is a series of steps that LangChain follows to process the text and answer questions.
  • OpenAI(): This initializes the OpenAI model that will generate the answers.
  • chain_type=’stuff’: This option tells the chain to treat all the text as one big block of information (a “stuff” type), rather than breaking it up further.

8. Asking a Question to the PDF

This step is like searching a document for keywords, but much smarter. Instead of just matching words, it considers the meaning and context of your query.

query = "The first six and half floors of the ISB are designed for"
docs = document_search.similarity_search(query)
  • query: This is the question you want to ask the PDF.
  • document_search.similarity_search(query): This searches through the text chunks using FAISS and returns the most relevant parts based on the query.

9. Generating the Answer with the LangChain Pipeline

This is where the magic happens!

The AI reads through the relevant text and generates an answer based on it. The output is a concise response to your question, directly pulled from the document.

chain.run(input_documents=docs, question=query)
  • chain.run: This method runs the question-answering chain.
  • input_documents=docs: It takes the relevant text chunks found in the previous step as input.
  • question=query: The query is passed as the question the AI needs to answer.

Conclusion and Further Enhancements

By following these steps, you’ve built a basic but effective PDF query system. This setup allows you to search and extract answers from large documents in a way that’s far more intelligent than simple keyword search. Potential improvements include refining the text splitting method, experimenting with different prompt templates, or integrating this tool into a web app.

Final Thoughts

This project is a practical introduction to using AI for document analysis. By leveraging tools like LangChain, FAISS, and OpenAI, you can build powerful systems to automate tasks like document review, research, and content extraction.

Outputs

AI-powered PDF Query System in Action: This image showcases the front-end interface where users can input queries and retrieve relevant answers from PDF documents.

This brings us to the end of this article. I hope you found this guide useful and that it helps you in building your own AI-powered PDF Query system. Understanding the integration of tools like LangChain, FAISS, and OpenAI is crucial to creating efficient document search solutions.

Remember, practice is key to mastering these concepts and implementing them effectively in real-world scenarios.

If you’re interested in more resources related to Data Science, Machine Learning, and AI-driven projects, feel free to explore my GitHub account.

Let’s connect on LinkedIn — Ravjot Singh.

--

--

Ravjot Singh

A Tech enthusiast || Dedicated and hardworking with a passion for Data Science || Codes in Python & R.