Rapid Q&A on multiple PDFs using langchain and chromadb as local disk vector store

Diptiman Raichaudhuri
11 min readSep 26, 2023

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

A lot of content is written on Q&A on PDFs using LLM chat agents. This is my turn !

In this post, I have taken chromadb as my local disk based vector store where I intend to store the word embedding after the text from PDF files are extracted. Then I create a rapid prototype using Streamlit.

I took this dataset , which is a dataset of unfilled clinical consent forms for various medical procedures like bronchoscopy, colonoscopy etc .. Ideally, in a hospital, these forms will be filled up by patients/patient families and an OCR version would be kept for records.

Here’s a subset of those consent files, where I have only picked the ones with OCR of the text done, and no images present. Otherwise, I would have needed to perform the OCR of those images using Tesseract, poppler etc ..

In order to run my program as a quick web application, I started with a new file in PyCharm community, “doc_finder.py”. Here’s the file in github.

I created a new project in PyCharm and installed the following dependencies :

pip install chromadb langchain pypdf2 tiktoken streamlit python-dotenv

ChromaDB as my local disk based vector store for word embeddings

LangChain as my LLM framework

python-dotenv to load my API keys

Streamlit as the web runner and so on …

The imports :

import os
from dotenv import load_dotenv

import streamlit as st

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import Chroma
import chromadb

Now, created a “.env” file where the API key information is written, something like this :

HUGGINGFACEHUB_API_TOKEN="<YOUR_KEY>"
PINECONE_API_KEY="<YOUR_KEY>"
PINECONE_ENV=environment="<YOUR_KEY>"
SENTENCE_TRANSFORMERS_HOME="D:\\testing_space\\sentence_transformers_home"
OPENAI_API_KEY="<YOUR_KEY>"

Next, I loaded the subset of those patient consent forms from my local folder and created a vector store using chromadb and stored the text of those PDFs as word embeddings , locally on disk :

def load_chunk_persist_pdf() -> Chroma:
pdf_folder_path = "D:\\diptiman\\dataset\\consent_forms_cleaned"
documents = []
for file in os.listdir(pdf_folder_path):
if file.endswith('.pdf'):
pdf_path = os.path.join(pdf_folder_path, file)
loader = PyPDFLoader(pdf_path)
documents.extend(loader.load())
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=10)
chunked_documents = text_splitter.split_documents(documents)
client = chromadb.Client()
if client.list_collections():
consent_collection = client.create_collection("consent_collection")
else:
print("Collection already exists")
vectordb = Chroma.from_documents(
documents=chunked_documents,
embedding=OpenAIEmbeddings(),
persist_directory="D:\\testing_space\\chroma_store\\"
)
vectordb.persist()
return vectordb

I created a chromadb collection called “consent_collection” which was persisted on my local disk. Search on PDFs would be served from this chromadb embeddings vector store.

Next, I created an LLM QA Agent Chain to execute Q&A on the embeddings stored on the vectorstore and provide answers to questions :

def create_agent_chain():
model_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=model_name)
chain = load_qa_chain(llm, chain_type="stuff")
return chain

I have used “gpt-3.5-turbo”, similarly HuggingFace OSS chat/instruct models could also be used.

Finally, I served the answer from LLM Agent and created a quick Streamlit prototype :

def get_llm_response(query):
vectordb = load_chunk_persist_pdf()
chain = create_agent_chain()
matching_docs = vectordb.similarity_search(query)
answer = chain.run(input_documents=matching_docs, question=query)
return answer


# Streamlit UI
# ===============
st.set_page_config(page_title="Doc Searcher", page_icon=":robot:")
st.header("Query PDF Source")

form_input = st.text_input('Enter Query')
submit = st.button("Generate")

if submit:
st.write(get_llm_response(form_input))

Running the app requires executing the command :

streamlit run <FILE_NAME>.py
Streamlit Run

The app opened up on my default browser :

Streamlit App

I entered “ What is anesthesia consent ?” and got the reply :

Pretty impressive ! All search happening from my local disk stored embeddings store on chromadb !

Changed my create_agent_chain() method to inclue “verbose=True” :

def create_agent_chain():
model_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=model_name)
chain = load_qa_chain(llm, chain_type="stuff", verbose=True)
return chain

Tried with another Q : “what is Esophagogastroduodenscopy consent ?”

And got the following Agent Q&A log :

> Entering new LLMChain chain...
Prompt after formatting:
System: Use the following pieces of context to answer the users question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
CHAMBERSBURG ENDOSCOPY CENTER, LLC
835 Fifth Avenue, Chambersburg, PA 17201

Consent for Esophagogastroduodenscopy
(Do not sign without reading)
A.M.
Name: _____________________________________Date: ________________________Time: ________________ P.M.

______________________________________will be performing an esophagogastroduodenoscopy under sedation/analgesia
Name of Physician
and/or total IV anesthesia with possible biopsies , treatment of bleeding source, and/or dilatation of stricture if indicated.


____ It has been explained to me how the procedure will be don e and what to expect.
Yes

____ It has been explained why this procedure was recommended to me.
Yes

____ It has been explained the available alternatives and choices to this procedure, as well as the risks of not havi ng the procedure
Yes done. These include but are not limited to: Upper GI x -ray study. I understand in the event of a stricture that there are no
alternatives for treatment and my symptoms will likely continue if a dilatation is not performed I have been fully informed
in general terms of the risks, benefits, and alternatives associated with having the procedure at Chambersburg Endoscopy
Center instead of a hospital.

____ I have been provided an explanation of t he known and recognized risks to this procedure . Specific risks include but are not
Yes limited to: risk of aspiration pneumonia , bleeding, and perforation (uncommon risk), heart attack, breathing difficulties or
medication reactio n (rare complications.).
____ I have had an opportunity to ask questions.
Yes

____ I have had my questions answered and I believe I have all the information I need to fully agree to this procedure.
Yes

____ I have read and understand this form and truthfully answered all questions.
Yes

____ I consent to the procedure.
Yes

____
Patient Self -Determination Act of 1990/Advance Directives : Chambersburg Endoscopy Center has made available to me
Yes written information on my rights and responsibilities to make health care treatment decisions in compliance with th e Patient
Self-Determination Act of 1990. I also understand that I am consenting to have an elective procedure p erformed upon me at
this facility . If I have an Advance Directive and an untoward event occurs, Chambersburg Endoscopy center will stabilize
and transfer me to Chambersburg Hospital.

____ Personal Valuables : Chambersburg Endoscopy Center provides facilities for the safekeeping of any valuables and any
Yes valuables kept by the patient are kept at the patient’s risk. I hereby accept full responsibility for any personal effects taken to
the procedure room, including such things as dentures, eyeglasses, contact lenses, hearing aids.

_________________________________________ __________________________________________
Signature of patient or person authorized to consent for patient Signature of Witness

I certify th at the patient/parent/guardian or other legally responsible person has been provided information on the risks and hazards,
Benefits and alternatives to treatments outlined above, had questions within my area of expertise answered and has given cons ent.

_____/_____/_____ _________________ _______________________________________________
Date Time Signature of Physician


CHAMBERSBURG ENDOSCOPY CENTER, LLC
835 Fifth Avenue, Chambersburg, PA 17201

Consent for Esophagogastroduodenscopy
(Do not sign without reading)
A.M.
Name: _____________________________________Date: ________________________Time: ________________ P.M.

______________________________________will be performing an esophagogastroduodenoscopy under sedation/analgesia
Name of Physician
and/or total IV anesthesia with possible biopsies , treatment of bleeding source, and/or dilatation of stricture if indicated.


____ It has been explained to me how the procedure will be don e and what to expect.
Yes

____ It has been explained why this procedure was recommended to me.
Yes

____ It has been explained the available alternatives and choices to this procedure, as well as the risks of not havi ng the procedure
Yes done. These include but are not limited to: Upper GI x -ray study. I understand in the event of a stricture that there are no
alternatives for treatment and my symptoms will likely continue if a dilatation is not performed I have been fully informed
in general terms of the risks, benefits, and alternatives associated with having the procedure at Chambersburg Endoscopy
Center instead of a hospital.

____ I have been provided an explanation of t he known and recognized risks to this procedure . Specific risks include but are not
Yes limited to: risk of aspiration pneumonia , bleeding, and perforation (uncommon risk), heart attack, breathing difficulties or
medication reactio n (rare complications.).
____ I have had an opportunity to ask questions.
Yes

____ I have had my questions answered and I believe I have all the information I need to fully agree to this procedure.
Yes

____ I have read and understand this form and truthfully answered all questions.
Yes

____ I consent to the procedure.
Yes

____
Patient Self -Determination Act of 1990/Advance Directives : Chambersburg Endoscopy Center has made available to me
Yes written information on my rights and responsibilities to make health care treatment decisions in compliance with th e Patient
Self-Determination Act of 1990. I also understand that I am consenting to have an elective procedure p erformed upon me at
this facility . If I have an Advance Directive and an untoward event occurs, Chambersburg Endoscopy center will stabilize
and transfer me to Chambersburg Hospital.

____ Personal Valuables : Chambersburg Endoscopy Center provides facilities for the safekeeping of any valuables and any
Yes valuables kept by the patient are kept at the patient’s risk. I hereby accept full responsibility for any personal effects taken to
the procedure room, including such things as dentures, eyeglasses, contact lenses, hearing aids.

_________________________________________ __________________________________________
Signature of patient or person authorized to consent for patient Signature of Witness

I certify th at the patient/parent/guardian or other legally responsible person has been provided information on the risks and hazards,
Benefits and alternatives to treatments outlined above, had questions within my area of expertise answered and has given cons ent.

_____/_____/_____ _________________ _______________________________________________
Date Time Signature of Physician


CHAMBERSBURG ENDOSCOPY CENTER, LLC
835 Fifth Avenue, Chambersburg, PA 17201

Consent for Esophagogastroduodenscopy
(Do not sign without reading)
A.M.
Name: _____________________________________Date: ________________________Time: ________________ P.M.

______________________________________will be performing an esophagogastroduodenoscopy under sedation/analgesia
Name of Physician
and/or total IV anesthesia with possible biopsies , treatment of bleeding source, and/or dilatation of stricture if indicated.


____ It has been explained to me how the procedure will be don e and what to expect.
Yes

____ It has been explained why this procedure was recommended to me.
Yes

____ It has been explained the available alternatives and choices to this procedure, as well as the risks of not havi ng the procedure
Yes done. These include but are not limited to: Upper GI x -ray study. I understand in the event of a stricture that there are no
alternatives for treatment and my symptoms will likely continue if a dilatation is not performed I have been fully informed
in general terms of the risks, benefits, and alternatives associated with having the procedure at Chambersburg Endoscopy
Center instead of a hospital.

____ I have been provided an explanation of t he known and recognized risks to this procedure . Specific risks include but are not
Yes limited to: risk of aspiration pneumonia , bleeding, and perforation (uncommon risk), heart attack, breathing difficulties or
medication reactio n (rare complications.).
____ I have had an opportunity to ask questions.
Yes

____ I have had my questions answered and I believe I have all the information I need to fully agree to this procedure.
Yes

____ I have read and understand this form and truthfully answered all questions.
Yes

____ I consent to the procedure.
Yes

____
Patient Self -Determination Act of 1990/Advance Directives : Chambersburg Endoscopy Center has made available to me
Yes written information on my rights and responsibilities to make health care treatment decisions in compliance with th e Patient
Self-Determination Act of 1990. I also understand that I am consenting to have an elective procedure p erformed upon me at
this facility . If I have an Advance Directive and an untoward event occurs, Chambersburg Endoscopy center will stabilize
and transfer me to Chambersburg Hospital.

____ Personal Valuables : Chambersburg Endoscopy Center provides facilities for the safekeeping of any valuables and any
Yes valuables kept by the patient are kept at the patient’s risk. I hereby accept full responsibility for any personal effects taken to
the procedure room, including such things as dentures, eyeglasses, contact lenses, hearing aids.

_________________________________________ __________________________________________
Signature of patient or person authorized to consent for patient Signature of Witness

I certify th at the patient/parent/guardian or other legally responsible person has been provided information on the risks and hazards,
Benefits and alternatives to treatments outlined above, had questions within my area of expertise answered and has given cons ent.

_____/_____/_____ _________________ _______________________________________________
Date Time Signature of Physician


CHAMBERSBURG ENDOSCOPY CENTER, LLC
835 Fifth Avenue, Chambersburg, PA 17201

Consent for Esophagogastroduodenscopy
(Do not sign without reading)
A.M.
Name: _____________________________________Date: ________________________Time: ________________ P.M.

______________________________________will be performing an esophagogastroduodenoscopy under sedation/analgesia
Name of Physician
and/or total IV anesthesia with possible biopsies , treatment of bleeding source, and/or dilatation of stricture if indicated.


____ It has been explained to me how the procedure will be don e and what to expect.
Yes

____ It has been explained why this procedure was recommended to me.
Yes

____ It has been explained the available alternatives and choices to this procedure, as well as the risks of not havi ng the procedure
Yes done. These include but are not limited to: Upper GI x -ray study. I understand in the event of a stricture that there are no
alternatives for treatment and my symptoms will likely continue if a dilatation is not performed I have been fully informed
in general terms of the risks, benefits, and alternatives associated with having the procedure at Chambersburg Endoscopy
Center instead of a hospital.

____ I have been provided an explanation of t he known and recognized risks to this procedure . Specific risks include but are not
Yes limited to: risk of aspiration pneumonia , bleeding, and perforation (uncommon risk), heart attack, breathing difficulties or
medication reactio n (rare complications.).
____ I have had an opportunity to ask questions.
Yes

____ I have had my questions answered and I believe I have all the information I need to fully agree to this procedure.
Yes

____ I have read and understand this form and truthfully answered all questions.
Yes

____ I consent to the procedure.
Yes

____
Patient Self -Determination Act of 1990/Advance Directives : Chambersburg Endoscopy Center has made available to me
Yes written information on my rights and responsibilities to make health care treatment decisions in compliance with th e Patient
Self-Determination Act of 1990. I also understand that I am consenting to have an elective procedure p erformed upon me at
this facility . If I have an Advance Directive and an untoward event occurs, Chambersburg Endoscopy center will stabilize
and transfer me to Chambersburg Hospital.

____ Personal Valuables : Chambersburg Endoscopy Center provides facilities for the safekeeping of any valuables and any
Yes valuables kept by the patient are kept at the patient’s risk. I hereby accept full responsibility for any personal effects taken to
the procedure room, including such things as dentures, eyeglasses, contact lenses, hearing aids.

_________________________________________ __________________________________________
Signature of patient or person authorized to consent for patient Signature of Witness

I certify th at the patient/parent/guardian or other legally responsible person has been provided information on the risks and hazards,
Benefits and alternatives to treatments outlined above, had questions within my area of expertise answered and has given cons ent.

_____/_____/_____ _________________ _______________________________________________
Date Time Signature of Physician

Human: what is Esophagogastroduodenscopy consent ?

> Finished chain.

Similar functionality can be realised using other local vector stores like milvus and SaaS vector stores such as Pinecone etc ..

For in memory stores FAISS (pip install faiss-cpu) is another wonderful one !

So long !

--

--