Knowledge Bases and retrieval augmented LLMs, A primer.

Anindyadeep
11 min readJul 30, 2023

So far we have learned how to connect custom Large Language Models using Langchain and there I showed an example of connecting a LLM provided by GPT4All with a custom download function. However, the same thing can also be done for any kind of LLMs and extending that through different custom functions to connect with existing workflows and at the same time, get the power of LangChain like prompt chaining, connecting with vector DBs, making agents, etc.

In the second blog, we were more focused on organizing projects and how to easily manage complex configurations using configuration management tools like Hydra.

What is covered so far in the series

Today in this blog, I will cover the topic of knowledge bases in the most intuitive possible manner. After reading this blog, you can answer these three questions very easily.

  1. What is a knowledge base?
  2. How does an LLM get help from the knowledge base and how it finds the “knowledge”
  3. What are these vector DBs (like FAISS, Chroma, PineCone, etc)

In the next part of this blog, we will create our knowledge base using Chroma DB and connect it with our LLM using langchain.

What is a knowledge base?

Well, a Knowledge base is a very generic term, which (in tech) often gets defined as a library of information about any kind of products, service, or topics, etc. And yes, when it comes to an LLM it is also the same. Any kind of document regarding any topics, products, or services that helps to provide context to an LLM is considered a knowledge base.

How an LLM finds “knowledge” from a knowledge base?

That’s a very good question. Because these questions build the foundations of understanding how these things work. As you might be already familiar that LLMs are very much popular with question-answering-based tasks. And some times, you might also have observed that LLMs get confused and either provide some imaginary answer or get out of context. This means the bottleneck is the word called CONTEXT.

So before answering the above question we need to know, what context means. Well, it’s very simple. In real life too, suppose hopped in a middle of a “serious” conversation (like Pewdiepie vs T-Series 🤭) in your friend’s group, and out of nowhere they want your point of view. Now, being confused, you always ask, “What are you guys talking about?” or “Give me some context”, and one of your friends summarizes what they were talking about and now you can provide your point of view, more confidently.

That’s it, I hope you got the intuition of what context is. Similarly, you want your LLMs to answer or explain some things on some private document (or it might be your book, or maybe your notion pages) and since the LLM is less likely to see that information, you provide your LLM a context prior to asking the question. Once LLMs get the correct context it gives you awesome answers/explanations with its generative capabilities.

Hence to summarize, for an LLM, a context is simply some additional strings that are very much specific to the user’s query and domain specific which get prepended before the user’s query before passing to an LLM.

Then how to find that knowledge in vast numbers of documents?

That’s a fair question to ask. Do LLMs find the knowledge itself or there is something else that helps LLMs to find and provide the context? Well, the second option is correct here.

Here I will introduce the concepts of embeddings. Those who already know how embeddings work can skip this part. In this blog, our main goal is to integrate our Knowledge Base using a vector DB. So going to very depths of what are embeddings and how they exactly work would not make sense much. However, I will try to provide a primer and the required knowledge to understand the next part of the blog.

So before embeddings, what are vectors? We can think of vectors in two different ways.

One: An array of numbers. That’s it. 🤓

PS. Physics folks here, please don’t think v and a are velocity and acceleration here, lol. It’s just an example.

Two: We can also think of a vector as a point in space with some magnitude and direction. (classic definition right)

Here w, v, p are vectors (with three dimensions, as the list has three numbers respectively for each of the vectors). If you observe vectors v and p are very close to each other which means they share very similar kind of properties where as w is far from both of them which means it might hold dissimilar properties.

Coming back to Embeddings

Now that you got a primer about vectors, now image you have a very big chunk of text. And what if I can convert these chunks of text into these kinds of vectors? But why, if I can do that, then I can somehow convert my question (query, which is a text) to these vectors and find similarity with existing vectors (text chunks) and my top 10 similar vectors might contain topics which are similar to my queries (which can be our answers). And now if I “defectors” these top 10 similar vectors then I will get the top 10 texts (or text chunks) that will contain our answers, and giving those texts as our context to LLM, it can generate awesome explanations and answers all factually correct and related to our query.

So now we know, what vectors are and how these vectors can be helpful for LLM for information retrieval. Only one thing left, what are embeddings then? Simply, think of embeddings as a specialized (n-dimensional) vector that resembles the very text. Or simply a mapping of the text compressed into some (n-dimensional) vector.

The (black) box that you are seeing in the above picture is the one responsible for this mapping. This is essentially a neural network like (BERT, a transformer-based encoder) that was trained on a huge number of text-based tasks (like classification, clustering, token classification, etc), and due to this massive amount of training, these encoders (encoders are the entities that encode the text to this kind of vectors) have found an optimal mapping to compress these texts into this representation, conserving its semantic meaning.

This means, that the properties of language and mathematical properties of vectors are both conserved by these special representations called embeddings. The “n” in n-dimension is generally numbers like 512, 728, 1024, etc.

So here’s how our LLMs get the answer

When we have our documents (on which we want to create our knowledge base) we first decompose these documents into smaller chunks for two reasons

  1. Smaller chunks make context length small (which helps the LLM not go context out of length)
  2. Also, there might be different parts of a document or different documents where the answer might lie. Hence chunking documents into smaller pieces makes search more efficient and on a granular level.

We then use these chunks to now create the embeddings of each. And store them in a database that stores these vectors and documents. We call them vector stores or vector DBs.

Step 1: Creation of chunks

Once we create our chunks, we also try to save them and also keep metadata with ourselves. A metadata contains data about our data. Here our data is our document chunks. Metadata will contain a mapping of which chunks belongs to which document, the page number, and so on… Here is a simple “dummy” example

{
'doc1': [
{
'chunk1': {
'start page number': 1,
'end page number': 1
},
'chunk2': {
'start page number': 1,
'end page number': 1
},
}
],

'doc2': [
{
'chunk1': {
'start page number': 13,
'end page number': 13
},
'chunk2': {
'start page number': 13,
'end page number': 14
},

'chunk3': {
'start page number': 14,
'end page number': 14
},
}
],

}

Now, please note that: this is just an oversimplified example just to provide the intuition of how these metadata might look like. The point is, these metadata are very much important during the time of retrieval and “devectorising” these embedding back to the text and providing the text as context, and also we can additionally provide the source where the LLM is getting context/answer from.

Step 2: Make embeddings from these chunks.

All of these chunks (i.e. pieces of text) now go to our embedding layer (which is our neural network here that maps our texts to some embeddings) and stores this mapping {chunk : embedding} to a vector database.

So now then, our documents are indexed and can be retrieved easily when queried, here is an awesome schematic diagram by Pinecone of the overall picture from creation to retrieval.

Source: Pinecone Blog

So whenever we have to do question answering on our documents, then with every use query undergoes this workflow of going to an embedding model, and then search the top-n embeddings (example top 10 embeddings) and then returns us as query results.

Now NOTE: In the case of our LLM it is not the query result, which becomes our final result. The raw query result is a data structure that will contain the relevant chunks and their corresponding metadata. And these chunks prepend inside our initial prompt and then we tell our LLM to give an answer based on the context given above.

So suppose when our prompt to our LLM was:

Query:

What was the main cause of the French revolution?

Then suppose these were our relevant chunks

(chunk 1)

The French Revolution occurred due to widespread economic inequality
and financial crisis in the country. The burden of heavy taxes and feudal
dues disproportionately fell on the common people
(chunk 2)

lack of opportunities, culminated in the revolution. The vast majority
of the population faced severe hardships, while the ruling elite lived
in luxury, exacerbating the resentment
(chunk 3)

liberty, and the questioning of traditional authority, played a
significant role in inspiring the French Revolution.

Now that we take these top 3 chunks and build our final prompt before passing into the LLM.

### Instruction:

You are only allowed to respond based on the underlined context. The answer
should be answered within 5 points not more than that. You are not allowed
to answer anything which you think it is not factually incorrect.

### Context:

(chunk 1)

The French Revolution occurred due to widespread economic inequality
and financial crisis in the country. The burden of heavy taxes and feudal
dues disproportionately fell on the common people

(chunk 2)

lack of opportunities, culminated in the revolution. The vast majority
of the population faced severe hardships, while the ruling elite lived
in luxury, exacerbating the resentment

(chunk 3)

liberty, and the questioning of traditional authority, played a
significant role in inspiring the French Revolution.

### Query:
What was the main cause of the French revolution?

This is sorta the final modified prompts which contain enough context that will provide better factually correct answers. Below is the complete schematic diagram that summarizes the overall process. (please zoom in to see better)

Knowledge Retrieval Architecture for LLM’s by Matt Boyenger

If you are more interested in going into the depths of knowledge retrieval architecture and vector databases, please check out these two awesome links.

What are these vector DBs (like FAISS, Chroma, PineCone, etc)?

Now that we have a good understanding of knowledge bases and how LLMs leverage them for context and information retrieval, let’s explore the backbone that enables this process — vector databases. Vector databases are powerful tools that store and index high-dimensional vectors, allowing for efficient similarity search and retrieval.

  1. FAISS (Facebook AI Similarity Search):
FAISS by Meta AI

FAISS, developed by Facebook’s AI Research (FAIR) team, is one of the most popular open-source vector databases available. It excels in handling large-scale vector datasets and performs blazing-fast similarity searches. FAISS offers a variety of index structures and algorithms, such as Product Quantization and Inverted Multi-Index, enabling quick and accurate nearest-neighbor searches. By integrating FAISS with LangChain, LLMs can harness the power of this database to efficiently find relevant contexts from vast numbers of documents and provide precise answers.

2. Chroma:

Another valuable vector storage and search library is Chroma, created by Airbnb. Chroma is optimized for similarity search in high-dimensional spaces, making it an excellent choice for LLM applications. Its algorithms and data structures enable effective indexing and retrieval of vectors, enabling LLMs to quickly access relevant chunks of information from the knowledge base. By seamlessly integrating Chroma into the LangChain ecosystem, LLMs can benefit from its capabilities and enhance their question-answering tasks.

3. PineCone:

PineCone, provided by Pinecone.io, is a cloud-native vector database service designed for similarity search at scale. Its focus on real-time search and recommendations makes it ideal for LLM use cases, where prompt chaining, connecting with vector databases, and making agents are crucial. By leveraging PineCone’s capabilities through LangChain, LLMs gain the ability to efficiently index and retrieve vector embeddings, enabling them to access relevant contexts and generate accurate answers swiftly.

In summary, these vector databases, including FAISS, Chroma, and PineCone, are key components in the knowledge retrieval architecture for LLMs. They enable efficient storage, indexing, and retrieval of high-dimensional vectors, empowering LLMs to find and utilize the right context to provide factually correct and contextually relevant answers to complex queries.

Conclusion

So in this blog, we learned about the following things

  1. What are Knowledge Bases
  2. What are embeddings
  3. How the embeddings help in knowledge retrieval and take part in making an effective knowledge base
  4. How LLMs use this to gain better context and generate better and more factually correct answers using the help of that.
  5. The complete workflow of building a knowledge base (from document chunking to creating an embedding to saving it to a vector store and retrieving relevant documents once queried)
  6. What are the popular vector Databases?

Up next we will build our own document store and knowledge base and connect our custom LLMs using langchain and GPT4All. Stay Tuned 💪

References

--

--

Anindyadeep

Engineering @PremAI | Ex ML @CorridorPlatforms, @Voxela Inc. I like to talk about my journey and learnings.