Structuring projects and configuration management for LLM-powered Apps.
You can say that this is part 1.5 π of my mini-series Building end-to-end LLM-powered applications without Open AIβs API.
What is covered so far in the series
- Integrating custom LLM using langchain (A GPT4ALL example)
- Configuration management for LLM-powered applications
- Connecting LLM to the knowledge base (coming soon)
- Serving LLM using Fast API (coming soon)
- Fine-tuning an LLM using transformers and integrating it into the existing pipeline for domain-specific use cases (coming soon)
- Putting all together into a simple make cloud-native LLM-powered application (coming soon)
Letβs get started
But Why? Why should I care about having configuration files? Because before building end-to-end applications, having properly organized structured configurations are very important to manage projects. There are several reasons why you should care of configuration management. Some of them w.r.t our project are as follows:
- Managing paths of the models. You might need to change the model path or the S3 link. And the dependencies of the path can be reflected in several project files.
- Hyperparameters management. Suppose you have a fine-tuning pipeline or an inference pipeline. Hyperparameters like learning rate, optimizer choice, etc are to become hard to manage if those are initialized within project files. In our cases, configs like whether we should stream or not, temperature, top_p, top_k, etc.
- Configurations act like settings. Just like all applications (even your Android/ios applications) have in-app settings. Similarly, configuration files also act as settings, which can also be tweaked from outside the inner code structure but can communicate the behavior of how your code must work. Even config files are very much useful while writing infrastructure code to serve applications.
Lack of configuration management can cause some serious problems during the time of production when scaling systems or tweaking/updating parameters.
The same goes for project structure. Making a modular project structure is very much essential. Modern cloud-native applications require proper structuring of projects such that most modules are decoupled (independent) of each other and can be handled or scaled independently. Now that you got some significance in writing config files, now letβs jump right into building our project structure and write our configuration files. For the project structure we will be using CookieClutter for building our project boilerplate and for configuration management we will be using Hydra.
Structuring our Project
CookieClutter is a command line utility that creates project structure from pre-existing project templates. You can install cookiecutter by
pip install cookiecutter
Then we will be using a popular template repository called cookieclutter data science. We can easily set it up by using this command
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science
Doing this will take you to a simple CLI application where you have to type things like app names, descriptions, etc. And it will create a base template. But as we are not proper data science here, I deleted some files and folders and the structure I use looks like this now.
.
βββ configs
β βββ config.yaml
β βββ knowledge_base
β β βββ default.yaml
β βββ model
β βββ default.yaml
βββ data
β βββ diff_lm.pdf
β βββ llama2.pdf
βββ LICENSE
βββ main.py
βββ Makefile
βββ README.md
βββ requirements.txt
βββ setup.py
βββ src
β βββ data
β β βββ __init__.py
β βββ __init__.py
β βββ models
β βββ __init__.py
βββ test_environment.py
βββ tests
βββ tox.ini
Here is some information you need to know about what each folder is for.
configs
: This is mainly to store different configurations for each of the components in our applications. Examples: models, knowledge base, vector database, etc.data
: This is used to save all our documents for our knowledge base. However, in practice, documents must be stored inside a service like Amazon S3.src
: It contains all our source files where it has subfolders likedata
, which will contain all the files to create and manage our knowledge base andmodels
to create and manage our LLMs.tests
: Here we will keep all our files for testing like unit testing, integration testing, etc.main.py
: For now, this will be our main Python file which will be using the source files as a helper and running the main code. We will run this code for our chatbot.Makefile
is useful to quickly set up our project like installing dependencies, linting, formatting, quick testing, etc.
The code from our previous blog will go inside src/models
the folder. Today our focus will be on writing configuration files under the configs
folder. I assume readers already are familiar with my previous blog about how we integrated a custom open-source LLM (using gpt4all
and langchain
). Because today we are going to write configuration files for our models.
The plot
In short, letβs imagine this plot. Suppose your app allows users to chat with different kinds of data and with different variants of open-source LLMs. Different open-source LLMs are accessed through different providers like LLAMA.cpp
, GPT4ALL
, HugggingFace
etc. Now each of the models written with different libraries might have different optimal configurations. How you are gonna manage those? The answer is writing configuration files. Here we will be writing our configurations for our gpt4all
LLM models.
Hydra 101
We start by installing Hydra. We can do this by this command
pip install hydra-core==1.1.0
Hydra operates on top of
OmegaConf
, which is aYAML
based hierarchical configuration system, with support for merging configurations from multiple sources (files, CLI argument, environment variables) providing a consistent API regardless of how the configuration was created.
Writing your first configuration files
Letβs get started by making a simple yaml
file named config.yaml
and the contents of the yaml
the file is like this
gpt4all:
gpt4all_model_name: ggml-gpt4all-j-v1.3-groovy.bin
gpt4all_model_folder_path: /home/anindya/.local/share/nomic.ai/GPT4All/
gpt4all_backend: llama
The above is a configuration file I wrote using yaml. Now I could also have written the same like this
gpt4all_model_name: ggml-gpt4all-j-v1.3-groovy.bin
gpt4all_model_folder_path: /home/anindya/.local/share/nomic.ai/GPT4All/
gpt4all_backend: llama
This one is also valid. But it all depends. Letβs consider the first yaml is format 1 and the second one is format 2. Format 1 lets me add configurations of different model providers all at once. For example, if I now want to add hugging face configurations I could have done something like this.
gpt4all:
gpt4all_model_name: ggml-gpt4all-j-v1.3-groovy.bin
gpt4all_model_folder_path: /home/anindya/.local/share/nomic.ai/GPT4All/
gpt4all_backend: llama
hugging_face:
hugging_face_model_id: some model name
hugging_face_adapter_name: some adapter id
hugging_face_dataset: some dataset source id
Whereas in format 2, I might do the same but may be in different files. In that case, one file would be named as gpt4all_config.yaml
, another would be huggingface_config.yaml
. It all depends on how complex our projects are and based on that we have to choose our
Now letβs load this file using Hydra
and use print the configurations.
# file name: main.py
import hydra
@hydra.main(config_path='.', config_name='config')
def main(cfg):
print('GPT4ALL Configurations')
print(' Model name: ', cfg.gpt4all.gpt4all_model_name)
print(' Model stored in path: ', cfg.gpt4all.gpt4all_model_folder_path)
print(' Model using LLM backend of: ', cfg.gpt4all.gpt4all_backend)
print('\nHuggingFace Configurations')
print(' Hugging face model name: ', cfg.hugging_face.hugging_face_model_id)
print(' Hugging face adapter name: ', cfg.hugging_face.hugging_face_adapter_name)
print(' Hugging face dataset name: ', cfg.hugging_face.hugging_face_dataset)
if __name__ == '__main__':
main()
I guess from intuition you can find that all hydra essential does here is that whatever configuration we provide through a bunch of yaml files, it provides a nice object-oriented interface of that so that it can be easily accessible. Running this will be printing the following
GPT4ALL Configurations
Model name: ggml-gpt4all-j-v1.3-groovy.bin
Model stored in path: /home/anindya/.local/share/nomic.ai/GPT4All/
Model using LLM backend of: llama
HuggingFace Configurations
Hugging face model name: some model name
Hugging face adapter name: some adapter id
Hugging face dataset name: some dataset source id
There are several other things Hydra is famous for. One such thing is Overriding configurations. This means that we can change the configurations in the run time and run our app accordingly. Here is an example while I run this app.
python3 main.py \
gpt4all.gpt4all_model_name="my latest model name" \
gpt4all.gpt4all_model_folder_path="my latest folder" \
gpt4all.gpt4all_backend="GPTJ"
And this will print
GPT4ALL Configurations
Model name: my latest model name
Model stored in path: my latest folder
Model using LLM backend of: GPTJ
HuggingFace Configurations
Hugging face model name: some model name
Hugging face adapter name: some adapter id
Hugging face dataset name: some dataset source id
I guess this information is enough to get you all started with our existing project. However, if you are interested to learn more, check out the awesome documentation by Hydra itself. Also, a huge part of this 101 was inspired by this awesome blog by Raviraja Ganta. He used Hydra to show us how to use to manage configurations while building end-to-end ml applications.
Applying Hydra to our project to manage our model
Previously I showed you how to create a single yaml file and use that as our configurations. But in real scenarios we will be having much more complex things to handle. You can just take our app examples. Now we are just dealing with our model. In future, we have to do the same for our vector database (Knowledge base), API, serving, etc. Hence it is better to create configs for different components. Here is an example of our config
folder structure.
configs
βββ config.yaml
βββ knowledge_base
β βββ default.yaml
βββ model
βββ default.yaml
Think of config.yaml
inside configs
as the config orchestrator. It means that the config.yaml
knows where other configurations of each components are located and through config.yaml
we can load and control the other components. Now inside model
and knowledge_base
there is this default.yaml
which has the default configurations for our model and our knowledge base (we will talk about the knowledge base in the later parts of our blog). Now letβs take a look at our config.yaml
file.
defaults:
- model: default
- knowledge_base: default
If we just use our intuition, then it is not much hard to understand. It says that our default
configurations are as follows. The default model configurations will be using the default.yaml
inside configs/models
and the same for our knowledge base. Now similarly configs/models
might contain other config files like model_dev.yaml
(all model configurations for development phase), model_production.yaml
(untouched configurations for production, we can also make it read-only using Hydra if we want). And then we can use that in our config.yaml
reference. But for now, letβs keep this setting.
Once done now letβs take a peek at how our configs/models/default.yaml
look like.
gpt4all_model:
gpt4all_model_name: ggml-gpt4all-j-v1.3-groovy.bin
gpt4all_model_folder_path: /home/anindya/.local/share/nomic.ai/GPT4All/
gpt4all_backend: llama
gpt4all_allow_streaming: true
gpt4all_allow_downloading: false
gpt4all_temperature: 1
gpt4all_top_p: 0.1
gpt4all_top_k: 40
gpt4all_n_batch: 8
gpt4all_n_threads: 4
gpt4all_n_predict: 256
gpt4all_max_tokens: 200
gpt4all_repeat_last_n: 64
gpt4all_penalty: 1.18
We now wrote all the changing parameters inside the config and we are also done referencing by our config.yaml
Now inside our main.py
file letβs use it to test whether our configs are working on not. And also while running main.py
we will tweak some parameters as done previously in our dummy example to see if this works in this case too.
Here is our main.py
file.
import hydra
from src.models.gpt4all_model import MyGPT4ALL
# reference the ./configs folder to tell hydra where the configs are located
# also tell hydra that our master config (which manages all other config)
# name is config.
@hydra.main(config_path='./configs', config_name='config')
def main(cfg):
# instantiate the model and populate the arguments using hydra
chat_model = MyGPT4ALL(
model_folder_path=cfg.model.gpt4all_model.gpt4all_model_folder_path,
model_name=cfg.model.gpt4all_model.gpt4all_model_name,
allow_download=cfg.model.gpt4all_model.gpt4all_allow_downloading,
allow_streaming=cfg.model.gpt4all_model.gpt4all_allow_streaming,
)
while True:
query = input('Enter your Query: ')
if query == 'exit':
break
# use hydra to fill the **kwargs
response = chat_model(
query,
n_predict=cfg.model.gpt4all_model.gpt4all_n_predict,
temp=cfg.model.gpt4all_model.gpt4all_temperature,
top_p=cfg.model.gpt4all_model.gpt4all_top_p,
top_k=cfg.model.gpt4all_model.gpt4all_top_k,
n_batch=cfg.model.gpt4all_model.gpt4all_n_batch,
repeat_last_n=cfg.model.gpt4all_model.gpt4all_repeat_last_n,
repeat_penalty=cfg.model.gpt4all_model.gpt4all_penalty,
max_tokens=cfg.model.gpt4all_model.gpt4all_max_tokens,
)
print()
if __name__ == '__main__':
main()
This file is simple enough to understand I believe. All it does is that it instantiates the chat_model
using our hydra configs and also calls the **kwargs
while using the model in a chat with our hydra configs. And if we just run python3 main.py
it should run as expected. But suppose we want to tweak some parameters in runtime. Take a look at this for example.
PYTHONPATH=. python3 main.py \
model.gpt4all_model.gpt4all_model_name=ggml-mpt-7b-instruct.bin \
model.gpt4all_model.gpt4all_temperature=1 \
model.gpt4all_model.gpt4all_top_k=50 \
model.gpt4all_model.gpt4all_max_tokens=10 \
model.gpt4all_model.gpt4all_penalty=1.00
I changed the model name, temperature, values of top_k, max_tokens, and penalty and hence it will change the parameters before running the code. Also the best part!!! This provides us with an awesome cli interface without letting us make one. Isβt this awesome?
Conclusion
Congratulations π₯³π₯³. You just completed a very important part of production-grade general ML life cycles i.e. configuration management. And we can apply the knowledge in building our LLM-powered applications. Now this might not look very cool, but I assure you, it makes life easier when applications become too complex to handle. In the next part of our blog, we will be learning how to connect documents and make a knowledge base out of it, to make LLMs more robust and do question-answering on unseen, private docs. All the codes used here and previously are dumped inside this GitHub repo. Feel free to check that out. Until next time. πͺ
References and Acknowledgements
- Config management using Hydra by Raviraja Ghanta https://deep-learning-blogs.vercel.app/blog/mlops-hydra-config
- Mastering Config management using Hydra https://towardsdatascience.com/mastering-configuration-ml-with-hydra-ef138f1c1852