How to integrate custom LLM using langchain. A GPT4ALL example.

12 min readJul 16, 2023

This is part 1 of my mini-series: Building end to end LLM powered applications without Open AI’s API

What is covered so far in the series

Integrating custom LLM using langchain (A GPT4ALL example)
Configuration management for LLM-powered applications
Connecting LLM to the knowledge base (coming soon)
Serving LLM using Fast API (coming soon)
Fine-tuning an LLM using transformers and integrating it into the existing pipeline for domain-specific use cases (coming soon)
Putting all together into a simple make cloud-native LLM-powered application (coming soon)

Why, do not use Open AI’s API (it’s just three lines of code bruh)

Building Large Language Models and creating LLM-powered applications have become popular ever since the rise of chatGPT. Communities right now are mainly focused on two fundamental things.

Coming up with newer Open source Large Language Models (Llama, Falcon, etc)
Building different kinds of use cases and crafting applications powered by LLMs.

If we pay close attention, most of the time, the LLM is used for making applications with our good old gpt3.5-turbo (chatGPT) API, which is fine. But consider these two scenarios

Experimenting with Open AI’s API might become a standard but it is expensive.
There are a lot of use cases, tangled with high stakes environment which might not let makers use third-party API.

Lemme expand on the second point for you. Suppose you want to deploy an assistant dealing with use cases in financial/medical/law centers. There you not only want to control your overall application but also your LLM instead. And one way of controlling your LLM from the ground up might be fine-tuning it with custom (might be proprietary data). In those cases where privacy, controllability is of utmost priority, chatGPT might be your last option.

So, suppose you already have an open-source LLM, then you fine-tuned it for your use cases (might be optional initially) and now you want to build cool applications around that. We all know langchain comes into our mind first when it comes to building applications with LLMs. This tutorial focuses on how we integrate custom LLM using langchain.

For those folks who are unaware of langchain, langchain is an amazing open-source framework that makes it easier for developers to build applications using language models. It provides a set of tools and abstractions that make it easier to connect language models to other data sources and interact with their environment. (Definition by bard 🤓)

Let’s get started 💪

So, before getting started, we must have access to an open-source LLM right? There are several providers of open-source LLM, for example: 🤗 (those who know, they know). But I had a pretty hard time running inference and facing memory errors, as I do not have enough GPU. Then I came across GPT4ALL. For those who don’t know what gpt4all is, it’s a gem. Think of it as your chatGPT-like application but inside your laptop and you can even choose the model you want to use to chat with. Amazing right? Best part it runs on local devices (even with 4 GB of ram). GPT4ALL also provides a chatGPT-like interface and also has impressive features all FREE.

For more, you may check out their website: NomicAI/gpt4all. For this tutorial, I will be using the Python bindings provided by gpt4all.

A Disclaimer before getting started

Integrations of gpt4all with langchain are already provided by langchain itself. It’s just a matter of some lines of code. Here it is:

from langchain.llms import GPT4All
from langchain import PromptTemplate, LLMChain

# create a prompt template where it contains some initial instructions
# here we say our LLM to think step by step and give the answer

template = """
Let's think step by step of the question: {question}
Based on all the thought the final answer becomes:
"""
prompt = PromptTemplate(template=template, input_variables=["question"])

# paste the path where your model's weight are located (.bin file)
# you can download the models by going to gpt4all's website.
# scripts for downloading is also available in the later 
# sections of this tutorial

local_path = ("./models/GPT4All/ggml-gpt4all-j-v1.3-groovy.bin")

# initialize the LLM and make chain it with the prompts

llm = GPT4All(
    model=local_path, 
    backend="llama", 
)

llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)

# run the chain with your query (question)

llm_chain('Who is the CEO of Google and why he became the ceo of Google?')

Well, yes it’s just 6 lines of code. Then why I am making this tutorial to write more lines of code when it is already done by langchain? Here are some reasons

We will be making a model class in this tutorial, and the same blueprint can be applied to any other model.
More customizability: You can add more features to extend the class the way you want.
Easy integrations with systems. This is the extension of the previous point. Your model is not only to provide results but also gonna serve in productions. So making a custom class that can handle things like (auto download of latest versions, streaming tokens via API calls, etc can also be done) However that is out of the scope of this tutorial. (Maybe in some other tutorial? stay tuned 🤭)

Now let’s really get started

Also one more thing, I hope you have gpt4all installed it. If not then install it using this command

pip install gpt4all

Let’s start with defining our class and the set of parameters it requires while we instantiate the object when defining this class. Here is a sample code for that.

from typing import Optional
from langchain.llms.base import LLM

class MyGPT4ALL(LLM):
    """
    A custom LLM class that integrates gpt4all models
    
    Arguments:

    model_folder_path: (str) Folder path where the model lies
    model_name: (str) The name of the model to use (<model name>.bin)
    allow_download: (bool) whether to download the model or not

    backend: (str) The backend of the model (Supported: llama/gptj)
    n_threads: (str) The number of threads to use
    n_predict: (str) The maximum numbers of tokens to generate
    temp: (str) Temperature to use for sampling
    top_p: (float) The top-p value to use for sampling
    top_k (float) The top k values use for sampling
    repeat_last_n: (int) Last n number of tokens to penalize
    repeat_penalty: (float) The penalty to apply repeated tokens
    
    """
    model_folder_path: str = Field(None, alias='model_folder_path')
    model_name: str = Field(None, alias='model_name')
    allow_download: bool = Field(None, alias='allow_download')

    # All the optional arguments

    backend:        Optional[str]   = None
    top_p:          Optional[float] = 0.1
    top_k:          Optional[int]   = 40
    max_tokens:     Optional[int]   = 200
    n_threads:      Optional[int]   = 4
    n_predict:      Optional[int]   = 256
    temp:           Optional[float] = 0.7
    repeat_last_n:  Optional[int]   = 64
    repeat_penalty: Optional[float] = 1.18


    # initialize the model
    gpt4_model_instance:Any = None

P.S. Now please don’t judge this ultra aesthetic style of code format…lol. Also please note, if you want to know what each of the optional parameters are for, you can check out this link. This blog focuses on how we design a custom LLM class, and going to depths of what each parameter does is out of the scope. We can have another blog explaining what each parameter does.

Moving forward, now we need to see what set of functions are required to define by us when extending langchain’s base LLM class. Also any kind of extra set of functions we need to define to make it more custom. Below shows the default requirements while extending the base LLM class.

class CustomLLM(LLM):
    
    @property
    def __llm_type(self) -> str:
        """
        It tells us what kind of LLM we are using
        For example: Here it will be gpt4all model
        """
        ...
    
    @property
    def _identifying_params(self) -> dict:
        """
        It should return a dict that provides
        the information of all the parameters 
        that are used in the LLM. This is useful
        when we print our llm, it will give use the 
        information of all the parameters.
        """
        ...

    def _call(self, prompt) -> str:
        """
        This is the main method which will be called when 
        we use the LLM to run with our prompt. Example:
        
        response = llm(prompt)
        """
        ...

Well, that’s an awesome point to start. But let’s add one more method which is download(model_name, model_folder_path) , this method can be used when defining the LLM that can automatically download the model we specified in the required specified path and load the LLM. This function is simple to demonstrate here. Although GPT4ALL already has this function built in. But assume you have to download the model from a S3 bucket or load it from some source automatically or integrate it to some cloud service or some existing pipeline then you might require to write your own custom function.

We can add more methods like token streaming, sending tokens to a network, etc. But that can be overhead and open to the user to experiment with.

Writing our model downloading function

It’s simple, all we need to do is specify the folder where the model will get downloaded and the name of the model to download. Once those are specified we need to check two things

Check whether the model already exists in the existing folder or not
Whether allow_download is set as True or not.

If those two checks are passed then download the model from the URL gpt4all.io/models/<model-name> and download by sending a GET request. Note as the download file is large so, we are gonna stream the downloading process and write the file chunk by chunk. Here is the code that does the same.

def auto_download(self) -> None:
    """
    This method will download the model to the specified path
    reference: python.langchain.com/docs/modules/model_io/models/llms/integrations/gpt4all
    """

    # import all the required dependencies
    import requests
    from tqdm import tqdm
    
    # see whether the model name has .bin or not

    model_name = (
        f"{model_name}.bin"
        if not self.model_name.endswith(".bin")
        else self.model_name
    )

    download_path = os.path.join(self.model_folder_path, model_name)

    if not os.path.exists(download_path):
        if self.allow_download:

            # send a GET request to the URL to download the file.
            # Stream it while downloading, since the file is large
    
            try:
                url = f'http://gpt4all.io/models/{model_name}'

                response = requests.get(url, stream=True)
                # open the file in binary mode and write the contents of 
                # the response in chunks.
                
                with open(download_path, 'wb') as f:
                    for chunk in tqdm(response.iter_content(chunk_size=8912)):
                        if chunk: f.write(chunk)
            
            except Exception as e:
                print(f"=> Download Failed. Error: {e}")
                return
            
            print(f"=> Model: {self.model_name} downloaded sucessfully 🥳")
        
        else:
            print(
                f"Model: {self.model_name} does not exists in {self.model_folder_path}",
                "Please either download the model by allow_download = True else change the path"
            )

Now that our download function is being written, now it’s time to use that download function and load our LLM when our MyGPT4ALL class is instantiated. Here is the code for that.

def __init__(self, model_folder_path, model_name, allow_download):
    super(MyGPT4ALL, self).__init__()
    self.model_folder_path: str = model_folder_path
    self.model_name = model_name
    self.allow_download = allow_download
    
    # trigger auto download
    self.auto_download()
    
    # load the model once the download finished (if model downloads)
    self.gpt4_model_instance = GPT4All(
        model_name=self.model_name,
        model_path=self.model_folder_path,
    )

Congratulations!! You have done and integrated a custom function inside langchain’s LLM class to build your custom LLM workflow. Now other than auto_download() it can be any other function like: load_from_s3() , load_from_drive() or maybe something from your custom pipeline.

Defining the required functions that custom LLM needs to implement

There are some functions that langchain requires to implement when making custom LLM. Those are _identifying_params property and _call function.

_identifying_params tells us what are the parameters (like temperature, top-p, top-k) our LLM is using by default. Here is how we implement that.

@property
def _get_model_default_parameters(self):
    return {
        "max_tokens": self.max_tokens,
        "n_predict": self.n_predict,
        "top_k": self.top_k,
        "top_p": self.top_p,
        "temp": self.temp,
        "n_batch": self.n_batch,
        "repeat_penalty": self.repeat_penalty,
        "repeat_last_n": self.repeat_last_n,
    }

@property
def _identifying_params(self) -> Mapping[str, Any]:
    """
    Get all the identifying parameters
    """
    return {
        'model_name' : self.model_name,
        'model_path' : self.model_folder_path,
        'model_parameters': self._get_model_default_parameters
    }

Now instead of model_path we can have s3_path if we are downloading from S3 or any other source. The point is, we can dump all the parameters and all the information that are required when watching the model’s properties.

Defining the _call function

We are almost in the end. It’s time to define our main functions which is the _call function. Below is a very simple implementation of how we code this thing through gpt4all ‘s inbuilt function.

def _call(self, prompt: str, stop: Optional[List[str]] = None, **kwargs):
    """
    Args:
        prompt: The prompt to pass into the model.
        stop: A list of strings to stop generation when encountered

    Returns:
        The string generated by the model        
    """
    
    params = {
        **self._get_model_default_parameters, 
        **kwargs
    }

    resposne = self.gpt4_model_instance.generate(prompt, **params)
    return resposne

Here **kwargs can add additional parameters (if any) or can change the existing default parameters (temp, top_k, top_p), etc. Above code can be very few lines, but you can still do wonders and paint your endless horizon of creativity.

For example, you want to output streams of tokens rather than waiting to get the answers all at once, like this.

Want to do something more advanced? You can even stream these tokens over an API to provide a response as a stream. Amazing right, literally like your own chatGPT. GPT4ALL has amazing functionalities you can have inbuilt chat sessions that capture the chat (prompts and responses), you can either store it or put more context, etc.

Congratulations 🥳 You have just created your own LLM pipeline using gpt4all and langchain. You might question why we are using langchain, I could literally make a simpler class and do the same thing.

Well, yes you can surely do that if you want. The only thing is, if this becomes compatible with langchain then you can take advantage of all the other amazing functionalities that langchain provide with your own LLMs. Here is the full code of our class.

import os
from pydantic import Field
from typing import List, Mapping, Optional, Any
from langchain.llms.base import LLM
from gpt4all import GPT4All

class MyGPT4ALL(LLM):
    """
    A custom LLM class that integrates gpt4all models
    
    Arguments:

    model_folder_path: (str) Folder path where the model lies
    model_name: (str) The name of the model to use (<model name>.bin)
    allow_download: (bool) whether to download the model or not

    backend: (str) The backend of the model (Supported backends: llama/gptj)
    n_threads: (str) The number of threads to use
    n_predict: (str) The maximum numbers of tokens to generate
    temp: (str) Temperature to use for sampling
    top_p: (float) The top-p value to use for sampling
    top_k: (float) The top k values use for sampling
    n_batch: (int) Batch size for prompt processing
    repeat_last_n: (int) Last n number of tokens to penalize
    repeat_penalty: (float) The penalty to apply repeated tokens
    
    """
    model_folder_path: str = Field(None, alias='model_folder_path')
    model_name: str = Field(None, alias='model_name')
    allow_download: bool = Field(None, alias='allow_download')

    # # all the optional arguments

    backend:        Optional[str]   = 'llama'
    temp:           Optional[float] = 0.7
    top_p:          Optional[float] = 0.1
    top_k:          Optional[int]   = 40
    n_batch:        Optional[int]   = 8
    n_threads:      Optional[int]   = 4
    n_predict:      Optional[int]   = 256
    max_tokens:     Optional[int]   = 200
    repeat_last_n:  Optional[int]   = 64
    repeat_penalty: Optional[float] = 1.18


    # initialize the model
    gpt4_model_instance:Any = None 

    def __init__(self, model_folder_path, model_name, allow_download, **kwargs):
        super(MyGPT4ALL, self).__init__()
        self.model_folder_path: str = model_folder_path
        self.model_name = model_name
        self.allow_download = allow_download
        
        # trigger auto download
        self.auto_download()

        self.gpt4_model_instance = GPT4All(
            model_name=self.model_name,
            model_path=self.model_folder_path,
        )

        
    def auto_download(self) -> None:
        """
        This method will download the model to the specified path
        reference: python.langchain.com/docs/modules/model_io/models/llms/integrations/gpt4all
        """

        # import all the required dependencies
        import requests
        from tqdm import tqdm
        
        # see whether the model name has .bin or not

        model_name = (
            f"{model_name}.bin"
            if not self.model_name.endswith(".bin")
            else self.model_name
        )

        download_path = os.path.join(self.model_folder_path, model_name)

        if not os.path.exists(download_path):
            if self.allow_download:

                # send a GET request to the URL to download the file.
                # Stream it while downloading, since the file is large
        
                try:
                    url = f'http://gpt4all.io/models/{model_name}'

                    response = requests.get(url, stream=True)
                    # open the file in binary mode and write the contents of the response 
                    # in chunks.
                    
                    with open(download_path, 'wb') as f:
                        for chunk in tqdm(response.iter_content(chunk_size=8912)):
                            if chunk: f.write(chunk)
                
                except Exception as e:
                    print(f"=> Download Failed. Error: {e}")
                    return
                
                print(f"=> Model: {self.model_name} downloaded sucessfully 🥳")
            
            else:
                print(
                    f"Model: {self.model_name} does not exists in {self.model_folder_path}",
                    "Please either download the model by allow_download = True else change the path"
                )
    
    @property
    def _get_model_default_parameters(self):
        return {
            "max_tokens": self.max_tokens,
            "n_predict": self.n_predict,
            "top_k": self.top_k,
            "top_p": self.top_p,
            "temp": self.temp,
            "n_batch": self.n_batch,
            "repeat_penalty": self.repeat_penalty,
            "repeat_last_n": self.repeat_last_n,
        }

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        """
        Get all the identifying parameters
        """
        return {
            'model_name' : self.model_name,
            'model_path' : self.model_folder_path,
            'model_parameters': self._get_model_default_parameters
        }

    @property
    def _llm_type(self) -> str:
        return 'llama'
    
    def _call(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str:
        """
        Args:
            prompt: The prompt to pass into the model.
            stop: A list of strings to stop generation when encountered

        Returns:
            The string generated by the model        
        """
        
        params = {
            **self._get_model_default_parameters, 
            **kwargs
        }

        resposne = self.gpt4_model_instance.generate(prompt, **params)
        return resposne

What’s Next?

The next part of this mini-series contains two things

How do we integrate llama-index with our custom LLM?
How we can integrate knowledge bases with our custom LLM pipeline

LLMs are getting super hyped up, as they should be for their incredible amount of use cases. However, it is important to know how to build end-to-end applications through this. Writing just 10 lines of code with open ai’s API is just not enough. Hence in this mini-series, I am documenting my learning journey of how we can

Integrate custom open-source LLM models for in-house lock-in llm powered applications
How we can integrate a bunch of services and cloud workflows with these tools like langchain
How we can extend existing ML pipelines and workflows to LLMs and provide better interpretable solutions (for example combining the amazing recommendation system of Amazon with ChatGPT to get the best product without being precise)

For more updates please follow my page and stay tuned to get more updates. You can find the full code in this Github Repo.

Acknowledgments and References

Recently I started my open-source journey by contributing to langchain. I specifically tried to improve the existing token streaming for gpt4all integration. There I found the amazing code of how the contributors integrated gpt4all with langchain. The above code is a simplified version of this. You can check out the original integration of gpt4all with langchain here.