Practical insights while fine-tuning LLMs for domain-specific use cases and best practices

10 min readAug 21, 2023

Our last post was focused more on building an analytical pipeline and setting up our experimentation environment and best practices around it. The use case we worked on was a text classification task, where given a set of user complaints we need to classify the category into which the user’s complaint falls. To know more about how we defined our use case and how we set up our analytical pipeline, please review our previous blog, Practitioners Guide to Fine-tune LLMs for domain-specific use cases.

Several combinations come into consideration to make an LLM work on a particular domain-specific use case. This includes the model configuration, the need for fine-tuning, different structures for prompts, preparing the format of the dataset for training the foundation model, etc. Each of these factors contributes significantly towards the performance of our LLMs. Let’s discuss each of them one-by-one.

What we have so far

Before starting to deliver our insights, it is essential to baseline what models, datasets, and prompts we have used, as those are the basis of the insights we are going to share in this blog.

Dataset: The dataset was based on text classification where we have user’s complaints in the form of text and a label associated with it, that categorizes the complaints into potential categories like ‘credit reporting’, ‘debt collection’, ‘mortgages and loans’, ‘credit cards’ and ‘retail banking’. You can find the dataset in Kaggle.

Prompts: For our overall experimentations, we considered two types of prompts. Please note: For both the images below, the text marked as green is our actual customer complaint, the yellow is the prompt, the text marked with red is our instruction part of our prompt, and the text marked with blue is our ground truth (used for training)

Simple prompting: where we augmented our customer complaint to look like a text classification task. Here is an example. We will call the below prompt a “simple prompt”.

Figure 2: A simple prompt augmented with customer complaint enforcing to be a classification task.

2. Instruction prompting: Here we provided some external instruction first (explicitly telling our LLM to do classification and abide by some rules while it classifies). We will call the below prompt an “instruction prompt”

Figure 3: A example of an instruction prompt, contains Instruction (red), input user compliant (green), output label (blue), and addition prompt (yellow)

Model and fine-tuning method: We used a decoder-only causal language Foundation model called Falcon holding 7 billion parameters. Since these models are fundamentally a language completion task, so there are two ways we can fine-tune them (classified based on the type of input prompt).

Traditional fine-tuning where our prompt would mimic the text completion task as classification. The prompt shown in Figure 2 will be the input to our LLM in that case
Instruction fine-tuning Where our prompt would be similar to the instruction prompt case — Figure 3. Here the instruction provided to the model acts as a signal to constraint our LLM’s response.

There is no such difference between the two when it comes to fine-tuning procedures or the pipeline. The way we construct our data with prompts makes a key difference here. Please note that we are practicing supervised fine-tuning here (no Reinforcement Learning through Human Feedback is involved here). To know more about LLM fine-tuning, we highly recommend this blog.

Soft Accuracy: This is an unofficial term of accuracy that we are using to quantify our model’s performance for the classification tasks. Unlike classical classification models, our LLM will not spit out a discrete class value but some set of continuous tokens which when decoded might contain the class of our interest. Hence to quantify that, we are proposing the term soft accuracy, where our predicted sentence matches with the target one if the target is present inside the predicted set of tokens. Let’s take an example:

For some customer complaints, our
target: mortgages and loans
predicted: mortgages and loans are classified into the category: ### Assistant

Using the definition of soft accuracy, we can say the target equals the predicted. We also had to do an additional check of whether there is more than one class in the outcome. However, the number of such cases was negligible.

Model’s generation configuration during inference: Parameters like temperature, top-k, top-p, maximum generation tokens, etc are some variables that affect the model’s stochasticity and generation. To keep our experimentations fair, we fixed our generation config while starting with our experimentation. Here is what we choose:

Temperature: Temperature is a parameter used in language models that controls the randomness of the generated text; higher values make the output more diverse and creative, while lower values make it more focused and deterministic.

Top-p (Nucleus) Sampling: Top-p, also known as nucleus sampling, involves selecting the smallest set of words that collectively surpass a probability threshold; it’s a method for generating text that focuses on the most probable words while still allowing some diversity.

Top-k Sampling: Top-k sampling limits the vocabulary selection to the top-k most likely words at each step of text generation, which helps maintain a balance between randomness and coherence in the generated text.

Laying down different combinations for our experiment

We experimented with use case pipelines based on these 3 parameters:

Model: The state of the model (whether we are using a pre-trained LLM or we are fine-tuning it)
Dataset: The quality, and quantity of the dataset
Prompts: what kind of prompts we used while training, and what kind of prompts we used while doing inference

Keeping the dataset factor constant, here are some combinations we build, that led us to good results at each iteration:

Combination 1: Using raw Falcon 7B for inference with both simple and instruction prompts.
Combination 2: Fine-tuning Falcon 7B (using simple prompts) and then doing inference with the same prompts
Combination 3: Fine-tuning Falcon 7B (with simple prompts) and then doing inference with instruction prompts
Combination 4: Fine-tuning Falcon 7B (with instruction prompt) and doing inference with the same

We also experimented with our models after moderating that dataset for better-quality of samples. Stay tuned to know more about that later.

Experimentation Results and insights from combination 1

Falcon 7B for inference with both simple and instruction prompts

As mentioned earlier, in this combination we are not fine-tuning the model. During the time of evaluation, we used both types of prompts (Figure 2 and Figure 3) to validate our model’s performance. Since our model did not see any prior examples, as expected, the model did not perform well. Here are some sample outputs generated in this combination:

Figure 4: Model results from Combination 1

If you see in example 3, as the text contains “medical debt”, the completion contains “medical debt ….”. So all it is trying to do here was to randomly complete sentences for some imaginary classes. This means that our Language Model has a good language understanding property, however, prompting was not able to teach it the domain.

We then decided to go for in-context learning where we provided some examples of the models inside the prompt (also called few-shot learning). Here is how it looked:

Figure 5: Instruction prompt with few-shot in-context learning

As you can see, we made some changes in the prompt by providing some examples (we shortened the examples just to fit the images). But in that case, our model was just printing either all the classes or some of the very first labels i.e. credit reporting. Please head to this link of a weights and bias table to see more examples of output generated by our LLM under this combination.

The evaluation soft accuracy we got for this combination was 0%

Experimentation Results and insights from combination 2

Fine-tuning Falcon 7B (using simple prompts) and then doing inference with the same prompts.

In our previous experimental setup, we came to two very important conclusions

Our raw foundation model carries a good understanding of the language construct and semantic relationships.
Our raw foundation model does not carry any domain-specific learning despite in-context learning with few-shot examples.

Hence we fine-tuned our model to get better results. We used our training pipeline where we fine-tuned our model with simple prompts (Figure 2) as the input to our LLM. Here are some sample outputs of our fine-tuned Falcon 7B.

Figure 6: Results of our fine-tuned model with combination 2

In the above table, we can see good improvements in our model. Our model is now able to classify the texts though with some variance, for example — it is also generating some additional text. This means while our model is now better than combination 1, the generation quality and control are still not fully achieved. You can see more examples here.

The evaluation soft accuracy we got for this combination was 44 %

Experimentation Results and insights from combination 3

Fine-tuning Falcon 7B (with simple prompts) and then doing inference with instruction prompts

In this combination, we did not do any additional fine-tuning but enhanced our inputs for inference using the instruction prompt format. Our results got improved! The figure below shows some examples post inferencing.

Figure 7: Results of our fine-tuned model with combination 3.

This time, the soft accuracy rose to 66 % and the generation was tighter to just the output class label. To see more results for this combination, please refer to this table to see more such examples.

Experimentation Results and insights from combination 4

Fine-tuning Falcon 7B (with instruction prompt) and doing inference with the same

We went one step further. We fine-tuned the model with instruction-prompted input (Figure 3) and also used the same format for inference. And this time, the results were quite solid. Here are some sample examples:

Figure 8: Sample evaluation result after we instructed tuned our model

This model achieved a soft accuracy test of over 74 %, which was a huge improvement! However, when it came to controllability in generation, there were still some issues. We saw that in almost all the model generations, we saw some additional text was printed after the class label. Some examples:

Credit reporting is
Mortgages and Loans is ###
Credit card ###

As we had limited the max token generation to 15, we are getting a small number of additional text in the output. We are researching some more on how to control this variance. One possibility is that the model is somehow bypassing the instructions. This might be fixed by iterating more on the instruction prompts. You can find more examples here.

Additional insights when fine-tuned on high-quality data

Till now the basis of our combinations was just the model’s state and which prompts were used and how they were used during the fine-tuning process and inference.

An additional parameter to vary was the quality of the dataset (note in combo 1–4, we kept the dataset constant). While doing our experimentation, we saw some issues with the underlying dataset. Here are some samples of issues:

There was a lot of null text (although we removed those initially)
There was some gibberish text like “aaa aaaaaaa aaaaaaa”, or “hello my credit card is now working credit card is now working credit card is now working …”
There were some text corpus that was just 1 to 3 words long
Most importantly, there was some text corpus that did not make sense semantically. Some examples were labeled as “credit reporting”, however when we checked those and consulted with a domain expert, these examples best fit a “debt collection” categorization.

To solve these problems, we leveraged a human expert to manually annotate the data by rating the quality of text and its classification in the range of 1 (bad) — 5(very good). We annotated a few examples. Out of those, we sub-selected 120 examples with quality ≥ 3.5.

We fine-tuned our LLM on this dataset and then did inference with our validation set and got our soft accuracy of 67 % with the most controllable output. One potential area of improvement could be if we reduced our class imbalance in our training labeled dataset. Most of the data contained the label “credit reporting”. However, this approach gave us the impetus to move forward and continue with this combination.

What we learned and where are we headed to?

So far we went through a lot of experiments and on each iteration, changing a small parameter led to a new set of results which iteratively improved the overall performance of LLM.

To wrap up, here is our summary table of all the combinations we discussed so far:

Figure 9: Overall summary of our LLM’s performance under different conditions

Some of the key insights we learned are as followed:

Fine-tuning is an important step if we want to build domain-specific use cases. Just prompt engineering on the foundation model might not be sufficient. Augmenting the labeled dataset with prompts before fine-tuning does improve the model performance.
The quality of labeled datasets (as evaluated by a human annotator) can be powerfully coupled with fine-tuning to drive higher-quality results — even with a smaller sample of quality examples. More research is required in this area.

As a final step, we will combine the best parameters from all the combinations, i.e. Fine-tuning a foundation model with more curated and higher quality (balanced) data with instruction prompting during inference.

In our next blog, we will discuss the results of the final combination, along with some insights on how we approached getting good quality data programmatically instead of a completely manual human-in-the-loop process.

Also, we will compare the results with an encoder-only (BERT) model and an encoder-decoder (Flan-T5) model with our Falcon-based decoder model approach.

Stay tuned. Cheers