Beyond the Black Box — Demystifying LLM fine-tuning with data-driven insights

Large language models (LLMs) are exceptional tools, offering portability, dense information, and broad deployability. Writers leverage them to overcome writer’s block, developers use them as pair programmers, and students see them as a last-minute lifeline for assignments. While LLMs are powerful in their default state, their true potential shines when personalized for specific tasks.
Have you ever wondered how AI models can be tailored to meet your specific needs? Let's dive into the world of fine-tuning large language models (LLMs) to find out!
Large language models (LLMs) are exceptional tools, offering portability, dense information, and broad deployability. Writers leverage them to overcome writer’s block, developers use them as pair programmers, and students see them as a last-minute lifeline for assignments. While LLMs are powerful in their default state, their true potential shines when personalized for specific tasks.
Let's explore how to personalize portable LLMs without diving too deeply into technical details.
Using Retrieval-Augmented Generation (RAG)
RAG enhances LLMs by providing hints as embeddings alongside prompts. This method personalizes LLMs temporarily by enriching input data. For more enduring customization, fine-tuning is employed.
Fine-Tuning – Overview and Experiments
Fine-tuning involves adjusting only a portion of an LLM’s parameters rather than training the entire model from scratch. This selective training, where certain model weights remain "frozen," results in a model that retains new information more permanently than RAG.
Experiments with fine-tuning LLMs used internal HR documentation supported by an in-house framework for fine-tuning, inference, evaluation, and experiment tracking. We tested models like Llama 3 8b, Mistral 7b v0.2 and v0.3, Gemma 2b, and Phi 2 medium.

Various fine-tuning methods with and without RAG were applied, providing insights into selecting the right LLMs for specific tasks.
Technical Breakdown
High-quality datasets are crucial for effective fine-tuning. The framework generates these datasets using Ollama and prompt engineering to generate these datasets. Contentstack provided an NVIDIA A10 GPU with 24GB VRAM for the experiments, with each fine-tuning run taking about 1.5 hours.
We employed several methodologies, and found that QLoRA (quantized LoRA) with 4-bit precision yielded the best results. To optimize the fine-tuning process, various hyper-parameters were also tested, a time-consuming task, particularly for models like Mistral 7b v0.2.
After determining the optimal hyper-parameters, we conducted causal and instructed fine-tuning, saving the adapters to a private HuggingFace repository for easy deployment. The evaluation session assessed fine-tuned models against a curated dataset using metrics like BLEU, ROUGE, and n-gram similarity. Additionally, a RAGA feature was implemented for subjective evaluation, scoring outputs on accuracy and relevance.
Outcome and Findings
Results revealed:
- Instruct fine-tuning with RAG underperformed compared to the base model for Mistral 7b v0.3 and Llama 3 8b.
- Causal fine-tuning outperformed instruct fine-tuning for Mistral 7b v0.3.
- Smaller models like Gemma 2b benefited significantly from instruct fine-tuning with RAG.
These findings highlight the heuristic nature of LLMs and their varied performance based on the use case.

We also analyzed raw textual metrics such as BLEU, ROUGE, and n-gram similarity. These metrics, averaged over all prompts, provide additional insights but should be interpreted cautiously.

The experiments demonstrated that fine-tuning substantially benefits smaller models, while larger models show less significant gains. These insights enable the selection of cost-effective models tailored to specific tasks.
Abschluss
Fine-tuning LLMs can significantly enhance their performance for specific tasks, especially for smaller models. By carefully selecting and customizing LLMs, users can achieve cost-effective, high-performing AI solutions. Explore these techniques in your own projects and unlock the full potential of LLMs.
Über Contentstack
The Contentstack team comprises highly skilled professionals specializing in product marketing, customer acquisition and retention, and digital marketing strategy. With extensive experience holding senior positions at renowned technology companies across Fortune 500, mid-size, and start-up sectors, our team offers impactful solutions based on diverse backgrounds and extensive industry knowledge.
Contentstack is on a mission to deliver the world’s best digital experiences through a fusion of cutting-edge content management, customer data, personalization, and AI technology. Iconic brands, such as AirFrance KLM, ASICS, Burberry, Mattel, Mitsubishi, and Walmart, depend on the platform to rise above the noise in today's crowded digital markets and gain their competitive edge.
In January 2025, Contentstack proudly secured its first-ever position as a Visionary in the 2025 Gartner® Magic Quadrant™ for Digital Experience Platforms (DXP). Further solidifying its prominent standing, Contentstack was recognized as a Leader in the Forrester Research, Inc. March 2025 report, “The Forrester Wave™: Content Management Systems (CMS), Q1 2025.” Contentstack was the only pure headless provider named as a Leader in the report, which evaluated 13 top CMS providers on 19 criteria for current offering and strategy.
Follow Contentstack on LinkedIn.
