RipeSeed Logo

Fine Tuning Open Source LLMs like Llama 3.1, Mistral and Gemma - The Right Way!

September 17, 2025
Learn how you can train open-source models like Llama or Gemma on your own knowledge base
Fine Tuning Open Source LLMs like Llama 3.1, Mistral and Gemma - The Right Way!
Muhammad Noraiz
Senior Software Engineer
7 min read

Fine-Tuning Open-Source Large Language Models: A Comprehensive Guide

Fine-tuning open-source models can help businesses solve their complex and unique use cases using AI, without relying on any enterprise closed source AI models. It enables engineers to utilize the power of pre-trained models for specific use cases.

In this blog, we'll go through the detailed process of fine-tuning open-source Large Language Models using publicly available tools and frameworks.

Choosing an Open-Source Model

There are many open-source models available, such as LLaMA, Mistral, and Gemma. You can choose one that better suits your needs. For this blog, we'll be using LLaMA by Meta. LLaMA 3.1 is the latest version of LLaMA, available in three parameter sizes (8B, 70B, and 405B). For this guide, we'll be using the LLaMA 3.1 8B model.

Downloading and Running Open-Source Models on Your Local Machine or Server

The easiest way to run an open-source LLM on your local machine or server is by using Ollama.

Ollama is available for Mac, Windows, and Linux operating systems. After downloading and installing Ollama on your local machine or server, you can access it using your operating system's command line.

Run the Ollama server:

ollama serve

Download a model using Ollama:

ollama pull llama3.1:8b

Accessing Models Using the Ollama API

To access the downloaded models using the Ollama API, you simply need to send the model name and the prompt in the request body.

curl http://localhost:11434/api/generate -d '{ "model": "llama3.1", "prompt": "Why is the sky blue?" }'

Fine-tuning LLaMA Using Unsloth

What is Unsloth?

Unsloth accelerates fine-tuning of LLMs like LLaMA-3, Mistral, Phi-3, and Gemma, making it twice as fast and using 70% less memory—all without compromising accuracy. You can utilize Unsloth's Google Colab notebook or create a Python script to run the fine-tuning on your server.

Step-by-Step Guide for Fine-tuning with Unsloth

Install Unsloth

First, install Unsloth and its dependencies:

pip install unsloth pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" pip install --no-deps xformers trl peft accelerate bitsandbytes

Prepare Your Dataset

Create a dataset in the format Unsloth expects. Here's an example of how to structure your data:

dataset = [ { "instruction": "Translate to French: Hello, how are you?", "output": "Bonjour, comment allez-vous?" }, { "instruction": "Summarize this text: [Your long text here]", "output": "[Your summary here]" }, # Add more examples... ]

Load the Pre-trained Model

Load the pre-trained LLaMA model using Unsloth:

from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/llama-3.1-8b-bnb-4bit", max_seq_length=2048, dtype=None, load_in_4bit=True, )

Configure Training Parameters

Set up your training configuration:

from transformers import TrainingArguments training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=4, warmup_steps=500, weight_decay=0.01, logging_dir="./logs", )

Fine-tune the Model

Use Unsloth to fine-tune the model on your dataset:

from trl import SFTTrainer trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=training_args, dataset_text_field="text", max_seq_length=2048, ) trainer.train()

Save the Fine-tuned Model

After training, save your fine-tuned model locally or push it to the Hugging Face Hub:

# Local saving model.save_pretrained("./fine_tuned_model") tokenizer.save_pretrained("./fine_tuned_model") # Push the fine-tuned model to Hugging Face Hub (Optional): # model.push_to_hub("your_name/fine_tuned_model", token="...") # tokenizer.push_to_hub("your_name/fine_tuned_model", token="...")

Save the Model in GGUF Format for Ollama

To use your fine-tuned model with Ollama, you'll need to convert it to the GGUF format. This conversion allows you to create and use your fine-tuned model within Ollama's ecosystem.

# Save the model in GGUF format model.save_pretrained_gguf("./model", tokenizer, quantization_method="q4_k_m")

After converting your model to GGUF format, you can create a new Ollama model using the following command:

ollama create model_name -f ./model/Modelfile

This command creates a new Ollama model named model_name using the Modelfile in the ./model directory. The Modelfile contains the necessary information for Ollama to use your fine-tuned model.

Test the Fine-tuned Model

Finally, test your fine-tuned model with some example inputs:

# Load the fine-tuned model FastLanguageModel.for_inference(model) # Test with a sample prompt inputs = tokenizer( ["Translate to French: Hello, how are you?"], return_tensors="pt" ).to("cuda") outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True) result = tokenizer.batch_decode(outputs) print(result)

Conclusion

This step-by-step guide provides a basic framework for fine-tuning using Unsloth. You may want to expand on each step with more detailed explanations and additional code snippets for data preprocessing, model evaluation, and advanced fine-tuning techniques.

You can also push your newly created fine-tuned Ollama model to your Ollama account, allowing you to download and use it anywhere.

RipeSeed - All Rights Reserved ©2025