Today we’re diving into something really cool: fine-tuning YOLO models for soccer detection. We’re going to detect ball
s, referee
s, player
s, and goalkeeper
s in soccer matches using Ultralytics’ fine-tuning tools.
The popular YOLO models you’ll find out there, like Ultralytics YOLOv11, and YOLOX in different sizes, are mostly pre-trained on something called the COCO dataset.

What’s COCO? It’s a massive dataset containing 80 different object classes, things like person
, bicycle
, car
, dog
, cat
, even kites
. Basically, everyday objects you’d see around. These models work incredibly well for generic scenarios. If you’re trying to detect people and cars on a street, they’re fantastic right out of the box.
But here’s where it gets interesting, and this is the real magic of pre-trained models. They’re not just good at detecting those 80 classes. They’re actually amazing as a foundation, as a starting point we can build upon. We can take these models and fine-tune them on our specific datasets to make them specialists in whatever domain we care about.
You might be thinking, <<Why not just train a model from scratch?>>. Well, training from scratch is a tough road, especially for YOLO models. Training from scratch requires an absolutely massive dataset (like the COCO dataset), thousands of images per class (perfectly labeled), serious compute power and time… and even after all that work, final accuracy can be lower than fine-tuned models.
Why fine-tuning works: Pre-trained models have already learned fundamental visual features, they understand edges, shapes, textures, and patterns. All this knowledge is compressed and encoded in their weights. When we fine-tune, we’re not starting from zero, we’re building on top of a solid foundation.
Here’s what we’re doing: fine-tuning a pre-trained Ultralytics YOLO model (trained on COCO) to focus on just 4 classes: ball
, player
, referee
and goalkeepe
r.
Now, you might say, <<Wait! Goalkeeper, player, and referee are all people, and YOLO already detects people really well on the COCO dataset. Why not just use that?>>. That’s a great point, but here’s why fine-tuning is worth it:
- We want a soccer specialist: we want to differentiate between players, referees and goalkeepers. We also want our model to excel at detecting players in motion, goalkeepers diving, referees in their uniforms, and balls that might be partially hidden by players’ feet.
- Context matters: all these objects and people are in the specific context of a soccer field.
Getting the data
ince we’re fine-tuning (not training from scratch), we don’t need as much data, but we still need a few thousand images, ideally between 5,000-10,000 images.
For this project, I’m using Roboflow Universe, which is a fantastic resource for finding open-source datasets. It’s perfect for fine-tuning YOLO models, with datasets for everything from license plate recognition to people detection.
I found a soccer dataset with exactly our 4 classes, and the images are already split into three sets:
- 7,010 training images: these teach the model
- 1,056 validation images: these help us understand if the model is generalizing (not just memorizing)
- 1,002 test images: these give us the final report card on completely unseen data

You can download the dataset in many formats, we are going to use the YOLO format which works with Ultralytics tooling out of the box. This is a standard way of organizing images and their bounding box labels.
You’ll see:
data.yaml
– contains class names and dataset pathstrain/
images/
– training imageslabels/
– text files with bounding box coordinates
val/
images/
– validation imageslabels/
– validation labels
test/
images/
– test imageslabels/
– test labels
Each label file contains lines like this:
2 0.537890625 0.4625 0.11328125 0.546875
2 0.644140625 0.469140625 0.15078125 0.51953125
Where:
- First number = class index (0=ball, 1=goalkeeper, 2=player, 3=referee)
- Next four numbers = bounding box coordinates (center x, center y, width, height)
- All coordinates are relative to image size (0.0 to 1.0)
Fine-tuning the model
First, you’ll need to set up a Python virtual environment and install ultralytics
library.
python3 -m venv venv
# on linux/unix
source venv/bin/activate
pip install ultralytics
The ultralytics
library includes everything you need, not just for training YOLO models, but also for running inference. Once installed, you’ll have access to the yolo
command-line tool.
Using Ultralytics’ tools, training is surprisingly simple:
yolo train data=data.yaml model=yolo11n.pt epochs=50 imgsz=640
data=data.yaml
– points to our dataset configurationmodel=yolo11n.pt
– the pre-trained model we’re fine-tuning (nano version)epochs=50
– how many times we go through the entire training setimgsz=640
– input image size (640×640 pixels)
Understanding Epochs: Think of epochs like re-reading a book. An epoch is one complete pass through our entire training dataset. The data stays the same, but after each epoch, the model’s weights get updated based on what it learned. So when the model sees the same images again in the next epoch, it processes them differently because the model itself has changed. It’s like re-reading a book after gaining more experience, you notice different patterns and details because you’ve evolved, not because the book changed.
It’s also possible to add data augmentation, which I’m not going to use this time. If you want to know about it, here’s a page in the ultralytics documentation.
Training metrics
During training, there are several key metrics to monitor, with the most important being the loss values that should steadily decrease. The model actually tracks three different types of loss: box loss, class loss and focal loss. These losses measure how wrong the model’s predictions are at any given point, so seeing them trend downward tells us the model is learning and improving with each epoch.
However, it’s crucial that we see losses decreasing on both train and validation datasets. If the training loss goes down but validation loss stays flat or increases, it signals overfitting, where the model is memorizing the training images rather than learning generalizable patterns that work on new, unseen data.

The other crucial metric to watch is mAP50, or Mean Average Precision at 50% IoU, which you can think of as your model’s “accuracy grade” for object detection. This metric measures how well the model finds objects and draws decent bounding boxes around them, with the “50” meaning we accept a bounding box as correct if it overlaps with the true location by at least 50%. The mAP50 ranges from 0 to 1, where 1 represents perfect detection, and unlike loss, this is a metric we want to see climb higher throughout training. The mAP50 is evaluated on the validation set.
Our final training results showed interesting performance differences across the four classes, with an overall mAP50 of 88% indicating really good performance. Player detection achieved the highest accuracy at around 90%, which makes sense since players are typically the largest objects in the images and likely had the most training examples. Both referee and goalkeeper detection performed solidly at around 85%, while ball detection came in at about 70%, still decent performance considering the ball is by far the smallest and most challenging object to detect in the images, often appearing as just a few pixels even in the original high-resolution frames.
Using the model on an Elixir LiveBook
Once finished, you’ll find the weights under ./runs/detect/train*/weights/best.pt
or last.pt
. In ./runs/detect/train*/
you’ll also find different files describing how well the model was trained.
If you want to use this model in an Elixir LiveBook, before using the Elixir YOLO library, you need to convert the pytorch model to an onnx model. In Elixir YOLO v0.2.0 – YoloX and custom models support I show how to convert the .pt
model into an .onnx
in a livebook.
You simply need to run this python script
import json
from ultralytics import YOLO
MODEL_NAME = "yolov11m_soccer"
CLASSES_PATH = MODEL_NAME + ".json"
IMAGE_SIZE = 640
model = YOLO(MODEL_NAME + ".pt")
model.export(format='onnx', imgsz=IMAGE_SIZE, opset=12)
# Export the categories
with open(CLASSES_PATH, "w") as f:
data = [model.names[idx] for idx in model.names]
json.dump(data, f)
Then you simply load the onnx
model, the image and run the inference.
mat = Evision.imread(soccer_image_frame_path)
model = YOLO.load(model_path: model_path, classes_path: classes_path)
detected_objects =
model
|> YOLO.detect(mat, iou_threshold: 0.45, prob_threshold: 0.25)
|> YOLO.to_detected_objects(model.classes)
{:ok, image} = Image.from_evision(mat)
KinoYOLO.Draw.draw_detected_objects(image, detected_objects)

To evaluate our fine-tuned model, I compared it with the original COCO-trained YOLO11m on actual soccer match frames. The original COCO model detected people correctly but also picked up spectators in the background, and it couldn’t detect the soccer ball. Our fine-tuned model showed a better contextual understanding: it focused only on players, goalkeepers, and referees on the field, successfully detected the soccer ball even as a tiny dot in resized images, and provided proper role-specific classifications while ignoring background spectators.


I think that the advantages of fine-tuning are evident through this comparison. Beyond requiring much less data and training time than starting from scratch, our model became specialized for soccer contexts rather than general object detection. It works reasonably well even on heavily compressed images (1920×1080 → 640×640) and showing better contextual focus by concentrating on field action while filtering out crowd activity. Most importantly, it gained the ability to detect soccer balls. This indicates the model learned useful soccer-specific patterns.
Conclusion
With just a few thousand images and some training time, we transformed a generic 80-class detector into a specialized soccer model (87% mAP50), with clear improvements in context understanding and object-specific detection.
Whether you’re working on sports analysis, autonomous vehicles, or any specialized computer vision task, fine-tuning pre-trained YOLO models is often your best starting point. The combination of existing visual knowledge and domain-specific training creates powerful, practical solutions.
Try this approach with your own datasets and please let me know your results!