Competition

Retail shelf detection and classification for norwegian championship in AI

March 20265 minute read

GitHub Repo Submission Notes

Place#22 of 361

PctTop 93.9%

Test0.7059

Val0.8349

Placement

#22 of 361

Percentile

Top 93.9%

Test score

0.7059

Val score

0.8349

At the national championship in AI I worked on the NorgesGruppen Data challenge, a retail computer vision benchmark built around crowded grocery shelves. Each image contained many products, often with visually similar. The task was to train a CV system that could both detect where in the image the products were and the exact SKU for each product. We were given a data set of 248 images of grocery shelves with COCO mappings. I chose a disciplined approach to training with a big validation split of 48 images. That made validation scores map strongly to actual scores, however this approach had one major issue, it left me with much less data. While the top performers trained all parts of their detector families, I only did one small fine tune of the classifier HEAD with the validation set. In hindsight I should have been less worried about overfitting and chosen a more aggressive split.

Most of the early work was done to maximize the detector performance. The competition weighted box quality at 70% of the score. The winning idea here were ensambles. The top teams used a mix of different detector families to learn different data features. I also tried this approach early but with less success. My ensamble (YOLO8x + RF-DETR-m) box performance was worse than using a bigger single model with reduced precision. Eventually I settled on a single RF-DETR 2XL as the main box proposal reduced to fp16 precision. The reason for the reduction in precision was to save space. The limit of 420mb for the entire submission made me drop precision across all models. On the validation set this didn't seem to affect performance at all. In hindsight the RF-DETR-2XL box proposal model should have been trained on the entire dataset.

The system

Eventually i landed on a staged pipeline for the final submission. The flow works like this. The trained RF-DETR-2XL model captures boxes around products. These boxes are then cropped from the entire image. The crop is passed (with some margin) to the detector. The detector, a trained EfficientNetV2-M, then generates a likelihood distribution over the product SKUs. Finally if this distribution is low margin, meaning there is no clear winner, the crop is sent to a trained DINOv3 ConvNeXt-s that uses an embedding bank of reference photographs to readjust the distribution.

Put simply: Draw boxes around products -> crop image to box -> classify what the box contains.

This is also the reason everything needed reduced precision. Storing three models and an embedding bank in 420mb is not easy! Finally the embedding model used a COS similarity approach

ŷ(x) = arg max_{c ∈ C_K(x)} [log p_cls(c | x) + λ max_{r ∈ R_c} cos(z(x), r)]

(1)

z(x) = f(x) / ||f(x)||₂

(2)

RF-DETR proposes the boxes, one selected box becomes a crop, EfficientNet scores the SKU candidates, and only the close calls are embedded, scored by max cosine against class references, and reranked with a margin gate.

What moved the score

The early improvements came from making the detector stack stronger and more stable. The larger jump came later, when the pipeline moved away from the ensemble and into a single RF-DETR 2XL path with a stronger crop classifier and, at the end, a selective reranker.

The progression below is the clearest summary. Detector-centric baselines raised the floor, but the decisive movement came from improving crop-level decisions and only adding the embedder where it actually changed the answer.

The main shift was not another detector family. It was the move to a cleaner single-detector path with stronger crop classification and a targeted reranker.

Training setup

The final submission was not trained as one monolithic system. The detector, classifier, and reranker each had a separate lane, which made it possible to improve the crop stage without destabilizing the full-image detector, and to attach the reranker only after the rest of the runtime was already working.

In practice, this is also what made the compact runtime possible. The pipeline could keep the main path small and predictable while still gaining a more careful reference-based decision stage for the hardest cases.

The competition was sponsored by GCP and fortunately the system could be trained on an overkill 8xA100 box, this made training runs incredibly fast.

The detector, classifier, and reranker were trained in separate lanes, then stitched into one compact runtime with extra compute reserved for the ambiguous cases.

Results

The remaining errors were mostly near-neighbor packaging confusions: products that already look close to a human observer, especially when they are partially occluded or visually compressed on a shelf. That is exactly the failure mode where a reference-based recheck makes sense.

The final result was therefore less about one dramatic architectural choice and more about making each stage do a narrower job well: detect, crop, classify, and only then recheck the ambiguous cases.