Measuring label quality (CVPR 2026 paper)

Label variations exist in all datasets, but they often remain hidden in those containing only one rater per image. While reducing these variations is a direct way to mitigate the issues caused by “noisy labels,” it is first necessary to identify whether these are structural disagreements or individual errors.

For my recent CVPR publication, we developed the KαLOS (KaLOS) toolkit to rigorously evaluate dataset quality. The tool provides granular diagnostics to identify:

  • Hard images and difficult classes for annotators.
  • Collaboration clusters and “school of thought”.
  • Annotator vitality and individual rater consistency.

KαLOS is designed for use during both the creation and post-hoc assessment of datasets. It is versatile enough to evaluate human labels, semi-automated proposals, and fully automated annotations.

Collaboration cluster analysis on the TexBiG.
Fig 1. Collaboration cluster analysis on the TexBiG dataset, visualizing agreement levels between raters. "NaN" values indicate raters who did not share any overlapping tasks.

KαLOS code can be found on GitHub or can be installed via pip install kalos.

I presented KαLOS at CVPR 2026 in Denver in June as a main track paper. If you missed it, here is a quick overview video on YouTube.

@inproceedings{tschirschwitz2026kalos,
  title={KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks},
  shorttitle = {KαLOS},
  author = {Tschirschwitz, David and Rodehorst, Volker},
  booktitle={Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
  year = {2026}
}