state of the art in segmentation

Some quick analysis on SOTA for segmentation. SAM 2[1] achieves a 6× speedup over the original SAM, running at 44 FPS on an A100 with streaming memory for real-time video segmentation. Grounded SAM 2 integrates Grounding DINO for text-driven segmentation.

Florence-2[2] unifies captioning, grounding, and segmentation, reaching 35.8 mIOU on referring segmentation and 135.6 CIDEr on COCO.

Transformer unification dominates current SOTA:

  • MaskDINO[3] – 54.5 AP (COCO instance), 59.4 PQ (panoptic), 60.8 mIOU (ADE20K).
  • DI-MaskDINO[4] – +1.2 AP (box), +0.9 AP (mask) over MaskDINO.
  • Mask2Former[5] – universal transformer for panoptic, instance, and semantic segmentation.
  • OneFormer[6] – single-training model surpassing task-specific counterparts across all segmentation tasks.

For efficiency, SegFormer-B5 achieves 51.8 mIOU on ADE20K without positional encoding, while SegNeXt improves by +2 mIOU using 1/10 the parameters.

In short: the field has consolidated around modular, transformer-based frameworks capable of multi-task, zero-shot segmentation across vision and language.

References

  1. [1] Ravi, Nikhila, et al. "SAM 2: Segment Anything in Images and Videos." arXiv preprint arXiv:2408.00714 (2024).
  2. [2] Xiao, Bin, et al. "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks." arXiv preprint arXiv:2311.06242 (2023).
  3. [3] Li, Feng, et al. "Mask DINO: Towards a Unified Transformer-based Framework for Object Detection and Segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  4. [4] Nan, Zhixiong, et al. "DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model." arXiv preprint arXiv:2410.16707 (2024).
  5. [5] Cheng, Bowen, et al. "Masked-attention Mask Transformer for Universal Image Segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  6. [6] Jain, Jitesh, et al. "OneFormer: One Transformer to Rule Universal Image Segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.