state of the art in segmentation
Some quick analysis on SOTA for segmentation. SAM 2[1] achieves a 6× speedup over the original SAM, running at 44 FPS on an A100 with streaming memory for real-time video segmentation. Grounded SAM 2 integrates Grounding DINO for text-driven segmentation.
Florence-2[2] unifies captioning, grounding, and segmentation, reaching 35.8 mIOU on referring segmentation and 135.6 CIDEr on COCO.
Transformer unification dominates current SOTA:
- MaskDINO[3] – 54.5 AP (COCO instance), 59.4 PQ (panoptic), 60.8 mIOU (ADE20K).
- DI-MaskDINO[4] – +1.2 AP (box), +0.9 AP (mask) over MaskDINO.
- Mask2Former[5] – universal transformer for panoptic, instance, and semantic segmentation.
- OneFormer[6] – single-training model surpassing task-specific counterparts across all segmentation tasks.
For efficiency, SegFormer-B5 achieves 51.8 mIOU on ADE20K without positional encoding, while SegNeXt improves by +2 mIOU using 1/10 the parameters.
In short: the field has consolidated around modular, transformer-based frameworks capable of multi-task, zero-shot segmentation across vision and language.
References
- [1] Ravi, Nikhila, et al. "SAM 2: Segment Anything in Images and Videos." arXiv preprint arXiv:2408.00714 (2024).
- [2] Xiao, Bin, et al. "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks." arXiv preprint arXiv:2311.06242 (2023).
- [3] Li, Feng, et al. "Mask DINO: Towards a Unified Transformer-based Framework for Object Detection and Segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- [4] Nan, Zhixiong, et al. "DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model." arXiv preprint arXiv:2410.16707 (2024).
- [5] Cheng, Bowen, et al. "Masked-attention Mask Transformer for Universal Image Segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- [6] Jain, Jitesh, et al. "OneFormer: One Transformer to Rule Universal Image Segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.