Meta has just released DINO-V3, a vision model that learns entirely from unlabeled images, with no captions or annotations. This is a huge breakthrough, as DINO-V3 outperforms models like CLIP, SAM, and even the previous DINO-V2 on dense tasks like segmentation, depth estimation, and 3D matching.
The magic behind DINO-V3 lies in its ability to train a 7B-parameter ViT without feature degradation over long training. This is achieved through a new technique called Gram Anchoring.
But what does this mean for the future of vision tasks? With DINO-V3, we can expect to see significant improvements in areas like object detection, image generation, and more.
If you’re interested in learning more, check out the [paper](https://ai.meta.com/dinov3/) and (https://www.youtube.com/watch?v=VfYUQ2Qquxk).
This is a huge milestone in the field of AI, and we can’t wait to see what the future holds.