Unraveling the Mystery of DeepMind's Genie 3 Architecture | Ranjan Kumar

Have you seen the latest breakthrough from DeepMind – Genie 3? It’s a game-changer in the world of world models. I was blown away by the comparison between Genie 2 and 3. The most striking difference is that Genie 2 has this constant statistical noise in the frame, whereas Genie 3 has eliminated it completely. This got me thinking – how did they achieve this? I believe Genie 2 is a diffusion model that outputs one frame at a time, conditional on the past frames and keyboard inputs for movement. But Genie 3’s perfect preservation of the environment makes me think it’s done in a different way, perhaps by generating the actual 3D physical world as the model’s output, saving it as some kind of 3D meshing + textures, and then having some rules of what needs to be generated in the world when (anything the user can see in the frame).

What do you think? Let’s speculate together! The possibilities are endless, and I’d love to hear your thoughts on how DeepMind might have achieved this remarkable feat.

Unraveling the Mystery of DeepMind’s Genie 3 Architecture

Leave a Comment Cancel Reply