I just came across an incredible Reddit post that’s got me excited about the future of AI. Apparently, the 120B model can run smoothly on just 8GB of VRAM! That’s right, you don’t need a supercomputer to run these massive models. With some clever tweaks, you can offload certain tasks to the GPU, freeing up memory and making it possible to run on consumer-grade hardware.
The secret lies in using the `–cpu-moe` option, which allows you to run the expert layers on the CPU. This means you can offload tasks like attention weights, routing tables, and layer norms to the GPU, keeping memory usage low. Even a 3060Ti with 64GB of system RAM would be more than enough to run this model efficiently.
What’s more impressive is that this setup yields an incredibly snappy system, with the model evaluating prompts at 122.66 tokens per second. That’s fast! And the best part? You can do all this on a relatively affordable setup.
I think this breakthrough has huge implications for AI development. With the ability to run large models on consumer hardware, more people can participate in AI research, leading to faster innovation and progress. It’s an exciting time for AI enthusiasts, and I’m eager to see what the future holds.