I’ve been digging into model distillation lately, and I’m struck by the gap between the impressive results shown in research papers and the tiny, open-source LLMs that are actually available. Papers are filled with examples of distilling huge LLMs into smaller ones with minimal performance loss, but when I look at open-source releases, most ‘distilled’ models are surprisingly small, like DistilBERT and DistilGPT-2.
So, what’s going on? Is it because distillation is still too resource-intensive at large scales? Are there legal or IP restrictions stopping labs from releasing larger distilled models? Or is there simply not enough demand for mid-sized, high-performance variants of today’s big models?
It feels like the research world is serving up five-star distillation recipes, but open-source only gives us the ‘instant noodles’ version. Has anyone else noticed this gap? Or am I missing a secret club where all the good distilled LLMs are hiding?
I’d love to hear your thoughts on this. Are there other factors at play that I’m not considering?