The Transformer architecture has been a game-changer in natural language processing, but it’s not without its limitations. One of the main drawbacks is its computational complexity, which can make it slow and memory-intensive. But what if there was a way to achieve similar results with fewer parameters and less computational overhead?
That’s exactly what I’ve been exploring with my new alternative, PosetLM. Instead of using full self-attention, PosetLM processes sequences as a causal DAG, where each token connects only to a small set of previous tokens. This approach allows for more efficient computation and reduced memory usage.
## Early Results
I’ve trained both PosetLM and a small Transformer on the enwik8 dataset, and the results are promising. While the quality of the two models is similar, PosetLM uses around 35% fewer parameters. However, my current implementation is slower and uses more memory than the Transformer.
## Why This Matters
PosetLM offers several advantages over traditional Transformers. Firstly, its structured sparsity means that compute scales with O(T·K) rather than O(T²), making it more efficient. Secondly, the explicit edges in the DAG make it more interpretable, allowing us to see which past tokens each position attends to. Finally, the iterative refinement process decouples ‘which edges’ from ‘how many propagation steps,’ potentially improving with more iterations at eval.
## Limitations and Caveats
Of course, there are still some limitations to PosetLM. The naive implementation is not kernel-optimal, leading to poor GPU utilization. Additionally, the throughput and VRAM usage are currently worse than a small Transformer. Finally, I’ve only tested it on byte-level enwik8 with modest budgets, so it’s unclear how it will perform on larger datasets.
## The Future of PosetLM
So, is it worth exploring this direction further? I think so. While there are still some challenges to overcome, the potential benefits of PosetLM make it an exciting area of research. Whether it’s optimizing kernels for better efficiency, scaling up to larger datasets, or exploring new applications, there are many ways to push this technology forward.
What do you think? Should I keep investing time and resources into PosetLM?