Have you noticed that large language models (LLMs) are struggling to perform well when given longer prompts? Despite their increasing size, they seem to degrade in quality, hallucinate more, and experience latency spikes. But what’s behind this issue? Research suggests that it’s not about model size, but rather how we manage context. Most models don’t process longer inputs as reliably as shorter ones, leading to position bias, distractors, and bloated inputs that make things worse.
So, how do we handle this in production? Are we summarizing history, retrieving only what’s needed, or building scratchpads and using autonomy sliders? I’d love to hear about what’s working (or failing) for others building LLM-based apps.
It’s time to talk about context rot and how we can overcome it.