The Surprising Challenge of Testing AI with Codenames | Ranjan Kumar

Have you ever played the popular board game Codenames with friends? It’s a fun party game where your team tries to link random words from a grid to a specific code word given by the team’s code master. But what if I told you that Large Language Models (LLMs) are really bad at this type of game? I recently played a challenging round with friends and, afterwards, consulted my LLMs for suggestions with the game word set I had. The results were surprising – and not in a good way.

It got me thinking: if we’re worried about AGI emerging from LLMs, maybe we should forget the Turing test and focus on something more practical. What if we tested an LLM’s ability to play Codenames convincingly? It’s a game that requires understanding nuances of language, context, and relationships between words – skills that are essential for human-like intelligence.

The idea is intriguing, and it raises questions about how we evaluate the capabilities of LLMs. Are we relying too heavily on traditional tests, or is it time to think outside the box (or game board)? Share your thoughts – do you think Codenames could be a useful benchmark for LLMs?

Leave a Comment Cancel Reply