When working with large language models (LLMs), ensuring consistency across different sessions and seeds is crucial. But how do we measure this consistency? If we run identical blinded probes across N sessions/seeds, what statistics and tests can we use to claim a ‘stable signal’ versus noise?
One approach is to use statistical methods such as mean absolute error (MAE) or mean squared error (MSE) to quantify the differences between sessions. Another approach is to use hypothesis testing, such as the t-test or ANOVA, to determine whether the signals observed are statistically significant.
However, it’s essential to choose the right baseline to compare our results to. A simple baseline could be the performance of a random model, while a more robust baseline could be the performance of a state-of-the-art model on a similar task.
Ultimately, measuring cross-session consistency in LLMs requires a thoughtful approach to experimental design and statistical analysis. By using the right tools and methods, we can gain a deeper understanding of how these models work and how to improve their performance.