GPT-OSS: A Surprisingly Powerful Open-Source AI Model | Ranjan Kumar

I’ve been experimenting with GPT-OSS since its release, and unlike many posts/news I’ve seen, it’s surprisingly powerful — even on uncommon datasets. I recently tested it on SATA-Bench, a benchmark where each question has at least two correct answers (rare in standard LLM evaluation).

The results are impressive: the 120B open-source model is similar to GPT-4.1’s performance on SATA-Bench, while the 20B model lags behind but still matches DeepSeek R1 & Llama-3.1-405B.

One of the key takeaways from this experiment is that repetitive reasoning hurts — 11% of 20B outputs loop, losing ~9 exact match rate. Additionally, reason–answer mismatches happen often in 20B and they tend to produce one answer even if their reason suggests a few answers are correct.

Interestingly, longer outputs don’t necessarily mean better results. In fact, overthinking reduces accuracy. You can find the detailed findings of this experiment here: https://weijiexu.com/posts/sata_bench_experiments.html. SATA-Bench dataset is available here: https://huggingface.co/datasets/sata-bench/sata-bench.

Overall, GPT-OSS is a powerful open-source AI model that’s worth exploring, especially for those working with uncommon datasets.

Leave a Comment Cancel Reply