Have you ever wondered how we can protect AI agents from malicious prompts and attacks? I recently worked on a classifier that can detect these threats in real-time, and I’d love to share my findings with you.
I fine-tuned three SLMs (Smaller Language Models) to evaluate their performance in detecting prompt attacks like prompt injection and context manipulation. The models I tested were Qwen-3 0.6B, Qwen-2.5 0.5B, and SmolLM2-360M.
The results were fascinating. Qwen-3 0.6B outperformed the others, with a precision of 92.1%, recall of 88.4%, and accuracy of 90.3%. The other models didn’t fare as well, with Qwen-2.5 0.5B achieving an accuracy of 83.1% and SmolLM2-360M reaching 71.1%.
So, what did I learn from these experiments? Firstly, adding a simple chain-of-thought reasoning step to each training example significantly improved classification performance. Secondly, Qwen-3 0.6B handled nuanced and edge cases better than the other models. And lastly, with a good dataset and a small reasoning step, even smaller language models can perform surprisingly well.
If you’re interested in exploring this further, I’ve open-sourced the final model on HF, and the code is available in an easy-to-use package on GitHub.
What do you think is the most promising approach to detecting prompt attacks? Share your thoughts in the comments!