As AI agents become more prevalent, it’s essential to ensure they’re not vulnerable to prompt attacks like context manipulation or injection. I’ve been working on a classifier that can detect these attacks in real-time, and I wanted to share my experience fine-tuning three Small Language Models (SLMs) to achieve this goal.
I started by creating a dataset of 4,000 malicious and 4,000 harmless prompts, which I generated synthetically using a Large Language Model (LLM). To improve the performance of my models, I added a single line of reasoning to each training example, explaining why a prompt was malicious or safe.
I tested three SLMs: Qwen-3 0.6B, Qwen-2.5 0.5B, and SmolLM2-360M. The results were interesting – Qwen-3 0.6B outperformed the others, achieving a precision of 92.1%, recall of 88.4%, and accuracy of 90.3%. The smaller models struggled, especially with ambiguous queries.
My key takeaways from this experiment are that even short chain-of-thought reasoning can significantly improve classification performance, and that Qwen-3 0.6B handles nuance and edge cases better than the other models. Additionally, with a good dataset and a small reasoning step, SLMs can perform surprisingly well.
If you’re interested in exploring this further, I’ve made my final model open-source on Hugging Face, and the code is available in an easy-to-use package on GitHub.
What do you think about the potential of fine-tuned SLMs in detecting prompt attacks? Share your thoughts in the comments!