Detecting Prompt Attacks with Fine-Tuned SLMs: What I Learned

Detecting Prompt Attacks with Fine-Tuned SLMs: What I Learned

Have you ever wondered how we can protect AI agents from malicious prompts and attacks? I recently worked on a classifier that can detect these threats in real-time, and I’d love to share my findings with you.

I fine-tuned three SLMs (Smaller Language Models) to evaluate their performance in detecting prompt attacks like prompt injection and context manipulation. The models I tested were Qwen-3 0.6B, Qwen-2.5 0.5B, and SmolLM2-360M.

The results were fascinating. Qwen-3 0.6B outperformed the others, with a precision of 92.1%, recall of 88.4%, and accuracy of 90.3%. The other models didn’t fare as well, with Qwen-2.5 0.5B achieving an accuracy of 83.1% and SmolLM2-360M reaching 71.1%.

So, what did I learn from these experiments? Firstly, adding a simple chain-of-thought reasoning step to each training example significantly improved classification performance. Secondly, Qwen-3 0.6B handled nuanced and edge cases better than the other models. And lastly, with a good dataset and a small reasoning step, even smaller language models can perform surprisingly well.

If you’re interested in exploring this further, I’ve open-sourced the final model on HF, and the code is available in an easy-to-use package on GitHub.

What do you think is the most promising approach to detecting prompt attacks? Share your thoughts in the comments!

Leave a Comment

Your email address will not be published. Required fields are marked *