Fine-Tuning an SLM for Malicious Prompt Detection: Lessons Learned

Fine-Tuning an SLM for Malicious Prompt Detection: Lessons Learned

I recently fine-tuned a Small Language Model (SLM) to detect malicious prompts in user queries for my AI agents. My goal was to create a lightweight model that could accurately classify queries as malicious or harmless. In this post, I’ll share my journey, including the challenges I faced and the techniques that ultimately led to success.

I started by creating a dataset of over 4000 malicious queries using GPT-4, along with an equal number of harmless queries. My initial attempt involved using this dataset to fine-tune the base SLM, but the resulting model was unusable, classifying every query as malicious.

For my second attempt, I fine-tuned the Qwen-3 0.6B model and spent more time prompt-tuning the instructions. This approach yielded slightly improved accuracy, but the model still struggled with edge cases.

It wasn’t until I incorporated Chain of Thought into my approach that I achieved the desired results. By adding reasoning behind each malicious query, I was able to fine-tune the model to achieve high accuracy. The final model is now open-source on Hugging Face, and you can find the code on GitHub.

Through this process, I learned the importance of carefully crafting the dataset and fine-tuning approach to achieve accurate results. I’m excited to use this model as a middleware between users and my AI agents, and I hope my experience can help others working on similar projects.

Leave a Comment

Your email address will not be published. Required fields are marked *