Flagging Prompt Attacks with AI: A 95% Accurate Defense Model

Flagging Prompt Attacks with AI: A 95% Accurate Defense Model

Imagine being able to flag malicious prompts attacking your language models with an accuracy of 95%. That’s exactly what I’ve achieved with my neural network and embedding-based model, Bhairava-0.4B, which requires only 0.4 billion parameters.

The journey began with building small defense models to sit between users and large language models (LLMs). I wanted to detect prompt injection, jailbreak, context attack, and other types of malicious inputs. Initially, I used a ModernBERT model, but it struggled to classify tricky attack queries. Then, I moved to SLMs, which showed some improvement.

However, it wasn’t until I applied contrastive learning to a larger dataset that I saw a significant jump in performance. The new model outperforms my previous SLMs, and I’ve made it open-source on HF with an easy-to-use package on GitHub.

So, how did I achieve this? I trained on a dataset of 12,000 malicious and benign prompts, generated using an LLM. I used ModernBERT-large for embeddings and trained a small neural net to predict whether the input is an attack or not. The contrastive loss helps the model understand the semantic space of attacks, making it more effective.

The best part? This model is fast and efficient, running on just the embedding plus head, making it suitable for real-time filtering. During inference, it embeds the prompt, classifies it as safe or malicious, and passes it to the LLM or logs/block/reroutes the input accordingly.

I’m thrilled to share that Bhairava-0.4B is now able to classify 91% of queries correctly, given its size. If you’re looking to improve your LLM’s defense, I encourage you to give it a try and see how it performs in your stack.

Leave a Comment

Your email address will not be published. Required fields are marked *