When it comes to Large Language Models (LLMs), one of the biggest challenges is defending against prompt attacks. These attacks can be malicious, trying to trick the model into doing something it shouldn’t, or even jailbreak the system. To combat this, I’ve been building defense models that can sit between users and LLMs, flagging incoming user prompts as either safe or malicious.
My journey started with a ModernBERT model, but I soon realized it was hard to get it to classify tricky attack queries correctly. That’s when I turned to Sentence-Level Models (SLMs) to improve performance. However, I recently revisited this approach with contrastive learning and a larger dataset, and the results are astonishing.
My new model, called Bhairava-0.4B, uses ModernBERT-large for embeddings and trains a small neural net to predict whether the input is an attack or not. The key to its success lies in the contrastive loss function, which pulls embeddings of benign samples together and pushes them away from malicious ones. This allows the model to understand the semantic space of attacks.
The best part? This model is small (only 396M params) and optimized for real-time filtering. On my test set, it’s able to classify 91% of the queries as attack/benign correctly, which is a significant improvement.
If you’re interested in trying it out in your stack, you can find the open-source code on GitHub. I’d love to hear about your experience and how it performs in your use case.