When it comes to image classification, every percentage point of accuracy matters. So, when I stumbled upon some surprising results while experimenting with attention modules on CLIP RN50, I had to dig deeper.
As a first step in my audio-visual project, I built unimodal models before moving on to the multimodal stage. For the vision part, I started with CLIP RN50 as the backbone and fine-tuned only the classification layer. With that setup, I was able to reach around 84% accuracy on my dataset.
But I wanted more. So, I experimented with adding attention modules to see if I could push performance even further.
## The Surprising Results
With CBAM (Convolutional Block Attention Module), accuracy improved to 89%. Not bad. But then I tried SENet (Squeeze-and-Excitation Network), and I was surprised to get an even better result: 93%.
Here’s the thing: I thought CBAM, which combines both channel and spatial attention, would give a stronger boost than SENet, which only does channel attention. But in my experiments, the opposite happened.
## What’s Going On?
Am I missing something obvious here? Could this be due to dataset characteristics, training setup, or how I integrated CBAM into CLIP? I’m not sure, and that’s why I’d love to hear from others who have tried attention modules on CLIP or ResNet backbones.
## The Power of Attention Modules
Attention modules are designed to help models focus on the most relevant parts of an image. They can be especially useful when dealing with complex or noisy data. But what I didn’t expect was for SENet to outperform CBAM in my experiments.
This experience has left me wondering: are there other areas where attention modules can make a bigger impact than we think? And how can we better understand when to use which type of attention module?
If you have any insights or experiences to share, I’d love to hear them in the comments.