Have you ever tried to finetune a Vision-Language Model (VLM) on a Mac with a custom dataset, only to hit a wall? I’m guessing you’re not alone. I recently stumbled upon a Reddit post from someone who’s facing the same issue, and I’m here to help you navigate the challenges.
The Goal: Recognizing Ships on Images
The original poster wants to use a VLM to recognize ships on an image and return a JSON response with specific data from the model. They’re using Yolo11 to grab images of ships on video and SmolVlm through Ollama to get a specific JSON response with additional data. The setup allows them to get the data they need in around 200ms, but the data is imprecise.
The Error: Runtime Error with Torch
The poster tried following a course on SmolVlm, but kept getting a RuntimeError
with torch.searchsorted(): boundaries and input value tensors should have same device type, but got boundaries tensor device type cpu and input value tensor device type mps:0.
The Switch to MLX-VLM
Frustrated, the poster switched to MLX-VLM, but couldn’t find any proper documentation or examples on how to run the finetuning with Python and a custom dataset.
Overcoming the Hurdles
If you’re facing similar issues, here are a few things to keep in mind:
- Check your device type: Make sure your boundaries and input value tensors are on the same device type. In this case, the error is due to the mismatch between cpu and mps:0.
- Look for documentation and examples: While MLX-VLM might not have extensive documentation, you can try searching for examples or tutorials on finetuning VLMs with custom datasets.
- Explore alternative models: If SmolVlm isn’t working for you, consider other VLMs like MLX-VLM or even transformer-based models like Vision Transformers.
Final Thought
Finetuning a VLM on a Mac with a custom dataset can be challenging, but it’s not impossible. By being aware of the potential pitfalls and exploring different models and approaches, you can overcome the hurdles and achieve your goals.
*Further reading: SmolVlm course*