As AI models continue to evolve, one technique has been gaining attention: Reinforcement Learning (RL). In this post, I’ll dive into my experience with RL and its application in JoyCaption, a captioning tool designed to assist with image datasets for Stable Diffusion and similar models.
## The Story Behind JoyCaption
JoyCaption was built primarily to assist with captioning image datasets for Stable Diffusion and similar models. But have you ever wondered how it’s made? In this article, I’ll share my entire process of putting JoyCaption through Reinforcement Learning to improve its performance.
## Understanding Reinforcement Learning
RL is often misunderstood as just Preference Tuning, but it’s so much more. In RL, an agent learns to make decisions by interacting with an environment and receiving rewards or penalties. This process refines the model’s performance over time.
## The JoyCaption Process
My process involved a huge dump of not only my entire process of putting JoyCaption through Reinforcement Learning but also a breakdown of RL itself. You can read more about it in the linked article.
## The Future of Vision Models
I believe that diffusion and vision models desperately need their ‘RL moment’ like LLMs had. By putting a VLM and a diffusion model in one big back-and-forth RL loop, we can hammer massive improvements into both.
## Further Reading
If you’re interested in learning more about how JoyCaption was made, I’ve got another article underway that covers the base model training, building the core caption dataset, VQA, and training a sightless Llama 3.1 to see.
*Further reading: [How OpenAI Misled You on RLHF](https://aerial-toothpaste-34a.notion.site/How-OpenAI-Misled-You-on-RLHF-1f83f742d9dd80a68129d06503464aff)*