The Struggle of Collecting Clean Data for Fine-Tuning: A Common Dilemma | Ranjan Kumar

As a machine learning enthusiast, I’ve been there too – trying to fine-tune a model with my own dataset, only to face a plethora of problems. One of the most significant hurdles is collecting clean data. I’ve tried writing scripts to scrape websites, but they often come with a lot of noise. Writing a different script for each website is tedious, and even then, the data can be wrong or incomplete.

I’ve also tried manually checking samples, but it’s a time-consuming process that can be frustratingly inefficient. And let’s not forget the issue of HTML format changes on websites, which can lead to noisy samples.

I’ve experimented with using ChatGPT to generate samples, but they’re not good enough for fine-tuning. Manually adding samples is also not a viable solution, as it takes a lot of time and effort. I’ve even tried using regex to clean the data, but it’s not always effective.

So, the question remains: is there a way to get clean data more easily? What kind of crawlers or scripts can we use to automate this process? Are there any go-to solutions or techniques that are commonly used to collect data?

The struggle is real, and I’m not alone. Many of us have been in this situation, and it’s essential to find better ways to collect and clean data. Maybe it’s time to explore new tools and techniques that can simplify this process and make our lives easier as machine learning practitioners.

Leave a Comment Cancel Reply