As I delved deeper into my RAG project, I realized that text segmentation is more than just splitting text into chunks. I wanted to improve the quality of my segmentation, and I’m sure many of you can relate.
Initially, I was satisfied with the results, but as time went by, I noticed that the segments were either too short or broke sentences apart. I knew I had to do something about it.
## Fine-tuning the Embedding Model
My first approach was to fine-tune the embedding model itself, but surprisingly, the base model outperformed my tuned versions. This led me to focus on improving the segmentation process instead.
## Local Improvements
Since my project has accumulated several library dependencies over time, I wanted to explore local improvements that wouldn’t require additional packages. This is where I need your help.
## Leveraging a Classification NN
I’ve also built a simple classification neural network that accurately identifies top N topics for a given segment. I believe this could add value to defining cut-off points in segmentation. The question is, how can I utilize it effectively?
## The Quest for Improvement
If you’ve faced similar challenges or have ideas on how to improve embedding-based segmentation, I’d love to hear them. Bonus points if they’re computationally efficient!
Some potential areas to explore include:
– **Sentence tokenization**: Could this be the culprit behind broken sentences?
– **Segmentation algorithms**: Are there alternative approaches that could yield better results?
– **Integrating the classification NN**: How can I effectively use my topic classification model to enhance segmentation?
Share your thoughts, and let’s take text segmentation to the next level!