I came across a Reddit post from someone deep in the weeds of a text classification problem, and it felt incredibly familiar. They were trying to tag business documents with one of ~500 different ‘Business Codes’ (BCs). To make things harder, the codes often had similar descriptions, and their training data was sparse—not every code had an example.
This is a classic NLP challenge. You have a ton of categories, messy text, and not enough labeled data. The poster, /u/Open-Occasion-3437, had already tried the usual suspects:
1. **TF-IDF with XGBoost/Random Forests:** This is often the first thing people try. It’s fast and simple, but as they found, it often gives poor results, especially with nuanced text.
2. **Word2Vec with XGBoost/Random Forests:** A step up, this approach tries to capture the *meaning* of words. But it still didn’t work well for them. It can struggle when the context is highly specific to a business domain.
3. **Clustering with KNN:** They tried to group similar business codes together first, then classify within those smaller groups. A smart idea, but the clusters weren’t making sense.
So, what’s next when you’re stuck in this exact spot? Before jumping to expensive, complex LLMs, there are a few other things worth exploring.
### Let’s Talk About Embeddings (But Better Ones)
Word2Vec is a good start, but modern NLP has moved way beyond it. The real power today lies in pre-trained transformer models. Think BERT, RoBERTa, or even smaller, more efficient models like DistilBERT.
Instead of just converting words to vectors, these models understand context. They know that “project plan” in a business document means something different from “project runway” in a fashion blog.
Here’s a simple way to use them:
* **Use a pre-trained model to get document embeddings.** Take your entire document, feed it into a model like `sentence-transformers`, and get a single, rich vector that represents its meaning.
* **Then, use a simple classifier.** You can feed these high-quality embeddings into a simple logistic regression, SVM, or even a lightweight neural network. You might be surprised at how well this works compared to XGBoost on older-style embeddings.
This approach gives you the best of both worlds: the deep contextual understanding of a large language model without the cost and complexity of fine-tuning it from scratch for classification.
### What If You Have No Examples for Some Codes?
This is the trickiest part. If the model has never seen an example of ‘BC-432’, how can it ever predict it? This is where you have to get a little creative. This is often called **zero-shot classification**.
Here’s an approach I’ve seen work:
1. **Embed *Everything***: Use a sentence-transformer model to create vector embeddings for all your training documents. But here’s the key: also create embeddings for the *descriptions* of all 500 business codes.
2. **Use Cosine Similarity:** Now, when a new, untagged document comes in, you create an embedding for it. Then, you simply compare that new document’s vector to the vectors of all 500 business code descriptions. The business code whose description is most ‘similar’ (has the highest cosine similarity score) to the document is your predicted tag.
This method doesn’t require a single training example for a business code, as long as you have a good description for it. It’s basically a sophisticated matching game.
### A Final thought: Don’t Underestimate the Data
Before diving into more complex models, always ask: can I improve my data? In this case, since many business codes are similar, maybe the labels themselves are ambiguous. Or maybe the document text is full of noise.
Sometimes, a bit of clever data cleaning or feature engineering (like pulling out specific keywords or phrases) can make a bigger difference than the fanciest model in the world.
So, if you find yourself in a similar situation, don’t give up. The classics like TF-IDF might not cut it anymore, but there’s a whole world of techniques between that and massive LLMs. Give sentence-transformers and a zero-shot approach a try. You might just get the breakthrough you’re looking for.
—
*Original Reddit post for context: [Advice on building a classification model for text classification](https://www.reddit.com/r/MLQuestions/comments/1mshvgk/advice_on_building_a_classification_model_for/)*