Have you ever tried to build a machine learning model to predict something, like the price of a laptop, and added more columns of data thinking it would improve the accuracy of your model? Only to find out that the accuracy remained the same?
I’m not alone, right?
I recently stumbled upon a Reddit post where someone shared a similar experience. They built a simple model to predict laptop prices using the Kaggle dataset, starting with two columns – ‘Ram’ and ‘Inches’. But when they added more columns, the accuracy of their model remained at 60%.
So, what’s going on?
The Problem: Correlated Features
In many cases, adding more columns doesn’t necessarily mean better accuracy. This is because many features in our dataset might be correlated with each other. For example, if we have columns for ‘Ram’, ‘Storage’, and ‘Inches’, they might be highly correlated. This means that the information they provide is redundant, and our model isn’t learning anything new.
The Fix: Feature Selection and Engineering
So, how do we improve our model’s accuracy? One approach is to select the most relevant features that provide the most information. This is called feature selection. We can use techniques like mutual information, correlation analysis, or recursive feature elimination to select the best features.
Another approach is to engineer new features that provide more information. For example, we could create a new feature that combines ‘Ram’ and ‘Storage’ into a single feature, like ‘System Performance’.
Other Reasons for Poor Accuracy
Of course, there could be other reasons why our model’s accuracy isn’t improving. Maybe our model is overfitting or underfitting, or maybe we need to tune our hyperparameters.
Takeaway
Adding more data columns isn’t a guarantee of better accuracy. We need to carefully select and engineer our features to provide the most information to our model. By doing so, we can improve our model’s accuracy and make better predictions.
*Further reading: Feature Selection in Machine Learning*