Navigating Small Datasets and Large Feature Spaces

Navigating Small Datasets and Large Feature Spaces

When working with small datasets and large feature spaces, it can be challenging to get accurate predictions. I recently came across a Reddit post where someone was struggling with a similar issue. They had a spectral library with 56 observations and about 2000 features, and were trying to use Partial Least Squares Regression (PLSR) to predict a biochemical variable from the spectra.

The poster had already reduced the feature count to around 100-150 by using Pearson correlation between each spectral feature and the target variable. However, they were unsure about how to proceed with data splitting and cross-validation, especially given the small dataset size.

One of the key concerns was how to implement nested cross-validation with PLSR, which is often recommended for small datasets. Another issue was that some models were achieving higher R² values in the test set than in the training set, which seemed counterintuitive.

These are common challenges that many of us face when working with small datasets and large feature spaces. In this post, we’ll explore some strategies for dealing with these issues and getting more accurate predictions.

Leave a Comment

Your email address will not be published. Required fields are marked *