Hey there, fellow machine learning enthusiasts! I’m working on my MSc thesis and I’ve stumbled upon a common dilemma in model evaluation. I’m using autoencoders for unsupervised fraud detection on the Kaggle credit card dataset. I’ve trained 8 different architectures, each with 8 different thresholding strategies. The problem is, one of my strategies is designed to find the best F1 score on the validation set, which makes it the obvious winner when I compare all strategies on the same set.
So, I’m left wondering: is it valid to use the test set to compare all the strategies and pick the best ones? I wouldn’t be tuning anything on the test set, just comparing frozen models and thresholds. I’ve saved all the model states, threshold values, and predictions on both the validation and test sets.
The question is, would I be risking data leakage or overfitting by doing so? Or is it a legitimate way to evaluate my models? Let’s dive in and explore the possibilities.
In essence, I’m trying to avoid overfitting and ensure that my models generalize well to new, unseen data. By using the test set to compare strategies, I’m hoping to get a more accurate picture of which models perform best in real-world scenarios.
But, I’m also aware that this approach might lead to data leakage, where my models inadvertently learn from the test data and become biased towards it. This would render my evaluation invalid and compromise the integrity of my research.
So, what do you think? Is using the test set to compare strategies a valid approach, or should I explore other evaluation methods to avoid data leakage and overfitting?