Have you ever wondered how to predict employee departures using HR data? I’ve been working on a classification problem to do just that, and I’d love to share my experience and get your feedback.
My dataset is updated monthly, and for each employee, I’ve kept only one row: either their last available row if they’re still employed or the row corresponding to the month they left. This approach makes sense to me, but I’m not entirely sure if it’s the right one.
I’ve cleaned the data and trained classification models using Decision Trees and Random Forests. My goal is to predict employee departures accurately – maximizing true positives (correctly predicting departures) while minimizing false positives and false negatives.
My best-performing model (a Random Forest classifier) gives me roughly:
• True Positives: ~88.6%
• False Negatives: ~2.4%
• False Positives: ~4.3%
• True Negatives: ~4.7%
While the results are decent, I’m still looking to reduce false positives and false negatives. I’ve already optimized the model’s hyperparameters using grid/tuning, but I’m not seeing major improvements.
That’s where you come in! I’m looking for advice on the following:
• Are there techniques (feature engineering, modeling approaches, sampling strategies, etc.) that are particularly effective for churn prediction or HR datasets?
• How can I further improve class separation, especially considering the imbalance between people who stay vs leave?
• Is it possible (and meaningful) to calculate an individual-level probability of churn (i.e., how likely a specific person is to leave), particularly when using a Random Forest? If yes, how would I extract and interpret that?
If you’ve worked on similar projects or have experience with HR data analysis, I’d really appreciate any tips, experience sharing, or suggestions – thanks in advance!
Let’s discuss and learn from each other!