BLUE
TK
Tal Korem
@tkorem.bsky.social
Microbiome, network inference, metabolism and reproductive health. All views are mine.
239 followers159 following51 posts
TKtkorem.bsky.social

With this in mind, we developed RebalancedCV, an sklearn-compatible package which drops the minimal amount of samples from the training set to maintain the same class balance in the training sets of all folds, thus resolving distributional bias. github.com/korem-lab/Re...

GitHub - korem-lab/RebalancedCV
GitHub - korem-lab/RebalancedCV

Contribute to korem-lab/RebalancedCV development by creating an account on GitHub.

1

TKtkorem.bsky.social

With RebalancedCV we could see the "real-life" impact of distributional bias. We reproduced 3 recently published analyses that used LOOCV, and showed that it under-evaluated performance in all of them. While the effect isn't major, it is consistent.

A reanalysis of 4 evaluations from 3 recently published studies comparing leave-one-out cross-validation to a Rebalanced version, demonstrating the impact of distributional bias. Panel A shows two ROC curves of preterm birth prediction using vaginal microbiome data. LOOCV has an auROC of 0.692 while Rebalanced LOOCV has auROC=0.697. Panel B is an ROC curve of a model predicting toxicity to immune checkpoint inhibitor blockade using T-Cell measurements. LOOCV has auROC=0.817 while RLOOCV has auROC=0.833. Panel C is an ROC of a gradient boosted regressor model predicting chronic fatigue syndrome using blood test measurement. LOOCV has auROC=0.818 while RLOOCV has auROC=0.824. Panel D is the same analysis with an XGBoost mode. LOOCV has an auROC of 0.796 while RLOOCV has auROC=0.817.
1
TK
Tal Korem
@tkorem.bsky.social
Microbiome, network inference, metabolism and reproductive health. All views are mine.
239 followers159 following51 posts