BLUE

Tal Korem

@tkorem.bsky.social

Microbiome, network inference, metabolism and reproductive health. All views are mine.

239 followers159 following51 posts

TKtkorem.bsky.socialJun 11, 2024 1:50pm

With this in mind, we developed RebalancedCV, an sklearn-compatible package which drops the minimal amount of samples from the training set to maintain the same class balance in the training sets of all folds, thus resolving distributional bias. github.com/korem-lab/Re...

GitHub - korem-lab/RebalancedCV

Contribute to korem-lab/RebalancedCV development by creating an account on GitHub.

TKtkorem.bsky.socialJun 11, 2024 1:51pm

With RebalancedCV we could see the "real-life" impact of distributional bias. We reproduced 3 recently published analyses that used LOOCV, and showed that it under-evaluated performance in all of them. While the effect isn't major, it is consistent.

A reanalysis of 4 evaluations from 3 recently published studies comparing leave-one-out cross-validation to a Rebalanced version, demonstrating the impact of distributional bias. Panel A shows two ROC curves of preterm birth prediction using vaginal microbiome data. LOOCV has an auROC of 0.692 while Rebalanced LOOCV has auROC=0.697. Panel B is an ROC curve of a model predicting toxicity to immune checkpoint inhibitor blockade using T-Cell measurements. LOOCV has auROC=0.817 while RLOOCV has auROC=0.833. Panel C is an ROC of a gradient boosted regressor model predicting chronic fatigue syndrome using blood test measurement. LOOCV has auROC=0.818 while RLOOCV has auROC=0.824. Panel D is the same analysis with an XGBoost mode. LOOCV has an auROC of 0.796 while RLOOCV has auROC=0.817.

Tal Korem

@tkorem.bsky.social

Microbiome, network inference, metabolism and reproductive health. All views are mine.

239 followers159 following51 posts