BLUE

Tal Korem

@tkorem.bsky.social

Microbiome, network inference, metabolism and reproductive health. All views are mine.

239 followers159 following51 posts

TKtkorem.bsky.socialJun 11, 2024 1:48pm

The issue is that every time one holds out a sample as a test set in LOOCV, the mean label average of the training set shifts slightly, creating a perfect negative correlation across the folds between that mean and the test labels. We call this phenomenon distributional bias:

An illustration of how leave-one-out cross-validation introduces distributional bias. On the left are N data set labels. In each training iteration, one sample is held out as a test set. When that sample has a positive label, it shifts the average of the training set's labels down. When that sample has a negative label, it shifts the average of the training set's labels up. This creates a perfect negative correlation across the training iteration of the average of the training set's labels and the held out data points.

TKtkorem.bsky.socialJun 11, 2024 1:49pm

Distributional bias is a severe information leakage - so severe that we designed a dummy model that can achieve perfect auROC/auPR in ANY binary classification task evaluated via LOOCV (even without features). How? it just outputs the negative mean of the training set labels!

A receiver operating characteristic curve of a dummy predictor always providing a score equals to the negative of the average of the training set's labels. The curve goes from 0,0 to 0,1 to 1,1, having an area under the curve of 1.

Tal Korem

@tkorem.bsky.social

Microbiome, network inference, metabolism and reproductive health. All views are mine.

239 followers159 following51 posts