BLUE
TK
Tal Korem
@tkorem.bsky.social
Microbiome, network inference, metabolism and reproductive health. All views are mine.
239 followers159 following51 posts
TKtkorem.bsky.social

The issue is that every time one holds out a sample as a test set in LOOCV, the mean label average of the training set shifts slightly, creating a perfect negative correlation across the folds between that mean and the test labels. We call this phenomenon distributional bias:

An illustration of how leave-one-out cross-validation introduces distributional bias. On the left are N data set labels. In each training iteration, one sample is held out as a test set. When that sample has a positive label, it shifts the average of the training set's labels down. When that sample has a negative label, it shifts the average of the training set's labels up. This creates a perfect negative correlation across the training iteration of the average of the training set's labels and the held out data points.
1

TKtkorem.bsky.social

Distributional bias is a severe information leakage - so severe that we designed a dummy model that can achieve perfect auROC/auPR in ANY binary classification task evaluated via LOOCV (even without features). How? it just outputs the negative mean of the training set labels!

A receiver operating characteristic curve of a dummy predictor always providing a score equals to the negative of the average of the training set's labels. The curve goes from 0,0 to 0,1 to 1,1, having an area under the curve of 1.
1
TK
Tal Korem
@tkorem.bsky.social
Microbiome, network inference, metabolism and reproductive health. All views are mine.
239 followers159 following51 posts