LSTM, RNN, convolution(al), cross(-)entropy, entropy, logit, logistic, softmax, sampling, contrastive, CUDA, NCCL, QKV, FlashAttention, stochastic, SVM, Gauss, Gaussian, augmentation, backprop(agation), hessian, jacobian, optimizer, GAN, reinforcement learning, RLHF, layernorm, rmsnorm
this is AlignProp, which is reinforcement learning, is based on human feedback, but is not 'RLHF' in the explicit algorithmic sense that most people mean it. arxiv.org/abs/2310.03739
Text-to-image diffusion models have recently emerged at the forefront of image generation, powered by very large-scale unsupervised or weakly supervised text-to-image training datasets. Due to their u...
my guess is that it's something like AlignProp where RLHF steps are intercalated into traditional fine-tuning, rather than just plain-vanilla RLHF. also if i were doing it i would freeze almost the entire model and just RLHF cross-attention.
much more familiar with LMs where there's a lot of work on preventing rlhf from forcing mode collapse, you have to kl penalty it, etc. given that "vanilla" finetuning works nicely on the earlier SD releases the swap to explicit rlhf feels risky
what was the underlying technical concern here? the weird training instability issues you get with RLHF over image models? you can resolve some of that by using some sort of PEFT (e.g., LoRA) or by using something like AlignProp during training although i have never gotten AlignProp to work myself.
i agree that they are using it to tag data but i am real iffy on any sourcing that they're using rlhf specifically, which is a different thing than just generically finetuning
huh, you are correct, sometimes they are this is an oddity to me. are there any models besides sdxl that are known to use rlhf as opposed to finetuning directly on some more curated set?
One of the things that drives me crazy as a hole in current RLHF literature. Human preferences: High noise, low bias LLM as a judge: Low noise, high bias. We don’t know what this bias is. Drives me crazy. A lot of it seems like LLMs are better, but there is something we haven't measured.
rlhf is LLMs, very different domain but same issue where your smaller curated dataset ends up defining your house style
RLHF is not that but you get the basic idea