BLUE
Profile banner
LC
Leshem Choshen
@lchoshen.bsky.social
🥇 #NLProc researcher 🥈 Opinionatedly Summarizing #ML & #NLP papers 🥉 Good science #scientivism
148 followers107 following317 posts
LClchoshen.bsky.social

What they see is that if you take every layer to be comparable to steps of an algorithm, the layers succeed much faster than SGD, note that they are even curved down, not up like SGD (Super linearity to the rescue🦸‍♀️)

1

LClchoshen.bsky.social

So, why do we care? Because we don't know any algorithm that computes only the gradient and converges that fast Transformers learn with their layers an algorithm that is better The paper has more theoretical and other measures to compare it to the newton's but the gist is up IMO

1
Profile banner
LC
Leshem Choshen
@lchoshen.bsky.social
🥇 #NLProc researcher 🥈 Opinionatedly Summarizing #ML & #NLP papers 🥉 Good science #scientivism
148 followers107 following317 posts