BLUE

Leshem Choshen

@lchoshen.bsky.social

🥇 #NLProc researcher 🥈 Opinionatedly Summarizing #ML & #NLP papers 🥉 Good science #scientivism

148 followers107 following317 posts

LClchoshen.bsky.socialOct 2, 2024 3:58pm

What they see is that if you take every layer to be comparable to steps of an algorithm, the layers succeed much faster than SGD, note that they are even curved down, not up like SGD (Super linearity to the rescue🦸‍♀️)

LClchoshen.bsky.socialOct 2, 2024 3:59pm

So, why do we care? Because we don't know any algorithm that computes only the gradient and converges that fast Transformers learn with their layers an algorithm that is better The paper has more theoretical and other measures to compare it to the newton's but the gist is up IMO

Leshem Choshen

@lchoshen.bsky.social

🥇 #NLProc researcher 🥈 Opinionatedly Summarizing #ML & #NLP papers 🥉 Good science #scientivism

148 followers107 following317 posts