BLUE
Profile banner
LC
Leshem Choshen
@lchoshen.bsky.social
🥇 #NLProc researcher 🥈 Opinionatedly Summarizing #ML & #NLP papers 🥉 Good science #scientivism
148 followers107 following317 posts
LClchoshen.bsky.social

It's not a lower bound because if you are willing to do enough compute you can do better than SGD for training models its inefficient but possible. The paper claims it is very similar to Newton's method (based on 2nd order so gradient of gradient)

0
LClchoshen.bsky.social

So, why do we care? Because we don't know any algorithm that computes only the gradient and converges that fast Transformers learn with their layers an algorithm that is better The paper has more theoretical and other measures to compare it to the newton's but the gist is up IMO

1
LClchoshen.bsky.social

What they see is that if you take every layer to be comparable to steps of an algorithm, the layers succeed much faster than SGD, note that they are even curved down, not up like SGD (Super linearity to the rescue🦸‍♀️)

1
LClchoshen.bsky.social

What they see is that if you take every layer to be comparable to steps of an algorithm, the layers succeed much faster than SGD, note that they are even curved down, not up like SGD (Super linearity to the rescue🦸‍♀️)

0
LClchoshen.bsky.social

So how do you compare transformers to classic algorithms? You train them to do ICL on simple closed problems we understand and know how to solve in other ways (e.g. SGD or second-order derivatives like Newton's iterative) For example, compute a linear function from examples seen

1
LClchoshen.bsky.social

In alphaxiv.org/pdf/2310.17086 they discuss what is internally learnt when you give a model examples and it manages to understand how to act upon it. also known as ICL

alphaXiv
alphaXiv

Comment directly on top of arXiv papers.

2
LClchoshen.bsky.social

It was claimed in-context-learning (ICL) is doing SGD inside the transformer layers. A new finding shows this is not possible. They must be doing something BETTER In fact, exponentially better than SGD, so second-order methods?🧵 🤖

alphaXiv
alphaXiv

Comment directly on top of arXiv papers.

1
LClchoshen.bsky.social

I wish, in these days there are many days I don't get to read it, especially busy times.

0
LClchoshen.bsky.social

We are all used by now to delve into topics full of LLM influence. But, did you know we started speaking like that too? In other words, LLMs drive language change, our language, not only the copy-pasted text. alphaxiv.org/abs/2409.13686#LLM#LLMs#ML#machinelearning#nlproc#language#change 🤖

2
Profile banner
LC
Leshem Choshen
@lchoshen.bsky.social
🥇 #NLProc researcher 🥈 Opinionatedly Summarizing #ML & #NLP papers 🥉 Good science #scientivism
148 followers107 following317 posts