BLUE
Profile banner
AD
Alexander Doria
@dorialexander.bsky.social
LLM for the commons. Cofounder Pleias
533 followers238 following106 posts
ADdorialexander.bsky.social

Uh. Retried just in case. This is normally my mail without issues.

1
ADdorialexander.bsky.social

But actually now you make me wonder: did you receive my (late) answer?

1
ADdorialexander.bsky.social

Ah ah. As a matter of fact we have a completely new design training right now. If all goes well much smaller and more accurate (even if less capable when too many errors).

1
ADdorialexander.bsky.social

Yes and I completely concur. Just had a meeting with Eleuther for the open Pile project and they heavily shift to pdf/open library sources.

1
ADdorialexander.bsky.social

Specificity of this approach: this is neither "rephrasing" nor prompting in the traditional sense. The LLM only sees cultural analytics/stylistic indicators that can be easily tweaked, and tries to create a new text from this. One more for the road:

0
ADdorialexander.bsky.social

Doing experiments of synthetic literature with a freshly finetuned llama 8b: Plato's Republic as a film noir works surprisingly well.

1
ADdorialexander.bsky.social

Could be. But I think it's mostly due to high noise at the start of a text that confuses token id probabilities and once the LLM has settled on a language, it switches to translation mode (also as there are many language translation alignment in pretraining corpus to help with multilingual support).

0
ADdorialexander.bsky.social

I completely agree with that. We started testing with pre-training on "dirty" sources and this seems to be a much more robust solution, not only due to the increased familiarity with historical sources but also learned resiliency to OCR mistakes.

1
ADdorialexander.bsky.social

Here it's mine and a rare example. With Haiku it was on the first go (along with many other issues: the translation is at least correct here!)

0
ADdorialexander.bsky.social

I strongly believe that LLM research and development should focus more on solving actual use cases. OCR correction is a concrete issue among many that we can address thanks to open LLM research.

1
Profile banner
AD
Alexander Doria
@dorialexander.bsky.social
LLM for the commons. Cofounder Pleias
533 followers238 following106 posts