BLUE

Alexander Doria

@dorialexander.bsky.social

LLM for the commons. Cofounder Pleias

533 followers238 following106 posts

Overview Posts Replies

ADdorialexander.bsky.socialJul 5, 2024 9:18pm

Uh. Retried just in case. This is normally my mail without issues.

ADdorialexander.bsky.socialJul 5, 2024 5:44pm

But actually now you make me wonder: did you receive my (late) answer?

ADdorialexander.bsky.socialJul 5, 2024 5:44pm

Ah ah. As a matter of fact we have a completely new design training right now. If all goes well much smaller and more accurate (even if less capable when too many errors).

ADdorialexander.bsky.socialMay 13, 2024 5:49pm

Yes and I completely concur. Just had a meeting with Eleuther for the open Pile project and they heavily shift to pdf/open library sources.

ADdorialexander.bsky.socialMay 1, 2024 5:46pm

Specificity of this approach: this is neither "rephrasing" nor prompting in the traditional sense. The LLM only sees cultural analytics/stylistic indicators that can be easily tweaked, and tries to create a new text from this. One more for the road:

ADdorialexander.bsky.socialMay 1, 2024 5:44pm

Doing experiments of synthetic literature with a freshly finetuned llama 8b: Plato's Republic as a film noir works surprisingly well.

ADdorialexander.bsky.socialApr 26, 2024 1:26pm

Could be. But I think it's mostly due to high noise at the start of a text that confuses token id probabilities and once the LLM has settled on a language, it switches to translation mode (also as there are many language translation alignment in pretraining corpus to help with multilingual support).

ADdorialexander.bsky.socialApr 26, 2024 1:19pm

I completely agree with that. We started testing with pre-training on "dirty" sources and this seems to be a much more robust solution, not only due to the increased familiarity with historical sources but also learned resiliency to OCR mistakes.

ADdorialexander.bsky.socialApr 26, 2024 1:18pm

Here it's mine and a rare example. With Haiku it was on the first go (along with many other issues: the translation is at least correct here!)

ADdorialexander.bsky.socialApr 26, 2024 12:44pm

I strongly believe that LLM research and development should focus more on solving actual use cases. OCR correction is a concrete issue among many that we can address thanks to open LLM research.

Alexander Doria

@dorialexander.bsky.social

LLM for the commons. Cofounder Pleias

533 followers238 following106 posts