RH
Ryan Heuser
@heuser.bsky.social
Florida man abroad. Lapsed Catholic, vulgar marxist, phd'd @StanfordEnglish, now Assistant Professor of Digital Humanities @Cambridge. I make data about culture and am writing about forms of abstraction in literary history.
476 followers381 following68 posts
Actually, does anyone know a way we might estimate/plot the "publication date" (or equivalent) of texts in The Pile, or whatever is the most current openly accessible training dataset for LLMs?
Dolma is probably the best documented. Getting exact dates for the whole thing is going to be a very very labor-intensive project. But you can see the relative size of the books corpora and academic paper corpora relative to the web scrape. allenai.github.io/dolma/docs/a...
RH
Ryan Heuser
@heuser.bsky.social
Florida man abroad. Lapsed Catholic, vulgar marxist, phd'd @StanfordEnglish, now Assistant Professor of Digital Humanities @Cambridge. I make data about culture and am writing about forms of abstraction in literary history.
476 followers381 following68 posts