BLUE
Profile banner
RH
Ryan Heuser
@heuser.bsky.social
Florida man abroad. Lapsed Catholic, vulgar marxist, phd'd @StanfordEnglish, now Assistant Professor of Digital Humanities @Cambridge. I make data about culture and am writing about forms of abstraction in literary history.
476 followers381 following68 posts
RHheuser.bsky.social

Actually, does anyone know a way we might estimate/plot the "publication date" (or equivalent) of texts in The Pile, or whatever is the most current openly accessible training dataset for LLMs?

1

TUtedunderwood.me

Dolma is probably the best documented. Getting exact dates for the whole thing is going to be a very very labor-intensive project. But you can see the relative size of the books corpora and academic paper corpora relative to the web scrape. allenai.github.io/dolma/docs/a...

The image you provided is a table labeled "Table 1: Composition of Dolma" that summarizes data sources and their characteristics from different subsets used in a dataset. Here are the details listed in the table:

- **Source** categories include Common Crawl, C4, The Stack, Project Gutenberg, and Wikipedia/Wikibooks.
- **Subset** specifications include various descriptions like "24 shards, 2020-05 to 2023-06" for Common Crawl and other specific details for each source.
- **Kind** of data is described, such as 'web', 'academic', 'code', 'books', and 'encyclopedic'.
- **Gzip files (GB)** show the compressed size of the data for each source, ranging from several to thousands of GB.
- **Documents (millions)** indicating the number of documents included in each source, varying from low to several millions.
- **Tokens (billions)** denote the number of words or tokens counted in the data, also varying widely.

The table concludes with a total row that sums the figures for each column, reflect
1
Profile banner
RH
Ryan Heuser
@heuser.bsky.social
Florida man abroad. Lapsed Catholic, vulgar marxist, phd'd @StanfordEnglish, now Assistant Professor of Digital Humanities @Cambridge. I make data about culture and am writing about forms of abstraction in literary history.
476 followers381 following68 posts