I think increasingly the value of social media posts won't be training data per se but sifting through the noise to find the actual information (Reddit has a ton of this after all, even if the modal post quality is bad) and using it as a retrieval database for grounding and factuality.
Ultimately as synthetic data methods get better and there are built up corpora for things like storytelling, English language styles, common sense reasoning, etc, what will remain a moving target that requires refreshing is news-like information, skill and trade information, and other factual info.
Got it. The “sifter” there would itself require an interesting kind of intelligence—able to infer, from social/network kinds of evidence, how much to trust a particular source on a previously unseen topic.