BLUE
Profile banner
AG
Adam Gurri 🥥
@adamgurri.liberalcurrents.com
Founder and Editor-in-Chief of Liberal Currents liberalcurrents.com
3.3k followers806 following5.6k posts

I'm neither a computer scientist nor a data scientist, but even more than large language models, I'm very skeptical of the utility of synthetic data, a phrase that seems to me to be a contradiction in terms.

9

TFerisianrite.com

I’ve worked on and adjacent to systems using synthetic data and can give you an example from the computer vision world. We had a model that was trained on video of human faces that we collected and paid participants for. For numerous non-tech reasons, this meant we couldn’t collect as much data on

1
Dllyfrgellbabel.bsky.social

It's useful in social science cases where you have long wait times to access datasets in secure storage -- while you wait you can at least check your code will run on it and maybe (in the case of high quality synthetic data) do a bit of preliminary analysis, >

1
OMoranmagal.bsky.social

This is going to make a fantastic example one day for the crucial principle: garbage in, garbage out.

0

Simulation is obviously useful and has been used extensively for as long as the computing revolution has enabled it (and less extensively before that). I don't really see how pretending simulation = data is anything but a category error.

6
db-user.bsky.social

It should never be used where real data is preferred.

0
RCsearyanc.dev

This doesn't seem contradictory? e.g. to test the effectiveness of your speech recognition algorithm in the presence of environmental noise, there's nothing inherently unsound with taking an audio clip with a known transcript and mixing in authentic noise on top of it to see how it does

1
BGbrian.gawalt.com

Here's an example of using synthetic data to produce a model that improves performance on a pre-existing benchmark: arxiv.org/abs/2403.20327 (I now rely on this text embedder in my daily work.) Like most things ML, synthetic data falls into the "sure, worth a shot"/"trust but verify" bucket.

Gecko: Versatile Text Embeddings Distilled from Large Language Models
Gecko: Versatile Text Embeddings Distilled from Large Language Models

We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a...

0
TAtacoauditor.bsky.social

Cheap, easy, and sidesteps privacy concerns(mostly) Whether it can deliver is another thing **Shrug Emoticon**

1
MGwalmsley.bsky.social

aiui synthetic data is useful for fine-tuning. you have a model that is at some general level of performance and you want to make it do some specific task reliably (i.e. "I want art in this style").

1
Profile banner
AG
Adam Gurri 🥥
@adamgurri.liberalcurrents.com
Founder and Editor-in-Chief of Liberal Currents liberalcurrents.com
3.3k followers806 following5.6k posts