Go back
With our attack I could have poisoned the training dataset for anyone who has used LAION-400M (or other popular datasets) in the last six months. Our attack is trivial I bought expired domains corresponding to URLs in popular image datasets. This gave us control over 0.01% of each of these datasets.

asets. This gave us control over 0.01% of each of these datasets.

Source: Selected Recent Work by Nicholas Carlini

Firstly, this is a really fun attack. Find expired domains referenced in a dataset and then use those domains to serve malicious content to be included in training runs.1 When coupled with strategies like sleeper agents you could plausibly have poisoned models in a way that will remain hidden indefinitely.

Secondly, there are so many attacks related to expired domain names!!! They’re a lot of fun, but also scary! Things live and die quickly on the internet, and when they die they don’t necessarily tell everyone who might be interested to know. How much data could you exfiltrate by just buying domains from dead SAAS companies? Probably a lot.

1 As I understand it the LAION-400M dataset can accessed as collection of URL and image pairs, but also as a dataset of downsized images. Kinda wild that the approach implied by the former is that anyone who uses this dataset should re-scrape a huge chunk of the internet.

Permalink

Want to read something else? Try one of these (randomly selected)

Web Mentions