A major AI training data set contains millions of examples of personal data

1.1K

•

A major AI training data set contains millions of examples of personal data (www.technologyreview.com)

So... What happens when I download datasets from the "darkweb" from leaks and feed it all into AI systems?

From the post:

>Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found. Thousands of images—including identifiable faces—were found in a small subset of DataComp CommonPool, a major AI training set for image generation scraped from the web. Because the researchers audited just 0.1% of CommonPool’s data, they estimate that the real number of images containing personally identifiable information, including faces and identity documents, is in the hundreds of millions. The study that details the breach was published on arXiv earlier this month.

So... What happens when I download datasets from the "darkweb" from leaks and feed it all into AI systems? Archive: https://archive.today/7P9ZQ From the post: >>Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found. Thousands of images—including identifiable faces—were found in a small subset of DataComp CommonPool, a major AI training set for image generation scraped from the web. Because the researchers audited just 0.1% of CommonPool’s data, they estimate that the real number of images containing personally identifiable information, including faces and identity documents, is in the hundreds of millions. The study that details the breach was published on arXiv earlier this month.

(post is archived)