Researchers find child sexual abuse images in LAION-5B AI training dataset

Date:

Kinguin WW
Lilicloth WW
ChicMe WW

Researchers have found child sexual abuse material in LAION-5B, an open-source artificial intelligence training dataset used to construct image generation models.

The invention was made by the Stanford Web Observatory, or SIO, which detailed its findings in a Tuesday report. SIO researchers have identified greater than 1,000 exploitative images of kids in LAION-5B. They detailed of their report that they evaluated only a subset of the files within the database, which implies it likely comprises thousand of additional CSAM images which have not yet been found.

SIO identified the illegal images with an information management technique called hashing. Using the technique, researchers can turn a file into a singular series of letters and numbers called a hash. After creating hashes of the pictures in LAION-5B, the SIO researchers compared them against the hashes of known CMAS images. 

“Removal of the identified source material is currently in progress as researchers reported the image URLs to the National Center for Missing and Exploited Children (NCMEC) within the U.S. and the Canadian Centre for Child Protection (C3P),” SIO researchers wrote. “The study was primarily conducted using hashing tools similar to PhotoDNA, which match a fingerprint of a picture to databases maintained by nonprofits that receive and process reports of online child sexual exploitation and abuse.”

LAION-5B was released in early 2022 by a German nonprofit that has received funding from several AI startups. The dataset comprises greater than 5 billion images scraped from the net and accompanying captions. It’s an upgraded version of earlier AI training dataset, called LAION-400M, that was published by the identical nonprofit just a few months earlier and includes about 400 million images. 

In an announcement issued to Bloomberg today, the nonprofit stated that it has a zero-tolerance policy for illegal content. It has deleted multiple versions of the dataset from the web “to make sure they’re protected before republishing them.” Moreover, the nonprofit has released filters for locating and removing illegal content from its datasets.

Since its release last 12 months, LAION-5B has been used to coach multiple image generation models. SIO determined that a few of those models are used to generate CSAM images.

Considered one of the highest-profile firms to have leveraged LAION-5B to coach its neural networks is Stability AI Ltd., the startup behind the favored Stable Diffusion series of image generation models. The corporate told Bloomberg that the relatively recent 2.0 version of Stability Diffusion wasn’t trained on LAIONk-5B, but slightly a subset of the dataset with less unsafe content. Stability AI has also equipped its newer models with filters designed to dam unsafe inputs and outputs.

The brand new SIO report doesn’t mark the primary time the LAION-5B dataset has come under scrutiny. Early last 12 months, three artists filed a lawsuit against Stability AI and two other firms that allegedly used thousands and thousands of copyrighted images from LAION-5B to coach their image generation models. Earlier, photos of an artist’s medical record were discovered among the many files within the database. 

Image: LAION

Your vote of support is significant to us and it helps us keep the content FREE.

One click below supports our mission to supply free, deep, and relevant content.  

Join our community on YouTube

Join the community that features greater than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and lots of more luminaries and experts.

“TheCUBE is a crucial partner to the industry. You guys really are an element of our events and we actually appreciate you coming and I do know people appreciate the content you create as well” – Andy Jassy

THANK YOU

Share post:

High Performance VPS Hosting

Popular

More like this
Related