Beaumont & Sheridan

Artificial intelligence models, particularly large language models and image generators, are trained on vast amounts of data. That data includes text, images, audio, and video — much of it scraped from the public internet. And much of it was created by individual creators who never consented to its use.

How Training Data Is Collected

AI companies collect training data through automated web scraping — software that crawls the internet and downloads content. Common sources include:

Public websites, blogs, and online galleries
Social media platforms (photos, captions, comments)
Music streaming and video platforms
Online portfolios and personal websites
Digital libraries and archives
Code repositories and open-source projects

The scale is enormous. Major training datasets like LAION-5B contain billions of images. The content is downloaded, labeled, and fed into machine learning models that learn patterns, styles, and structures — then generate new content based on what they learned.

The key question

When a creator's work is included in a training dataset without permission, is that copyright infringement? The courts are currently deciding this. Several major class-action lawsuits argue that it is — that AI companies are building commercial products on the backs of creators without paying for the raw material.

Why It Matters for Creators

If your work has been included in a training dataset without your knowledge or consent, several of your rights may have been violated:

Reproduction right: Your work was copied into a dataset
Derivative work right: The AI's output may be based on your work's style or content
Attribution right: Your name was likely removed from the work
Compensation: You received nothing for the use of your work

What to Do If Your Work Was Used

If you believe your work has been included in an AI training dataset without your permission:

Document everything. Save screenshots, URLs, and any evidence of your work appearing in training data.
Check dataset listings. Some datasets are publicly documented. Search for your work in the LAION dataset or others.
File opt-out requests. Some platforms allow creators to opt out of future training data collection.
Consider legal action. Class actions are currently accepting creators in several categories.

See our Documenting Infringement guide for a complete checklist of what to save.

Was this helpful? Contact us if you have questions about your specific situation.

What is AI Training Data?

How Training Data Is Collected

The key question

Why It Matters for Creators

What to Do If Your Work Was Used