Artificial intelligence models, particularly large language models and image generators, are trained on vast amounts of data. That data includes text, images, audio, and video — much of it scraped from the public internet. And much of it was created by individual creators who never consented to its use.
AI companies collect training data through automated web scraping — software that crawls the internet and downloads content. Common sources include:
The scale is enormous. Major training datasets like LAION-5B contain billions of images. The content is downloaded, labeled, and fed into machine learning models that learn patterns, styles, and structures — then generate new content based on what they learned.
When a creator's work is included in a training dataset without permission, is that copyright infringement? The courts are currently deciding this. Several major class-action lawsuits argue that it is — that AI companies are building commercial products on the backs of creators without paying for the raw material.
If your work has been included in a training dataset without your knowledge or consent, several of your rights may have been violated:
If you believe your work has been included in an AI training dataset without your permission:
See our Documenting Infringement guide for a complete checklist of what to save.
Was this helpful? Contact us if you have questions about your specific situation.
Contact us