Is Your Work in a Training Dataset?

One of the most common questions we hear from creators is: how do I know if my work was used to train an AI model? The answer is not always straightforward, but there are steps you can take to find out.

Know Which Datasets Exist

The most well-known training datasets include:

LAION-5B: 5.85 billion image-text pairs scraped from the internet
Common Crawl: Billions of web pages, used for text-based models
The Pile: A large text dataset used for language models
ImageNet: Millions of labeled images
YouTube-8M: Millions of video URLs and labels

Some datasets are publicly documented and searchable. Others are proprietary and undisclosed.

How to search LAION

The LAION dataset is indexed and searchable through community tools like haveibeentrained.com. Upload or paste a sample of your work and the tool will check whether it appears in the dataset. Results vary — not everything in the dataset will be found — but it's a starting point.

Check Platform Policies

Some platforms now provide information about whether your content has been used for training. Check the terms of service and privacy policies for any platform where you've published work. Notable developments:

Some platforms now offer opt-out mechanisms for future training
Some have acknowledged using public content but deny using private or copyrighted work
Some have been sued for using content without permission

Look for Your Work in Model Outputs

Another approach: test whether an AI model can reproduce or closely mimic your work. If you ask an image generator to produce something in your style and it produces a result that looks like your actual work, that's a strong signal.

What If You Can't Find Evidence?

Absence of evidence is not evidence of absence. Many training datasets are proprietary and not publicly searchable. Even if you can't find your work in a specific dataset, it may still have been used. Documentation of your publication timeline — when your work was published, where it appeared, and how widely it was distributed — can help establish that it was available for scraping.

Preserve Your Evidence

Whether you find your work or not, document everything. See our Documenting Infringement guide for a complete checklist.

Was this helpful? Contact us if you have questions about your specific situation.