Your Content Is Likely Training AI

If you create and share content online, it’s probably being used to train generative AI models. Even if you delete your ChatGPT account or opt out of data usage, your social media posts and other online content are likely still in the mix. For instance, in September 2023, X (formerly Twitter) announced they would use user posts to train AI. Similarly, Meta has trained its AI assistant on public Facebook and Instagram posts. Although Meta allows you to delete your personal data from their models, not all platforms provide this option. The policies of Snapchat and TikTok remain unclear, and there’s no standardized opt-out system in place.

Some platforms, like Medium, are engaging users in discussions about AI usage policies. However, this proactive approach is rare. Generally, it’s safe to assume that your online content is being used to train generative AI.

Shielding Your Content from AI Scraping

Your website, whether personal or business-related, might be the last defense against generative AI scraping. While completely blocking AI from accessing your content isn’t foolproof, here are some strategies to limit their reach.

Understanding Crawlers and Robots.txt

Bots, spiders, and crawlers are programs that collect information from web pages. The robots.txt file, hosted at the root of your domain (e.g., example.com/robots.txt), contains rules that guide these bots on what they can and cannot crawl. It’s important to note that adherence to these rules is voluntary.

To block all content from being crawled, add the following directive to your robots.txt file:

Disallow: /

For more targeted blocking, you can specify rules for individual bots using their user-agents.

User-Agents and Blocking Specific Bots

A user-agent identifies different types of software and bots. By specifying directives for generative AI user-agents in your robots.txt file, you can block many of them. Here are some examples:

ChatGPT:

User-agent: GPTBot
Disallow: /

Google Bard and Vertex AI:

User-agent: Google-Extended
Disallow: /

Google SGE:
Blocking Google Search Generative Experience (SGE) requires blocking Googlebot entirely, as they share the same user-agent. You can use the nosnippet meta tag to prevent Google from displaying your content in SGE results, but this also stops text snippets or video previews in search results:

<meta name="robots" content="nosnippet">

Blocking Bing Chat

To block Bing Chat from using your content, use meta tags instead of robots.txt directives:

<meta name="robots" content="nocache">

This limits Bing Chat to only displaying the page URL, snippet, or title. To prevent Bing Chat from using the page in its training data entirely, use:

<meta name="robots" content="noarchive">

Leveraging CDNs

Content Distribution Networks (CDNs) like Cloudflare and Akamai offer bot detection tools that can help in blocking generative AI crawlers. While they aren’t currently blocking the user-agents mentioned above, customer demand might change this in the future.

Conclusion

While you can’t completely prevent your content from being used to train AI models, you can take steps to limit it. Using robots.txt directives and meta tags can help protect your website content from many generative AI tools. Google SGE remains a challenge, but future accommodations might provide better control for webmasters.