How to Prevent ChatGPT From Stealing Your Content & Traffic

We Keep you Connected

How to Prevent ChatGPT From Stealing Your Content & Traffic

ChatGPT and similar large language models (LLMs) have added further complexity to the ever-growing online threat landscape. Cybercriminals no longer need advanced coding skills to execute fraud and other damaging attacks against online businesses and customers, thanks to bots-as-a-service, residential proxies, CAPTCHA farms, and other easily accessible tools.
Now, the latest technology damaging businesses’ bottom line is ChatGPT.
Not only have ChatGPT, OpenAI, and other LLMs raised ethical issues by training their models on scraped data from across the internet. LLMs are negatively impacting enterprises’ web traffic, which can be extremely damaging to business.
Among the threats ChatGPT and ChatGPT plugins can pose against online businesses, there are three key risks we will focus on:
Depending on your business model, your company should consider ways to opt out of having your data used to train LLMs.
The most at-risk industries for ChatGPT-driven damage are those in which data privacy is a top concern, unique content and intellectual property are key differentiators, and ads, eyes, and unique visitors are an important source of revenue. These industries include:
Worried about ChatGPT scraping your content? Learn how to outsmart AI bots, defend your content, and secure your web traffic.
According to a research paper published by OpenAI, ChatGPT3 was trained on several datasets:
The largest amount of training data comes from Common Crawl, which provides access to web information through an open repository of web crawl data. The Common Crawl crawler bot, also known as CCBot, leverages Apache Nutch to enable developers to build large-scale scrapers.
The most current version of CCBot crawls from Amazon AWS and identifies itself with a user agent of ‘CCBot/2.0’. But businesses who want to allow CCBot should not rely solely on the user agent to identify it, because many bad bots spoof their user agents to disguise themselves as good bots and avoid being blocked.
To allow CCBot on your website, use attributes such as IP ranges or reverse DNS. To block ChatGPT, your website should, at minimum, block traffic from CCBot.

LLMs use scraper bots to gather training data. While blocking CCBot might be effective for blocking ChatGPT scrapers today, there is no telling what the future holds for LLM scrapers. Moving forward, if too many websites block OpenAI (for example) from accessing their content, the developers could decide to stop respecting robots.txt and could stop declaring their crawler identity in the user agent.
Another possibility is OpenAI could use its partnership with Microsoft to access Microsoft Bing’s scraper data, making the situation more challenging for website owners. Bing’s bots identify as Bingbot, but blocking them could cause problems by preventing your site from being indexed on the Bing search engine, resulting in fewer human visitors.
You could face similar issues by blocking Google’s LLM Bard (competitor to ChatGPT). Google is vague about the origin and collection of the public data used to train Bard, but it is possible that Bard is, or will be, trained with data collected by Googlebot scrapers. Like with Bingbot, blocking Googlebot would likely be unwise, impacting how your website gets indexed and how the Google search engine drives traffic to your site. The result could mean a serious drop in visitors.
One of the main limits of models like ChatGPT is the lack of access to live data. Since it was trained on a dataset that stops in 2021, it is unable to provide the most relevant, up-to-date information. That’s where plugins come in.
Plugins are used to connect LLMs like ChatGPT to external tools and allow the LLMs to access external data available online, which can include private data and real-time news. Plugins also let users complete actions online (e.g. booking a flight or ordering groceries) through API calls.
Some businesses are developing their own plugins to provide a new way for users to interact with their content/services via ChatGPT. But, depending on your industry, letting users interact with your website through third-party ChatGPT plugins can mean fewer ads seen by your users, as well as lower traffic to your website.
You may also notice that users are less willing to pay for your premium features once your features can be replicated through third-party ChatGPT plugins. For example, an unofficial web client interacting with your site could offer premium features through their UI.
OpenAI documentation states that requests with a specific user agent HTTP header (with token: “ChatGPT-User”) come from ChatGPT plugins. But the documentation does not state that the disclosed user agent is the only user agent that can be used by plugins when making HTTP requests.
Therefore, as ChatGPT plugins interact with third-party APIs, the APIs can then do any kind of HTTP requests from their own infrastructure. The diagram below shows what happens when a fictitious “Live Sport Plugin” is used with ChatGPT to get an update about a sporting event.
A plugin can actually make a request to a sport API without having to scrape the sports website. In fact, when requests are made directly from the server hosting the plugin API, there is no constraint on the user agent.
In a process similar to blocking ChatGPT’s web scrapers, you can block requests from plugins that declare their presence with the “ChatGPT-User” substring by user agent. But blocking the user agent could also block ChatGPT users with the “browsing” mode activated. And, contrary to what OpenAI documentation might indicate, blocking requests from “ChatGPT-User” does not guarantee that ChatGPT and its plugins can’t reach your data under different user agent tokens.
In fact, ChatGPT plugins can make requests directly from the servers hosting their APIs using any user agent, and even using automated (headless) browsers. Detecting plugins that do not declare their identity in the user agent requires advanced bot detection techniques.
Obtaining high-quality datasets of human-generated content will remain of critical importance to LLMs. In the long term, companies like OpenAI (funded partially by Microsoft) and Google may be tempted to use Bingbots and Googlebots to build datasets to train their LLMs. That would make it more difficult for websites to simply opt out of having their data collected, since most online businesses rely heavily on Bing and Google to index their content and drive traffic to their site.
Websites with valuable data will either want to look for ways to monetize the use of their data or opt out of AI model training to avoid losing web traffic and ad revenue to ChatGPT and its plugins. If you wish to opt out, you’ll need advanced bot detection techniques, such as fingerprinting, proxy detection, and behavioral analysis, to stop bots before they can access your data.
Advanced solutions for bot and fraud protection leverage AI and machine learning (ML) to detect and stop unfamiliar bots from the first request, keeping your content safe from LLM scrapers, unknown plugins, and other rapidly evolving AI technologies.
Note: This article is expertly written and contributed by Antoine Vastel, PhD, Head of Research at DataDome.
Sign up for free and start receiving your daily dose of cybersecurity news, insights and tips.