Generative AI tools have quickly become an almost unavoidable and ubiquitous presence in many areas of our digital lives, both at work and at home. While we all provide them with an increasing amount of information, the fundamental issue of how they handle our private data is still largely overlooked.
All the major players in this field, from ChatGPT to Anthropic’s Claude, Google Gemini and Microsoft Copilot, collect the information we share in order to further train their models in obscure and undocumented ways that often seem to lie beyond our control. Such data has often been leaked by the models themselves and can potentially pose a security risk to users.
The Data Ingestion
Anat Baron, futurologist, speaker, and generative AI expert, explains:
“AI models are trained on data. Think Pac-Man. They ingest huge amounts of training data from public sources and publications. The more data, the more comprehensive the response. But beyond what they already train on, they are constantly ‘feasting’ on more. The more they know about you, the more helpful they can be. The AI companies warn us not to enter any confidential or sensitive information, but then again, they’re warning the same people who give up privacy willingly every time they use social media.”
While the data ingestion problem is real, it doesn’t mean we have to stop using Gen AI tools altogether. We should, nonetheless, approach these tools with awareness, knowing precisely what the best practices are to avoid feeding them data we shouldn’t be sharing in the first place.
The first rule, says Baron, is common sense:
“Don’t provide anything you wouldn’t give to a stranger. Your driver’s license number, social security number, passport, banking details. Don’t upload sensitive medical results. And never provide confidential company documents like financial statements. If you must do so, then anonymize the reports first.”
Read also
Data Collection Opt-Out
Apart from this fundamental tenet, there are some basic methods we can employ to opt out of data collection when using some of the major LLMs.
ChatGPT
On OpenAI’s ChatGPT, data sharing is enabled by default on both the free and pro plans. To opt out of the collection, users can open the Settings from ChatGPT’s web interface by clicking on their account icon, then select Data Controls and toggle the “Improve the model for everyone” option.
To achieve the same result, albeit only temporarily for specific conversations with the chatbot, users can turn on the “Temporary Chat” function by clicking on the dotted balloon icon on the top right of ChatGPT’s web interface.
Claude
Unlike ChatGPT, Anthropic doesn’t collect data automatically unless users opt in. If you’re a Claude user, you can check the privacy policy under Settings – Privacy. Here you can also disable the use of location metadata, which Anthropic collects to provide a more targeted product experience in the chatbot.
Gemini
Google’s Gemini, especially on Android where the integration runs deep and system-wide, is the model with the most byzantine privacy controls and the least control left to the user. Customers can disable Gemini Apps Activity under the privacy settings in their Google Account. By doing so, the model will not collect and send information, although the data could still be saved and sent to Google for at least 72 hours for review and safety purposes.
Perplexity
The AI-based search engine Perplexity has an OpenAI-style policy. Data collection is turned on automatically, but users can opt out by turning off the “AI Data Usage” toggle under Settings.
Copilot
Microsoft’s Copilot has a comprehensive privacy policy that excludes organizational data and files from being uploaded, scanned, or transferred for any purposes. Prompts are also private as long as the user is logged into a work or school account.
Usage becomes riskier when using Copilot through the Edge browser or Bing on the web without being signed in. In that case, data could be saved for training and safety purposes. The major risk in this case is that users of an organization wouldn’t know the difference between the two scenarios, thus uploading or sharing confidential or sensitive data on the web interface, as they expect it to be as secure as the logged-in version of Copilot.
Other Models
For all the other models out there, the indications follow more or less the same script as above. Users should check within their account settings if some form of data sharing is turned on and possibly search the relevant privacy policy to understand whether data collection is happening and when.
The DeepSeek Case
Among the most popular models, one that users should ultimately avoid is certainly DeepSeek. While being extremely capable and useful, the Chinese-made LLM fares worst in terms of privacy policy. The online version of the model automatically collects and analyzes data with no option available to stop it from doing so.
Ultimately, control lies with the user, who can still decide what to share, how, and with which model. At least for now.
“As Gen AI becomes ubiquitous in our lives, we need to decide how much to share. After all, it’s always a tradeoff: privacy for convenience”, concludes Baron. “And we’re only in the early stages. Once we have digital twins — agents that act on our behalf to take over tasks like booking travel, ordering groceries, managing our calendar — we will need to come to grips with a new reality and hope for better cybersecurity.”