Governments’ Influence on AI Chatbots Through Online Media Environments: A New Study Unveils Institutional Imprints in Large Language Models
In an era increasingly dominated by artificial intelligence, a groundbreaking study published in Nature uncovers how governments can shape the outputs of AI chatbots, not through direct intervention in the technology, but via their influence on the web content these models learn from. This multi-institutional research, involving the University of Oregon, Purdue University, the University of California San Diego, New York University, and Princeton University, reveals that state-coordinated media have left measurable imprints on large language models (LLMs), particularly when these models are queried in the languages most affected by governmental media control.
The research encapsulates six interlinked studies that collectively expose how entrenched political power seeps into AI training data. This effectively means that the AI’s responses, especially on political topics, bear traces of the institutional media environments present in their respective countries. The phenomenon is especially pronounced in country-specific languages, highlighting how local internet information ecosystems, often shaped by state-controlled channels, subtly govern the AI’s narrative boundaries in those linguistic contexts.
One core finding demonstrated the significant presence of state-coordinated media content embedded within the training datasets of common AI models. By analyzing Chinese online content, the study showed that over 3.1 million Chinese-language documents in an open-source multilingual dataset closely mirrored phrasing from documented Chinese state media sources. This volume equates to approximately 1.64% of the Chinese textual corpus within that dataset, a figure strikingly higher—around 40 times—than the representation of Chinese-language Wikipedia documents, a frequently leveraged training resource. When focusing solely on documents mentioning Chinese political figures or institutions, this proportion soared as high as 23%, indicating a substantial infiltration of politically framed content.
This spread is not limited to content directly from government websites or official news outlets. Only about 12% of the matched documents originated from known governmental or news domains. Instead, the data suggests a widespread recirculation of state-coordinated narratives as they weave through a variety of online media including lesser-known websites, social media platforms, reposting mechanisms, and everyday online pages. Consequently, AI models ingest this dominant, state-shaped discourse, effectively “laundering” propagandistic language into what appears as objective, neutral information in chatbot responses.
Further experiments underscored this influence by retraining smaller AI models with curated news content to assess shifts in ideological tone. The inclusion of state-scripted news material enhanced the likelihood that these models would generate answers favoring the government perspective by nearly 80%, compared to models trained without this content. Even when juxtaposed with non-scripted or more neutral Chinese media, the scripted content demonstrated a remarkable capacity to nudge the model’s political framing, validating claims that repetitive, coordinated language exerts a cumulative effect on AI behavior.
A striking method applied by the researchers involved leveraging language-based comparisons within the same AI model. For instance, they posed identical political questions about China in both Chinese and English to commercial chatbots, then evaluated the responses using expert human raters. Responses prompted in Chinese were judged to be more favorable toward Chinese state perspectives about 75% of the time, whereas English prompts showed no systematic bias. This clever cross-lingual approach enabled the team to peek inside proprietary systems, revealing differences in output tied directly to varied training data across languages rather than model architecture or algorithmic bias.
Importantly, this observed linguistic asymmetry is not unique to China. Extending their analysis to 37 countries where a national language is predominantly localized, the researchers noted a consistent pattern: models showed more favorable portrayals of governments and public institutions when responding in these countries’ primary languages, especially in nations characterized by stronger media controls. While the relationship is correlational and does not prove direct intent by AI companies or state actors to manipulate outputs, the pattern aligns with the notion that political control over information ecosystems implicitly shapes AI behavior.
The implications of these findings resonate far beyond the technical domain, touching upon issues of democracy, governance, and the emergent role of AI in public discourse. As Joshua Tucker, co-director of NYU’s Center for Social Media, AI, and Politics, emphasized, “The public debate has focused on what AI can generate, but this study points upstream. Before AI systems can influence politics, politics can influence AI.” This perspective highlights the feedback loop between real-world power structures and the increasingly trusted AI interlocutors shaping user perceptions.
One of the crucial challenges addressed by the study is the opacity surrounding training data for commercial AI systems. Since the specific sources and composition of datasets remain largely protected trade secrets, researchers deployed a multiplicity of approaches to triangulate the influence of political environments: from analyzing publicly accessible training corpora to memorization tests on commercial models, retraining experiments with custom datasets, meticulous human evaluation of chatbot outputs, and broad cross-national comparisons. This interdisciplinary methodology strengthens confidence in their central claim: media control is already shaping the behavior of large language models.
The authors caution that their findings should not be misinterpreted as evidence of deliberate efforts by AI developers to align with governmental narratives. Rather, the phenomenon arises organically from the socio-political realities embedded in publicly available internet data, which form the substrate of machine learning. Powerful institutions have, over decades, regulated, censored, and shaped online information ecosystems. AI models, dependent on this data, inadvertently amplify the resulting asymmetries.
A revealing quote from Margaret E. Roberts, a co-author from UC San Diego, encapsulates the novel dynamic at play: “Censorship and propaganda have always shaped what information people encounter. What is new here is that they can also shape the systems people increasingly ask to summarize, explain, and interpret the world for them.” This shift means that AI, once seen as an impartial intermediary, may instead unwittingly propagate the prevailing narratives constructed by political power.
The study also identifies a crucial sociotechnical challenge: AI chatbots separate the message from its messenger. As Brandon M. Stewart from Princeton University points out, “What began as a strategic narrative from a powerful government in a state media outlet can reappear as informed commentary from a highly knowledgeable intelligent agent.” Without visible reputational markers indicating the origin or bias of information, users may misinterpret these AI-generated answers as dispassionate facts rather than content subtly shaped by underlying institutional interests.
Moreover, the research underscores the incentives this situation creates for powerful actors. Given the demonstrated impact of repeated, coordinated language on AI outputs, governments and other influential institutions may have increased motivation to strategically disseminate carefully framed content online, knowing that this linguistic material could enter AI training datasets and thereby influence future AI-mediated public discourse.
Transparency about training data sources emerges as a critical theme throughout the study. Solomon Messing from NYU’s Center for Social Media, AI, and Politics stressed that “If we want to understand the powerful interests these models reflect, we need to know how we’re sourcing the concrete. That starts with more transparency about what goes into the training data.” Absent such openness, the broader public and policymakers face challenges in assessing the fairness, neutrality, or potential biases of AI technologies that now function as pervasive cultural intermediaries.
The researchers created a dedicated project website sharing their methods and replicability tests on newer AI models, acknowledging the rapidly evolving AI landscape. The tools and insights they provide are intended to shape a new field of inquiry—scrutinizing how power and politics influence the invisible supply chain behind AI systems.
This pioneering work fundamentally recasts our understanding of AI language models. Instead of purely algorithmic or technical artifacts, LLMs emerge as socio-technical phenomena, tightly interwoven with the global media and political ecosystems they draw upon. As this study so emphatically demonstrates, to understand and govern AI, we must look beyond model architectures and computational methods, delving into the complex politics of the internet text that forms their backbone.
Subject of Research: Not applicable
Article Title: Governments may shape what AI chatbots say by shaping the web they learn from, new Nature study finds
News Publication Date: 13-May-2026
Web References: https://state-media-influence-llm.github.io/
References: DOI 10.1038/s41586-026-10506-7
Image Credits: Hannah Waight (University of Oregon), Eddie Yang (Purdue University), Yin Yuan (University of California San Diego), Solomon Messing (New York University), Margaret E. Roberts (University of California San Diego), Brandon M. Stewart (Princeton University), Joshua A. Tucker (New York University)
Keywords: Large Language Models, AI Training Data, Media Control, State-Coordinated Media, Political Influence, Cross-Language Analysis, Government Propaganda, AI Bias, Information Environment, Chatbot Responses, Institutional Influence, Machine Learning Transparency
Tags: AI chatbot response manipulationAI ethics in state-influenced environmentsAI training data censorshipgovernment influence on AI chatbotsgovernment media control and AI narrativesinstitutional media influence on AIlanguage-specific AI response shapinglarge language models and political biasmultilingual AI biasesonline media control and AIpolitical power in AI datasetsstate-coordinated media impact on AI training



