Note: I’m sharing part of the talk I gave at Universidad Panamericana, as part of the 26th Annual Convention of the Media Ecology Association. The panel was titled: “Aesthetics, Narrative, and Artificial Intelligence”.
FdoGtz2025_MEA_SVArtificial intelligence is transforming our informational landscape at an unprecedented speed. Just like in ecological systems, this transformation is driven by an insatiable hunger for resources —in this case, data. The parallels between data extraction in AI and natural resource exploitation are striking. In this presentation, we explore how AI reshapes our cognitive environment and propose sustainable approaches to address the challenges it brings.
What Is “Easy Data” in AI?
“Easy data” refers to public, abundant, and low-friction datasets with high informational value —resources that require minimal processing before being used to train AI models. Examples include Wikipedia, Common Crawl, public domain texts, and open social media posts. These datasets are essential to the development of large language models, as they provide diverse linguistic patterns and foundational knowledge.
However, this supply is diminishing. According to a 2024 WIRED article and a study by the Data Provenance Initiative at MIT, roughly 25% of the highest-quality data from major datasets like C4, RefinedWeb, and Dolma has become inaccessible. The reason? Many websites now restrict automated data collection through robots.txt protocols or have implemented paywalls, limiting their use in AI training. This not only impacts big tech companies but also hinders academic research and innovation.
The Data Gold Rush: Fading Out
The rapid deployment of AI systems has created unprecedented demand for data. Public sources are under increasing strain due to extraction pressures. In response, media outlets and online platforms have taken protective measures: modifying terms of service, erecting paywalls, or signing exclusive commercial agreements with AI developers. These moves are designed to safeguard intellectual property and ensure fair compensation.
This shift has led to what some experts, such as Shayne Longpre (MIT) and Yacine Jernite (Hugging Face), describe as an emerging consent crisis. The lack of clear, equitable agreements on data use has generated conflict between developers and content creators, diminishing the pool of accessible data and complicating the ethical landscape of AI development.
The Key to AI: Quality and Diversity of Sources
An AI model’s accuracy, versatility, and relevance are directly tied to the quality and diversity of its training data. Broad and up-to-date datasets from trustworthy sources —academic journals, verified news, official statistics, structured databases— enhance both performance and ethical reliability.
Conversely, training on limited or biased data may perpetuate misinformation, errors, and cultural insensitivity. Diverse sources enable models to be more culturally aware, reduce bias, and adapt better across contexts. But the scarcity of such data today is not just a technical challenge —it represents a rupture in the balance of the informational ecosystem.
AI as an Ecosystem Engineer
AI doesn’t just consume information; it actively reshapes the conditions under which information is created and shared. Like ecosystem engineers in nature, AI alters what knowledge is generated, how it circulates, and what gets prioritized. The unregulated extraction of massive datasets —text, image, audio— mirrors ecological overexploitation and leads to systemic strain.
This calls for a renewed analytical lens: we must ask not only what content AI produces, but also how the structures behind that content are formed and maintained. The extractive logic of data mining must give way to an ecological approach —one that sees data as part of a regenerative cycle of use, consent, and reciprocity.
Ecological Lessons: Over-Extraction and Feedback
A positive alternative lies in community-based data stewardship. Imagine a university, a local media collective, and a group of developers collaborating to train an AI model using curated texts and interviews. The data is labeled with context and consent; contributors receive credit and access to results. The model, in turn, supports the community through summaries, translations, or analytic tools.
In ecosystems, balance is maintained through feedback loops. In AI, this could mean governance structures that give voice and value back to the authors of the content. The transition from data mining to data stewardship is essential —prioritizing transparency, traceability, and redistribution of benefits.
Conclusion: Toward a Regenerative Data Ecosystem
The dwindling availability of high-quality data reflects a broader informational crisis. The overexploitation of open sources has triggered a partial collapse of the knowledge commons, where access to reliable content is increasingly restricted.
This mirrors warnings from thinkers like McLuhan and Postman: technologies don’t just change what we know —they reshape what counts as knowledge, truth, and participation. AI reconfigures the cognitive environment. It transforms, filters, and selects what we see and what we don’t. Therefore, a deeper understanding is needed —one that goes beyond content and interrogates the structures producing it.
It is time to move from an extractive logic to one that is ethical, regenerative, and collaborative. Data stewardship offers a new paradigm: one grounded in consent, contextualization, and shared benefit. This is not just a technical fix —it’s a cultural and ecological necessity for the future of AI and the societies it increasingly influences.