
Wikipedia Is Making a Dataset for Training AI Because It’s Overwhelmed by Bots
On Wednesday, the Wikimedia Foundation announced it is partnering with Google-owned Kaggle—a popular data science community platform—to release a version of Wikipedia optimized for training AI models. Starting with English and French, the foundation will offer stripped down versions of raw Wikipedia text, excluding any references or markdown code.
Being a non-profit, volunteer-led platform, Wikipedia monetizes largely through donations and does not own the content it hosts, allowing anyone to use and remix content from the platform. It is fine with other organizations using its vast corpus of knowledge for all sorts of cases—Kiwix, for example, is an offline version of Wikipedia that has been used to smuggle information into North Korea.
But a flood of bots constantly trawling its website for AI training needs has led to a surge in non-human traffic to Wikipedia, something it was interested in addressing as the costs soared. Earlier this month, the foundation said bandwidth consumption has increased 50% since January 2024. Offering a standard, JSON-formatted version of Wikipedia articles should dissuade AI developers from bombarding the website.
“As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data,” Kaggle partnerships lead Brenda Flynn told The Verge. “Kaggle is excited to play a role in keeping this data accessible, available, and useful.”
It is no secret that tech companies fundamentally do not respect content creators and place little value on any individual’s creative work. There is a rising school of thought in the industry that all content should be free and that taking it from anywhere on the web to train an AI model constitutes fair use due to the transformative nature of language models.
But someone has to create the content in the first place, which is not cheap, and AI startups have been all too willing to ignore previously accepted norms around respecting a site’s wishes not to be crawled. Language models that produce human-like text outputs need to be trained on vast amounts of material, and training data has become something akin to oil in the AI boom. It is well known that the leading models are trained using copyrighted works, and several AI companies remain in litigation over the issue. The threat to companies from Chegg to Stack Overflow is that AI companies will ingest their content and return to it users without sending traffic to the companies that made the content in the first place.
Some contributors to Wikipedia may dislike their content being made available for AI training, for these reasons and others. All writing on the website is licensed under the Creative Commons Attribution-ShareAlike license, which allows anyone to freely share, adapt, and build upon a work, even commercially, as long as they credit the original creator and license their derivative works under the same terms.
The Wikimedia Foundation told Gizmodo that Kaggle is paying for the data through Wikimedia Enterprise, a premium offering that allows high-volume users to more easily reuse content. It said that reusers of the content, such as AI model companies, are still expected to respect Wikipedia’s attribution and licensing terms.
AP by
OMG
Asian-Promotions.com |
Buy More, Pay Less | Anywhere in Asia
Shop Smarter on AP Today | FREE Product Samples, Latest
Discounts, Deals, Coupon Codes & Promotions | Direct Brand Updates every
second | Every Shopper’s Dream!
Asian-Promotions.com or AP lets you buy more and pay less
anywhere in Asia. Shop Smarter on AP Today. Sign-up for FREE Product Samples,
Latest Discounts, Deals, Coupon Codes & Promotions. With Direct Brand
Updates every second, AP is Every Shopper’s Dream come true! Stretch your
dollar now with AP. Start saving today!
Originally posted on: https://gizmodo.com/wikipedia-is-making-a-dataset-for-training-ai-because-its-overwhelmed-by-bots-2000590704