Data Provenance Explorer Audits Popular AI Training Data Sets
Sharing data on the Internet is like giving car keys to a teenager: once you’ve done it, you’ll never track where they’ve gone. Channeling anxious parents everywhere, a coalition of researches from MIT, Cohere for AI, and 11 other institutions has released Data Provenance Explorer, which audits the contents of nearly 2,000 widely used training data sets. You won’t be surprised that there’s a lot of missing information and unauthorized use; you may be surprised how often they warn that licenses limited to non-commercial use will stifle growth of new AI-based businesses.