AI’s become so invasively popular and I’ve seen more evidence of its ineffectiveness than otherwise, but what I dislike most about it is that many run on datasets of stolen data for the sake of profitability à la OpenAI and Deepseek
https://mashable.com/article/openai-chatgpt-class-action-lawsuit https://petapixel.com/2025/01/30/openai-claims-deepseek-took-all-of-its-data-without-consent/
Are there any AI services that run on ethically obtained datasets, like stuff people explicitly consented to submitting (not as some side clause of a T&C), data bought by properly compensating the data’s original owners, or datasets contributed by the service providers themselves?
There are no legal sources big enough to train an AI on the level required to even perform basic interaction.
This is very true.
I was part of the OpenAssistant project, voluntarily submitting my personal writing to train open-source LLMs without having to steal data, in the hopes it would stop these companies from stealing people’s work and make “AI” less of a black box.
After thousands of people submitting millions of prompt-response pairs, and after some researchers said it was the highest quality natural language dataset they’d seen in a while, the base model was almost always incoherent. You only got a functioning model if you just used the data to fine-tune an existing larger model, Llama at the time.