Disclaimer: If it is your first time reading this in our blog post, you should go some steps back and read the first piece of this amazing series. Where? Just here.
Days ago, I presented this topic motivated by understanding and discussing what are the dangers and ethical challenges surrounding AI. To start with, I shorty explore what AI means and how an AI pipeline is built, allowing us to go deeper into the first step: Data collection.
In today’s post, I’ll move forward into the pipeline and I’ll talk about Data exploration and model building. Let’s go through it!
2. Data exploration
2.1. Privacy
When we explore the data (maybe looking for biases?), we will go through multiple data points that we should not be looking at. Earlier we explored the concerns of obtaining and storing that data, but now we have other people with their hands in it.
Let’s pretend, for example, that we have gathered location information for some kind of mapping application. Let’s say that we have already gone through the ethical and legal hurdles of obtaining and storing that data. To make the example more real, feel free to check one of these open datasets which came out of a quick Google search: [1], [2], [3]. Data scientists might find patterns in the data that they were not looking for. “Huh, guess what — this person leaves their house every day at 9 AM and makes a 30-minute stop before going to work.”
These “insights” are baked into the data that we (GPS users) chose to share, but we never knew those insights came with it. Did we agree to share that? Can data scientists even warn us about what they would find before actually finding it?
2.2. Insight rights
Once the insights are found, are data scientists under an obligation to let us know what they found? Should they warn authorities if they find evidence of non-legal activities? Should they contact people when health is at stake? Are they under the obligation of sharing benefits from those insights?
Google Maps is probably not required to guarantee the fastest trip, but should a DNA sequencing company get in touch about a possible finding correlated with cancer?
Currently, we, the AI builders, are not under an obligation to share these insights. We are not under an obligation to share the benefits of what we find. Again, this is one of those cases where what we’re doing is not illegal [2], but is it ethical?
3. Model Building
3.1. Proprietary Algorithms
The best model for text-to-speech currently out there is not open source. It’s not used by a big company and nobody knows exactly how it works. Nobody except the creator, who claims to be the only person behind it. This model, 15.ai, was created with a dataset from the Pony Preservation Project, a huge effort by multiple people around the world to duplicate the voices of the My Little Pony characters.
This is an amazing example of multiple people working under the same interest, but one of them reaching it sooner with amazing results and the rest of them being left in the dark. Anons from the Pony Preservation Project were never able to replicate the amazing results of 15.ai: text-to-speech with faster-than-real-time processing, multiple-speaker-support, context-aware pronunciations, dynamic realistic emotions, and voices that can be cloned with less than 30 seconds of data.
This example seems trivial (unless you really miss Twilight Sparkle) but this is actually a common occurrence in the world of AI. Advances will push technology and science forward, but companies and institutions are reluctant to share these advantages with the world. Sometimes, because of the fear of abuse. Sometimes because they would lose the competitive advantage they just acquired.
This has become such a problem that OpenAI was founded with the very objective of circumventing it. As a foundation, they strive to make great strides in the world of AI and make these advances open to the whole world. This way, if everyone has advantages, we all win and nobody is left behind.
If you have been paying attention to the links in this blog post series, you might have been amazed that OpenAI was the one to avoid disclosing the results from their GPT-2 breakthrough. Even with good intentions, a breakthrough might be too dangerous to let out. We’ll talk about the societal impact later.
Is this really a problem though? At this point, the challenge is that companies are not incentivized to share their progress, but instead, benefit from it privately. It’s not a technical problem, but a business one. Most researchers are working under the grant of paid enterprises. If they’re not working for a company like FAIR (Facebook) or DeepMind (Google), then they’re under the watchful eye of government institutions like CIFAR (Canada), NSF (US), etc.
3.2. Technological monopoly
But what about the competition? If a company happened to come across a breakthrough, they could release a product so good that it would just wipe out the competition. Fortunately enough, today there is more than one company making great advances in research. We are still at the verge of creating a huge gap .
This is so real that the Future of Humanity Institute has proposed the Windfall Clause, a policy for governments that would mitigate the risk of AI creating a gap so vast that some governments/companies would outrun others.
Today it is just that: a proposal. The same document acknowledges that there are risks and downsides that haven’t been completely addressed. In a very real sense, this is still an open problem.