The AI Pipeline: Map of AI Ethics

We’ve all heard the news about how AI will revolutionize multiple industries, transform mankind and bring us into a new era of technology. But it all sounds too good to be true. Indeed, it is. Like any other technological tool, there are serious consequences if it is left unchecked.

Discussing the ethics of AI has always been a tricky subject, since the question often asked is “whose fault is it?” This is just scratching the surface in terms of ethical discussions. A deeper analysis of what we mean is more revealing and makes for a better framework to have these conversations.

Since it is a not simple thing, I would like to explore the steps having a piece of functional AI takes, and go deep beyond the task itself to understand the ethical implication in each case.

It would be a three-post series where I’ll walk through the 5 steps in building an AI pipeline:

Data collection
Data exploration
Model building
Model evaluation
Model deployment

Such an interesting thing deserves a little bit of background, so let’s start with one simple question…

Wait, what’s AI?

There are lots of articles talking about AI (Artificial Intelligence) and different ways in which it can be defined. No wonder such a topic is difficult to discuss, since getting started on the right foot is such a challenge. But here’s a quick definition that can help: AI consists of algorithms that sort through huge portions of data, find patterns and insights, and make predictions or classifications that humans would not be able to do without intelligence. As these models are released into the real world to do their job, they get exposed to more data and can further learn and adapt.

I have always found that analyzing a difficult subject is easier if I can locate myself in the middle of it all. It’s the same feeling of visiting a huge mall plaza and feeling relieved because there’s always a sign that says “you are here.” This is what will help us through this discussion: a map to guide us. Our map will be an AI/ML “pipeline,” which is the process of creating one of these tools from start to end.

The AI pipeline

A pipeline is roughly the set of steps that need to happen before we can have a functioning piece of AI. First, if we want our AI system to learn, we need something for it to learn from. It follows that our first step is data collection. As the name suggests, this is the action of collecting data so that our system can analyze it.

Data engineers, a rare breed between programmers and data scientists, will make use of this data to analyze and explore it. This is actually the most important step, because any conclusions drawn here will drastically change what comes next. They perform a data exploration and once done, we are ready to transform the data in a way that is machine-learnable.

After this, data scientists will build machine learning models, which is what will be adapted to predict or classify aspects about the data. Intuitive as it might be, this step is called model building, and it involves arranging the right algorithm for data to be processed. It sounds simple, and indeed it is for some models. It can be hauntingly difficult too, leading to gigantic models like DeepMind’s AlphaGo or OpenAI’s GPT-3.

The models are then evaluated by one or multiple performance metrics, where it can be decided if the results are “good enough” (whatever that means), and if they can be improved or not. Insights also happen here, sometimes about correlations from other fields that we didn’t expect to find.

Finally, the model is productivized and it starts acting in the real world. This is called deployment. At this point, we might collect more data and re-cycle the process.

With our map at hand, let’s dive deeper into each of these steps and discuss what dangers and ethical challenges lurk ahead.

1.1. Data privacy

If we’re off to build a model, we need some data for it to learn from. Fantastic. Where should the data be sourced from? It is entirely possible that the sources are not legal or have data privacy concerns. Web scraping is not illegal, but is it ethical? Did the users consent to this use of their data? Should they? Some websites like StackOverflow actually consent to this by giving reusable licenses on the data.

This issue goes hand-in-hand with changes that make users’ privacy defaults more lax, which results from a growing need for social networks to be more interconnected. Getting data out of Facebook today, for instance, is a lot easier than it was 10 years ago. We know that’s the case, since a huge nation-wide scandal happened based on that fact alone.

Even if data engineers are able to get the data, should they? Should users, system admins, legal teams and webmasters be notified and consent to its use?

The answer is not yet clear, as there are many different kinds of data online.

In most cases, models don’t really need to identify users, so we’re talking about anonymized data. Anonymized doesn’t mean private. Also, what guarantees can we give that an anonymized data-point won’t really identify someone? We need to consider that data re-identification is a full-blown discipline and gets real results.

1.2. Expert exploitation

There is a particular kind of learning algorithm called supervised learning. This algorithm takes examples with pre-calculated results, and learns how to get to the result from the data points. To work these models, we need to set up the results first, an activity sometimes called tagging.

The work needed to tag datasets is sometimes easy, but not always. You might have done this yourself unknowingly when you translated an image to words and helped Google’s OCR. Sometimes, the initial work needed requires legal knowledge, or deep scientific training, or whatever-it-is in your industry of choice.

Creating these datasets is time consuming and often industry experts get stuck making them. To lower costs, companies like Amazon have turned these expert-level jobs into optimized minimal tasks, something we could call a data-factory. This has brought concerns over how much people are paid for their time, along with the quality of the subject treatment when the people involved are actual subjects in a study.

This problem, while not new to AI, has immensely blown up since AI is very data-hungry and investment-promising. It is what most major and minor companies are turning to. Before services like Mechanical Turk, our options were limited to a few online surveys, but now there are lots of other alternatives around. Some of these options seek to address the ethical concerns behind the Mechanical Turk approach, while others just replicate it.

1.3. Bias

There are several measures of how data is “quality” data. But let’s shine another light on this problem to make it easier to understand. Let’s assume we build a system to predict who might be good candidates for our workforce, based on their resumes. After a while of using it, we identify a disturbing trend: it rejects women a lot more than men.

If AI models could talk (they can, it’s called explainability and we’ll dive into it later) they could say: “I learned from a bunch of examples and they were 80% men and 20% women — hence, men are better suited for your task force.” Your intentions are good, little model, but you’ve got it wrong.

The disturbing realization is not that the model is sexist, but rather that it’s magnifying the underlying bias in our initial dataset. And it’s hard to fix biases in data, because we might not even be aware of them. Have we accounted for differences in gender? Race? Age? What about location? What about socioeconomic class? Are neurotypical low-income latinos represented in the same fashion as Crazy Rich Asians (2018)? What about… You get the idea.

This issue is not just theoretical, it causes actual problems in real life. Predictive policing algorithms are the best example of this concept, one of the worst cases happening. Companies interested in predicting crime use police records for predicting problematic areas. The objective is to be proactive and know where to police more. The existing bias in police (as shown in research and even FBI reports) means that they are already patrolling certain neighborhoods more, which leads to further reports, which leads to the algorithm predicting them as more dangerous, which leads to further patrolling and further bias.

This is a sneaky, dangerous problem. It might be a long time before we even know it exists. Back to the example of the sexist model, instead of taking for granted that men and women are equal, we could mistakenly accept the insights learnt by the model. “Ah yes, this proves that men are really better suited for an office job.” It sounds silly because it’s obviously wrong. Sometimes it’s more subtle and hence more dangerous: “Ah yes, this proves this black neighborhood is more dangerous, as we always suspected.” This sort of bias is a serious problem.

Final words

Today I have explored what AI is, how to build a pipeline that allows us to get our AI tool ready, and I went deep into the first step in the pipeline: Data Collection. As I said lines above, any AI system needs information to learn from them and for setting their own basis, but there are many concerns around how we get this data and how much protected against privacy attacks users are.

This was just the beginning and I hope you enjoyed it. Stay tuned for the next post on this subject coming soon.