Artificial Intelligence (AI) and Machine Learning are powerful buzzwords representing areas of important innovation. If you’re fascinated by these topics like we are, you may have taken a deeper dive into them and run across the concept of Reinforcement Learning. Without getting too technical, here’s a quick explanation of what that is and why it’s a fundamental part of the world of AI.


The best way to break down the concept of Reinforcement Learning into manageable bites is to look at the way humans and animals think. If you’ve heard of positive reinforcement and negative reinforcement, then you’re halfway there to understanding Reinforcement Learning.

This is how it works: when we’re trying to train or educate someone we ask them to perform a task. If they do good, we encourage them and award them. When this happens, biological brains adapt so that they can perform this task easier in the future, effectively getting better.

This is positive reinforcement.

When this happens over and over, the brain actually adapts so as to better perform that reward-inducing task in the future. The brain is ‘optimizing’, so to speak, so those results can be duplicated.

But what happens when that same person produces some not-so-praiseworthy behavior?

If the result is incorrect, we dis-encourage them or even punish them. Biological brains then adapt so that they don’t go this path anymore, because they don’t want to experience that feeling again.

That’s when you bring in the negative reinforcement.

Whether it’s a firm scolding or a form of punishment, it’s gotta be something with negative consequences for the bad action. That will cause the behavior-learning mechanisms in his brain to adapt. Only this time, the adaptation has a different end goal: to avoid that feeling of being scolded or punished.  

The picture below represents the basic process and elements involved in a reinforcement learning model.

Reinforcement Learning Process

Reinforcement Learning Process

Translating Human Behavior to a Computer Algorithm

The human brain is far more complex than current versions of AI but scientists are encroaching upon the functionalities of a biological brain with machine learning. Using algorithms, they can simulate the same behavior-response system by modeling the following three concepts:

  1. Some sort of agent that makes decisions (so that they try the task we assign them) by choosing the best from a set of options.  
  2. A way to reinforce it when it selects the right choice, so that next time it is more probable it chooses the right one
  3. A way to reinforce it when it selects the wrong choice, so that next time it is more probable it chooses the right one

Recap: Reinforcement Learning is the equivalent of behavior training. A certain action is tried, the undesired results are punished, and the good ones are rewarded.

The 2 Sides of Reinforcement Learning: Exploration and Exploitation

Let’s imagine you are a decision-making agent, for a moment. Of course, you are one in your everyday life, but in a much more complex way. Right now, for the sake of this little exercise, we’re asking you to simplify your thought processes and use a much smaller portion of your brain.

Picture a bag of small plastic balls. Each ball has a number painted on it. The number of balls in the bag is unknown to you, as are the numbers on the balls. You have no idea if the balls are numbered 1 – 10 or even if they’re numbered sequentially at all.

Along with the bag of balls, you are given a task: find the ball with the highest number. You may draw one ball or you may draw them all out of the bag to find the highest number.

But there’s a catch, and here’s where the decision-making part comes in.

Only the last ball you draw from the bag is evaluated. You can draw 5 balls but only the fifth one out of the bag will count as your submission. If you draw 4,19, 5, and 1 (in that order), it’s a huge shame because the only ball that counts is the one with ‘1’ painted on it. Your lucky 19 draw is irrelevant.

It’s a bit like playing Blackjack.

Only with Blackjack, you at least know what range of numbers you’re working with. Not so in this exercise. This task would be way easier if you only knew how high the numbers went… but you don’t. In scientific terms, we say the environment is not totally visible.

So you draw a ball and get a certain number (let’s say 7). As we said, you have no idea if this is a relatively high or low number so what do you? You draw another. Now you’ve pulled a 35. Way to go!

That’s a much better number but what if there are even higher numbers in there?

You think, What if I could pull something in the thousands?

So you pull another ball and you get a ‘2’. Now clearly you’ve made a bad decision. You can hardly be blamed, however, because you didn’t have much information to start with.

In this line of thinking, you’re exploring the environment and making inferences about it (learning). This will allow you to, in the next run, decide when to play it safe and decide when to go for more.

That was exploration of the environment. Now, here’s a look at exploitation of an environment.

Let me give you another example: for most of your life, you learned to love Italian cuisine. You just really enjoy it, you know already that it’s a 10/10 for you. And then, even when Italian dishes are available to you, you might decide to go for something new. Depending on how adventurous you feel, you might decide to go with a variation of an Italian dish. Or maybe you change completely and go for something like Korean cuisine. Who knows? Maybe you’ll even like it more than you did with Italian, but you won’t know until you try.

This kind of boredom for things, even when we love them, is our brain striving to explore new options, learn about the world and find choices that might be even better than the ones that we have already. It’s the brain thinking that now you’ve got a 10, but next one might be a 15. And this changes from person to person: some people really like to play it safe with their choices, some others are really out there for trying all sorts of new things.

As we’ve seen before, exploring (trying new choices) helps you learn more about the world, and be more informed about what to choose next. But you cannot make two choices at the same time, so when you’re exploring, you’re not getting the options that you know are good (they might be wrong choices!)

On the other hand, picking the choices that you know are good already, or “playing it safe” is called exploitation. This is: you’re making use of your knowledge gained by your previous experiences.

Neither exploration nor exploitation is better than the other. There’s a balance that has to be met and there’s no one-size-fits-all way to find that balance.

Recap: there’s risk involved with exploration but the upside is that we get to learn. There’s no risk with exploitation but we don’t learn much by playing it safe.

The Importance of Environment

In the numbered ball task, it was difficult to make an informed decision because you, as the agent, didn’t know how the balls were numbered. There was very little context or environment. You had to just keep on drawing balls until you get a sense of their numbering.

The cuisine example is a little bit easier, because while Italian and Korean cuisine are really different, if you know that you like the garlic-y taste of spaghetti aglio et olio, it’s very likely that you’ll also enjoy some manul changachi, because it’s also garlic-based.

In such an environment, where there’s more clarity, exploration is less crucial.

To put that into a real-world example, if you’re playing blackjack, it’s not smart to ask for another card when you’re already at 20. You know this because you know the rules, you know your cards, and you know the chances of overshooting your goal of 21. There’s a rich environment from which to draw clues and make a good decision based on known rules.

In this type of environment, we’re not really working in the arena of reinforcement learning anymore. It’s not about exploration and exploitation. It’s something more akin to what neural networks use. However, it’s not a black-and-white classification, so there are techniques that mix the approach from both areas, like deep reinforcement Learning.

Takeaways: reinforcement is another kind of search (optimization process). Improving that search makes it easier to find good results.

Now that you’ve grasped the basics of reinforcement learning via our non-technical description, you should feel confident about launching a more in-depth exploration. As you may have guessed, there’s no shortage of internet-based resources for this.