The Journal.

A technology blog run by Rowden team members

Reinforcement Learning

Here, we discuss and build on some of the principles of Reinforcement Learning to set the foundations for future posts regarding our Concept Analysis & Decision Support tools.

Broadly speaking the field of Machine Learning can be split up into 3 categories: Supervised Learning, Unsupervised Learning and Reinforcement Learning.

Supervised Learning methods typically rely on large sets of annotated data. For example, if we wanted to create an algorithm that could distinguish between different images of animals, we’d need 1000s of training images correctly labelled with which animal was in each image. This labelling process can be very time consuming, especially if it relies on human domain experts. During training, Supervised Learning algorithms repeatedly calculate the difference between each datapoint’s label and the model’s corresponding prediction, and uses this difference as a signal to update the algorithm’s internal parameters.

Unsupervised Learning algorithms usually don’t aim to classify datapoints, but instead are typically used to find patterns and structure within datasets. As the name suggests, they don’t rely on having labelled training data. Clustering algorithms are a very good example of an Unsupervised Learning Algorithm.

Finally, Reinforcement Learning (RL) algorithms are trained via repeated interaction with an environment. This environment could be a physical environment such as a robot moving around a maze, or more typically a simulated environment such as a physics engine modelling a robot moving around a maze.

In RL problems, the algorithm’s goal is to learn a policy that maps the agent’s current perceived state of the environment, to the action that will maximise the future rewards gained. For example, if the agent was learning to play chess, it would need to learn a policy that mapped the current state of the board to the move it should take in order to maximise the likelihood of winning the game.

Current RL algorithms typically rely on vast amounts of trial and error style interactions with the environment in order to completely explore the state space and learn the optimal actions. In most nontrivial RL problems we have to rely on policies that are able to aggregate similar states, as it’s computationally impossible for the agent to experience and store every possible state during training. Deepmind’s breakthrough Atari games paper performed this state aggregation using Deep Neural Networks.

A Maze Example

We could imagine a situation where an agent needs to learn the shortest path to a position on a grid:

Here, at each time-step, the agent can choose 1 of 4 actions (North, East, South or West).

It’s important to note that in this example the agent has no model of the environment that it can use to plan its route. Its only way of learning is to repeatedly submit actions to the environment and seeing what results it gets back.

In the diagram above we can see that each time the agent submits an action it gets back the new state from the environment (i.e. which square on the grid it now occupies) and a reward.

The reward is defined by a function that maps the current state to a scalar value and has to be defined as part of the problem. It is the job of the agent to maximise the amount of reward it gets back over multiple time steps.

In our example above, because we want the agent to find the shortest route to the goal state, we are going to give the agent a reward of -1 for each step it takes until it reaches the goal state. By setting the reward to be -1 the agent will need to learn the shortest route in order to maximise the total reward.

The agent learns during training by repeatedly updating its internal policy based on the feedback it receives from the environment. The policy could be a simple look up table, where the key is the current state and the value is the agent’s preference for taking that action when it is in that state (i.e will it go North, South, East or West), or it could be a Neural Network that takes the state as the input and outputs the action it wants to take.

As mentioned previously, RL algorithms are typically very sample inefficient, meaning they require huge amounts of data (interactions with the environment) in order to learn the optimal policy. They are also typically very bad at transferring skills learnt in one environment to another.

For example, Deepmind’s algorithm is unable to transfer any of the skills it learnt playing Breakout to another Atari game. It instead essentially has to learn the new game from scratch, and in doing so forget how to play the original Breakout game. This issue of ‘catastrophic forgetting’ is a very common problem for Reinforcement Learning algorithms, and shows a clear difference to how humans learn.

Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning (HRL) aims to tackle some of the limitations described above. Specifically, HRL algorithms aim to reuse skills learnt in one problem whilst tackling a different problem. For example, we could imagine a new problem where the agent needs to reach a new goal state:

We can now imagine how a HRL agent might tackle this problem.

Whereas before our agent could only choose one of 4 atomic actions (North, East, South or West) we will now give the agent a further option of deferring control to the policy learnt in the previous door finding problem.

In doing so, this new meta-agent would be building on top of the experience obtained during the previous task, meaning it would be able to solve this new problem with less interactions with the environment. Also, during the training of the meta-agent the parameters of the previous policy would not be updated, meaning it wouldn’t forget how to solve the original task.

There would then be nothing stopping creating another meta-agent a level above our current agent to solve yet another higher-level task.

In the example given above it could be argued that we cheated by defining a sub-goal that we knew would help that agent solve the higher-level task.

Ideally, we’d like the agent to be able to autonomously determine which sub-goals it should learn in order to solve higher level tasks. We could imagine an agent exploring an environment without any reward function, and instead using features of the state space’s topology to identify effective sub-goals.

Autonomous sub-goal discovery is very much an open problem in HRL and something that we will explore in a future blog post. Check back for further updates.