Reinforced Learning
Last updated
Last updated
Often, we associate learning in humans or animals with discovering an optimal behavior to achieve certain goals through interaction with their environment. The interactions involve learning about the current state of the environment, then taking an action that will influence the environment. Taking such actions has consequences, and discovering a behavior to take actions that will result in favorable consequences towards achieving some long term goal is a natural way of thinking about intelligence. Reinforcement learning takes this approach of goal-directed learning through interaction with the environment. In reinforcement learning problems, we have an entity or an agent that learns a policy or behavior to associate the environmental states with the actions that maximize a reward signal. One can think of the reward signal as a numerical measure of the consequences of the agent’s actions in a given state of the environment.
In the previous section, we learned about supervised learning. Supervised learning involves learning a structure that maps the input states - representative of all the states the system might encounter in the future, with their correct labels known during learning. Unlike supervised learning, the reinforcement learning problems don’t have access to correct labels for all the representative input states an agent will operate on in the future. The unsupervised learning methods discover a hidden structure in unlabeled input data, also different from reinforcement learning methods focused on learning through experiences of the agent’s interactions with its environment to maximize the reward signal.
Figure 1 The reinforcement learning system diagram of the interaction between an agent and its environment.
In it’s simplest form, the reinforcement learning system contains an agent and the environment or a model of the environment in which the agent functions. Figure 1 depicts the flow of a typical reinforcement system. The policy that defines the behavior of the agent and the reward signal generated by the environment defines the whole reinforcement learning system.
There are several approaches to solve reinforcement learning problems. The class of reinforcement learning methods that search the space of policies include evolutionary methods. The evolutionary methods use non-learning agents to execute different policies and accumulate the rewards received over the lifetime of the agent. The method selects the policy that generated the highest accumulated reward. These approaches use optimization methods like genetic algorithms, simulated annealing, particle swarm optimization, and other optimization methods to compute the best policy.
A reward signal provides a measure of the immediate response of the environment to an action. To define the learning goal in terms of the reward signal, one can specify a value function. The value function specifies the long term expectation of reward under a given policy. A class of reinforcement learning methods focuses on approximating this value function. To approximate the value function, the agent must explore new actions to develop the knowledge of the rewards. The approximated value function is then used to compute a deterministic policy. Temporal difference methods and Q-Learning methods are examples of value function approximation methods. To maximize the value function, the agent must prefer the actions that generated higher rewards in the past. To exploit the knowledge of the higher reward generating actions, the agent must explore actions. The learning agent thus needs to consider a trade-off between exploring new actions to find highly rewarding actions and exploiting the existing knowledge of actions that yield better rewards to achieve the long term goals. The problem of balancing exploration against exploitation in learning is unique to reinforcement learning.
Another approach to solving reinforcement learning problems is to use an independent function approximator to approximate the policy function directly. This class of reinforcement learning methods is called policy gradient methods. The policy functions can be represented by deep neural networks or parametric probability distributions. These methods estimate the parameters of the policy function approximator by adjusting their values approximately proportional to the gradient in the direction of higher reward. Some of policy gradient methods maintain an estimate of the value function. These methods are called Actor-Critic methods. Within this class we have methods that explicitly compute the gradients like Natural Actor-Critic Algorithm[i] and the methods that don’t explicity compute the gradients like REINFORCE[ii].
Examples
Reinforcement learning algorithms are used in various control problems, famously in playing games. In intelligent transportation systems like traffic control and self-driving cars.
a. Playing video games:
A reinforcement learning algorithm is used to learn optimal policies to play Atari 2600 games. A deep convolutional neural network directly encodes the raw pixels of the video game into a value function. A Q-learning algorithm combined with the deep convolutional neural network learns a policy to earn a higher score on these games. [iii] developed a model to train on seven Atari games and reported the algorithm performance to be better than three human expert players.
b. Adaptive traffic control:
The efficient management of traffic flow can significantly reduce traffic congestion. Reinforcement learning is used to develop control algorithms that read the current state of the transportation network and takes actions with the goal of reducing congestion. In this case, the actions include changing the traffic lights[iv] or dynamically pricing the toll roads[v].
c. Self-driving Cars:
In self-driving cars, the autonomous systems need to plan their driving actions to navigate on streets. Supervised models detect objects like other cars, pedestrians in the path of the car. Another set of algorithms predicts the future state of the car’s environment based on the movements of the detected objects. Based on this input the car needs to make decisions to avoid these objects. A reinforcement learning system can learn an optimal policy to execute control actions like braking, accelerating, turning, etc.