This report aims to summarise the work done on reinforcement learning. The task considered here is a “homing” based task where our agent (e.g. robot) starts in a specific location and must return to a particular location while avoiding an obstacle. The agent is considered to be moving on a grid with four possible actions (Up, Down, Left, Right).
A single trial involves multiple steps on this grid. There are no explicit landmarks, but to simplify the task we assume that the robot has been familiarised with the environment.
Therefore, it has an internal representation of its position in the space. In this task, the grid is a space of 4 rows and 12 columns with a ’cliff’ obstacle along one edge [1, 2]. A schematic is shown in figure 1. For the coordinates (i, j) used in the following, the first index represents the row and the second the column. The agent starts in space (1,1),denoted by “S”, and must reach square (1,12), denoted by “F”, without stepping on the cliff, which is situated from (1,2) to (1,11). Remember to convert these into python ‘0’ based indices if necessary. Every movement the agent makes gives a reward of r = −1 but stepping on the cliffff gives a reward of r = -100 and ends the trial. The trial ends when either the agent reaches the end ’F’ square,if it steps on the cliff or if a maximum number of steps has been performed. Since these rewards are negative, the agent should try to reach the goal in as few steps as possible.
Figure 1: Schematic of the cliff walking task. The agent starts at (1,1) and must arrive at (1,12) by the shortest route possible. Each transition receives a reward of r= -1 apart from steps into the cliff region, which has a value of r = – 100.
The trial ends if the agent reaches its goal or fall off the cliff.
Your task is to write a program where the above-mentioned goal-oriented behaviour (homing) can be learned by using reinforcement learning in the following way:
- The agent is placed at the start location.
- Calculate the Q-values and select an action based on your policy.
- Apply the selected action. If the agent moves to any space that is not designated as the cliff region then it receives a reward of r = −1. If the action moves it onto the cliff then it receives a reward of r =−100 and the trail ends.
If the agent would move of the grid, the action returns it to it’s current square but with the
same r= -1 applied.
- As it explores the space it learns the Q-values/weights using either SARSA or Q-learning
- The trial ends when the final space is reached, a predefined number of steps are exceeded or it enters the cliff. The procedure repeats from step 1 until a predefined number of trials is reached.
This constitutes one run of the algorithm. You should use enough trials for the agent to learn the Q-values (or weights) properly, which may depend on the learning rate. It will need to be repeated with different initial Q-values (or weights). Through each run you should store the total reward (sum of the reward over the steps of the trial) vs the trial number, we call this the learning curve. In order to produce learning curves that allow for a comparison, you need to repeat the procedure (for at least 5 times, i.e. multiple runs) and produce a curve that is the average of the learning curves of the individual runs. On an average curve we typically use error-bars that represent standard deviation or standard error (use shaded error-bars if this helps in the visualisation). You may need to consider smoothing techniques for the curve, see for instance exponential average.
Your report should address these points:
- Give a brief introduction to reinforcement learning and the task being considered here. What is the objective of reinforcement learning?
- Implement the SARSA and Q-learning algorithms on the cliff walking task. What are the update rules for the Q-values (weights) for each? Explain the algorithms and highlight their differences. Due to the nature of the task, use an undiscounted version (γ = 1). Explain why we can use a high value of γ in this case. Explain what is meant by the terms “on-policy” and “off-policy” methods.
To which of these two categories do SARSA and Q-learning belong? Justify your answer.
- Implement a policy that aids exploration (such as ε-greedy). Introduce and discuss the algorithm that you are using, what parameters control it and explain why exploration is important.
- Train your agent to find the finish location using both SARSA and Q-learning with a learning rate of 0.5. Plot the average learning curves over the first 500 trials and discuss how the different learning algorithms compare for this task. How does you chosen method of exploration affect the performance of either algorithm?
- Average the total reward over the trials after convergence has been reached and average this over multiple runs to give a metric of the agent’s performance. Plot this metric for both SARSA and Q-learning for different values of the learning rate and exploration factor (i.e epsilon for the #-Greedy algorithm). Discuss how these factors affect the performance.
- Propose a method to reveal the information about the preferred direction stored in the weights (or Q values). Plot these preferred directions for SARSA and Q-learning and explain the behaviour of both.