This report aims to summarise the work done on reinforcement learning. The task considered here is a “homing” based task where our agent (e.g. robot) starts in a specific location and must return to a particular location while avoiding an obstacle. The agent is considered to be moving on a grid with four possible actions (Up, Down, Left, Right).
A single trial involves multiple steps on this grid. There are no explicit landmarks, but to simplify the task we assume that the robot has been familiarised with the environment.
Therefore, it has an internal representation of its position in the space. In this task, the grid is a space of 4 rows and 12 columns with a ’cliff’ obstacle along one edge [1, 2]. A schematic is shown in figure 1. For the coordinates (i, j) used in the following, the first index represents the row and the second the column. The agent starts in space (1,1),denoted by “S”, and must reach square (1,12), denoted by “F”, without stepping on the cliff, which is situated from (1,2) to (1,11). Remember to convert these into python ‘0’ based indices if necessary. Every movement the agent makes gives a reward of r = −1 but stepping on the cliffff gives a reward of r = -100 and ends the trial. The trial ends when either the agent reaches the end ’F’ square,if it steps on the cliff or if a maximum number of steps has been performed. Since these rewards are negative, the agent should try to reach the goal in as few steps as possible.
Figure 1: Schematic of the cliff walking task. The agent starts at (1,1) and must arrive at (1,12) by the shortest route possible. Each transition receives a reward of r= -1 apart from steps into the cliff region, which has a value of r = – 100.
The trial ends if the agent reaches its goal or fall off the cliff.
Your task is to write a program where the above-mentioned goal-oriented behaviour (homing) can be learned by using reinforcement learning in the following way:
This constitutes one run of the algorithm. You should use enough trials for the agent to learn the Q-values (or weights) properly, which may depend on the learning rate. It will need to be repeated with different initial Q-values (or weights). Through each run you should store the total reward (sum of the reward over the steps of the trial) vs the trial number, we call this the learning curve. In order to produce learning curves that allow for a comparison, you need to repeat the procedure (for at least 5 times, i.e. multiple runs) and produce a curve that is the average of the learning curves of the individual runs. On an average curve we typically use error-bars that represent standard deviation or standard error (use shaded error-bars if this helps in the visualisation). You may need to consider smoothing techniques for the curve, see for instance exponential average.
Your report should address these points:
To which of these two categories do SARSA and Q-learning belong? Justify your answer.