Instructions: Compile all solutions to the written problems on this assignment in a single PDF
file (typed, or handwritten legibly if absolutely necessary). Please show your work by either
writing down the relevant equations or expressions for each problem, or by explaining
any logic that you are using to bypass known equations. Coding solutions may be directly
implemented in the provided Python file(s). When ready, follow the submission instructions to
submit all files to Gradescope. Please be mindful of the deadline, as late submissions are not
accepted, as well as our course policies on academic honesty.
Problem 1: UCB Bandits (12 points)
In this problem you will be running several instances of the provided UCB bandit Python script.
You will not be writing or turning in code, although you will be asked to show certain plot outputs
in your PDF. The script simulates a n-armed UCB bandit, where the reward for each action follows
a normal distribution. The arguments to the UCB bandit function are a list of n distribution means
and a list of n distribution variances. The third argument is the exploration parameter c, and the
fourth (optional) argument is the number of iterations for which to simulate the bandit.
Three plots are shown when the script is run. The first two show the action values Qt and upper
confidence interval (first and second terms of the UCB expression) of each arm at each iteration.
The third plot shows the arm (action) taken in each iteration. Actions are represented as integers
starting from 0.
(a) Simulate two 3-armed bandit experiments with means [0, 0, 0] and variances [1, 1, 1].
For experiment 1, select a value of c such that all Qt values converge to Q⇤; for experiment 2,
select c such that at least one Qt value does not converge to Q⇤. Show the three plots for each
experiment. Briefly explain why your c values lead to the two di↵erent outcomes. Contrast the
trend of the confidence intervals and distribution of actions taken in each.
(b) Now suppose our three arms have very di↵erent means: [5, 0, -5]. Simulate two bandit
experiments with these means, and variances [1, 1, 1], once with c = 1 and once with
c = 10. Again show the plots for each scenario. Comment (for each scenario) on how the Qt
values of each action update over time (if at all), how the confidence intervals evolve, and the
distribution of the actions taken.
(c) Let’s consider a scenario in which the means are di↵erent but very close, while variances are
very large. Simulate bandit experiments with means [1, 0, -1], variances [10, 10, 10],
and c = 1. Do so until you see an outcome in which the “dominant” action taken most often is
not action 0 with the largest mean. Show the plots for this result and explain why we ended up
converging on a suboptimal action. Briefly explain the importance of c and exploration when
we have bandits with large variances.
Problem 2: Mini-Blackjack (12 points)
We will model a mini-blackjack game as a MDP. The goal is to draw cards from a deck containing
2s, 3s, and 4s (with replacement) and stop with a card sum as close to 6 as possible without going
over. The possible card sums form the states: 0, 2, 3, 4, 5, 6, “done”. The last state is terminal
and has no associated actions. From all other states, one action is to draw a card and advance
to a new state according to the new card sum, with “done” representing card sums of 7 and 8.
Alternatively, one may stop and receive reward equal to the current card sum, also advancing to
(a) Draw a state transition diagram of this MDP. The diagram should be a graph with seven
nodes, one for each state. Draw edges that represent transitions between states due to the
draw action only; you may omit transitions due to the stop action. Write the transition
probabilities adjacent to each edge.
(b) Based on the given information and without solving any equations, what are the optimal actions
and values of states 5 and 6? You may assume that V ⇤(done) = 0. Then using ! = 1, solve for
the optimal actions and values of states 4, 3, 2, and 0 (you should do so in that order). Briefly
explain why dynamic programming is not required for this particular problem.
(c) Find the largest possible value of ! that would possibly lead to di↵erent optimal actions in
both states 2 and 3 (compared to those above). Compute the values of states 3, 2, and 0 for
the discount factor that you found. Briefly explain why a lower value of ! decreases the values
of these states but not those of the other states.
Problem 3: Dynamic Programming (12 points)
Let’s revisit the mini-blackjack game but from the perspective of dynamic programming. You will
be thinking about both value iteration and policy iteration at the same time. Assume ! = 1.
(a) Let’s initialize the time-limited state values: V0(s) = 0 for all s. Find the state values of V1
after one round of value iteration. You do not need to write out every calculation if you can
briefly explain how you infer the new values.
(b) Coincidentally, V0 are also the values for the (suboptimal) policy ⇡0(s) = draw for all s. If we
were to run policy iteration starting from ⇡0, what would be the new policy ⇡1 after performing
policy improvement? Choose the stop action in the case of ties.
(c) Perform a second round of value iteration to find the values V2. Have the values converged?
(d) Perform a second round of policy iteration to find the policy ⇡2. Has the policy converged?
Problem 4: Reinforcement Learning (12 points)
Let’s now study the mini-blackjack game from the perspective of using reinforcement learning to
learn an optimal policy. Again, assume ! = 1. We initialize all values and Q-values to 0 and
observe the following episodes of state-action sequences: