这是一个美国的Python人工智能作业代写

Instructions: Compile all solutions to the written problems on this assignment in a single PDF

file (typed, or handwritten legibly if absolutely necessary). Please show your work by either

writing down the relevant equations or expressions for each problem, or by explaining

any logic that you are using to bypass known equations. Coding solutions may be directly

implemented in the provided Python file(s). When ready, follow the submission instructions to

submit all files to Gradescope. Please be mindful of the deadline, as late submissions are not

accepted, as well as our course policies on academic honesty.

## Problem 1: UCB Bandits (12 points)

In this problem you will be running several instances of the provided UCB bandit Python script.

You will not be writing or turning in code, although you will be asked to show certain plot outputs

in your PDF. The script simulates a n-armed UCB bandit, where the reward for each action follows

a normal distribution. The arguments to the UCB bandit function are a list of n distribution means

and a list of n distribution variances. The third argument is the exploration parameter c, and the

fourth (optional) argument is the number of iterations for which to simulate the bandit.

Three plots are shown when the script is run. The first two show the action values Qt and upper

confidence interval (first and second terms of the UCB expression) of each arm at each iteration.

The third plot shows the arm (action) taken in each iteration. Actions are represented as integers

starting from 0.

(a) Simulate two 3-armed bandit experiments with means [0, 0, 0] and variances [1, 1, 1].

For experiment 1, select a value of c such that all Qt values converge to Q⇤; for experiment 2,

select c such that at least one Qt value does not converge to Q⇤. Show the three plots for each

experiment. Briefly explain why your c values lead to the two di↵erent outcomes. Contrast the

trend of the confidence intervals and distribution of actions taken in each.

(b) Now suppose our three arms have very di↵erent means: [5, 0, -5]. Simulate two bandit

experiments with these means, and variances [1, 1, 1], once with c = 1 and once with

c = 10. Again show the plots for each scenario. Comment (for each scenario) on how the Qt

values of each action update over time (if at all), how the confidence intervals evolve, and the

distribution of the actions taken.

(c) Let’s consider a scenario in which the means are di↵erent but very close, while variances are

very large. Simulate bandit experiments with means [1, 0, -1], variances [10, 10, 10],

and c = 1. Do so until you see an outcome in which the “dominant” action taken most often is

not action 0 with the largest mean. Show the plots for this result and explain why we ended up

converging on a suboptimal action. Briefly explain the importance of c and exploration when

we have bandits with large variances.

## Problem 2: Mini-Blackjack (12 points)

We will model a mini-blackjack game as a MDP. The goal is to draw cards from a deck containing

2s, 3s, and 4s (with replacement) and stop with a card sum as close to 6 as possible without going

over. The possible card sums form the states: 0, 2, 3, 4, 5, 6, “done”. The last state is terminal

and has no associated actions. From all other states, one action is to draw a card and advance

to a new state according to the new card sum, with “done” representing card sums of 7 and 8.

Alternatively, one may stop and receive reward equal to the current card sum, also advancing to

“done” afterward.

(a) Draw a state transition diagram of this MDP. The diagram should be a graph with seven

nodes, one for each state. Draw edges that represent transitions between states due to the

draw action only; you may omit transitions due to the stop action. Write the transition

probabilities adjacent to each edge.

(b) Based on the given information and without solving any equations, what are the optimal actions

and values of states 5 and 6? You may assume that V ⇤(done) = 0. Then using ! = 1, solve for

the optimal actions and values of states 4, 3, 2, and 0 (you should do so in that order). Briefly

explain why dynamic programming is not required for this particular problem.

(c) Find the largest possible value of ! that would possibly lead to di↵erent optimal actions in

both states 2 and 3 (compared to those above). Compute the values of states 3, 2, and 0 for

the discount factor that you found. Briefly explain why a lower value of ! decreases the values

of these states but not those of the other states.

## Problem 3: Dynamic Programming (12 points)

Let’s revisit the mini-blackjack game but from the perspective of dynamic programming. You will

be thinking about both value iteration and policy iteration at the same time. Assume ! = 1.

(a) Let’s initialize the time-limited state values: V0(s) = 0 for all s. Find the state values of V1

after one round of value iteration. You do not need to write out every calculation if you can

briefly explain how you infer the new values.

(b) Coincidentally, V0 are also the values for the (suboptimal) policy ⇡0(s) = draw for all s. If we

were to run policy iteration starting from ⇡0, what would be the new policy ⇡1 after performing

policy improvement? Choose the stop action in the case of ties.

(c) Perform a second round of value iteration to find the values V2. Have the values converged?

(d) Perform a second round of policy iteration to find the policy ⇡2. Has the policy converged?

## Problem 4: Reinforcement Learning (12 points)

Let’s now study the mini-blackjack game from the perspective of using reinforcement learning to

learn an optimal policy. Again, assume ! = 1. We initialize all values and Q-values to 0 and

observe the following episodes of state-action sequences: