Consider a neural network with one hidden layer containing two nodes, input dimension 2 and output dimension 1. That is, the fifist layer contains two nodes v0,1, v0,2, the hidden layer has two nodes v1,1, v1,2, and the output layer one nodes v2,1. All nodes between consecutive layers are connected by an edge. The weights between node vt,i and vt+1,j is denoted by wt,j,i as (partially) indicated here: The nodes in the middle layer apply a difffferentiable activation function σ : R → R, which has derivative σ 0 .
(a) The network gets as input a 2-dimensional vector x = (x1, x2). Give an expression for the output N(x) of the network as a function of x1, x2 and all the weights.
(b) Assume we employ the square loss. Give an expression for the loss ` (N(·),(x, t)) of the network on an example (x, t) (again, as a function of x1, x2, t and all the weights).
(c) Consider the above expression of the loss as a function of the set of weights L(w0,1,1, w0,2,1, w0,1,2, w0,2,2, w1,1,1, w1,1,2) = ` (N(·),(x, t)). Compute the 6 partial derivatives
∂L /∂w1,1,1, ∂L /∂w1,1,2, ∂L /∂w0,1,1, ∂L /∂w0,1,2, ∂L /∂w0,2,1, ∂L /∂w0,2,2
Simplify your expressions by using the notation ai,j and oi,j for the input and output of nodes as appropriate.
(d) Can you identify the components δk,i from the backpropagation algorithm from lecture 20?
(2 + 1 + 5 + 4 marks)
In this question you will implement and experiment with the k-means algorithm.
(a) Implement the k-means algorithm and include your code in your submission. As always, the data, consisting of n datapoints which are d-dimensional vectors, is given as an n × d-dimensional matrix. The output of your algorithm should be one n-dimensional vector containing values on {1, . . . , k} indicating the cluster assignments of the points. We will work with three ways of initializing the centers.
In the fifirst, you will simply set the initial centers by hand. In the second you will choose k datapoints uniformly at random from the dataset as your initial centers. For the third initialization, choose the fifirst center uniformly at random,and then choose each next center to be the datapoint that maximizes the sum of (euclidean) distances from the previous datapoints.
(b) Load the fifirst dataset twodpoints.txt. Plot the datapoints. Which number of clusters do you think would be suitable from looking at the points? Show that, for the number k that you chose by visually inspecting the data, the k-means algorithm can converge to to difffferent clusterings depending on the initializations.
Show two such instances by plotting the clusterings that k-means produced (use a difffferent color for the points of each cluster) together with the initial centers.
In this part, set the initial centers manually.
(c) Can the above phenomenon also happen if the initial centers are chosen uniformly at random? What about the third way of initializing? For each way of initializing, run the algorithm several times and report your fifindings.
(d) From now on, we will work with the third method for initializing cluster centers.
Run the algorithm for k = 1, . . . 10 and plot the k-means cost of the resulting clustering (with respect to the last set of centers, which are the means of the output clusters). What do you observe? How does the cost evolve as a function of the number of clusters? How would you determine a suitable number of clusters from this plot (eg in a situation where the data can not be as obviously visualized).
(e) Repeat the last step (that is, plot the cost as a function of k = 1, . . . , 10) for the next dataset threedpoints.txt. What do you think is a suitable number of clusters here? Can you confifirm this by a visualization?
(f) Download the “seeds” data set from the UCI repository:
https://archive.ics.uci.edu/ml/datasets/seeds.
That data is also stored in text fifile and can be loaded the same way. It contains 210 instances, with three difffferent label (the last column in the fifile corresponds to the label).
Remove the last column (which contains the class labels) before the clustering.
From how the the k-means cost evolves, what seems like a suitable number of clusters for this dataset? How could you use this for designing a simple 3-class classififier for this dataset? What is the empirical loss of this classififier?
(g) Design a simple (two-dimensional) dataset where the 2-means algorithm with the third initialization method will always fail to fifind the optimal 2-means clustering.
Explain why it will fail on your example or provide plots of the data with initializations and costs that show that 2-means converges to a suboptimal clustering.
Discuss what you think is the issue with this initialization method and how it could be fifixed.
(2 + 2 + 2 + 3 + 3 + 3 + 3 marks)