R数据代写|INFO411/INFO911 Final Examination Paper



Question 1  (1+1+1+1+1 = 5 marks)

Suggest plots that would be appropriate to explore datasets of the following types:

(A) A single continuous variable (e.g. height of a student).

(B) A single categorical variable (e.g. the days of a week).

(C) A single continuous variable (e.g. personal income) and a single categorical variable (e.g.gender).

(D) Two continuous variables (e.g. height and weight of a student).

(E) Two categorical variables (e.g. highest qualification and gender).

Question 2  (1+1+1+2 = 5 marks)

(A) Discuss the connections and differences between partitional clustering and hierarchical clustering.

(B) Explain self-organizing map and its “topology preserving properties”.

(C) Training self-organizing map needs to specify several parameters. Name three parameters and explain their purpose.

(D) Given a set of points, students A and B apply k-means clustering to cluster these points into M clusters, respectively. Due to various reasons, their clustering results are not identical. You are invited to determine whose clustering result is better. Please describe your solution.

Question 3   (2+2+2 = 6 marks)

(A) Describe the main steps of the Apriori algorithm for mining association rules. Explain how the algorithm generates the sets of candidate itemsets and how the algorithm prunes the candidate itemsets.

(B) Consider the following set of items {A, B, D, F, H}. Create a set of transactions such that the association rule {A, D} => {F, H} would have support 0.3 and confidence 0.6.

(C) The measure “confidence” is commonly used to evaluate the interestingness of a mined association rule. However, sometimes a high confidence value does not necessarily mean a rule is indeed interesting. Discuss the potential issue of the measure “confidence” and explain how this issue is addressed in association analysis.

Question 4  (2+2+2+3 = 9 marks)

(A) K-nearest neighbour (k-NN) classifier is a simple and effective classifier. Suppose you are given a set of M samples and the class label of each sample is also provided to you.

Meanwhile, another set of N samples are hidden from you and they will only be used as a test set to evaluate the k-NN classifier that you have developed. Describe the procedure that you will follow in order to obtain a k-NN classifier that can achieve the highest classification accuracy on the test set (i.e., the N samples).

(B) Given a training set with the following properties:

Number of samples: 1900

Dimension of features in each sample: 45

Dimension of the target value for each sample: 3

Assume that this dataset is being used to train an MLP which has a single hidden layer with 20 neurons, and that the network is being trained for 400 iterations. What is the total number of weights (weight parameters) in this MLP? Show and explain how you derived your answer.

(C) Given a 2-layer MLP as is depicted below. The MLP depicted consists of 1 hidden layer neuron, one output layer neuron, and 5 weights. The value of each of the weights is indicated by a numeric value that is attached to a link (for example, the weight between input x1 and the output neuron is +1). Assume that the activation function for both, the hidden layer neuron and the output layer neuron is a threshold function defined as

f (x)=1  if>u

0 else,

where  is the threshold value, and x is the sum of all weighted inputs to a given neuron.

Thus, for example, if the threshold of a neuron is  = 0.5, and the sum of its weighted inputs is 0.35 then this neuron will produce 0 as an output.

Given an input set that contains the following four samples:

Sample1: x1=1.5, x2=1.2

Sample2: x1=0, x2=1.2

Sample3: x1=0.5, x2=0.5

Sample4: x1=1.6, x2=0

Compute the output produced by this network for each of these samples. You need to show the key steps of calculation.

You may also like: