这是英国的R语言数据分析作业代写

## 1 Examples for practice, NOT ASSESSED

Some sample solutions are available in the solution script for this practical SB1-CompStats-Third-Prac.R

The data in the file baseball.txt are a subset of data collected on professional American baseball

players. Focus on the following variables:

1. salary – The yearly salary of each player (there are 59 NA’s in salary – remove these rows)

2. homeruns_career – The total number of homeruns a player made over their career

3. division – The division a player is in (either W or E)

### (a) EDA and Smoothing

We havent covered this in lectures yet but it is interesting to look at the code in the answers to see

an example of what we will be doing and explore the behavior and check the variable dependence.

You may wish to focus on the next block of questions for this prac.

a.1 Examine the relationship between the log of outcome variable salary and explanatory

variables homeruns_career and division using exploratory plots.

a.2 Can we predict salaries with the predictor variable homeruns_career? Perform a regression of

response variable log_salary on the variable homeruns_career. Plot the regression lines onto

the data and comment on the fit. Give an interpretation of the model, keeping in mind that

the salaries are reported on a log-scale. Comment on the effect of extreme observations in

homeruns_career.

a.3 Try a smoothing procedure, for example a Kernel smoother ksmooth() or local polynomial

regression locpoly(). Plot the fit for various smoothing parameters and discuss what you

find. Investigate also lowess() and dpill().

a.4 Could anything be done to improve the fit?

a.5 Optional Extensions: (i) use LOO-CV to estimate the bandwith in a local-polynomial fit of

degree 1 (see lecture code L6.R for something nearly identical); (ii) Calculate the smoothing

matrix ( in this case (the expression for ( is given in PS2) and verify the fit by hand. Calculate

the DOF for GCV, trace„(”, and the DOF for variance estimation, 2trace„(”−trace„()(”.

### (b) Rank sum test

b.1 Are the salaries similar in the two divisions? Examine whether the distributions of player

salaries are the same in both leagues. Use the Wilcoxon test with the normal approximation

and calculate by hand the value of the test statistic, its expected value and variance under the

null hypothesis of no difference and give the test result.

b.2 The data contain ties. Explain how to simulate the distribution of the test statistic under the

null. Implement this procedure and check that the significance levels you computed in the

previous question are robust to the effect of ties.

b.3 Test your result by implementing the Wilcoxon test in R and give a point estimate (using the

Lehmann-Hodges estimator) and a confidence interval for the difference in median salary

between the two leagues. State any assumptions.

### (c) Rank sign test

c.1 Consider the possibility that walks and runs have the same median. Why should we treat

these data as paired?

c.2 Find a confidence interval for the difference in medians and use it to test for equal medians at

95%. What do you conclude? State any assumptions.

## 2 ASSESSED EXERCISE

Write a short self-contained report introducing the data, clarifying and formalising the question, outlining and

justifying your choice of statistical methods (smoother and test) and presenting your results and conclusions

in a clear and physically interpretable manner. Draw attention to any assumptions you make in the course of

selecting a test and what criterion was used to choose the smoother.

The dataset rainfall.csv available on canvas contains monthly rainfall data for Oxford and

Cambridge. This data is from

https://www.metoffice.gov.uk/research/climate/maps-and-data/historic-station-data.

Total monthly rainfall in mm is given for both locations for the period between January 2014 and

December 2020. Perform exploratory data analysis, including at least plotting a smooth of rainfall

for each location over months. Perform a test of whether the rainfalls in two locations have the same

distributions against a suitable alternative.