This assignment consists of two sections, each contributing 50% of the total marks for this assignment. You should attempt both sections. Please submit one pdf via eBART containing your solutions – it should be written up using word processing software (e.g. LaTeX, R Markdown, or Word). Hand written solutions will be accepted where mathematical descriptions are required, but a professional word processed submission is preferred.
Deadline: Noon (12pm), on 8th August 2022
You are expected to work independently – strict disciplinary action will be taken for any plagiarism. Late submissions will also be penalised. The data required for this assignment MTHM506_refdef.RData can be loaded into R using the load() function. If you have any issues or questions, please email me (m.l.thomas@e xeter.ac.uk).
Section A – Exercises
In this section, you are required to answer a series of exercises based on the module. Note that the questions are organised in the order we covered the topics, and not in order of diffiffifficulty. Therefore it is advised that you read through the questions fifirst, and start working on those that you feel more comfortable with.
Solutions are expected to be concise, well structured and well presented. Commented R code (e.g. ‘model<- glm(…)’) and the outcomes/plots should form part of your solutions. Do not display too much raw R output (e.g. don’t display the full output of ‘summary(model)’), but edit this down to the essentials. Ensure to include justifification for each step of your analyses, providing comments alongside your R code to explain what you are doing and add appropriate titles and labelled axes to your plots. There are 60 marks in total for this section, with marks for each part question indicated.
The data frame dengue involves data on the count of weekly dengue fever cases in Rio de Janeiro (yi), starting on the 36th week of 2012 and goes up to the 15th week of 2013. A scatter plot of cases versus time suggests a strong non-linear relationship:
Yi ∼ NegBin(µi , θ), Yi independent
log(µi) = β0 + β1xi
where xi is time (in weeks). The goal is to capture the temporal structure of the disease outbreak in 2013.
The Negative Binomial distribution with mean µ and dispersion parameter θ has pmf:
p(yi; µi , θ) = yi + θ − 1yi θ θ +µi θ µi θ +µi yi
Note the R functions dnbinom and qnbinom call θ the size.
(a) [2 marks] Write down the log-likelihood ` (β0, β1, θ; y, x) for this model. Show your working.
(b) [2 marks] Write an R function mylike() which evaluates the negative log-likelihood (i.e. −` (β0, β1, θ; y, x)) for any values of the three parameters.
(c) [5 marks] Use the R function nlm() in association with your function mylike() to numerically minimise the negative log-likelihood. Provide some evidence of how you chose sensible starting values. Report the maximum likelihood estimates of the parameters and superimpose a plot of the associated mean relationship on a scatter plot of y versus x.
(d) [3 marks] Report the standard errors for β0 and β1, and use those to construct 95% confifidence intervals.
(e) [3 marks] Test the hypothesis that β1 = 0 at the 5% signifificance level (not using the confifidence interval) and compute the associated p-value of the test.
(f) [3 marks] Use plug-in prediction to construct and plot 95% prediction intervals. Looking at the estimated mean and prediction intervals, comment on the suitability of this model to capture this data.
In this question, data y1, . . . , yn will be simulated from a known model involving the Poisson distribution with a log link. You are then asked to fifit both a Gaussian GLM with a log-link to yi and a Poisson GLM with a log link to yi and comment on the difffferences in the results.
(a) [2 marks] In R, simulate values of an explanatory variable xi by sampling n = 200 observations from a uniform distribution between 0 and 1 (runif()). Now use rpois() to simulate corresponding response values yi from the following Poisson model
yi ∼ P ois(λi), i = 1, 2, . . . , 200
log(λi) = β0 + β1xi
with β0 = 0.3 and β1 = 5. Note: to make sure that you get the same simulated values you may want to use set.seed().
(b) [3 marks] In R, fifit a Gaussian GLM with a log-link to yi versus xi using glm() (careful, yi may contain zeros so add a small constant to each yi , e.g. 0.1). In addition, fifit a Poisson GLM to yi with a log link.
Compare parameter estimates and standard errors of the two models, making reference to the actual β0 and β1 values.
(c) [6 marks] Produce predictions of the mean trend of both models and superimpose them over the data with associated 95% prediction intervals. Furthermore, produce the residuals vs fifitted values plot and residual QQ plot for each model. Comment on both sets of plots and state which is the preferred model.