Your report should be clearly written. There are no marks awarded for presentation but
there are marks awarded for clarity. When you perform a statistical test do not just report
the p-value but also report your conclusion using plain language. The same holds for
interpreting models; reporting the coefficients and their standard errors is required but
try to link your results to the research question as well. You should use captions for your
tables and figures and include your commented R-code, preferably in an appendix.
1 Exercise for practice, NOT ASSESSED
The dataset bw.csv gives details of 189 babies and mothers, focusing on low birth weight.
The dataset contains information on:
• low: birth weight status, 1 = birth weight less than 2.5 kg, 0 otherwise
• age: mother’s age in years
• mwt: mother’s pre-pregnancy weight in pounds
• race: mother’s race (1 = white, 2 = black, 3 = other)
• smoke: 1 if smoked during pregnancy, 0 otherwise
• ptlp1: 0 if no previous premature labours, 1 otherwise
• ht: 1 if mother has history of hypertension, 0 otherwise
1. Load the data, using for example read.csv():
bw <- read.csv(“bw.csv”)
2. Produce some suitable exploratory plots of the data, examining the relationships
between the variables.
# brief hints
bw$race <- as.factor(bw$race)
# to be able to refer to a column as e.g. race rather than bw$race
# here it is convenient to:
# use detach(bw) to remove it when finished, can check using search()
# For plot examples
(tab1 <- table(low, race))
barplot(tab1, beside = TRUE)
# can use e.g. names.arg and col arguments of barplot() to improve plot
boxplot(mwt ~ low, xlab=’low bw’, ylab=’mother weight’)
3. Which GLM do you specify to analyse how the incidence of low birth weight depends
on the other variables? Motivate your choice. What are your priors wrt the directions
of the effects of the other variables on the incidence of low birth weight?
4. Carry out model selection using likelihood ratio tests (ignoring any interaction
5. Assess the quality of the model fit using suitable methods.
6. Interpret your findings fully.
7. Compute an estimate of the average marginal effect for mwt. How do you interpret
this effect? Sketch how you would obtain a standard error.
2 ASSESSED EXERCISE
The data set smoke.csv contains a sample of individual level data on smoking status for
individuals that are in work, and work area smoking bans from the US. The available
• smoker: 1 if current smoker, 0 otherwise
• ban: 1 if there is a work area smoking ban, 0 otherwise
• age: age in years
• age2: age-squared
• education: highest education level attained: 1 high school (hs) drop out, 2 hs
graduate, 3 some college, 4 college graduate, 5 master’s degree or higher
• aahisp: 1 if African-American or Hispanic, 0 otherwise
• female: 1 if female, 0 otherwise
Investigate and write a report on how smoking status depends on the other variables. The
main goal here is to obtain a suitable interpretable model and to give a full interpretation
of that model.
1. Perform an exploratory analysis of the data and summarise your findings. As well
as producing suitable plots that examine the relationship between the smoking
status and the available explanatory variables, you may also wish to consider some
2. Model the relation between smoking status and variables that are available using
the appropriate GLM with canonical link. Do not consider all possible interactions,
but only interactions of the aahisp indicator and the other variables, and include
the education levels as factors. Using appropriate tests, carry out model selection to
examine the relationship between the possible explanatory variables and smoking
3. Assess the quality of the model fit using suitable methods.
4. Interpret your final model carefully. In particular, present and interpret the estimated
effects on smoking status of each of the variables included in your final model. Include
95% confidence intervals for the effects.