This assignment consist of four sections. Section A requires you to work in Python, while sections B and C have R-based exercises. Section D is a narrated power point presentation on Section C question 4.
This part of the coursework involves python code, written summaries, and fifigures.
A useful environment to work on these questions and submit your answers is in a Jupyter notebook. If you use this format, submit the notebook fifile, a nd (o ptionally, b ut h ighly r ecommended) a n e xport o f your notebook to either pdf or html format. If we end up not being able to open your notebook, and don’t have such an export to look at, we cannot assign any marks.
Using a Jupyter notebook is not strictly necessary. You may also submit a python source code fifile (plain text) that contains your python code, your answers to the questions as commented lines in the code, and any fifigures a s s eparate fifi les wi th fifil enames ind icating the que stion the y answer.
Do not paste your code into a Word document or similar, the formatting will likely alter the indentation and introduce errors in the code. For the submitted code and written answers, it is important that all text can be highlighted, copied and pasted from your submission – do not submit screenshots of your code and answers.
Always state your answer as complete sentences, not just as the output of python commands. Indicate very clearly which question you answer, using an appropriate markdown heading, or otherwise clarifying it for the marker, for example:
Answer to question 1.2: The answer is 3.
Also clearly indicate the code chunks that you used to answer each question.
You may not use any other programming languages than the python language to answer the questions in this part (calling R from within python counts as “using another language”).
(i) Download the csv fifile “accidents2019.csv” from the “week 1” tile of MTHM503 ELE page (on ELE, the fifile is titled “UK Accidents 2019 data (accidents2019.csv)” – it is the same fifile that was used in one of the exercises). You don’t have to use a python command to download the data.
The data was originally downloaded from the UK Department of Transport Road Safety data base (“Road Safety Data – Accidents 2019”) (LINK) where further information about the data is available. Read the csv fifile into a pandas data frame named accidents, and display the fifirst 2 lines of the data frame.
Unless stated otherwise, all of the following questions related to traffiffiffic accidents are to be answered by analysing the data frame accidents.
(i) How many accidents happened in 2019?
(ii) What are the column names in the data frame?
(iii) Report the date (day/month/year) and coordinates (longitude and latitude) of the accident in
the 100th row of the data frame. (The row at the top of the data frame counts as the 1st row.)
(iv) Does any of the columns contain information about the type of vehicles involved in the accident?
(i) Calculate the total number of casualties.
(ii) What is the difffference in the number of casualties between Lower Layer Super Output area (LSOA) E01032739 (City of London 001F) and E01033708 (Hackney 027G)?
(iii) What LSOA saw the highest total number of casualties in 2019?
(i) What are the possible values that occur in the column Did_Police_Officer_Attend_Scene_of_Accident,and how often does each value occur?
(ii) The value Did_Police_Officer_Attend_Scene_of_Accident = 1 indicates that a police offiffifficer attended the accident. What do the other value(s) in that column mean? (Consult the data web site to answer this question.)
(iii) What fraction of accidents was attended by a police offiffifficer?
(iv) What fraction of those accidents that happened on a weekday was attended by a police officer?
How does this number compare to the corresponding fraction for accidents on weekends?
(i) Visualise the locations of all accidents by a scatter plot of Latitude vs Longitude. Annotate the axes, add a plot title, and increase the fifigure size to 10in by 10in.
(ii) Create a similar scatter plot as in the previous question 5(i). but zoom in on a 2 by 2 degree area that includes Exeter/Devon, and choose a difffferent color for accidents that happened in a rural area and in an urban area. Include a red marker that indicates the coordinates of Exeter/Devon(as per wikipedia).
(iii) Are accidents at higher speed limits more likely to be fatal than at lower speed limits? Answer the question with an appropriate data visualisation and a short written summary.
Gym exercise and physiology data
To complete the following questions, you have to load the “Linnerud physical exercises data” from the scikit-learn package using the following commands:
(i) State the author’s last name and year of the study in which that data fifirst appeared.
(ii) Using the appropriate function from the scikit-learn package, fifit a simple linear regression model with number of chinups (Chins) as the target variable, and “Weight” as the covariate. Report the fifitted regression coeffiffifficients, and interpret the slope coeffiffifficient.
(iii) Your lecturer is a middle-aged male, 170 pounds (“Weight”), 32 inch waist size (“Waist”), and a resting heart rate of 70 (“Pulse”)? How many chin-ups (“Chins”) do you think he can do? (Use linear regression in scikit-learn on the Linnerud data to answer the question.)
Imagine you are in a data sciences consulting role. You are tasked with automating the analyses in part A to browse and run accurately across a socio-economic data base (netcdf files, csv and excel data types including missing data cells) for multiple regions and countries. Write a 200 word paragraph to explain how you could automate the analysis, give an example of this and explain in words the summary quantification information you would provide to give on a two page overview.
THE REST OF THE QUESTIONS SHOULD BE SOLVED USING R