Exploring temperature and rainfall in Australia
Due date: Oct 1 at 11:59pm
The goal of this assignment is to get used to designing and writing functions along with practicing our tidyverse skills that we learned in our previous module. Writing functions involves thinking about how code should be divided up and what the interface/arguments should be. In addition, you need to think about what the function will return as output.
Please write up your project using R Markdown and processed with knitr
. Compile your document as an HTML file and submit your HTML file to the dropbox on Courseplus. Please show all your code (i.e. make sure to set echo = TRUE
) for each of the answers to each part.
Before attempting this assignment, you should first install the following packages, if they are not already installed:
install.packages("tidyverse")
install.packages("tidytuesdayR")
In this part, we are going to practice creating functions.
The exponential of a number can be written as an infinite series expansion of the form \[ \exp(x) = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots \] Of course, we cannot compute an infinite series by the end of this term and so we must truncate it at a certain point in the series. The truncated sum of terms represents an approximation to the true exponential, but the approximation may be usable.
Write a function that computes the exponential of a number using the truncated series expansion. The function should take two arguments:
x
: the number to be exponentiated
k
: the number of terms to be used in the series expansion beyond the constant 1. The value of k
is always \(\geq 1\).
For example, if \(k = 1\), then the Exp
function should return the number \(1 + x\). If \(k = 2\), then you should return the number \(1 + x + x^2/2!\).
You can assume that the input value x
will always be a single number.
You can assume that the value k
will always be an integer \(\geq 1\).
Do not use the exp()
function in R.
The factorial()
function can be used to compute factorials.
Exp <- function(x, k) {
# Add your solution here
}
Next, write two functions called sample_mean()
and sample_sd()
that takes as input a vector of data of length \(N\) and calculates the sample average and sample standard deviation for the set of \(N\) observations.
\[ \bar{x} = \frac{1}{N} \sum_{i=1}^n x_i \] \[ s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2} \]
You can assume that the input value x
will always be a vector of numbers of length \(N\).
Do not use the mean()
and sd()
functions in R.
sample_mean <- function(x) {
# Add your solution here
}
sample_sd <- function(x) {
# Add your solution here
}
Next, write a function called calculate_CI()
that:
Has two inputs to the calculate_CI()
function. First, it should take as input a vector of data of length \(N\). Second, the function should also have a conf
(\(=1-\alpha\)) argument that allows the confidence interval to be adapted for different \(\alpha\).
Calculates a confidence interval (CI) (e.g. a 95% CI) for the estimate of the mean in the population. If you are not familiar with confidence intervals, it is an interval that contains the population parameter with probability \(1-\alpha\) taking on this form
\[ \bar{x} \pm t_{\alpha/2, N-1} * s_{\bar{x}} \]
where \(t_{\alpha/2, N-1}\) is the value needed to generate an area of \(\alpha / 2\) in each tail of the \(t\)-distribution with \(N-1\) degrees of freedom and \(s_{\bar{x}} = \frac{s}{\sqrt{N}}\) is the standard error of the mean. For example, if we pick a 95% confidence interval and \(N\)=50, then you can calculate \(t_{\alpha/2, N-1}\) as
alpha <- 1 - 0.95
degrees_freedom = 50 - 1
t_score = qt(p=alpha/2, df=degrees_freedom, lower.tail=FALSE)
lower_bound
, the second value is the upper_bound
.calculate_CI <- function(x, conf = 0.95) {
# Add your solution here
}
You can assume that the input value x
will always be a vector of numbers of length \(N\).
Do not use the confint()
or other similar functions in R.
However, to help you check if your function output matches the correct answer, assume there exists a vector \(x\) of length \(N\) and see if the following two code chunks match.
calculate_CI(x, conf = 0.95)
dat = data.frame(x=x)
fit <- lm(x ~ 1, dat)
# Calculate a 95% confidence interval
confint(fit, level=0.95)
In this part, we will practice our wrangling skills with the tidyverse that we learned about in module 1.
The two datasets for this part of the assignment comes from TidyTuesday. Specifically, we will use the following data from January 2020, which I have provided for you below:
tuesdata <- tidytuesdayR::tt_load('2020-01-07')
rainfall <- tuesdata$rainfall
temperature <- tuesdata$temperature
Note: A useful way to avoid re-downloading data is to write code to check to see if those files already exist using and if()
statement:
library(here)
if(!file.exists(here("data","tuesdata_rainfall.RDS"))){
tuesdata <- tidytuesdayR::tt_load('2020-01-07')
rainfall <- tuesdata$rainfall
temperature <- tuesdata$temperature
# save the files to RDS objects
saveRDS(tuesdata$rainfall, file= here("data","tuesdata_rainfall.RDS"))
saveRDS(tuesdata$temperature, file= here("data","tuesdata_temperature.RDS"))
}
Note the above code will only run if it cannot find the path to the tuesdata_rainfall.RDS
on your computer. Then, we can just read in these files every time we knit the R Markdown, instead of redownloading them every time.
Now we can look at the data with glimpse()
Rows: 179,273
Columns: 11
$ station_code <chr> "009151", "009151", "009151", "009151", "009151…
$ city_name <chr> "Perth", "Perth", "Perth", "Perth", "Perth", "P…
$ year <dbl> 1967, 1967, 1967, 1967, 1967, 1967, 1967, 1967,…
$ month <chr> "01", "01", "01", "01", "01", "01", "01", "01",…
$ day <chr> "01", "02", "03", "04", "05", "06", "07", "08",…
$ rainfall <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ period <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ quality <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lat <dbl> -31.96, -31.96, -31.96, -31.96, -31.96, -31.96,…
$ long <dbl> 115.79, 115.79, 115.79, 115.79, 115.79, 115.79,…
$ station_name <chr> "Subiaco Wastewater Treatment Plant", "Subiaco …
glimpse(temperature)
Rows: 528,278
Columns: 5
$ city_name <chr> "PERTH", "PERTH", "PERTH", "PERTH", "PERTH", "PE…
$ date <date> 1910-01-01, 1910-01-02, 1910-01-03, 1910-01-04,…
$ temperature <dbl> 26.7, 27.0, 27.5, 24.0, 24.8, 24.4, 25.3, 28.0, …
$ temp_type <chr> "max", "max", "max", "max", "max", "max", "max",…
$ site_name <chr> "PERTH AIRPORT", "PERTH AIRPORT", "PERTH AIRPORT…
If we look at the TidyTuesday github repo from 2020, we see this dataset contains temperature and rainfall data from Australia.
Here is a data dictionary for what all the column names mean:
Using the rainfall
and temperature
data, perform the following steps and create a new data frame called df
:
rainfall
dataset and drop any rows with NAs.date
that combines the columns year
, month
, day
into one column separated by “-”. (e.g. “2020-01-01”). This column should not be a character, but should be recognized as a date. (Hint check out the ymd()
function in lubridate
R package). You will also want to add a column that just keeps the year
.city_name
column, convert the city names (character strings) to all upper case.temperature
dataset such that it includes only observations that are in both data frames. (Hint there are two keys that you will need to join the two datasets together). (Hint: If all has gone well thus far, you should have a dataset with 83,964 rows and 13 columns).# Add your solution here
drop_na()
from tidyr
and str_to_upper()
function from stringr
useful.In this part, we will practice our ggplot2
plotting skills within the tidyverse that we learned about in module 1 starting with our wrangled df
data from Part 2. For full credit in this part (and for all plots that you make), your plots should include:
Consider playing around with the theme()
function to make the figure shine, including playing with background colors, font, etc.
Use the functions in ggplot2
package to make a line plot of the max and min temperature (y-axis) over time (x-axis) for each city in our wrangled data from Part 2. You should only consider years 2014 and onwards. For full credit, your plot should include:
city_name
to show all cities in one figure.# Add your solution here
Here we want to explore the distribution of rainfall (log scale) with histograms for a given city (indicated by the city_name
column) for a given year (indicated by the year
column) so we can make some exploratory plots of the data. Note: you are again using the wrangled data from Part 2.
The following code plots the data from one city (city_name == "PERTH"
) in a given year (year == 2000
).
While this code is useful, it only provides us information on one city in one year. We could cut and paste this code to look at other cities/years, but that can be error prone and just plain messy.
The aim here is to design and implement a function that can be re-used to visualize all of the data in this dataset.
There are 2 aspects that may vary in the dataset: The city_name and the year. Note that not all combinations of city_name
and year
have measurements.
Your function should take as input two arguments city_name and year.
Given the input from the user, your function should return a single histogram for that input. Furthermore, the data should be readable on that plot so that it is in fact useful. It should be possible visualize the entire dataset with your function (through repeated calls to your function).
If the user enters an input that does not exist in the dataset, your function should catch that and report an error (via the stop()
function).
For your homework submission
Write a short description of how you chose to design your function and why.
Present the code for your function in the R markdown document.
Include at least one example of output from your function.
# Add your solution here
In this part, we will apply the functions we wrote in Part 1 to our rainfall data starting with our wrangled df
data from Part 2.
sample_mean()
), the sample standard deviation (using your function sample_sd()
), and a 95% confidence interval for the average rainfall (using your function calculate_CI()
). Specifically, you should add two columns in this summarized dataset: a column titled lower_bound
and a column titled upper_bound
containing the lower and upper bounds for you CI that you calculated (using your function calculate_CI()
).rain_df
.# Add your solution here
Using the rain_df
, plots the estimates of mean rainfall and the 95% confidence intervals on the same plot. There should be a separate faceted plot for each city. Think about using ggplot()
with both geom_point()
(and geom_line()
to connect the points) for the means and geom_errorbar()
for the lower and upper bounds of the confidence interval.
# Add your solution here
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hicks (2021, Sept. 14). Statistical Computing: Project 2. Retrieved from https://stephaniehicks.com/jhustatcomputing2021/projects/2021-09-14-project-2/
BibTeX citation
@misc{hicks2021project, author = {Hicks, Stephanie}, title = {Statistical Computing: Project 2}, url = {https://stephaniehicks.com/jhustatcomputing2021/projects/2021-09-14-project-2/}, year = {2021} }