4 Workshop
4.1 Overview
In this workshop, you will explore spotify songs!
Please write up your solution using R Markdown and knitr
. Please show all your code for each of the answers to each part.
At the end of the workshop, we will go over the answers.
4.2 Data
That data for this part of the assignment comes from TidyTuesday, which is a weekly podcast and global community activity brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.
[Source: TidyTuesday]
To access the data, you need to install the tidytuesdayR
R package and use the function tt_load()
with the date of ‘2020-01-21’ to load the data.
install.packages("tidytuesdayR")
This is how you can download the data.
<- tidytuesdayR::tt_load('2020-01-21')
tuesdata <- tuesdata$spotify_songs spotify_songs
However, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:
library(here)
library(tidyverse)
# tests if a directory named "data" exists locally
if(!dir.exists(here("data"))) { dir.create(here("data")) }
# saves data only once (not each time you knit a R Markdown)
if(!file.exists(here("data","spotify_songs.RDS"))) {
<- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
url_csv <- readr::read_csv(url_csv)
spotify_songs
# save the file to RDS objects
saveRDS(spotify_songs, file= here("data","spotify_songs.RDS"))
}
Here we read in the .RDS
dataset locally from our computing environment:
<- readRDS(here("data","spotify_songs.RDS"))
spotify_songs as_tibble(spotify_songs)
# A tibble: 32,833 × 23
track_id track_name track_artist track_popularity track_album_id
<chr> <chr> <chr> <dbl> <chr>
1 6f807x0ima9a1j3VPbc7… I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
2 0r7CVbZTWZgbTCYdfa2P… Memories … Maroon 5 67 63rPSO264uRjW…
3 1z1Hg7Vb0AhHDiEmnDE7… All the T… Zara Larsson 70 1HoSmj2eLcsrR…
4 75FpbthrwQmzHlBJLuGd… Call You … The Chainsm… 60 1nqYsOef1yKKu…
5 1e8PAfcKUYoKkxPhrHqw… Someone Y… Lewis Capal… 69 7m7vv9wlQ4i0L…
6 7fvUMiyapMsRRxr07cU8… Beautiful… Ed Sheeran 67 2yiy9cd2QktrN…
7 2OAylPUDDfwRGfe0lYql… Never Rea… Katy Perry 62 7INHYSeusaFly…
8 6b1RNvAcJjQH73eZO4BL… Post Malo… Sam Feldt 69 6703SRPsLkS4b…
9 7bF6tCO3gFb8INrEDcjN… Tough Lov… Avicii 68 7CvAfGvq4RlIw…
10 1IXGILkPm0tOCNeq00kC… If I Can'… Shawn Mendes 67 4QxzbfSsVryEQ…
# ℹ 32,823 more rows
# ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
# playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
# playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
# loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
# instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
# duration_ms <dbl>
We can take a glimpse at the data
glimpse(spotify_songs)
Rows: 32,833
Columns: 23
$ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
$ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
$ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
$ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
$ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
$ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
$ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
$ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
$ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
$ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
$ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance…
$ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
$ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
$ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
$ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
$ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
$ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
$ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
$ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
$ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
$ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
$ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
$ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 16304…
For all of the questions below, you can ignore the missing values in the dataset, so e.g. when taking averages, just remove the missing values before taking the average, if needed.
4.3 Tasks
Use functions from dplyr
and ggplot2
to answer the following questions.
- How many songs are in each genre?
# Add your solution here
- What is average value of
energy
andacousticness
in thelatin
genre in this dataset?
# Add your solution here
- Calculate the average duration of song (in minutes) across all subgenres. Which subgenre has the longest song on average?
# Add your solution here
- Make two boxplots side-by-side of the
danceability
of songs stratifying by whether a song has a fast or slow tempo. Define fast tempo as any song that has atempo
above its median value. On average, which songs are more danceable?
Hint: You may find the case_when()
function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a mutate()
call).
## Generate some random numbers
<- tibble(x = rnorm(100))
dat slice(dat, 1:3)
# A tibble: 3 × 1
x
<dbl>
1 0.230
2 -0.0754
3 -0.536
## Create a new column that indicates whether the value of 'x' is positive or negative
%>%
dat mutate(is_positive = case_when(
>= 0 ~ "Yes",
x < 0 ~ "No"
x ))
# A tibble: 100 × 2
x is_positive
<dbl> <chr>
1 0.230 Yes
2 -0.0754 No
3 -0.536 No
4 -3.36 No
5 0.491 Yes
6 1.43 Yes
7 -0.0423 No
8 -0.339 No
9 0.201 Yes
10 0.192 Yes
# ℹ 90 more rows
# Add your solution here