4 Workshop
4.1 Overview
In this workshop, you will explore spotify songs!
Please write up your solution using R Markdown and knitr
. Please show all your code for each of the answers to each part.
At the end of the workshop, we will go over the answers.
4.2 Data
That data for this part of the assignment comes from TidyTuesday, which is a weekly podcast and global community activity brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.
[Source: TidyTuesday]
To access the data, you need to install the tidytuesdayR
R package and use the function tt_load()
with the date of ‘2020-01-21’ to load the data.
install.packages("tidytuesdayR")
This is how you can download the data.
<- tidytuesdayR::tt_load('2020-01-21')
tuesdata <- tuesdata$spotify_songs spotify_songs
However, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:
library(here)
library(tidyverse)
# tests if a directory named "data" exists locally
if(!dir.exists(here("data"))) { dir.create(here("data")) }
# saves data only once (not each time you knit a R Markdown)
if(!file.exists(here("data","spotify_songs.RDS"))) {
<- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
url_csv <- readr::read_csv(url_csv)
spotify_songs
# save the file to RDS objects
saveRDS(spotify_songs, file= here("data","spotify_songs.RDS"))
}
Here we read in the .RDS
dataset locally from our computing environment:
<- readRDS(here("data","spotify_songs.RDS"))
spotify_songs as_tibble(spotify_songs)
# A tibble: 32,833 × 23
track_id track…¹ track…² track…³ track…⁴ track…⁵ track…⁶ playl…⁷ playl…⁸
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 6f807x0ima9a… I Don'… Ed She… 66 2oCs0D… I Don'… 2019-0… Pop Re… 37i9dQ…
2 0r7CVbZTWZgb… Memori… Maroon… 67 63rPSO… Memori… 2019-1… Pop Re… 37i9dQ…
3 1z1Hg7Vb0AhH… All th… Zara L… 70 1HoSmj… All th… 2019-0… Pop Re… 37i9dQ…
4 75FpbthrwQmz… Call Y… The Ch… 60 1nqYsO… Call Y… 2019-0… Pop Re… 37i9dQ…
5 1e8PAfcKUYoK… Someon… Lewis … 69 7m7vv9… Someon… 2019-0… Pop Re… 37i9dQ…
6 7fvUMiyapMsR… Beauti… Ed She… 67 2yiy9c… Beauti… 2019-0… Pop Re… 37i9dQ…
7 2OAylPUDDfwR… Never … Katy P… 62 7INHYS… Never … 2019-0… Pop Re… 37i9dQ…
8 6b1RNvAcJjQH… Post M… Sam Fe… 69 6703SR… Post M… 2019-0… Pop Re… 37i9dQ…
9 7bF6tCO3gFb8… Tough … Avicii 68 7CvAfG… Tough … 2019-0… Pop Re… 37i9dQ…
10 1IXGILkPm0tO… If I C… Shawn … 67 4Qxzbf… If I C… 2019-0… Pop Re… 37i9dQ…
# … with 32,823 more rows, 14 more variables: playlist_genre <chr>,
# playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
# loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
# instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
# duration_ms <dbl>, and abbreviated variable names ¹track_name,
# ²track_artist, ³track_popularity, ⁴track_album_id, ⁵track_album_name,
# ⁶track_album_release_date, ⁷playlist_name, ⁸playlist_id
We can take a glimpse at the data
glimpse(spotify_songs)
Rows: 32,833
Columns: 23
$ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
$ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
$ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
$ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
$ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
$ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
$ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
$ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
$ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
$ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
$ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance…
$ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
$ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
$ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
$ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
$ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
$ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
$ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
$ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
$ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
$ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
$ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
$ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 16304…
For all of the questions below, you can ignore the missing values in the dataset, so e.g. when taking averages, just remove the missing values before taking the average, if needed.
4.3 Tasks
Use functions from dplyr
and ggplot2
to answer the following questions.
- How many songs are in each genre?
# Add your solution here
- What is average value of
energy
andacousticness
in thelatin
genre in this dataset?
# Add your solution here
- Calculate the average duration of song (in minutes) across all subgenres. Which subgenre has the longest song on average?
# Add your solution here
- Make two boxplots side-by-side of the
danceability
of songs stratifying by whether a song has a fast or slow tempo. Define fast tempo as any song that has atempo
above its median value. On average, which songs are more danceable?
Hint: You may find the case_when()
function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a mutate()
call).
## Generate some random numbers
<- tibble(x = rnorm(100))
dat slice(dat, 1:3)
# A tibble: 3 × 1
x
<dbl>
1 -0.825
2 1.26
3 -0.934
## Create a new column that indicates whether the value of 'x' is positive or negative
%>%
dat mutate(is_positive = case_when(
>= 0 ~ "Yes",
x < 0 ~ "No"
x ))
# A tibble: 100 × 2
x is_positive
<dbl> <chr>
1 -0.825 No
2 1.26 Yes
3 -0.934 No
4 -0.760 No
5 0.327 Yes
6 -0.282 No
7 -0.786 No
8 -2.43 No
9 1.13 Yes
10 -0.329 No
# … with 90 more rows
# Add your solution here