4  Workshop

4.1 Overview

In this workshop, you will explore spotify songs!

Please write up your solution using R Markdown and knitr. Please show all your code for each of the answers to each part.

At the end of the workshop, we will go over the answers.

4.2 Data

That data for this part of the assignment comes from TidyTuesday, which is a weekly podcast and global community activity brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.

Icon from TidyTuesday

[Source: TidyTuesday]

To access the data, you need to install the tidytuesdayR R package and use the function tt_load() with the date of ‘2020-01-21’ to load the data.

install.packages("tidytuesdayR")

This is how you can download the data.

tuesdata <- tidytuesdayR::tt_load('2020-01-21')
spotify_songs <- tuesdata$spotify_songs

However, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:

library(here)
library(tidyverse)

# tests if a directory named "data" exists locally
if(!dir.exists(here("data"))) { dir.create(here("data")) }

# saves data only once (not each time you knit a R Markdown)
if(!file.exists(here("data","spotify_songs.RDS"))) {
  url_csv <- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
  spotify_songs <- readr::read_csv(url_csv)
  
  # save the file to RDS objects
  saveRDS(spotify_songs, file= here("data","spotify_songs.RDS"))
}

Here we read in the .RDS dataset locally from our computing environment:

spotify_songs <- readRDS(here("data","spotify_songs.RDS"))
as_tibble(spotify_songs)
# A tibble: 32,833 × 23
   track_id              track_name track_artist track_popularity track_album_id
   <chr>                 <chr>      <chr>                   <dbl> <chr>         
 1 6f807x0ima9a1j3VPbc7… I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
 2 0r7CVbZTWZgbTCYdfa2P… Memories … Maroon 5                   67 63rPSO264uRjW…
 3 1z1Hg7Vb0AhHDiEmnDE7… All the T… Zara Larsson               70 1HoSmj2eLcsrR…
 4 75FpbthrwQmzHlBJLuGd… Call You … The Chainsm…               60 1nqYsOef1yKKu…
 5 1e8PAfcKUYoKkxPhrHqw… Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
 6 7fvUMiyapMsRRxr07cU8… Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
 7 2OAylPUDDfwRGfe0lYql… Never Rea… Katy Perry                 62 7INHYSeusaFly…
 8 6b1RNvAcJjQH73eZO4BL… Post Malo… Sam Feldt                  69 6703SRPsLkS4b…
 9 7bF6tCO3gFb8INrEDcjN… Tough Lov… Avicii                     68 7CvAfGvq4RlIw…
10 1IXGILkPm0tOCNeq00kC… If I Can'… Shawn Mendes               67 4QxzbfSsVryEQ…
# ℹ 32,823 more rows
# ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
#   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
#   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
#   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>

We can take a glimpse at the data

glimpse(spotify_songs)
Rows: 32,833
Columns: 23
$ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
$ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
$ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
$ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
$ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
$ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
$ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
$ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
$ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
$ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
$ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
$ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
$ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
$ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
$ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
$ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
$ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
$ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
$ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
$ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
$ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
$ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
$ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304…

For all of the questions below, you can ignore the missing values in the dataset, so e.g. when taking averages, just remove the missing values before taking the average, if needed.

4.3 Tasks

Use functions from dplyr and ggplot2 to answer the following questions.

  1. How many songs are in each genre?
# Add your solution here
  1. What is average value of energy and acousticness in the latin genre in this dataset?
# Add your solution here
  1. Calculate the average duration of song (in minutes) across all subgenres. Which subgenre has the longest song on average?
# Add your solution here
  1. Make two boxplots side-by-side of the danceability of songs stratifying by whether a song has a fast or slow tempo. Define fast tempo as any song that has a tempo above its median value. On average, which songs are more danceable?

Hint: You may find the case_when() function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a mutate() call).

## Generate some random numbers
dat <- tibble(x = rnorm(100))
slice(dat, 1:3)
# A tibble: 3 × 1
        x
    <dbl>
1  0.230 
2 -0.0754
3 -0.536 
## Create a new column that indicates whether the value of 'x' is positive or negative
dat %>%
        mutate(is_positive = case_when(
                x >= 0 ~ "Yes",
                x < 0 ~ "No"
        ))
# A tibble: 100 × 2
         x is_positive
     <dbl> <chr>      
 1  0.230  Yes        
 2 -0.0754 No         
 3 -0.536  No         
 4 -3.36   No         
 5  0.491  Yes        
 6  1.43   Yes        
 7 -0.0423 No         
 8 -0.339  No         
 9  0.201  Yes        
10  0.192  Yes        
# ℹ 90 more rows
# Add your solution here