Background

Due date: Sept 18 at 11:59pm

To submit your problem set

Please write up your project using R Markdown and knitr. Please show all your code for each of the answers to each part. Follow previous instructions in Problem Set 1 to submit.

Part 1: Explore spotify songs!

Data

That data for this part of the assignment comes from TidyTuesday, which is a weekly podcast and global community activity brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.

Icon from TidyTuesday

Icon from TidyTuesday

[Source: TidyTuesday]

To access the data, you need to install the tidytuesdayR R package and use the function tt_load() with the date of ‘2020-01-21’ to load the data.

install.packages("tidytuesdayR")

This is how you can download the data.

tuesdata <- tidytuesdayR::tt_load('2020-01-21')
spotify_songs <- tuesdata$spotify_songs

However, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:

library(here)
library(tidyverse)

# tests if a directory named "data" exists locally
if(!dir.exists(here("data"))) { dir.create(here("data")) }

# saves data only once (not each time you knit a R Markdown)
if(!file.exists(here("data","spotify_songs.RDS"))) {
  url_csv <- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
  spotify_songs <- readr::read_csv(url_csv)
  
  # save the file to RDS objects
  saveRDS(spotify_songs, file= here("data","spotify_songs.RDS"))
}

Here we read in the .RDS dataset locally from our computing environment:

spotify_songs <- readRDS(here("data","spotify_songs.RDS"))
as_tibble(spotify_songs)
## # A tibble: 32,833 × 23
##    track_id      track…¹ track…² track…³ track…⁴ track…⁵ track…⁶ playl…⁷ playl…⁸
##    <chr>         <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 6f807x0ima9a… I Don'… Ed She…      66 2oCs0D… I Don'… 2019-0… Pop Re… 37i9dQ…
##  2 0r7CVbZTWZgb… Memori… Maroon…      67 63rPSO… Memori… 2019-1… Pop Re… 37i9dQ…
##  3 1z1Hg7Vb0AhH… All th… Zara L…      70 1HoSmj… All th… 2019-0… Pop Re… 37i9dQ…
##  4 75FpbthrwQmz… Call Y… The Ch…      60 1nqYsO… Call Y… 2019-0… Pop Re… 37i9dQ…
##  5 1e8PAfcKUYoK… Someon… Lewis …      69 7m7vv9… Someon… 2019-0… Pop Re… 37i9dQ…
##  6 7fvUMiyapMsR… Beauti… Ed She…      67 2yiy9c… Beauti… 2019-0… Pop Re… 37i9dQ…
##  7 2OAylPUDDfwR… Never … Katy P…      62 7INHYS… Never … 2019-0… Pop Re… 37i9dQ…
##  8 6b1RNvAcJjQH… Post M… Sam Fe…      69 6703SR… Post M… 2019-0… Pop Re… 37i9dQ…
##  9 7bF6tCO3gFb8… Tough … Avicii       68 7CvAfG… Tough … 2019-0… Pop Re… 37i9dQ…
## 10 1IXGILkPm0tO… If I C… Shawn …      67 4Qxzbf… If I C… 2019-0… Pop Re… 37i9dQ…
## # … with 32,823 more rows, 14 more variables: playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>, and abbreviated variable names ¹​track_name,
## #   ²​track_artist, ³​track_popularity, ⁴​track_album_id, ⁵​track_album_name,
## #   ⁶​track_album_release_date, ⁷​playlist_name, ⁸​playlist_id

We can take a glimpse at the data

glimpse(spotify_songs)
## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304…

For all of the questions below, you can ignore the missing values in the dataset, so e.g. when taking averages, just remove the missing values before taking the average, if needed.

Tasks

Use functions from dplyr and ggplot2 to answer the following questions.

  1. How many songs are in each genre?
# Add your solution here
  1. What is average value of energy and acousticness in the latin genre in this dataset?
# Add your solution here
  1. Calculate the average duration of song (in minutes) across all subgenres. Which subgenre has the longest song on average?
# Add your solution here
  1. Make two boxplots side-by-side of the danceability of songs stratifying by whether a song has a fast or slow tempo. Define fast tempo as any song that has a tempo above its median value. On average, which songs are more danceable?

Hint: You may find the case_when() function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a mutate() call).

## Generate some random numbers
dat <- tibble(x = rnorm(100))
slice(dat, 1:3)
## # A tibble: 3 × 1
##        x
##    <dbl>
## 1 -0.630
## 2  0.465
## 3  1.67
## Create a new column that indicates whether the value of 'x' is positive or negative
dat %>%
        mutate(is_positive = case_when(
                x >= 0 ~ "Yes",
                x < 0 ~ "No"
        ))
## # A tibble: 100 × 2
##         x is_positive
##     <dbl> <chr>      
##  1 -0.630 No         
##  2  0.465 Yes        
##  3  1.67  Yes        
##  4 -1.04  No         
##  5 -0.170 No         
##  6  1.97  Yes        
##  7 -1.55  No         
##  8 -0.276 No         
##  9  1.51  Yes        
## 10 -2.15  No         
## # … with 90 more rows
# Add your solution here

Part 2: Single-cell data analysis

Data

The scRNAseq package provides convenient access to several publicly available data sets in the form of SingleCellExperiment objects. The focus of this package is to capture datasets that are not easily read into R with a one-liner from, e.g., read_csv(). Instead, the necessary data munging is already done so that users only need to call a single function to obtain a well-formed SingleCellExperiment.

library(scRNAseq)

To see the list of available datasets, use the listDatasets() function:

out <- listDatasets() 
out
## DataFrame with 61 rows and 5 columns
##                  Reference  Taxonomy               Part    Number
##                <character> <integer>        <character> <integer>
## 1   @aztekin2019identifi..      8355               tail     13199
## 2   @bach2017differentia..     10090      mammary gland     25806
## 3           @bacher2020low      9606            T cells    104417
## 4     @baron2016singlecell      9606           pancreas      8569
## 5     @baron2016singlecell     10090           pancreas      1886
## ...                    ...       ...                ...       ...
## 57    @zeisel2018molecular     10090     nervous system    160796
## 58     @zhao2020singlecell      9606 liver immune cells     68100
## 59    @zhong2018singlecell      9606  prefrontal cortex      2394
## 60  @zilionis2019singlec..      9606               lung    173954
## 61  @zilionis2019singlec..     10090               lung     17549
##                       Call
##                <character>
## 1        AztekinTailData()
## 2        BachMammaryData()
## 3        BacherTCellData()
## 4   BaronPancreasData('h..
## 5   BaronPancreasData('m..
## ...                    ...
## 57     ZeiselNervousData()
## 58   ZhaoImmuneLiverData()
## 59   ZhongPrefrontalData()
## 60      ZilionisLungData()
## 61  ZilionisLungData('mo..

You can load a dataset the following way:

sce <- ZeiselBrainData()
sce
## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(2): ERCC repeat

Tasks

  1. Pick a scRNA-seq dataset that has more than 5,000 cells and load the SingleCellExperiment (or sce) object.

  2. Show the number of number of genes and number of observations in the sce object.

  3. Using the material we learned in the lecture, analyze the scRNA-seq data using the Biocondutor packages we learned about. This should include (but not be limited to)

    • Quality control (you must use at least two different QC metrics)
    • Normalization
    • Feature selection using highly variable genes
    • Dimensionality reduction using PCA
    • Data visualization using tSNE or UMAP
    • Unsupervised clustering (your choice of method!)
  4. At the end of your analysis, show a plot of both (i) the PCA plot and (ii) either the tSNE or UMAP plot with the colors represented by the predicted labels from the clustering algorithm.

  5. For each component described in Task #3, write 3-4 sentences naming and describing the idea behind the methodology you used, along with interpreting the output.

# Add your solution here

Useful tips

  • If the original dataset was not provided with Ensembl annotation, we can map the identifiers with ensembl=TRUE. Any genes without a corresponding Ensembl identifier is discarded from the dataset.
sce <- ZeiselBrainData(ensembl=TRUE)
head(rownames(sce))
## [1] "ENSMUSG00000029669" "ENSMUSG00000046982" "ENSMUSG00000039735"
## [4] "ENSMUSG00000033453" "ENSMUSG00000046798" "ENSMUSG00000034009"