problem-set.Rmd
That data for this part of the assignment comes from TidyTuesday, which is a weekly podcast and global community activity brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.
[Source: TidyTuesday]
To access the data, you need to install the tidytuesdayR
R package and use the function tt_load()
with the date of
‘2020-01-21’ to load the data.
install.packages("tidytuesdayR")
This is how you can download the data.
tuesdata <- tidytuesdayR::tt_load('2020-01-21')
spotify_songs <- tuesdata$spotify_songs
However, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:
library(here)
library(tidyverse)
# tests if a directory named "data" exists locally
if(!dir.exists(here("data"))) { dir.create(here("data")) }
# saves data only once (not each time you knit a R Markdown)
if(!file.exists(here("data","spotify_songs.RDS"))) {
url_csv <- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
spotify_songs <- readr::read_csv(url_csv)
# save the file to RDS objects
saveRDS(spotify_songs, file= here("data","spotify_songs.RDS"))
}
Here we read in the .RDS
dataset locally from our
computing environment:
## # A tibble: 32,833 × 23
## track_id track…¹ track…² track…³ track…⁴ track…⁵ track…⁶ playl…⁷ playl…⁸
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 6f807x0ima9a… I Don'… Ed She… 66 2oCs0D… I Don'… 2019-0… Pop Re… 37i9dQ…
## 2 0r7CVbZTWZgb… Memori… Maroon… 67 63rPSO… Memori… 2019-1… Pop Re… 37i9dQ…
## 3 1z1Hg7Vb0AhH… All th… Zara L… 70 1HoSmj… All th… 2019-0… Pop Re… 37i9dQ…
## 4 75FpbthrwQmz… Call Y… The Ch… 60 1nqYsO… Call Y… 2019-0… Pop Re… 37i9dQ…
## 5 1e8PAfcKUYoK… Someon… Lewis … 69 7m7vv9… Someon… 2019-0… Pop Re… 37i9dQ…
## 6 7fvUMiyapMsR… Beauti… Ed She… 67 2yiy9c… Beauti… 2019-0… Pop Re… 37i9dQ…
## 7 2OAylPUDDfwR… Never … Katy P… 62 7INHYS… Never … 2019-0… Pop Re… 37i9dQ…
## 8 6b1RNvAcJjQH… Post M… Sam Fe… 69 6703SR… Post M… 2019-0… Pop Re… 37i9dQ…
## 9 7bF6tCO3gFb8… Tough … Avicii 68 7CvAfG… Tough … 2019-0… Pop Re… 37i9dQ…
## 10 1IXGILkPm0tO… If I C… Shawn … 67 4Qxzbf… If I C… 2019-0… Pop Re… 37i9dQ…
## # … with 32,823 more rows, 14 more variables: playlist_genre <chr>,
## # playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>, and abbreviated variable names ¹track_name,
## # ²track_artist, ³track_popularity, ⁴track_album_id, ⁵track_album_name,
## # ⁶track_album_release_date, ⁷playlist_name, ⁸playlist_id
We can take a glimpse at the data
glimpse(spotify_songs)
## Rows: 32,833
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 16304…
For all of the questions below, you can ignore the missing values in the dataset, so e.g. when taking averages, just remove the missing values before taking the average, if needed.
Use functions from dplyr
and ggplot2
to
answer the following questions.
# Add your solution here
energy
and
acousticness
in the latin
genre in this
dataset?
# Add your solution here
# Add your solution here
danceability
of
songs stratifying by whether a song has a fast or slow tempo. Define
fast tempo as any song that has a tempo
above its median
value. On average, which songs are more danceable?Hint: You may find the case_when()
function useful in this part, which can be used to map values from one
variable to different values in a new variable (when used in a
mutate()
call).
## # A tibble: 3 × 1
## x
## <dbl>
## 1 -0.630
## 2 0.465
## 3 1.67
## Create a new column that indicates whether the value of 'x' is positive or negative
dat %>%
mutate(is_positive = case_when(
x >= 0 ~ "Yes",
x < 0 ~ "No"
))
## # A tibble: 100 × 2
## x is_positive
## <dbl> <chr>
## 1 -0.630 No
## 2 0.465 Yes
## 3 1.67 Yes
## 4 -1.04 No
## 5 -0.170 No
## 6 1.97 Yes
## 7 -1.55 No
## 8 -0.276 No
## 9 1.51 Yes
## 10 -2.15 No
## # … with 90 more rows
# Add your solution here
The scRNAseq
package provides convenient access to several publicly available data
sets in the form of SingleCellExperiment
objects. The focus
of this package is to capture datasets that are not easily read into R
with a one-liner from, e.g., read_csv()
. Instead, the
necessary data munging is already done so that users only need to call a
single function to obtain a well-formed
SingleCellExperiment
.
library(scRNAseq)
To see the list of available datasets, use the
listDatasets()
function:
out <- listDatasets()
out
## DataFrame with 61 rows and 5 columns
## Reference Taxonomy Part Number
## <character> <integer> <character> <integer>
## 1 @aztekin2019identifi.. 8355 tail 13199
## 2 @bach2017differentia.. 10090 mammary gland 25806
## 3 @bacher2020low 9606 T cells 104417
## 4 @baron2016singlecell 9606 pancreas 8569
## 5 @baron2016singlecell 10090 pancreas 1886
## ... ... ... ... ...
## 57 @zeisel2018molecular 10090 nervous system 160796
## 58 @zhao2020singlecell 9606 liver immune cells 68100
## 59 @zhong2018singlecell 9606 prefrontal cortex 2394
## 60 @zilionis2019singlec.. 9606 lung 173954
## 61 @zilionis2019singlec.. 10090 lung 17549
## Call
## <character>
## 1 AztekinTailData()
## 2 BachMammaryData()
## 3 BacherTCellData()
## 4 BaronPancreasData('h..
## 5 BaronPancreasData('m..
## ... ...
## 57 ZeiselNervousData()
## 58 ZhaoImmuneLiverData()
## 59 ZhongPrefrontalData()
## 60 ZilionisLungData()
## 61 ZilionisLungData('mo..
You can load a dataset the following way:
sce <- ZeiselBrainData()
sce
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(2): ERCC repeat
Pick a scRNA-seq dataset that has more than 5,000 cells and load
the SingleCellExperiment
(or sce
)
object.
Show the number of number of genes and number of observations in
the sce
object.
Using the material we learned in the lecture, analyze the scRNA-seq data using the Biocondutor packages we learned about. This should include (but not be limited to)
At the end of your analysis, show a plot of both (i) the PCA plot and (ii) either the tSNE or UMAP plot with the colors represented by the predicted labels from the clustering algorithm.
For each component described in Task #3, write 3-4 sentences naming and describing the idea behind the methodology you used, along with interpreting the output.
# Add your solution here
ensembl=TRUE
. Any genes
without a corresponding Ensembl
identifier is discarded
from the dataset.
sce <- ZeiselBrainData(ensembl=TRUE)
head(rownames(sce))
## [1] "ENSMUSG00000029669" "ENSMUSG00000046982" "ENSMUSG00000039735"
## [4] "ENSMUSG00000033453" "ENSMUSG00000046798" "ENSMUSG00000034009"