Describe what is the Bioconductor and how it’s different than CRAN
Describe the package types in Bioconductor
Recognize and work with core Bioconductor objects including SummarizedExperiment
Be able to perform a basic differential expression analysis with bulk RNA-seq
Introduce the core Bioconductor object called SingleCellExperiment.
Introduction to Bioconductor
A brief history of Bioconductor
The Bioconductor project was started in the Fall of 2001, as an initiative for the collaborative creation of extensible software for computational biology and bioinformatics.
The goal of the project is to develop tools for the statistical analysis and comprehension of large datasets and technological artifacts in rigorously and robustly designed experiments. Beyond statistical analyses, the interpretation of statistical results is supported by packages providing biological context, visualization, and reproducibility.
Over the years, software packages contributed to the Bioconductor project have reflected the evolution and emergence of several high-throughput technologies, from microarrays to single-cell genomics, through many variations of sequencing experiments (e.g., RNA-seq, ChIP-seq, DNA-seq), analyses (e.g., sequence variation, copy number variation, single nucleotide polymorphisms), and data modalities (e.g., flow cytometry, proteomics, microscropy and image analysis).
The Bioconductor project culminates at an annual conference in North America in the summer, while regional conferences offer great opportunities for networking in Europe, Asia, and North America.
The project is committed to promote a diverse and inclusive community, including a Code of Conduct enforced by a Code of Conduct committee.
Timeline of major Bioconductor milestones alongside technological advancements. Above the timeline, the figure marks the first occurrence of major events. Within the timeline, the name of packages providing core infrastructure indicate the release date. Below the timeline, major technological advancements contextualise the evolution of the Bioconductor project over time.
A scientific project
The original publication describes the aims and methods of the project at its inception is Gentleman et al. (2004).
Huber et al. (2015) illustrates the progression of the project, including descriptions of core infrastructure and case studies, from the perspective of both users and developers.
Amezquita et al. (2020) reviewed further developments of the project in the wake of single-cell genomics technologies.
Many more publications and book chapters cite the Bioconductor project, with recent example listed on the Bioconductor website.
In addition, there is a core team which is supported by an NIH grant, and developers who contribute to the open source Bioconductor packages.
A package repository
Overview and relationship to CRAN
Undoubtedly, software packages are the best-known aspect of the Bioconductor project. Since its inception in 2001, the repository has grown over time to host thousands of packages.
The Bioconductor project has extended the preexisting CRAN repository – much larger and general-purpose in scope – to comprise R packages primarily catering for bioinformatics and computational biology analyses.
The Bioconductor release cycle
The Bioconductor project also extended the packaging infrastructure of the CRAN repository to better support the deployment and management of packages at the user level.
In particular, the Bioconductor projects features a 6-month release cycle (typically around April and October), which sees a snapshot of the current version of all packages in the Bioconductor repository earmarked for a specific version of R.
R itself is released on an annual basis (typically around April), meaning that for each release of R, two compatible releases of Bioconductor packages are available.
As such, Bioconductor package developers are required to always use the version of R that will be associated with the next release of the Bioconductor project.
This means using the development version of R between October and April, and the release version of R between April and October.
Why? The strict Bioconductor release cycle prevents users from installing temporally distant versions of packages that were very likely never tested together.
This practice reflects the development cycle of packages of both CRAN and Bioconductor, where contemporaneous packages are regularly tested by automated systems to ensure that the latest software updates in package dependencies do not break downstream packages, or prompts those package maintainers to update their own software as a consequence.
Prior to each Bioconductor release, packages that do not pass the requires suites of automated tests are deprecated and subsequently removed from the repository.
This ensures that each Bioconductor release provides a suite of packages that are mutually compatible, traceable, and guaranteed to function for the associated version of R.
Timeline of release dates for selected Bioconductor and R versions. The upper section of the timeline indicates versions and approximate release dates for the R project. The lower section of the timeline indicates versions and release dates for the Bioconductor project. Source: Bioconductor.
Package types
Packages are broadly divided in four major categories:
software
annotation data
experiment data
workflows
Software packages themselves can be subdivided into packages that
provide infrastructure (i.e., classes) to store and access data
packages that provide methodological tools to process data stored in those data structures
This separation of structure and analysis is at the core of the Bioconductor project, encouraging developers of new methodological software packages to thoughtfully re-use existing data containers where possible, and reducing the cognitive burden imposed on users who can more easily experiment with alternative workflows without the need to learn and convert between different data structures.
Annotation data packages provide self-contained databases of diverse genomic annotations (e.g., gene identifiers, biological pathways).
Different collections of annotation packages can be found in the Bioconductor project.
They are identifiable by their respective naming pattern, and the information that they contain.
For instance, the so-called OrgDb packages (e.g., the org.Hs.eg.db package) provide information mapping different types of gene identifiers and pathway databases; the so-called EnsDb (e.g., EnsDb.Hsapiens.v86) packages encapsulate individual versions of the Ensembl annotations in Bioconductor packages; and the so-called TxDb packages (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene) encapsulate individual versions UCSC gene annotation tables.
Experiment data packages provide self-contained datasets that are often used by software package developers to demonstrate the use of their package on well-known standard datasets in their package vignettes.
Finally, workflow packages exclusively provide collections of vignettes that demonstrate the combined usage of several other packages as a coherent workflow, but do not provide any new source code or functionality themselves.
Online communication channels
Support site
The Bioconductor support site provides a platform where users and developers can communicate freely (following the Bioconductor Code of Conduct) to discuss issues on a range of subjects, ranging from packages to conceptual questions about best practices.
Slack workspace
The Bioconductor Slack workspace is an open space that all community members are welcome to join (for free) and use for rapid interactions. Currently, the “Pro” pricing plan kindly supported by core funding provides:
Unlimited message archive
Unlimited apps
Group video calls with screen sharing
Work securely with other organizations using Slack Connect
A wide range of channels have been created to discuss a range of subjects, and community members can freely join the discussion on those channels of create new ones to discuss new subjects.
Important announcements are posted on the #general channel.
Developer Mailing List
The bioc-devel@r-project.org mailing list is used for communication between package developers, and announcements from the Biocondutor core team.
A scientific and technical community
Scientific Advisory Board (SAB) Meet Annually, External and Internal leader in the field who act as project advisors. No Term limits.
Technical Advisory Board (TAB). Meet monthly to consider technical aspects of core infastructure and scientific direction of the project. 15 members, 3 year term. Annual open-to-all elections to rotate members. Current officers are Vince Carey (chair), Levi Waldron (vice Chair) Charlotte Soneson (Secretary).
Community Advisory Board (CAB) Meet monthly to consider community outreach, events, education and training. 15 members, 3 year term. Annual open-to-all elections to rotate members. Current officers are Aedin Culhane (chair), Matt Ritchie (co Chair), Lori Kern (Secretary).
Bioconductor packages are linked in their versions, both to each other and to the version of R
Bioconductor’s installation function will look up your version of R and give you the appropriate versions of Bioconductor packages
If you want the latest version of Bioconductor, you need to use the latest version of R.
How do you know if a package is a Bioconductor package?
For one thing, you can just google the package name and you will see either CRAN or Bioconductor as a first result (packages must be in one or the other, they are not allowed to be on both repositories).
But also, you can use Bioconductor’s installation function to install any packages, even ones on CRAN.
By the way, you can install multiple packages at once by making a string vector: BiocManager::install(c("foo","bar"))
Working with Bioconductor objects
Bioconductor’s infrastructure is built up of object classes.
An example of a class is GRanges (stands for “genomic ranges”), which is a way to specify a set of ranges in a particular genome, e.g. from basepair 101 to basepair 200 on chromosome 1 of the human genome (version 38).
What’s an object?
Well everything in R is an object, but usually when we talk about Bioconductor objects, we mean data structures containing many attributes, so more complex than a vector or matrix.
And the objects have specific methods that help you either access the information in the object, run analyses on the object, plot the object, etc.
Bioconductor also allows for class hierarchy, which means that you can define a class of object that inherits the structure and methods of a superclass on which it depends. This last point is mostly important for people who are developing new software for Bioconductor (maybe that’s you!)
We will introduce the core Bioconductor objects here.
SummarizedExperiment (SE)
First, we will discuss one of the most important classes of object, which is the SummarizedExperiment, or SE.
SEs have the structure:
a matrix of data, rows are genomic features, and columns are samples
a table of data about the samples (columns)
a table of data about the features (rows)
A diagram of this 3-part structure can be found here.
In SE, the 3 parts of the object are called 1) assay, 2) colData and 3) rowData or rowRanges.
Note: There was a class of object that came before the SE, called the ExpressionSet, which was used primarily to store microarray data.
Here we will skip over the ExpressionSet, and just look at SEs.
It helps to start by making a small toy SE, to see how the pieces come together. (Often you won’t make an SE manually, but it will be downloaded from an external source, or generated by a function that you call, e.g. the tximeta or some other data loading function.)
sample condition treated
1 1 A 0
2 2 A 1
3 3 B 0
4 4 B 1
5 5 C 0
6 6 C 1
An important aspect of SEs is that the rows can optionally correspond to particular set of GRanges
e.g. a row of an SE could give the number of RNA-seq reads that can be assigned to a particular gene, and the row could also have metadata in the 3rd slot including, e.g. location of the gene in the genome.
In this case, we use the rowRanges slot to specify the information.
If we don’t have ranges, we can just put a table on the “side” of the SE by specifying rowData.
I will show in the example though how to provide rowRanges.
Let’s use the first 10 genes in the Ensembl database for human.
The following code loads a database, pulls out all the genes (as GRanges), removes extra “non-standard” chromosomes, and then subsets to the first 10 genes.
Adding data to the column metadata is even easier, we can just use $:
se$librarySize <-runif(6,1e6,2e6)colData(se)
DataFrame with 6 rows and 4 columns
sample condition treated librarySize
<factor> <factor> <factor> <numeric>
1 1 A 0 1100362
2 2 A 1 1461149
3 3 B 0 1465337
4 4 B 1 1557165
5 5 C 0 1991141
6 6 C 1 1564061
Using the ranges of a SE
How does this additional functionality of the rowRanges facilitate faster data analysis?
Suppose we are working with another data set besides se and we find a region of interest on chromsome 1.
If we want to pull out the expression data for that region, we just ask for the subset of se that overlaps.
First, we build the query region, and then use the GRanges function overlapsAny() within single square brackets (like you would subset any matrix-like object):
Another useful property is that we know metadata about the chromosomes, and the version of the genome.
Note: If you were not yet aware, the basepair position of a given feature, say gene XYZ, will change between versions of the genome, as sequences are added or rearranged.
Let’s download a SE object from recount2, which performs a basic summarization of public data sets with gene expression data.
This dataset contains RNA-seq samples from human airway epithelial cell cultures.
The paper is here. The structure of the experiment was that, cell cultures from 6 asthmatic and 6 non-asthmatics donors were treated with viral infection or left untreated (controls).
So we have 2 samples (control or treated) for each of the 12 donors.
library(here)
here() starts at /Users/stephaniehicks/Documents/github/projects/singlecelldemo
# tests if a directory named "data" exists locallyif(!dir.exists(here("data"))) { dir.create(here("data")) }file <-"asthma.rda"if (!file.exists(here("data", file))){ url <-"http://duffel.rail.bio/recount/SRP046226/rse_gene.Rdata"download.file(url, here("data", file))}load(here("data", file))
We use a custom function to produce a matrix which a count of RNA fragments for each gene (rows) and each sample (columns).
(Recount project calls these objects rse for RangedSummarizedExperiment, meaning it has rowRanges information.)
my_scale_counts <-function(rse_gene, round=TRUE) { cts <-assays(rse_gene)$counts# mapped_read_count includes the count for both reads of a pair paired <-ifelse(colData(rse_gene)$paired_end, 2, 1) x <- (colData(rse_gene)$mapped_read_count / paired) /colSums(cts)assays(rse_gene)$counts <-t(t(assays(rse_gene)$counts) * x)if (round) {assays(rse_gene)$counts <-round(assays(rse_gene)$counts) } rse_gene}rse <-my_scale_counts(rse_gene)
CharacterList of length 3
[[1]] cell type: Isolated from human trachea-bronchial tissues ...
[[2]] cell type: Isolated from human trachea-bronchial tissues ...
[[3]] cell type: Isolated from human trachea-bronchial tissues ...
rse$characteristics[[1]]
[1] "cell type: Isolated from human trachea-bronchial tissues"
[2] "passages: 2"
[3] "disease state: asthmatic"
[4] "treatment: HRV16"
We can pull out the 3 and 4 element using the sapply function and the square bracket function.
I know this syntax looks a little funny, but it’s really just saying, use the single square bracket, pull out the third element (or fourth element).
Here we just use a transformation so that we can compute meaningful distances on count data (without a larger discussion on normalization).
We build a DESeqDataSet and then specify the experimental design using a ~ and the variables that we expect to produce differences in the counts. (These variables are used to assess how much technical variability is in the data, but not used in the transformation function itself.)
This code pull out the top of the transformed data by variance, and adds an annotation to the top of the plot.
By default the rows and columns will be clustered by Euclidean distance. See ?pheatmap for more details on this function (it’s a very detailed manual page).
We can also easily make a PCA plot with dedicated functions:
plotPCA(vsd, intgroup="treatment")
SingleCellExperiment
An example of a class that extends the SE is SingleCellExperiment. This is a special object type for looking at single cell data.
For more details, there is a free online book “Orchestrating Single Cell Analysis With Bioconductor” produced by a group within the Bioconductor Project, with lots of example analyses: OSCA.
Here we show a quick example of how this object extends the SE.