Outline

0. Contact info

1. Course intro and prerequisite

2. Useful informations

3. R foundamentals

4. Tidyverse: pipe and tibble

5. Data processing/visualization

6. Cancer omic resources

7. Differential gene expression and geneset analysis

8. References

0. Contact info

Yen-Hsieh Chen

yhchen@ibms.sinica.edu.tw

1. Course intro and prerequisite

1.1. course brief intro

This course is designed for R beginner, biologist, and both.

Through this course, participants will understand R basics, visualization & graph formatting, public (cancer) data retrieving, parallel computing basics, and biological analysis (i.e. differential gene expression analysis followed by gene set enrichment analysis). The course will starts from and heavily rely on tidyverse R package, which is a beginner-friendly stepping stone entering R world.

[Highly Recommended] All course contents are applied in ipynb format and participants may download or try to edit the code online for different results.

1.2. codes in the cloud

Course contents are now ready for either cloud presentatin at Github or demonstration via Binder.

For participants prefering to try the demonstration instead of installing the whole R environment locally, please refer to this Binder link and have the Binder preparing the online Jupyter environment (it may take whiles to build up).

1.3. local environment prerequisite

Anaconda(optional but recommend), R or RStudio

To have R programming environment, direct installatio†n of R/RStudio or a virtural environment like Anaconda containing R/RStudio are both feasible. Installation steps of these items are platform-dependent (Linux/Mac/Windows).

1.4. local R package requirements

uncomment to install if you don't have packages installed

2. Useful informations

2.1. R/RStudio from Anaconda

2.2. online R resources

2.2.1. basic/interactive R courses

2.2.2. biology-relevant R packages

2.2.3. eBooks

2.2.4. visualization guidance

3. R foundamentals

3.1. basic data type

Numbers (integer/float), characters (string), boolean (TRUE and FALSE) are the primitive elements, namely vectors, in R environment and also common in other programming languages with interchangeable namings. Under most cases, any object/variable would be character either quoted by single ('X') and double quatation ("X") marks. Furthermore, to check a R object is certain class, is.X() functions are default in R programming returning TRUE/FALSE logical statement, e.g. is.character(), is.numeric(), etc. Examples are listed below and we also can check the class of an object via class() function.

[Tips] To understand argument/parameters and detailed information for a certain function, e.g. t.test(), you may use args(t.test), help(t.test) or ??t.test for advanced information.

3.2. basic data structures

As mentioned basic vecto`, other traditional structures in R includes:

3.2.1. list

Using function list(), you may concatenate multiple items with different data types as object demo_list. Also, by functions length(), object demo_list is confirmed to be a vector with 3 elements within. And to retrieve specific element from list, either by element naming (demo_list$B) or index (demo_list[[2]]) method are feasible.

3.2.2. matrix

For class matrix and data.frame below, since these two are table-like classes, you may add/change the column/row name via colnames and rownames() functions, even more, a specific column name could be altered by combining index.

To have the dimension of either matrix or data.frame, dim() function is often used and the results are twon numbers while the first one is row count and the other is column count. Also, to access or change the specific element within matrix or data.frame, even list, the coordinate system is applied while the representation would be like table[row_index, column_index] in 2D objects (matrix and data.frame), vec[index] in 1D array/vector and List[[index]] in 1D list object. Such coordinate system is commonly used in various programming languages including R.

3.2.3. data.frame

Structure of data.frame is similar with matrix presenting table-like form. The manupulation, coordinate system, naming, all are identical with matrix except not all elements within are restrcited to single data type, which is effective and widely used for downstream analysis. Data.frame is common however its drawback are (1) poor readability for high dimention data (either large column or row count) and (2) high occupacy for large data comparing to matrix with same dimension as following comparison.

The first disadvantage is solved by class tibble from tidyverse package, which will be discussed in later section, and the second one is hard to solve since it relates to the nature of R environment and data itself.

3.3. variable assignment

Variable is a container storing data values and by assigning the value to a variable, such variable is created and stored in current R enviroment. Not till 2001, the assignment operator is stricted to arrow symbol (<-) but nowadays equation symbol (=) is accepted as well. These two assignment operators has their own supporters and you may find the advantages of <- from this article.

A/B/C here are the named variables we created with assigned value/data and both arrows (-> and <-) clearly represent the assignment directions while = has the its own direction, i.e. variable A is 100.

Please keep in mind for careful variable naming since it would increase/decrease your code readability plus the naming started from numbers/special symbols are generally avoided unless quoted by back quotes. Also, complicated or meaningless namings, e.g. ThisCouse_startsfrom_2pmto4pm_byYHC or tfadef.gaerre, are avoided.

3.4. basic arithmetic and logical operators

Following examples will cover basic arithmetic (addition, subtraction, multiplication and division) and logical operations.

3.5. condition and iteration

Condition and iteration are two ways to control your code flow when executing complicated programs and scenarios would be often like: (a) execute a code depending on the condition, e.g. a specific gene is expression or not, and (b) execute the code repeatly many times, e.g. make the addition process 3 times in a row.

Basic conditions are included in previous logical operators hence in this part we will focus on another condition flow control (f-else statement) and common iterations including for/while-loop and apply family.

3.5.1. if-else

The main purpose of if-else statement is to permit a code's running or not based on the certain condition, whereas the condition could be singular or multiple and the flow structure could be linear or nested. The basic concepts of if-else statement are listed as followed.

simple if

if-else with 1 condition

if-else with 2 condition, linear

if-else with 2 condition, nested

3.5.2. for/while loop

The perupose and concept of for and while loop are similar that they both execute the code repeatly but the former one are limited within a range of iterations while the latter one will stop under certain condition. As followed diagram and code examples you may understand the difference.

for-loop

while-loop

3.5.3. apply family

Like for/while-loop, apply family functions from basic R are designed to perform functions repetitively on a list of components (e.g. vector, data.frame, matrix, etc.) and are composed of members including apply(), lapply() , sapply(), vapply(), mapply(), rapply(), and tapply() functions. The usage of which apply member depends on your input and output and here we will demonstrate lapply() function, which idea and syntax structure is similar to for-loop.

As demosntrated here, the syntax structure is nearly identical with for-loop that the code chunck runs repeatly within the iteration. However, the overall performance of lapply() is better than for-loop and when it comes to preserving the output, the syntax would be different.

Given a range of numbers, how to take the square root of them?

The output as you can see are identical except the result format is either in vector(res_forloop) or list (res_lapply) format respectively. To achieve the vector format, one could use another family named sapply() and for details about the apply family, please find this link for more information.

3.6. vectorization

Vectorization is the parallel process of a function on multiple values at a time and no need to loop through each element, making an more efficient and concise code to execute and read comparing to non-vectorized format as demonstrated with following examples.

Given two numeric vectors, how to make the pair-wise addition of the elements within?

3.6.1. non-vectorized

3.6.2. vectorized

3.7. package install and load

Several repositories are providing R packages including CRAN, BioConductor and Github. Installation syntax is different between these repositories.

To load an installed package, either two ways below are avaliable.

To check the installed packages or the version of the installed package, one may use the following lines to have the result.

3.8. function

Even though R and other repositories are providing built-in and convinient packages to keep you fron reinventing the wheel, one may encounter scenarios that a user-defined function is prefered. The basic/simple syntax of function is listed.

A practical user-defined functions would require one/multiple parameters to fit user need and (sometimes) would need printed messages during the process, return the result, or even save the intermediate results as example followed-up.

Given a range of numbers, how to take the integre of the specific position from its square root result and save into a vector?

With the example above, we have parameter dec_pos indicating the decimal position with default 1 in our user-defined function. Also, to determine the input is a square number (res_sqrt %%1 == 0) or incorrect input (NOT x>=0), nested if-else statement is applied as well. Furtheremore, we also have built-in unlist() function to convert the list object from the strsplit() output. Lastly, with our final result would be a vector object, it is named res by applying the sapply() function on all input numers interatively.

3.9. data input/output

Using a built-in data from R environment is an easy task but often you need to work with your own data in various format, e.g. .txt, .csv, .xlsx, etc. R already have built-in functions dealing with some of these file type, including read.csv(), read.delim() and read.table(), but the drawback are the limited file type and the input speed. With the convinient tidyverse package providing the solutions, you may use `read*()`_ functions (listed here and here) with your file accordingly and seamlessly as built-in R functions.

[Tips] For either tidyverse or basic R functions importing files of different formats, both local files or online files (direct file link https://path_to_the_file.csv for example) are allowed.

4. Tidyverse: pipe and tibble

4.1. Pipe

Pipe(%>%), developed by Stefan Milton Bache (magrittr R package), is an useful tool to concatenate discrete steps with a clear expression and readability. The strength of pipe, comparing to traditional work flow, will be demonstrated with follow-up hypothetical scenario.

Given a range of numbers (e.g. 1 to 10), how to take the square root and round to two decimal places?

4.1.1. traditional step-wise approach

As intermediate object is created at each step, these independent objects will be applied to another step and with numerous steps, accumulating named objects will occupy hardware memory and hamper workflow readability.

4.1.2. traditional one-sentence approach

For the hypothetical scenario, workflow could be written into one-sentence format but the nested structure, which organize the workflow in a flashback, will increase the reading difficulty as well.

4.1.3. tidy approach with pipe

Supported by the %>%, the workflow is formated as direct tone and increase the readability plus clear syntax expression. Furthermore, %>% is also feasible in user-defined function, putting these assembly together could make your codes efficient than traditional ways.

alternative expression

4.2. Tibble

matrix and data.frame are two common R classes presenting table-like format as in MS Excel. These two classes though are similar but the contens in all cells within matrix should be identical, e.g. either all numeric or character; data.frame is a hybrid form of matrix taking columns with different classes, i.e. number+number, number+character, and character+character.

Tibble (tbl_df or tbl) is a branch of traditional data.frame as it accepts various classess in columns, e.g. formula, tbl_df, images, etc., which greatly increase the flexibility.

To demonstrate the difference between matrix, data.frame and tbl_df, we will use classic iris dataset as example.

iris flower data set is the best known dataset introduced by the British statistician and biologist Ronald Fisher in 1936. The dataset consists of 50 instances from each of 3 iris classes (Iris setosa, Iris virginica and Iris versicolor) and 4 measurements (the length and the width of the sepals and petals). For its paradigm in statstics, this dataset is already included in R environment.

4.2.1. class data.frame

As illustrated by str(iris) (alternative glimpse(iris)), the default class of iris dataset is data.frame composed of 150 samples (row) and 5 features/attributes (column) while Species is the only non-numeric column. For the following part, we will compare the difference between matrix, data.frame and tbl_df via iris dataset.

4.2.2. class matrix

full dataset

numeric part of dataset

4.2.3. class tbl_df

iris conversion

With _as_tibble()_ function, a data.frame object could be seamlessly converted to tbl_df form. To construct a new tbl_df object from scratch, one can apply the concept below.

flexibility of tbl_df

Unlike the limitation of traditional matrix and data.frame that only number/character/factor are accepeted, components within tbl_df could be other format as the example below, which mostly are stored as list.

Though tbl_df follows the rules that elements within each column should be identical, e.g. column string and number, it accepts different data type mostly in list structure as column mixture as the example.

5. Data processing/visualization

Tidy data is an importnat stepping stone for R beginners for data analysis since avaliable data nowadays are mostly structured in MS Excel-like format, which table is consist of features columns (e.g. gene expression, height, histological status, etc.) and incidance rows (e.g. patients, sample collections, etc.). For more complicated relationational tables, please find here for more examples and demonstration. Palmer penguins data will be included here for demonstration.

Palmer penguins data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The palmerpenguins package contains two datasets. One is called penguins, and is a simplified version of the raw data penguins_raw. This data is an alternative for classic iris data since it is much closer to real world data, which is complicated, contains time series data, and having incomplete data in some observations.

5.1. data overview

5.1.1. penguins_raw

5.1.2. penguins

5.2. data processing

Data processing (or wrangling) is often the step after the data collection and right before analysis/visualization. The main object of this step is to clean-up the real world messy data into an organized form, including to handle missing data, simplify naming and content, or even to create a specific attribute (feature). Here we would demonstrate the process by converting penguins_raw data to penguins data in multiple steps by using functions from tidyverse package, where you may found other useful packages/functions to fit your research link.

5.2.1. raw data processing

Comparing penguins_raw and penguins you may found the big difference in many ways, e.g. column removal, column renaming, upper/lower case conversion, time stamp summarization, content manipulation

With initial comparison, now we can find the main difference between penguins_raw and penguins data are:

5.2.2. miscellaneous techniques for data wrangling

5.3. data visualization

Data visualization is the graphical representation of information and data by using visual concepts like charts, graphs, and maps. Through converting into human-sensible images, symbols, colors, or textures, data used to be hard to intepret are transformed and the information transparency are increased. Several systems are supporting R graphics and ggplot2 included in tidyverse package is one of the package that is a good point to start since its grammer of graphic and user communities are truly user-friendly. Also, numerous extensions are developing and pushing the boundary of data visualization to various research fields (e.g. genomic/neuron science/phylogenetic/image analysis) instead of limited in statistical analysis. You may find scientific packages from these resources: Bioconductor project, Neuroconductor project, and useful ggplot2 extensions to fit your need. And to have a meaningful visualization result, one may need to define the question of interest, target audience, and the key information to give. For this section, we will demonstrate common visualizations in exploratory data analysis using ggplot2, ggpubr and viridis packages for efficient publication-ready and colorblind-friendly graphs.

5.3.1. vertical bar chart via standard ggplot2

5.3.2. violin plot via standard ggplot2

5.3.3. scatter plot with linear relations via standard ggplot2

5.3.4. density plot via ggpubr & ggsci

5.3.5. dot plot via ggpubr & ggsci

5.3.6. advanced violin plot via ggpubr & ggsci

5.3.7. advanced scatter plot via ggpubr & ggsci

6. Cancer omic resources

Various databases and web tools are providing public cancer data and some of them are also visualizing summary plots and advanced survival analysis upon user query. And among numerous cancer data contributors, the Cancer Genome Atlas (TCGA) is one of widely recognized by researcher and also archived by databases including Genomic Data Commons (GDC), cBioPortal, Broad GDAC Firehose, UCSC Xena. Furthermore, apart from direct download, these sources and numerous R packages (RTCGAToolbox, TCGAbiolinks, cBioPortalData, etc.) are providing access for cancer data. With this section, we will use UCSC Xena for followed demonstration.

Xena is a genomic data accessing-hub hosted by UCSC providing the collection of other public databases (TCGA, Broad Institute, ICGC, GTex, CCLE, etc.). Users can access, download, analyze data like cancer or single-cell datasets directly on the website or locally on personal device for the data deployed on Xena are tier-3 public data. And UCSCXenaTools is the R package providing the accession inside R environment for convenient analysis workflow.

6.1. Xena overview & searching

6.1.1. overview

Xena datasets viewpage

6.1.2. searching6

OR

AND

grep

6.2. access Xena via UCSCXenaTools package

6.2.1. step 1: generate

option 1: global search-based

option 2: official steps

option 3: use datasetID directly

6.2.2. step2~4: query, download, prepare

6.3 direct download from Xena

demo webpage

7. Differential gene expression and geneset analysis

Over-representation analysis (ORA) is a statistical method to determine if a geneset (e.g. pathway from Gene Ontology or KEGG) is over-representated within user gene list, e.g. differentially expressed genes (DEGs) between two groups selected by the significance cutoff. Geneset enrichment analysis (GSEA) is a extented version of ORA whereas DEGs are ranked by the fold change as input, unlike ORA using DEGs list only, is much feasible for situations where all genes in a predefined gene set change in a small but coordinated way. As metioned above, either ORA or GSEA are requiring DEGs as input component, which can be identified through various methods/packages (e.g. DESeq2, edgeR, etc) and here we will start from limma-voom of edgeR to discover DEGs for downstream ORA and GSEA analysis.

clusterProfiler is a R package which provide a various accession to a list of well-known annotation databases, e.g. KEGG, Reactome, Wiki Pathways. Apart from these annotation collected, it is also capable for both semantic similarity and functional enrichment analysis with a friendly and tidy fashion for biologist without command line tool experience to access, analyze and visualize their results. According to official document, genomic coordinates, gene and gene cluster are supported as input format. (source)

7.1. data retrieval and pre-process

To demonstrate the workflow of DEG, here we start from downloding bladder cancer's gene expression data from Xena as example. With the input format (a n*m numeric matrix object whereas columns are samples and rows are genomic features) as suggested by edgeR document, you may prepare the input beyond this cancer data example. (data source)

7.1.1. data retrieval

7.1.2. data preprocess

data conversion (data-dependent)

Considering the unit from Xena is log2(norm_count+1), here we reverse expected_count to original count and take the nearest integer and you may find the difference below.

In order to confirm the input having any missing data, genes having all zero count, and correct format (i.e. matrix) for followed-up analysis, data wrangling are required.

checking NA values

design matrix annotation

removal of rows with all zeros

As indicated, some rows (genes) are having all patients with no expression count hence will be removed.

sample group annotation

Sample group annotation is required since the main object for this analysis is to distinguish the difference between two groups and here TCGA samples will be classified as tumor or adjacent by the barcode accordingly.

Patient data collected by TCGA are labeled with its own barcode system, which a dataset may composed of tumor and adjacent normal samples. Whether if the sample is tumor(0) or adjacent normal(1) could be defined by the 14th position of the barcode. With all these process, our data is ready for differential gene expression analysis.

(barcode table))

7.2. differential expression gene analysis

7.2.1. normalization

7.2.2. count/sample filter

With the setting of desired cutoff, genes with low-expressed count and expressed in small size samples would be neglected. The difference of before/after this procedure could be visualized via Voom variance plot, whereas each dot in the figure represents a gene and also the mean-variance relationship of included genes in the dataset. The main purpose of this plot is to identify if low counts genes have been filtered and if lots of variation in the data exist. The cutoff setting is data-dependent.

7.2.3. intepretation via voom variance plot

7.2.4. model fitting

7.3. geneset analysis

The main object of geneset/pathway analysis is to determines if a pre-defined sets, pathways collected from KEGG or Gene Ontology, are over-represented in a subset of user data beyond expected. For instance, given a geneset composed of 100 genes (e.g. liver cancer-specific pathway) and a knockout experiment with 500 DEGs while 90 of them are in the geneset, then the geneset is highly over/under-represented in the experiment. Above mentioned is the straightforward concept of conventional over-representation analysis (ORA) and followed by statistical tests such as the cumulative hyper-geometric test, significance of the overlapping could be calculated and filtered by cutoff (e.g. p-value 0.05) to select the annotated genesets.

Unlike ORA, Gene Set Enrichment Analysis (GSEA) ranks all genes in the genome to eliminate the need of cutoff as ORA (e.g. fold change) to define the input gene set. With such ranking, e.g. level of differential expression, genesets are expected as high or low via running-sum statistic.

7.3.1. ORA analysis and visualization

Here we take down-regulated DEGs to determine the enrichment result in the Biological Process category of Gene Ontology and also several visualization presentation.

dot plot display

enrichment map display

7.3.2. GSEA analysis and visualization

Here again we take up-regulated DEGs to determine the GSEA result in the Biological Process category of Gene Ontology and also several visualization presentation. The visualization methods are interchangable with previous ORA section.

concept map display

single/multiple geneset visualization

8. References