1. Course intro and prerequisite
5. Data processing/visualization
This course is designed for R beginner, biologist, and both.
Through this course, participants will understand R basics, visualization & graph formatting, public (cancer) data retrieving, parallel computing basics, and biological analysis (i.e. differential gene expression analysis followed by gene set enrichment analysis). The course will starts from and heavily rely on tidyverse R package, which is a beginner-friendly stepping stone entering R world.
[Highly Recommended] All course contents are applied in ipynb format and participants may download or try to edit the code online for different results.
Course contents are now ready for either cloud presentatin at Github or demonstration via Binder.
For participants prefering to try the demonstration instead of installing the whole R environment locally, please refer to this Binder link and have the Binder preparing the online Jupyter environment (it may take whiles to build up).
Anaconda(optional but recommend), R or RStudio
To have R programming environment, direct installatio†n of R/RStudio or a virtural environment like Anaconda containing R/RStudio are both feasible. Installation steps of these items are platform-dependent (Linux/Mac/Windows).
#install.packages('palmerpenguins')
library(palmerpenguins)
#install.packages('tidyverse')
library(tidyverse)
#install.packages('ggpubr')
library(ggpubr)
#install.packages('UCSCXenaTools')
library(UCSCXenaTools)
#if (!require('BiocManager', quietly = TRUE))
# install.packages('BiocManager')
#BiocManager::install("edgeR")
library(edgeR)
#BiocManager::install("clusterProfiler")
library(clusterProfiler)
#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
#BiocManager::install("enrichplot")
library(enrichplot)
Numbers (integer/float), characters (string), boolean (TRUE and FALSE) are the primitive elements, namely vectors, in R environment and also common in other programming languages with interchangeable namings. Under most cases, any object/variable would be character either quoted by single ('X'
) and double quatation ("X"
) marks. Furthermore, to check a R object is certain class, is.X()
functions are default in R programming returning TRUE/FALSE
logical statement, e.g. is.character()
, is.numeric()
, etc. Examples are listed below and we also can check the class of an object via class()
function.
class('X')
class(100)
is.numeric(123)
is.logical(FALSE)
[Tips] To understand argument/parameters and detailed information for a certain function, e.g.
t.test()
, you may useargs(t.test)
,help(t.test)
or??t.test
for advanced information.
As mentioned basic vecto`, other traditional structures in R includes:
list(
A = 'X',
B = c(1,3,5),
C = matrix(1:3, nrow =1, ncol = 3)
) -> demo_list
print(demo_list)
$A [1] "X" $B [1] 1 3 5 $C [,1] [,2] [,3] [1,] 1 2 3
length(demo_list)
names(demo_list)
demo_list$B
demo_list[[2]]
Using function list()
, you may concatenate multiple items with different data types as object demo_list
. Also, by functions length()
, object demo_list
is confirmed to be a vector with 3 elements within. And to retrieve specific element from list, either by element naming (demo_list$B
) or index (demo_list[[2]]
) method are feasible.
demo_matrix_chr = matrix(letters[1:15], byrow = TRUE, nrow = 3)
print(demo_matrix_chr)
[,1] [,2] [,3] [,4] [,5] [1,] "a" "b" "c" "d" "e" [2,] "f" "g" "h" "i" "j" [3,] "k" "l" "m" "n" "o"
demo_matrix_num = matrix(1:15, byrow = FALSE, nrow = 3)
rownames(demo_matrix_num) <- c('X', 'Y', 'Z')
colnames(demo_matrix_num) <- letters[1:5]
print(demo_matrix_num)
a b c d e X 1 4 7 10 13 Y 2 5 8 11 14 Z 3 6 9 12 15
colnames(demo_matrix_num)[c(3,5)] = c('new_C', 'new_E')
print(demo_matrix_num)
a b new_C d new_E X 1 4 7 10 13 Y 2 5 8 11 14 Z 3 6 9 12 15
print(dim(demo_matrix_num))
[1] 3 5
print(demo_matrix_num[3,2])
[1] 6
For class matrix and data.frame below, since these two are table-like classes, you may add/change the column/row name via colnames
and rownames()
functions, even more, a specific column name could be altered by combining index.
To have the dimension of either matrix or data.frame, dim()
function is often used and the results are twon numbers while the first one is row count and the other is column count. Also, to access or change the specific element within matrix or data.frame, even list, the coordinate system is applied while the representation would be like table[row_index, column_index]
in 2D objects (matrix and data.frame), vec[index]
in 1D array/vector and List[[index]]
in 1D list object. Such coordinate system is commonly used in various programming languages including R.
demo_df <- data.frame(
col_num = 1:5,
col_str = LETTERS[1:5],
row.names = paste('row', letters[1:5], sep = '_')
)
print(demo_df)
col_num col_str row_a 1 A row_b 2 B row_c 3 C row_d 4 D row_e 5 E
demo_df$add_col = demo_df$col_num*1.3-5
demo_df[,4] = sqrt(demo_df$col_num)
print(demo_df)
col_num col_str add_col V4 row_a 1 A -3.7 1.000000 row_b 2 B -2.4 1.414214 row_c 3 C -1.1 1.732051 row_d 4 D 0.2 2.000000 row_e 5 E 1.5 2.236068
str(demo_df)
'data.frame': 5 obs. of 4 variables: $ col_num: int 1 2 3 4 5 $ col_str: chr "A" "B" "C" "D" ... $ add_col: num -3.7 -2.4 -1.1 0.2 1.5 $ V4 : num 1 1.41 1.73 2 2.24
print(demo_df[,-1])
col_str add_col V4 row_a A -3.7 1.000000 row_b B -2.4 1.414214 row_c C -1.1 1.732051 row_d D 0.2 2.000000 row_e E 1.5 2.236068
print(demo_df[,'col_num'])
[1] 1 2 3 4 5
Structure of data.frame is similar with matrix presenting table-like form. The manupulation, coordinate system, naming, all are identical with matrix except not all elements within are restrcited to single data type, which is effective and widely used for downstream analysis. Data.frame is common however its drawback are (1) poor readability for high dimention data (either large column or row count) and (2) high occupacy for large data comparing to matrix with same dimension as following comparison.
The first disadvantage is solved by class tibble from tidyverse package, which will be discussed in later section, and the second one is hard to solve since it relates to the nature of R environment and data itself.
object_matrix = matrix(1:10000, nrow = 20)
object_dataframe = as.data.frame(object_matrix)
object.size(object_matrix)
object.size(object_dataframe)
40216 bytes
124608 bytes
Variable is a container storing data values and by assigning the value to a variable, such variable is created and stored in current R enviroment. Not till 2001, the assignment operator is stricted to arrow symbol (<-
) but nowadays equation symbol (=
) is accepted as well. These two assignment operators has their own supporters and you may find the advantages of <-
from this article.
A = 100
print(A)
[1] 100
B <- 100
print(B)
[1] 100
100 -> C
print(C)
[1] 100
A/B/C here are the named variables we created with assigned value/data and both arrows (->
and <-
) clearly represent the assignment directions while =
has the its own direction, i.e. variable A is 100.
Please keep in mind for careful variable naming since it would increase/decrease your code readability plus the naming started from numbers/special symbols are generally avoided unless quoted by back quotes. Also, complicated or meaningless namings, e.g. ThisCouse_startsfrom_2pmto4pm_byYHC or tfadef.gaerre, are avoided.
1hundred = 100
Error in parse(text = x, srcfile = src): <text>:1:2: unexpected symbol 1: 1hundred ^ Traceback:
$100 = 100
Error in parse(text = x, srcfile = src): <text>:1:1: 未預期的 '$' 1: $ ^ Traceback:
`$100` = 100
print(`$100`)
[1] 100
Following examples will cover basic arithmetic (addition, subtraction, multiplication and division) and logical operations.
A <- 10
B <- 100
C <- 100
# addition
print(A+B)
[1] 110
# subtraction
print(A-B)
[1] -90
# multiplication
print(A*B)
[1] 1000
# division
print(A/B)
[1] 0.1
# equal to
print(A == B)
[1] FALSE
# not equal to
print(A != B)
[1] TRUE
# (equal to or) larger than
print(A > B)
print(A >= B)
[1] FALSE [1] FALSE
# (equal to or) smaller than
print(A < B)
print(A <= B)
[1] TRUE [1] TRUE
# and
print(A == C & B == C)
[1] FALSE
# or
print(A ==C | B == C)
[1] TRUE
# is a member or not
print(C %in% c(A,B))
[1] TRUE
Condition and iteration are two ways to control your code flow when executing complicated programs and scenarios would be often like: (a) execute a code depending on the condition, e.g. a specific gene is expression or not, and (b) execute the code repeatly many times, e.g. make the addition process 3 times in a row.
Basic conditions are included in previous logical operators hence in this part we will focus on another condition flow control (f-else statement) and common iterations including for/while-loop and apply family.
The main purpose of if-else statement is to permit a code's running or not based on the certain condition, whereas the condition could be singular or multiple and the flow structure could be linear or nested. The basic concepts of if-else statement are listed as followed.
if(the condition is TRUE){
run code
}
if(the condition is TRUE){
run code A
} else {
run code B
}
if(the condition_1 is TRUE){
run code A
} else if(the condition_2 is TRUE){
run code B
} else {
run code C
}
if(the condition_1 is TRUE){
run code A
} else {
if(the condition_2 is TRUE){
run code B
} else {
run code C
}
}
The perupose and concept of for and while loop are similar that they both execute the code repeatly but the former one are limited within a range of iterations while the latter one will stop under certain condition. As followed diagram and code examples you may understand the difference.
iteration_range = 1:N
for(iteration in iteration_range){
run code
}
index = 1
while(index <= 50){
run code
}
Like for/while-loop, apply family functions from basic R are designed to perform functions repetitively on a list of components (e.g. vector, data.frame, matrix, etc.) and are composed of members including apply()
, lapply()
, sapply()
, vapply()
, mapply()
, rapply()
, and tapply()
functions. The usage of which apply member depends on your input and output and here we will demonstrate lapply()
function, which idea and syntax structure is similar to for-loop.
iteration_range = 1:N
lapply(iteration_range,
run code
)
As demosntrated here, the syntax structure is nearly identical with for-loop that the code chunck runs repeatly within the iteration. However, the overall performance of lapply()
is better than for-loop and when it comes to preserving the output, the syntax would be different.
Given a range of numbers, how to take the square root of them?
# for-loop
numbers = 1:15
res_forloop = c()
for(n in numbers){
res_forloop[n] = sqrt(n)
}
str(res_forloop)
num [1:15] 1 1.41 1.73 2 2.24 ...
# lapply
numbers = 1:15
res_lapply = lapply(numbers, sqrt)
# res_lapply = lapply(numbers, {sqrt})
str(res_lapply)
List of 15 $ : num 1 $ : num 1.41 $ : num 1.73 $ : num 2 $ : num 2.24 $ : num 2.45 $ : num 2.65 $ : num 2.83 $ : num 3 $ : num 3.16 $ : num 3.32 $ : num 3.46 $ : num 3.61 $ : num 3.74 $ : num 3.87
The output as you can see are identical except the result format is either in vector(res_forloop
) or list (res_lapply
) format respectively. To achieve the vector format, one could use another family named sapply()
and for details about the apply family, please find this link for more information.
Vectorization is the parallel process of a function on multiple values at a time and no need to loop through each element, making an more efficient and concise code to execute and read comparing to non-vectorized format as demonstrated with following examples.
Given two numeric vectors, how to make the pair-wise addition of the elements within?
vector_A = matrix(1:15, nrow = 5)
vector_B = matrix(11:25, nrow = 5)
result = matrix(nrow = 5, ncol = 3) # empty 5x3 matrix
for(i in 1:nrow(vector_A)){
for(j in 1:ncol(vector_A)){
result[i,j] <- vector_A[i,j]+vector_B[i,j]
}
}
print(result)
[,1] [,2] [,3] [1,] 12 22 32 [2,] 14 24 34 [3,] 16 26 36 [4,] 18 28 38 [5,] 20 30 40
vector_A = matrix(1:15, nrow = 5)
vector_B = matrix(11:25, nrow = 5)
print(vector_A+vector_B)
[,1] [,2] [,3] [1,] 12 22 32 [2,] 14 24 34 [3,] 16 26 36 [4,] 18 28 38 [5,] 20 30 40
Several repositories are providing R packages including CRAN, BioConductor and Github. Installation syntax is different between these repositories.
# CRAN
install.packages('specific_pkg')
# Bioconductor
if (!require('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install('specific_pkg')
# Github
if (!require('devtools', quietly = TRUE))
install.packages('devtools')
devtools::install_github('DeveloperName/PackageName')
To load an installed package, either two ways below are avaliable.
library(specific_pkg)
require(specific_pkg)
To check the installed packages or the version of the installed package, one may use the following lines to have the result.
installed.packges()
packageVersion('specific_pkg')
Even though R and other repositories are providing built-in and convinient packages to keep you fron reinventing the wheel, one may encounter scenarios that a user-defined function is prefered. The basic/simple syntax of function is listed.
demo_func <- function(){
print('hello world')
}
demo_func()
[1] "hello world"
A practical user-defined functions would require one/multiple parameters to fit user need and (sometimes) would need printed messages during the process, return the result, or even save the intermediate results as example followed-up.
Given a range of numbers, how to take the integre of the specific position from its square root result and save into a vector?
sqrt_dec = function(x, dec_pos = 1){
if(x>=0){
res_sqrt = sqrt(x)
if(res_sqrt %% 1 == 0){
message('input is a square number')
return(0)
} else {
message('.... processing')
res_chr = as.character(res_sqrt)
res_split = unlist(strsplit(res_chr, split ='\\.'))
res_dec = unlist(strsplit(res_split[[2]], split =''))
res_final = as.numeric(res_dec[dec_pos])
res_final
return(res_final)
}
} else {
message('incorrect input: positive number required')
return(NA)
}
}
# check arguments
args(sqrt_dec)
function (x, dec_pos = 1)
NULL
numbers = c(10, 256, 101.11, -100, -1.3)
res = sapply(numbers, sqrt_dec, dec_pos = 5)
print(res)
.... processing input is a square number .... processing incorrect input: positive number required incorrect input: positive number required
[1] 7 0 4 NA NA
With the example above, we have parameter dec_pos
indicating the decimal position with default 1 in our user-defined function. Also, to determine the input is a square number (res_sqrt %%1 == 0
) or incorrect input (NOT x>=0
), nested if-else statement is applied as well. Furtheremore, we also have built-in unlist()
function to convert the list object from the strsplit()
output. Lastly, with our final result would be a vector object, it is named res
by applying the sapply()
function on all input numers interatively.
Using a built-in data from R environment is an easy task but often you need to work with your own data in various format, e.g. .txt, .csv, .xlsx, etc. R already have built-in functions dealing with some of these file type, including read.csv()
, read.delim()
and read.table()
, but the drawback are the limited file type and the input speed. With the convinient tidyverse
package providing the solutions, you may use `read*()`_ functions (listed here and here) with your file accordingly and seamlessly as built-in R functions.
[Tips] For either tidyverse or basic R functions importing files of different formats, both local files or online files (direct file link https://path_to_the_file.csv for example) are allowed.
library(tidyverse)
Pipe(%>%
), developed by Stefan Milton Bache (magrittr R package), is an useful tool to concatenate discrete steps with a clear expression and readability. The strength of pipe, comparing to traditional work flow, will be demonstrated with follow-up hypothetical scenario.
Given a range of numbers (e.g. 1 to 10), how to take the square root and round to two decimal places?
As intermediate object is created at each step, these independent objects will be applied to another step and with numerous steps, accumulating named objects will occupy hardware memory and hamper workflow readability.
numbers = 1:10
num_sqr = sqrt(numbers)
num_sqr_digit2 = format(num_sqr, digits = 2)
print(num_sqr_digit2) # string form
num_sqr_digit2_res = as.numeric(num_sqr_digit2)
print(num_sqr_digit2_res) # numeric form
[1] "1.0" "1.4" "1.7" "2.0" "2.2" "2.4" "2.6" "2.8" "3.0" "3.2" [1] 1.0 1.4 1.7 2.0 2.2 2.4 2.6 2.8 3.0 3.2
For the hypothetical scenario, workflow could be written into one-sentence format but the nested structure, which organize the workflow in a flashback, will increase the reading difficulty as well.
numbers = 1:10
as.numeric(
format(
sqrt(numbers),
digits = 2
)
)
# or written in this way
#as.numeric(format(sqrt(numbers), digits = 2))
Supported by the %>%
, the workflow is formated as direct tone and increase the readability plus clear syntax expression. Furthermore, %>%
is also feasible in user-defined function, putting these assembly together could make your codes efficient than traditional ways.
numbers = 1:10
numbers %>%
sqrt() %>% # take the square root
format(digits = 2) %>% # round to two decimal
as.numeric() # convert result from string to numeric format
# user-defined function
sqrt_num = function(number){
sqrt(number) %>%
format(digits = 2) %>%
as.numeric()
}
numbers = 1:10
sqrt_num(numbers)
matrix and data.frame are two common R classes presenting table-like format as in MS Excel. These two classes though are similar but the contens in all cells within matrix should be identical, e.g. either all numeric or character; data.frame is a hybrid form of matrix taking columns with different classes, i.e. number+number, number+character, and character+character.
Tibble (tbl_df or tbl) is a branch of traditional data.frame as it accepts various classess in columns, e.g. formula, tbl_df, images, etc., which greatly increase the flexibility.
To demonstrate the difference between matrix, data.frame and tbl_df, we will use classic iris dataset as example.
iris flower data set is the best known dataset introduced by the British statistician and biologist Ronald Fisher in 1936. The dataset consists of 50 instances from each of 3 iris classes (Iris setosa, Iris virginica and Iris versicolor) and 4 measurements (the length and the width of the sepals and petals). For its paradigm in statstics, this dataset is already included in R environment.
data(iris)
class(iris)
str(iris)
'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <fct> | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
As illustrated by str(iris)
(alternative glimpse(iris)
), the default class of iris dataset is data.frame composed of 150 samples (row) and 5 features/attributes (column) while Species is the only non-numeric column. For the following part, we will compare the difference between matrix, data.frame and tbl_df via iris dataset.
matrix_iris = as.matrix(iris)
matrix_iris[1:6,]
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
str(matrix_iris)
chr [1:150, 1:5] "5.1" "4.9" "4.7" "4.6" "5.0" "5.4" "4.6" "5.0" "4.4" ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
matrix_iris_num = iris[,-5] # remove character-containing column
matrix_iris_num[1:6,]
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | |
1 | 5.1 | 3.5 | 1.4 | 0.2 |
2 | 4.9 | 3.0 | 1.4 | 0.2 |
3 | 4.7 | 3.2 | 1.3 | 0.2 |
4 | 4.6 | 3.1 | 1.5 | 0.2 |
5 | 5.0 | 3.6 | 1.4 | 0.2 |
6 | 5.4 | 3.9 | 1.7 | 0.4 |
str(matrix_iris_num)
'data.frame': 150 obs. of 4 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
as_tibble(iris) %>% head(4)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <fct> |
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
as_tibble(iris) %>% str()
tibble [150 × 5] (S3: tbl_df/tbl/data.frame) $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
With _as_tibble()
_ function, a data.frame object could be seamlessly converted to tbl_df form. To construct a new tbl_df object from scratch, one can apply the concept below.
S.length = iris$Sepal.Length
S.width = iris$Sepal.Width
new_tbl = tibble(
SLength = S.length,
SWidth = S.width,
# suppose we want to multiply Sepal length by Sepal width
`Length*Width` = SLength*SWidth,
class = iris$Species)
str(new_tbl)
tibble [150 × 4] (S3: tbl_df/tbl/data.frame) $ SLength : num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ SWidth : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Length*Width: num [1:150] 17.8 14.7 15 14.3 18 ... $ class : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Unlike the limitation of traditional matrix and data.frame that only number/character/factor are accepeted, components within tbl_df could be other format as the example below, which mostly are stored as list.
mixture_tbl =
tibble(
string = letters[1:5],
number = 1:5,
mixture = list(
A = c('AS', 'IBMS'), # character vector
B = c(5,6,8,100), # number vector
C = function(x){x^2}, # user-defined function
D = matrix(LETTERS[1:10], nrow = 2, ncol = 5), # simple matrix
E = y~x1*(x2-3)) # formula
)
mixture_tbl
string | number | mixture |
---|---|---|
<chr> | <int> | <named list> |
a | 1 | AS , IBMS |
b | 2 | 5, 6, 8, 100 |
c | 3 | function (x) , {, x^2, } |
d | 4 | A, B, C, D, E, F, G, H, I, J |
e | 5 | y ~ x1 * (x2 - 3) |
str(mixture_tbl$mixture)
List of 5 $ A: chr [1:2] "AS" "IBMS" $ B: num [1:4] 5 6 8 100 $ C:function (x) ..- attr(*, "srcref")= 'srcref' int [1:8] 8 17 8 32 17 32 8 8 .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fbcca4d8998> $ D: chr [1:2, 1:5] "A" "B" "C" "D" ... $ E:Class 'formula' language y ~ x1 * (x2 - 3) .. ..- attr(*, ".Environment")=<environment: 0x7fbcca570908>
Though tbl_df follows the rules that elements within each column should be identical, e.g. column string and number, it accepts different data type mostly in list structure as column mixture as the example.
Tidy data is an importnat stepping stone for R beginners for data analysis since avaliable data nowadays are mostly structured in MS Excel-like format, which table is consist of features columns (e.g. gene expression, height, histological status, etc.) and incidance rows (e.g. patients, sample collections, etc.). For more complicated relationational tables, please find here for more examples and demonstration. Palmer penguins data will be included here for demonstration.
Palmer penguins data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The palmerpenguins package contains two datasets. One is called penguins, and is a simplified version of the raw data penguins_raw. This data is an alternative for classic iris data since it is much closer to real world data, which is complicated, contains time series data, and having incomplete data in some observations.
library(palmerpenguins)
data(package = 'palmerpenguins')
Package | Item | Title |
---|---|---|
<chr> | <chr> | <chr> |
palmerpenguins | penguins | Size measurements for adult foraging penguins near Palmer Station, Antarctica |
palmerpenguins | penguins_raw (penguins) | Penguin size, clutch, and blood isotope data for foraging adults near Palmer Station, Antarctica |
penguins_raw
¶glimpse(penguins_raw)
head(penguins_raw)
Rows: 344 Columns: 17 $ studyName <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL… $ `Sample Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1… $ Species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P… $ Region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"… $ Island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse… $ Stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu… $ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", … $ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", … $ `Date Egg` <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,… $ `Culmen Length (mm)` <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34… $ `Culmen Depth (mm)` <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18… $ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,… $ `Body Mass (g)` <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34… $ Sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"… $ `Delta 15 N (o/oo)` <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18… $ `Delta 13 C (o/oo)` <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298… $ Comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult…
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <chr> |
PAL0708 | 1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 2007-11-11 | 39.1 | 18.7 | 181 | 3750 | MALE | NA | NA | Not enough blood for isotopes. |
PAL0708 | 2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 2007-11-11 | 39.5 | 17.4 | 186 | 3800 | FEMALE | 8.94956 | -24.69454 | NA |
PAL0708 | 3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 2007-11-16 | 40.3 | 18.0 | 195 | 3250 | FEMALE | 8.36821 | -25.33302 | NA |
PAL0708 | 4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 2007-11-16 | NA | NA | NA | NA | NA | NA | NA | Adult not sampled. |
PAL0708 | 5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 2007-11-16 | 36.7 | 19.3 | 193 | 3450 | FEMALE | 8.76651 | -25.32426 | NA |
PAL0708 | 6 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A2 | Yes | 2007-11-16 | 39.3 | 20.6 | 190 | 3650 | MALE | 8.66496 | -25.29805 | NA |
penguins
¶glimpse(penguins)
head(penguins)
Rows: 344 Columns: 8 $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … $ sex <fct> male, female, female, NA, female, male, female, male… $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
<fct> | <fct> | <dbl> | <dbl> | <int> | <int> | <fct> | <int> |
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
Data processing (or wrangling) is often the step after the data collection and right before analysis/visualization. The main object of this step is to clean-up the real world messy data into an organized form, including to handle missing data, simplify naming and content, or even to create a specific attribute (feature). Here we would demonstrate the process by converting penguins_raw
data to penguins
data in multiple steps by using functions from tidyverse package, where you may found other useful packages/functions to fit your research link.
Comparing penguins_raw
and penguins
you may found the big difference in many ways, e.g. column removal, column renaming, upper/lower case conversion, time stamp summarization, content manipulation
# retain certain columns
names(penguins_raw)
names(penguins)
col_keep = c('Species', 'Island', 'Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex', 'Date Egg')
# compare "Species" column
penguins_raw$Species %>% unique()
penguins$species %>% unique()
# compare "Island" column
penguins_raw$Island %>% unique()
penguins$island %>% unique()
# compare "Sex" column
penguins_raw$Sex %>% unique()
penguins$sex %>% unique()
# compare "Date" column
penguins_raw$`Date Egg` %>% unique()
penguins$year %>% unique()
With initial comparison, now we can find the main difference between penguins_raw
and penguins
data are:
# user-defined function to simplify Species
species_convert = function(str){
if(str == 'Adelie Penguin (Pygoscelis adeliae)'){
return('Adelie')
} else if(str == 'Gentoo penguin (Pygoscelis papua)'){
return('Gentoo')
} else{
return('Chinstrap')
}
}
penguins_process =
penguins_raw %>%
# retain column
select(all_of(col_keep)) %>%
# rename column
rename(
'species' = 'Species', # new_name = old_name
'island' = 'Island',
'sex' = 'Sex',
'body_mass_g' = 'Body Mass (g)',
'year' = 'Date Egg',
'flipper_length_mm' = 'Flipper Length (mm)',
'bill_length_mm' = col_keep[3],
'bill_depth_mm' = col_keep[4],
) %>%
mutate(
# simplify Species column
species = map_chr(species, function(str) species_convert(str)),
# convert Sex column
sex = map_chr(sex, tolower),
# simplify Date column
year = map_dbl(year, function(date){
tmp =
as.character(date) %>%
str_split(pattern = '-') %>%
unlist()
return(as.numeric(tmp[[1]]))
})
) %>%
# convert character columns as factor columns
mutate_if(is.character, factor) %>%
# convert "body mass" and "year" columns as integer
mutate_at(all_of(c('body_mass_g', 'year')), as.integer)
head(penguins_process)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
<fct> | <fct> | <dbl> | <dbl> | <dbl> | <int> | <fct> | <int> |
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
head(penguins)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
<fct> | <fct> | <dbl> | <dbl> | <int> | <int> | <fct> | <int> |
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
# How many studies in each year?
count(penguins, year)
year | n |
---|---|
<int> | <int> |
2007 | 110 |
2008 | 114 |
2009 | 120 |
# What’s the distribution of species in each island?
penguins %>% count(species, island)
species | island | n |
---|---|---|
<fct> | <fct> | <int> |
Adelie | Biscoe | 44 |
Adelie | Dream | 56 |
Adelie | Torgersen | 52 |
Chinstrap | Dream | 68 |
Gentoo | Biscoe | 124 |
# How many penguins’ body mass are larger than 3700 g?
penguins %>%
filter(body_mass_g>3700) %>%
nrow()
# What's the largest size of flipper in each species?
penguins %>%
group_by(species) %>%
summarize(max_flipper = max(flipper_length_mm, na.rm = T))
species | max_flipper |
---|---|
<fct> | <int> |
Adelie | 210 |
Chinstrap | 212 |
Gentoo | 231 |
# What’s the distribution of bill’s length/depth ratio of each species?
penguins_billstat =
penguins %>%
mutate(bill_ratio = bill_length_mm/bill_depth_mm) %>%
group_by(species) %>%
nest() %>%
mutate(stat = purrr::map(data, function(d){summary(d$bill_ratio)}))
penguins_billstat$stat
[[1]] Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.640 2.011 2.137 2.120 2.238 2.450 1 [[2]] Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 2.566 3.066 3.167 3.176 3.285 3.613 1 [[3]] Min. 1st Qu. Median Mean 3rd Qu. Max. 2.351 2.560 2.662 2.654 2.728 3.258
Data visualization is the graphical representation of information and data by using visual concepts like charts, graphs, and maps. Through converting into human-sensible images, symbols, colors, or textures, data used to be hard to intepret are transformed and the information transparency are increased. Several systems are supporting R graphics and ggplot2 included in tidyverse package is one of the package that is a good point to start since its grammer of graphic and user communities are truly user-friendly. Also, numerous extensions are developing and pushing the boundary of data visualization to various research fields (e.g. genomic/neuron science/phylogenetic/image analysis) instead of limited in statistical analysis. You may find scientific packages from these resources: Bioconductor project, Neuroconductor project, and useful ggplot2 extensions to fit your need. And to have a meaningful visualization result, one may need to define the question of interest, target audience, and the key information to give. For this section, we will demonstrate common visualizations in exploratory data analysis using ggplot2, ggpubr and viridis packages for efficient publication-ready and colorblind-friendly graphs.
# How many studies in each year?
penguins %>%
count(year) %>%
# visualization
ggplot() +
geom_bar(
mapping = aes(x = factor(year), y = n, fill =
factor(year)),
stat = "identity",
width = 0.7
) +
# plot title/axis label setting
labs(
title = 'Penguin Studies in Each Year',
subtitle = 'Palmer penguins',
caption = 'data source: package("palmerpenguins")',
x = 'Study Year',
y = 'Sample Size', # change axis title
fill = 'Year' # change legend title
) +
# plot theme setting
theme_bw() +
theme(
legend.position = 'bottom', # change legend position
plot.title = element_text(face = 'bold', size = 24), # change font and size
axis.text = element_text(color = 'navy', size = 18), # change color and size
axis.title.x = element_blank(),
axis.title.y = element_text(size = 20) # hide x-axis title
)
# What’s the distribution of bill’s length/depth ratio of each species?
penguins %>%
mutate(
bill_ratio = bill_length_mm/bill_depth_mm,
# reorder species by the median of body mass
species = fct_reorder(.f = species, .x = body_mass_g, .fun =
median, na.rm = T, .desc = T)
) %>%
ggplot(aes(x = species, y = bill_ratio)) +
geom_violin(color = 'black') +
geom_jitter(aes(color = species, shape = species), alpha = 0.8, size = 3) +
theme_classic() +
theme(
legend.position = 'none',
axis.title = element_text(face = 'bold.italic', size = 20, color = 'orange'),
axis.text = element_text(size = 18)
)
Warning message: “Removed 2 rows containing non-finite values (stat_ydensity).” Warning message: “Removed 2 rows containing missing values (geom_point).”
# What’s the relationship between body mass and other feature in each species?
penguins %>%
ggplot(aes(x = body_mass_g, y = bill_depth_mm, color = species)) +
geom_point() +
geom_smooth(method = 'lm', color = 'black') +
facet_wrap(~species, nrow = 3, ncol = 1) +
theme_bw() +
theme(
legend.position = 'none',
axis.title = element_text(face = 'bold', size = 18),
strip.text = element_text(face = 'bold', size = 18)
)
`geom_smooth()` using formula 'y ~ x' Warning message: “Removed 2 rows containing non-finite values (stat_smooth).” Warning message: “Removed 2 rows containing missing values (geom_point).”
library(ggpubr)
library(ggsci)
# How to understand the distribution of culmen length between penguin speccies?
ggdensity(
penguins,
x = "bill_length_mm", # target variable (feature)
add = "median", # additional/optional group average (mean)
rug = TRUE, # add marginal rug for data distribution
color = "species", # group stratification reference
fill = "species", # group stratification reference
) +
scale_fill_viridis_d() + # colorblind-friendly coloring
scale_color_viridis_d() + # colorblind-friendly coloring
labs(x = 'Bill length (mm)', y = 'Density') +
theme_classic() +
theme(legend.position = 'bottom')
Warning message: “Removed 2 rows containing non-finite values (stat_density).”
# What's the egging frequency of these penguins in each month by years?
penguins_raw %>%
mutate(
yr_egg = map_dbl(`Date Egg`, function(date){
tmp =
as.character(date) %>%
str_split(pattern = '-') %>%
unlist()
return(as.numeric(tmp[[1]]))
}),
month_egg = map_dbl(`Date Egg`, function(date){
tmp =
as.character(date) %>%
str_split(pattern = '-') %>%
unlist()
return(as.numeric(tmp[[2]]))
})
) %>%
#dplyr::select(Species, yr_egg, month_egg)
group_by(yr_egg, month_egg) %>%
summarize(count = n()) %>%
mutate(
date_label = paste(yr_egg, month_egg, sep = '-'),
yr_egg = as.factor(yr_egg) # convert numeric `yr_egg` to factor for grouping
) %>%
# dot plot
ggdotchart(
x = "date_label", y = "count",
color = "yr_egg", # color group by `yr_egg`
sorting = "descending", # sort value in descending order
add = "segments", # add segments from y = 0 to dots
rotate = TRUE, # rotate vertically
dot.size = 8,
) +
ggsci::scale_color_nejm() + # using NEJM coloring
labs(
title = 'Monthly egging frequency (from 2007 to 2009)',
color = 'Year', X = 'Month', y = 'Observed egging') +
theme(legend.position = 'right')
`summarise()` has grouped output by 'yr_egg'. You can override using the `.groups` argument.
# What’s the distribution of body mass of each species? Please provide statistical significance as well.
compares = list(c('Adelie', 'Gentoo'), c('Adelie', 'Chinstrap'), c('Gentoo', 'Chinstrap'))
penguins %>%
mutate(
# reorder species by the median of body mass
species = fct_reorder(.f = species, .x = body_mass_g, .fun =
median, na.rm = T, .desc = T),
body_mass_kg = body_mass_g/1000
) %>%
ggplot(aes(x = species, y = body_mass_kg)) +
geom_violin(color = 'black') +
geom_jitter(aes(color = species, shape = species), alpha = 0.8, size = 3) +
theme_classic() +
theme(
legend.position = 'none',
axis.title = element_text(size = 16, face = 'italic'),
axis.text = element_text(size = 11)
) +
ggsci::scale_color_jco() + # using JCO coloring
labs(
x = 'Species', y = 'Body Mass (Kg)',
caption = 'Body mass difference between groups are done by Wilcoxon test'
) +
# add statistical results between species
stat_compare_means(
method = 'wilcox.test', # can change to *t.test*
comparisons = compares, # combination of group comparisons
label.y = c(6.5, 6.7, 6.9) # position for significance labling
)+
# add global anova p-value
stat_compare_means(label.y = 7.3)
Warning message: “Removed 2 rows containing non-finite values (stat_ydensity).” Warning message: “Removed 2 rows containing non-finite values (stat_signif).” Warning message: “Removed 2 rows containing non-finite values (stat_compare_means).” Warning message: “Removed 2 rows containing missing values (geom_point).”
# How to visualize the bill's depth and length relationships between penguin species?
ggscatterhist(
penguins,
x = "bill_depth_mm", y = "bill_length_mm",
color = "species", fill = 'species',
size = 3, alpha = 0.6,
palette = "simpsons",
margin.plot = "histogram",
ggtheme = theme_bw()
)
Warning message: “Removed 2 rows containing missing values (geom_point).” Warning message: “Removed 2 rows containing missing values (geom_point).” Warning message: “Removed 2 rows containing non-finite values (stat_bin).” Warning message: “Removed 2 rows containing missing values (geom_point).” Warning message: “Removed 2 rows containing non-finite values (stat_bin).”
Various databases and web tools are providing public cancer data and some of them are also visualizing summary plots and advanced survival analysis upon user query. And among numerous cancer data contributors, the Cancer Genome Atlas (TCGA) is one of widely recognized by researcher and also archived by databases including Genomic Data Commons (GDC), cBioPortal, Broad GDAC Firehose, UCSC Xena. Furthermore, apart from direct download, these sources and numerous R packages (RTCGAToolbox, TCGAbiolinks, cBioPortalData, etc.) are providing access for cancer data. With this section, we will use UCSC Xena for followed demonstration.
Xena is a genomic data accessing-hub hosted by UCSC providing the collection of other public databases (TCGA, Broad Institute, ICGC, GTex, CCLE, etc.). Users can access, download, analyze data like cancer or single-cell datasets directly on the website or locally on personal device for the data deployed on Xena are tier-3 public data. And UCSCXenaTools is the R package providing the accession inside R environment for convenient analysis workflow.
library(tidyverse)
library(UCSCXenaTools)
data(XenaData)
head(XenaData)
XenaHosts | XenaHostNames | XenaCohorts | XenaDatasets | SampleCount | DataSubtype | Label | Type | AnatomicalOrigin | SampleType | Tags | ProbeMap | LongTitle | Citation | Version | Unit | Platform |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
https://ucscpublic.xenahubs.net | publicHub | Breast Cancer Cell Lines (Neve 2006) | ucsfNeve_public/ucsfNeveExp_genomicMatrix | 51 | gene expression | Neve Cell Line gene expression | genomicMatrix | Breast | cell line | cell lines,breast cancer | probeMap/affyU133_ucscGenomeBrowser_hg18.probeMap | Cell Line Gene Expression (Neve et al. Cancer Cell 2006) | Cancer Cell. 2006 Dec;10(6):515-27. | 2011-11-01 | NA | NA |
https://ucscpublic.xenahubs.net | publicHub | Breast Cancer Cell Lines (Neve 2006) | ucsfNeve_public/ucsfNeve_public_clinicalMatrix | 57 | phenotype | Phenotypes | clinicalMatrix | cancer | NA | NA | NA | 2011-11-11 | NA | NA | ||
https://ucscpublic.xenahubs.net | publicHub | Glioma (Kotliarov 2006) | kotliarov2006_public/kotliarov2006_genomicMatrix | 194 | copy number | Kotliarov Glioma CGH | genomicMatrix | Brain | tumor | cancer,neural | probeMap/probesAffy100K | Glioma CGH (Kotliarov et al. 2006) | Cancer Res. 2006 Oct 1;66(19):9428-36. | 2011-11-01 | NA | NA |
https://ucscpublic.xenahubs.net | publicHub | Glioma (Kotliarov 2006) | kotliarov2006_public/kotliarov2006_public_clinicalMatrix | 194 | phenotype | Phenotypes | clinicalMatrix | cancer | NA | NA | NA | 2011-11-11 | NA | NA | ||
https://ucscpublic.xenahubs.net | publicHub | Lung Cancer CGH (Weir 2007) | weir2007_public/weir2007_genomicMatrix | 383 | copy number | CGH | genomicMatrix | Lung | tumor | cancer | probeMap/probeAffy500K | Lung CGH (Weir et al. 2007) | Nature. 2007 Dec 6;450(7171):893-8. Epub 2007 Nov 4. | 2011-11-01 | NA | NA |
https://ucscpublic.xenahubs.net | publicHub | Lung Cancer CGH (Weir 2007) | weir2007_public/weir2007_public_clinicalMatrix | 383 | phenotype | Phenotypes | clinicalMatrix | cancer | NA | NA | NA | 2011-11-11 | NA | NA |
XenaScan(
pattern = 'Glioblastoma|gene expression',
ignore.case = T) %>%
head(3)
XenaHosts | XenaHostNames | XenaCohorts | XenaDatasets | SampleCount | DataSubtype | Label | Type | AnatomicalOrigin | SampleType | Tags | ProbeMap | LongTitle | Citation | Version | Unit | Platform |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
https://ucscpublic.xenahubs.net | publicHub | Breast Cancer Cell Lines (Neve 2006) | ucsfNeve_public/ucsfNeveExp_genomicMatrix | 51 | gene expression | Neve Cell Line gene expression | genomicMatrix | Breast | cell line | cell lines,breast cancer | probeMap/affyU133_ucscGenomeBrowser_hg18.probeMap | Cell Line Gene Expression (Neve et al. Cancer Cell 2006) | Cancer Cell. 2006 Dec;10(6):515-27. | 2011-11-01 | NA | NA |
https://ucscpublic.xenahubs.net | publicHub | Breast Cancer (Chin 2006) | chin2006_public/chin2006Exp_genomicMatrix | 118 | gene expression | Gene expression | genomicMatrix | Breast | tumor | cancer | probeMap/affyU133_ucscGenomeBrowser_hg18.probeMap | Gene Expression (Chin et al. Cancer Cell 2006) | Cancer Cell. 2006 Dec;10(6):529-41. | 2011-11-01 | NA | NA |
https://ucscpublic.xenahubs.net | publicHub | Melanoma (Lin 2008) | lin2008_public/lin2008Exp_genomicMatrix | 95 | gene expression | Lin Gene Exp | genomicMatrix | Skin | tumor | cancer | probeMap/affyU133_ucscGenomeBrowser_hg18.probeMap | Melanoma Gene Expression (Lin et al. 2008) | Cancer Res. 2008 Feb 1;68(3):664-73. | 2011-11-01 | NA | NA |
XenaScan(
pattern = 'Glioblastoma.*gene expression|gene expression.*Glioblastoma',
ignore.case = T) %>%
head(3)
XenaHosts | XenaHostNames | XenaCohorts | XenaDatasets | SampleCount | DataSubtype | Label | Type | AnatomicalOrigin | SampleType | Tags | ProbeMap | LongTitle | Citation | Version | Unit | Platform |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
https://tcga.xenahubs.net | tcgaHub | TCGA Glioblastoma (GBM) | TCGA.GBM.sampleMap/HT_HG-U133A | 539 | gene expression array | AffyU133a | genomicMatrix | Brain | tumor | cancer,nervous system | NA | TCGA glioblastoma multiforme (GBM) gene expression (AffyU133a array) | NA | 2017-09-08 | log2(affy RMA) | AffyU133a |
https://tcga.xenahubs.net | tcgaHub | TCGA Glioblastoma (GBM) | TCGA.GBM.sampleMap/HiSeqV2_percentile | 172 | gene expression RNAseq | IlluminaHiSeq percentile | genomicMatrix | Brain | tumor | cancer,nervous system | probeMap/hugo_gencode_good_hg19_V24lift37_probemap | TCGA glioblastoma multiforme (GBM) gene expression by RNAseq (polyA+ IlluminaHiSeq percentile) | NA | 2017-10-13 | percentile rank | IlluminaHiSeq_RNASeqV2 |
https://tcga.xenahubs.net | tcgaHub | TCGA Glioblastoma (GBM) | TCGA.GBM.sampleMap/HiSeqV2_PANCAN | 172 | gene expression RNAseq | IlluminaHiSeq pancan normalized | genomicMatrix | Brain | tumor | cancer,nervous system | probeMap/hugo_gencode_good_hg19_V24lift37_probemap | TCGA glioblastoma multiforme (GBM) gene expression by RNAseq (ployA+ IlluminaHiSeq), pancan normalized | NA | 2017-10-13 | pan-cancer normalized log2(norm_count+1) | IlluminaHiSeq_RNASeqV2 |
XenaData %>%
filter(
grepl(XenaHostNames, pattern = 'tcga', ignore.case = T),
grepl(DataSubtype, pattern = 'gene expression', ignore.case = T),
grepl(AnatomicalOrigin, pattern = 'brain', ignore.case = T)
) %>%
head(3)
XenaHosts | XenaHostNames | XenaCohorts | XenaDatasets | SampleCount | DataSubtype | Label | Type | AnatomicalOrigin | SampleType | Tags | ProbeMap | LongTitle | Citation | Version | Unit | Platform |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
https://tcga.xenahubs.net | tcgaHub | TCGA Lower Grade Glioma (LGG) | TCGA.LGG.sampleMap/HiSeqV2_percentile | 530 | gene expression RNAseq | IlluminaHiSeq percentile | genomicMatrix | Brain | tumor | cancer,nervous system | probeMap/hugo_gencode_good_hg19_V24lift37_probemap | TCGA brain lower grade glioma (LGG) gene expression by RNAseq (polyA+ IlluminaHiSeq percentile) | NA | 2017-10-13 | percentile rank | IlluminaHiSeq_RNASeqV2 |
https://tcga.xenahubs.net | tcgaHub | TCGA Lower Grade Glioma (LGG) | TCGA.LGG.sampleMap/HiSeqV2_PANCAN | 530 | gene expression RNAseq | IlluminaHiSeq pancan normalized | genomicMatrix | Brain | tumor | cancer,nervous system | probeMap/hugo_gencode_good_hg19_V24lift37_probemap | TCGA brain lower grade glioma (LGG) gene expression by RNAseq (ployA+ IlluminaHiSeq), pancan normalized | NA | 2017-10-13 | pan-cancer normalized log2(norm_count+1) | IlluminaHiSeq_RNASeqV2 |
https://tcga.xenahubs.net | tcgaHub | TCGA Lower Grade Glioma (LGG) | TCGA.LGG.sampleMap/HiSeqV2 | 530 | gene expression RNAseq | IlluminaHiSeq | genomicMatrix | Brain | tumor | cancer,nervous system | probeMap/hugo_gencode_good_hg19_V24lift37_probemap | TCGA brain lower grade glioma (LGG) gene expression by RNAseq (polyA+ IlluminaHiSeq) | NA | 2017-10-13 | log2(norm_count+1) | IlluminaHiSeq_RNASeqV2 |
file_op1 =
XenaScan(pattern = 'Glioblastoma.*gene expression|gene expression.*Glioblastoma', ignore.case = T) %>% # search Glioblastoma AND gene expression
XenaGenerate()
file_op1@datasets
file_op2 =
XenaGenerate(subset = XenaHostNames=="tcgaHub") %>%
XenaFilter(filterDatasets = "GBM.*HiSeqV2_PANCAN|HiSeqV2_PANCAN.*GBM")
file_op2@datasets
datasetID = 'TCGA.GBM.sampleMap/HiSeqV2_PANCAN'
file_op3 =
XenaGenerate() %>%
XenaFilter(filterDatasets = datasetID)
file_op3@datasets
df_gbm_op3 =
file_op3 %>%
XenaQuery() %>%
XenaDownload() %>%
XenaPrepare()
dim(df_gbm_op3)
head(df_gbm_op3)
#file = 'https://raw.githubusercontent.com/yenhsieh/LSL_2022/main/data/TCGA_CHOL_RNAseq'
file_url = 'https://tcga-xena-hub.s3.us-east-1.amazonaws.com/download/TCGA.CHOL.sampleMap%2FHiSeqV2.gz'
data = read_delim(file_url, delim = '\t')
head(data)
Over-representation analysis (ORA) is a statistical method to determine if a geneset (e.g. pathway from Gene Ontology or KEGG) is over-representated within user gene list, e.g. differentially expressed genes (DEGs) between two groups selected by the significance cutoff. Geneset enrichment analysis (GSEA) is a extented version of ORA whereas DEGs are ranked by the fold change as input, unlike ORA using DEGs list only, is much feasible for situations where all genes in a predefined gene set change in a small but coordinated way. As metioned above, either ORA or GSEA are requiring DEGs as input component, which can be identified through various methods/packages (e.g. DESeq2, edgeR, etc) and here we will start from limma-voom of edgeR to discover DEGs for downstream ORA and GSEA analysis.
clusterProfiler is a R package which provide a various accession to a list of well-known annotation databases, e.g. KEGG, Reactome, Wiki Pathways. Apart from these annotation collected, it is also capable for both semantic similarity and functional enrichment analysis with a friendly and tidy fashion for biologist without command line tool experience to access, analyze and visualize their results. According to official document, genomic coordinates, gene and gene cluster are supported as input format. (source)
To demonstrate the workflow of DEG, here we start from downloding bladder cancer's gene expression data from Xena as example. With the input format (a n*m numeric matrix object whereas columns are samples and rows are genomic features) as suggested by edgeR document, you may prepare the input beyond this cancer data example. (data source)
dataID = 'TCGA.BLCA.sampleMap/HiSeqV2$'
df_exp =
XenaGenerate() %>%
XenaFilter(filterDatasets = dataID) %>%
XenaQuery() %>%
#XenaDownload(force = F, destdir = "C:/Users/YHC/Documents") %>% # Windows System
XenaDownload(force = F) %>%
XenaPrepare() %>%
dplyr::rename('gene'='sample')
Considering the unit from Xena is log2(norm_count+1), here we reverse expected_count to original count and take the nearest integer and you may find the difference below.
df_exp_new =
df_exp %>%
mutate_if(is.numeric, function(x){round(2^x-1, digits = 0)})
head(df_exp)
head(df_exp_new)
In order to confirm the input having any missing data, genes having all zero count, and correct format (i.e. matrix) for followed-up analysis, data wrangling are required.
# Check if missing value exist and convert dataframe into matrix format
df_exp_new %>%
select_if(is.numeric) %>% # select numeric columns
as.matrix() %>% {.->>m_exp} %>% # convert to matrix
as.vector() %>% # convert 2D matrix as 1D array
is.na() %>% # check if NA
sum() # NA count
# convert input as matrix and add row names
row.names(m_exp) = df_exp_new$gene
m_exp[1:6, 1:5]
# identify rows with all zero
all_zero = map_lgl(1:nrow(m_exp), function(i){ sum(m_exp[i,] == 0) == ncol(m_exp)})
table(all_zero)
# retain rows not all zero
m_exp_new = m_exp[-which(all_zero == TRUE),]
dim(m_exp)
dim(m_exp_new)
As indicated, some rows (genes) are having all patients with no expression count hence will be removed.
Sample group annotation is required since the main object for this analysis is to distinguish the difference between two groups and here TCGA samples will be classified as tumor or adjacent by the barcode accordingly.
sample_class = map_chr(colnames(m_exp_new), function(x){
ifelse(str_sub(x, 14,14)=='0', yes = 'tumor', no = 'adjNormal')
}) %>% factor()
table(sample_class)
sample_class adjNormal tumor 19 407
Patient data collected by TCGA are labeled with its own barcode system, which a dataset may composed of tumor and adjacent normal samples. Whether if the sample is tumor(0) or adjacent normal(1) could be defined by the 14th position of the barcode. With all these process, our data is ready for differential gene expression analysis.
library(edgeR)
# quantile normalization
y = normalizeQuantiles(m_exp_new)
Loading required package: limma
With the setting of desired cutoff, genes with low-expressed count and expressed in small size samples would be neglected. The difference of before/after this procedure could be visualized via Voom variance plot, whereas each dot in the figure represents a gene and also the mean-variance relationship of included genes in the dataset. The main purpose of this plot is to identify if low counts genes have been filtered and if lots of variation in the data exist. The cutoff setting is data-dependent.
# cutoff setting
cutoff_count = 10
cutoff_sample = 20
genes_keep = rowSums(y > cutoff_count) >= cutoff_sample
table(genes_keep)
genes_keep FALSE TRUE 3011 17203
design = model.matrix(~0 + sample_class)
colnames(design) = c("adjNormal","tumor")
head(design)
adjNormal | tumor | |
---|---|---|
1 | 1 | 0 |
2 | 0 | 1 |
3 | 0 | 1 |
4 | 0 | 1 |
5 | 0 | 1 |
6 | 0 | 1 |
bef = voom(y, design, plot = T)
y_after = y[genes_keep,]
aft = voom(y_after, design, plot = T)
contrast = makeContrasts(tumor-adjNormal, levels=design)
contrast
tumor - adjNormal | |
---|---|
adjNormal | -1 |
tumor | 1 |
fit = lmFit(y_after, design)
fitC = contrasts.fit(fit, contrast)
fitC = eBayes(fitC, robust=TRUE, trend=TRUE)
summary(decideTests(fitC))
tumor - adjNormal Down 3081 NotSig 10614 Up 3508
# fitting result table
tab_deg =
topTable(fitC, adjust="BH", n=Inf) %>%
na.omit() %>%
rownames_to_column('gene') %>%
as_tibble() %>%
dplyr::select(gene, everything())
dim(tab_deg)
head(tab_deg)
gene | logFC | AveExpr | t | P.Value | adj.P.Val | B |
---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
C16orf89 | -620.0308 | 50.57542 | -28.23560 | 1.408132e-99 | 2.422410e-95 | -1.788981 |
F10 | -352.6064 | 37.23251 | -26.80088 | 2.084259e-93 | 1.792775e-89 | -1.893120 |
CLEC3B | -2284.7473 | 219.54431 | -25.25750 | 1.158656e-86 | 6.644122e-83 | -2.015517 |
SCARA5 | -2271.2513 | 157.10191 | -24.93476 | 3.050763e-85 | 1.312057e-81 | -2.042396 |
PI16 | -5118.6775 | 333.11552 | -24.31372 | 1.688589e-82 | 5.809759e-79 | -2.095473 |
C1QTNF7 | -464.7453 | 57.62090 | -23.03347 | 8.268577e-77 | 2.370739e-73 | -2.210654 |
# retain significant DEGs
sig_deg = filter(tab_deg, abs(logFC)>0.5 & tab_deg$adj.P.Val < 0.05)
# classify as up-/down-regulated DEGs
deg_up = filter(sig_deg, logFC > 1) %>% arrange(desc(logFC))
deg_dn = filter(sig_deg, logFC < -1) %>% arrange(desc(logFC))
The main object of geneset/pathway analysis is to determines if a pre-defined sets, pathways collected from KEGG or Gene Ontology, are over-represented in a subset of user data beyond expected. For instance, given a geneset composed of 100 genes (e.g. liver cancer-specific pathway) and a knockout experiment with 500 DEGs while 90 of them are in the geneset, then the geneset is highly over/under-represented in the experiment. Above mentioned is the straightforward concept of conventional over-representation analysis (ORA) and followed by statistical tests such as the cumulative hyper-geometric test, significance of the overlapping could be calculated and filtered by cutoff (e.g. p-value 0.05) to select the annotated genesets.
Unlike ORA, Gene Set Enrichment Analysis (GSEA) ranks all genes in the genome to eliminate the need of cutoff as ORA (e.g. fold change) to define the input gene set. With such ranking, e.g. level of differential expression, genesets are expected as high or low via running-sum statistic.
library(org.Hs.eg.db)
library(clusterProfiler)
library(enrichplot)
Here we take down-regulated DEGs to determine the enrichment result in the Biological Process category of Gene Ontology and also several visualization presentation.
geneList_dn = deg_dn$logFC %>% setNames(deg_dn$gene)
DEG_ora_dn = enrichGO(
gene = names(geneList_dn), # only gene list is required
OrgDb = org.Hs.eg.db,
keyType="SYMBOL",
ont = "BP",
pAdjustMethod = "BH")
DEG_ora_dn
# # over-representation test # #...@organism Homo sapiens #...@ontology BP #...@keytype SYMBOL #...@gene chr [1:3081] "DMRTC1B" "SEC1" "FAM7A3" "C22orf26" "CCDC158" "CC2D2B" ... #...pvalues adjusted by 'BH' with cutoff <0.05 #...1314 enriched terms found 'data.frame': 1314 obs. of 9 variables: $ ID : chr "GO:0003012" "GO:0006936" "GO:0031589" "GO:0060537" ... $ Description: chr "muscle system process" "muscle contraction" "cell-substrate adhesion" "muscle tissue development" ... $ GeneRatio : chr "157/2558" "123/2558" "111/2558" "121/2558" ... $ BgRatio : chr "467/18866" "362/18866" "359/18866" "409/18866" ... $ pvalue : num 3.14e-29 1.41e-23 7.24e-18 1.12e-17 9.06e-17 ... $ p.adjust : num 1.95e-25 4.39e-20 1.50e-14 1.74e-14 1.13e-13 ... $ qvalue : num 1.40e-25 3.14e-20 1.08e-14 1.25e-14 8.08e-14 ... $ geneID : chr "HTR2A/SLC8A3/KCNJ3/HTR2B/NMUR1/SCN3B/CTNNA3/TNNI3K/SCN2B/TMOD4/MYOT/ACTA1/LMOD3/KCNN2/TRPC3/KCNA5/SMPX/CKMT2/TA"| __truncated__ "HTR2A/SLC8A3/KCNJ3/HTR2B/NMUR1/SCN3B/CTNNA3/TNNI3K/SCN2B/TMOD4/MYOT/ACTA1/LMOD3/KCNN2/KCNA5/SMPX/CKMT2/TACR1/KC"| __truncated__ "TECTA/MADCAM1/PPFIA2/FER/RADIL/EDA/COL13A1/CORO2B/EPB41L5/PRKCE/EDIL3/ANGPT1/ATP1B2/NTNG1/FREM1/TEK/CLASP2/ECM2"| __truncated__ "KCNK2/KCNAB1/MTM1/GLI1/SGCG/ACTA1/LMOD3/FGF9/MEOX2/S100B/TWIST1/MYOM2/PPARGC1A/MYO18B/SAV1/TBX20/MAP2K4/CREB1/B"| __truncated__ ... $ Count : int 157 123 111 121 126 133 117 114 110 125 ... #...Citation Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 2012, 16(5):284-287
dotplot(DEG_ora_dn, showCategory = 20) +
labs(title = "ORA of down-regulated DEGs") +
theme_bw() +
theme(
plot.title = element_text(face = 'bold', size = 20),
axis.text = element_text(size = 15),
axis.title = element_text(size = 18)
)
wrong orderBy parameter; set to default `orderBy = "x"`
DEG_ora_dn %>%
pairwise_termsim() %>%
emapplot(layout = 'kk', cex_category = 1.5) +
labs(title = 'Enrichment map of down-regulated DEGs') +
theme(plot.title = element_text(size = 18, face = 'bold'))
Here again we take up-regulated DEGs to determine the GSEA result in the Biological Process category of Gene Ontology and also several visualization presentation. The visualization methods are interchangable with previous ORA section.
DEG_gsea_dn = gseGO(
gene = geneList_dn, # fold change value required
OrgDb = org.Hs.eg.db,
keyType="SYMBOL",
ont = "BP",
pAdjustMethod = "BH")
preparing geneSet collections... GSEA analysis... Warning message in fgseaMultilevel(...): “There were 3 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)” leading edge analysis... done...
DEG_gsea_dn
# # Gene Set Enrichment Analysis # #...@organism Homo sapiens #...@setType BP #...@keytype SYMBOL #...@geneList Named num [1:3081] -3.67 -3.99 -4.11 -4.13 -4.3 ... - attr(*, "names")= chr [1:3081] "DMRTC1B" "SEC1" "FAM7A3" "C22orf26" ... #...nPerm #...pvalues adjusted by 'BH' with cutoff <0.05 #...6 enriched terms found 'data.frame': 6 obs. of 11 variables: $ ID : chr "GO:0003008" "GO:0009888" "GO:0003012" "GO:0006936" ... $ Description : chr "system process" "tissue development" "muscle system process" "muscle contraction" ... $ setSize : int 416 456 157 123 331 200 $ enrichmentScore: num -0.769 -0.76 -0.837 -0.861 -0.765 ... $ NES : num -1.29 -1.27 -1.4 -1.43 -1.28 ... $ pvalue : num 1.25e-08 4.62e-08 6.00e-07 6.77e-07 4.94e-06 ... $ p.adjust : num 3.62e-05 6.71e-05 4.92e-04 4.92e-04 2.87e-03 ... $ qvalues : num 3.41e-05 6.31e-05 4.63e-04 4.63e-04 2.70e-03 ... $ rank : num 345 464 130 130 412 432 $ leading_edge : chr "tags=21%, list=11%, signal=21%" "tags=29%, list=15%, signal=29%" "tags=28%, list=4%, signal=28%" "tags=22%, list=4%, signal=22%" ... $ core_enrichment: chr "AKAP13/EFEMP1/KCNMB1/FOSL1/PLN/P2RX1/JAM3/CYB5R3/MAP1A/CXCL12/ITPR1/LDLR/MGLL/ADARB1/CD34/HSPB7/DMD/ATP1A2/KCNM"| __truncated__ "STAT5B/PDCD4/TGM1/DCHS1/FZD7/CNFN/ANO6/LAMA2/SVEP1/LTBP3/STC1/PDLIM5/OVOL1/MYOCD/KDM6B/MEF2D/SBDS/DKK1/SPRY1/KL"| __truncated__ "DMD/ATP1A2/KCNMA1/ERRFI1/CASQ2/SRF/NR4A3/CALM1/DMPK/SORBS2/CRYAB/CACNA1H/KLF4/RGS2/PTGS2/PI16/SLMAP/CAV1/ANXA6/"| __truncated__ "RGS2/PTGS2/SLMAP/CAV1/ANXA6/ATP2B4/VCL/MYL6/SORBS1/SMTN/TLN1/HSPB6/PPP1R12B/LMOD1/ACTC1/CALD1/TPM2/MYLK/TPM1/CN"| __truncated__ ... #...Citation Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 2012, 16(5):284-287
enrichplot::cnetplot(DEG_gsea_dn, foldChange = geneList_dn) +
scale_color_viridis_c() +
labs(color = 'Fold\nChange', title = 'Concept network of down-regulated DEGs') +
theme_void() +
theme(plot.title = element_text(face = 'bold', size = 24))
Scale for 'colour' is already present. Adding another scale for 'colour', which will replace the existing scale. Warning message: “ggrepel: 114 unlabeled data points (too many overlaps). Consider increasing max.overlaps”
as_tibble(DEG_gsea_dn@result)
ID | Description | setSize | enrichmentScore | NES | pvalue | p.adjust | qvalues | rank | leading_edge | core_enrichment |
---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <chr> |
GO:0003008 | system process | 416 | -0.7691471 | -1.287183 | 1.246404e-08 | 3.622051e-05 | 3.407276e-05 | 345 | tags=21%, list=11%, signal=21% | AKAP13/EFEMP1/KCNMB1/FOSL1/PLN/P2RX1/JAM3/CYB5R3/MAP1A/CXCL12/ITPR1/LDLR/MGLL/ADARB1/CD34/HSPB7/DMD/ATP1A2/KCNMA1/SERPINF1/ERRFI1/CASQ2/ELN/FGFR1/SRF/NR4A3/CALM1/GAS6/ABL1/CCL2/SYNPO/DMPK/PJA2/SORBS2/CRYAB/NAMPT/LAMB2/CACNA1H/ABLIM1/HBEGF/KLF4/AKAP12/SGK1/FXYD6/MCAM/RGS2/PTGS2/PI16/LUM/SLMAP/CAV1/ANXA6/TIMP3/ATP2B4/EZR/VCL/AQP1/MYL6/ITGA5/CLDN4/LMNA/PTGS1/IGFBP5/SORBS1/SMTN/JUN/TLN1/HSPB6/PPP1R12B/LMOD1/ACTC1/BTG2/CALD1/AQP3/TPM2/MYLK/TPM1/GSN/CNN1/SYNM/ACTA2/MYL9/ACTG2/FOS/DES/FLNA/MYH11 |
GO:0009888 | tissue development | 456 | -0.7595425 | -1.271279 | 4.617575e-08 | 6.709337e-05 | 6.311496e-05 | 464 | tags=29%, list=15%, signal=29% | STAT5B/PDCD4/TGM1/DCHS1/FZD7/CNFN/ANO6/LAMA2/SVEP1/LTBP3/STC1/PDLIM5/OVOL1/MYOCD/KDM6B/MEF2D/SBDS/DKK1/SPRY1/KLF2/PITX1/FOXF1/BTG1/RAP1A/TCF21/IQGAP1/IL6/SFRP1/SERPINB5/AKAP13/PKD1/EFEMP1/PDGFRA/WLS/PLN/ZEB1/SLC25A25/CLEC3B/JPH2/APOLD1/FOSL2/ADARB1/CD34/IVL/TGFB1I1/ITGA7/TIPARP/TXNRD1/TGFBR2/ERRFI1/PIM1/ELN/TGFBR3/FGFR1/CFL2/SRF/NR4A3/ARID5B/GAS6/CORO1C/ABL1/PBX1/PPL/ADD1/CSTA/SORBS2/LAMC1/MAFF/LAMB2/DUSP2/MYC/HBEGF/KLF4/ILK/EPHA2/SPINK5/RGS2/PTGS2/PI16/LUM/DUSP5/CAV1/SFRP2/ANXA6/PTGIS/CD81/SIK1/TXNIP/MYADM/ANXA1/ZFP36L1/MGP/EZR/ETS2/KANK2/VCL/CLIC4/CDKN1A/MMP2/MYL6/HSPG2/RHCG/PGM5/SOCS3/SVIL/ITGA5/SPRR3/ATF3/LMNA/IGFBP5/JUN/RHOB/EMP1/JUNB/PALLD/ACTC1/BTG2/AQP3/KRT4/MYLK/TPM1/DCN/GSN/EGR1/ZFP36/DUSP1/CSRP1/ACTA2/TAGLN/ACTG2/KRT13/FOS/FLNA/MYH11 |
GO:0003012 | muscle system process | 157 | -0.8373947 | -1.398320 | 6.002587e-07 | 4.917579e-04 | 4.625983e-04 | 130 | tags=28%, list=4%, signal=28% | DMD/ATP1A2/KCNMA1/ERRFI1/CASQ2/SRF/NR4A3/CALM1/DMPK/SORBS2/CRYAB/CACNA1H/KLF4/RGS2/PTGS2/PI16/SLMAP/CAV1/ANXA6/ATP2B4/VCL/MYL6/LMNA/IGFBP5/SORBS1/SMTN/TLN1/HSPB6/PPP1R12B/LMOD1/ACTC1/CALD1/TPM2/MYLK/TPM1/GSN/CNN1/SYNM/ACTA2/MYL9/ACTG2/DES/FLNA/MYH11 |
GO:0006936 | muscle contraction | 123 | -0.8614637 | -1.434024 | 6.768862e-07 | 4.917579e-04 | 4.625983e-04 | 130 | tags=22%, list=4%, signal=22% | RGS2/PTGS2/SLMAP/CAV1/ANXA6/ATP2B4/VCL/MYL6/SORBS1/SMTN/TLN1/HSPB6/PPP1R12B/LMOD1/ACTC1/CALD1/TPM2/MYLK/TPM1/CNN1/SYNM/ACTA2/MYL9/ACTG2/DES/FLNA/MYH11 |
GO:0007010 | cytoskeleton organization | 331 | -0.7653192 | -1.280751 | 4.942689e-06 | 2.872691e-03 | 2.702350e-03 | 412 | tags=28%, list=13%, signal=27% | FAM107A/DIXDC1/WASF2/SPRY1/CLIP1/DPYSL2/SPTAN1/EHBP1L1/CDC42BPA/ENAH/IQGAP1/BIN1/SFRP1/NEXN/AKAP13/PKD1/PAM/PDGFRA/CLIP3/MYO1C/TUBB6/PARVA/TNFAIP3/JAM3/MAP1A/SYNE1/SPTBN1/CXCL12/DMD/EHD2/WEE1/RND3/KANK1/TACC2/MACF1/DST/CASQ2/ELN/CFL2/SRF/CALM1/CORO1C/ABL1/AMOTL1/CCL2/PPP1R12A/SYNPO/PPL/ADD1/PDLIM7/MAP4/SORBS2/ZYX/CRYAB/ABLIM1/CCDC69/ILK/TACC1/TUBA1A/LRP1/PDLIM3/MAP1B/ACTN1/MYADM/ANXA1/EZR/CLIC4/DPYSL3/AQP1/TNXB/PGM5/SVIL/LMNA/SORBS1/SMTN/TLN1/RHOB/DSTN/LMOD1/PALLD/ACTC1/KRT4/TPM2/TPM1/SYNPO2/GSN/CNN1/CSRP1/SYNM/KRT13/DES/FLNA/MYH11 |
GO:0061061 | muscle structure development | 200 | -0.7840662 | -1.309561 | 5.303902e-05 | 2.568856e-02 | 2.416532e-02 | 432 | tags=34%, list=14%, signal=32% | PDLIM5/MYOCD/KDM6B/MEF2D/DKK1/RBPMS2/PITX1/PLEKHO1/FOXF1/BTG1/TCF21/BIN1/AKAP13/SPEG/PDGFRA/EGR3/ZEB1/JPH2/SYNE1/ADARB1/ITGA7/DMD/EHD2/TGFBR2/PIM1/CASQ2/ELN/TGFBR3/FGFR1/CFL2/SRF/MBNL1/ARID5B/ABL1/DMPK/PDLIM7/SORBS2/MAFF/CRYAB/CACNA1H/HBEGF/ILK/RGS2/PI16/PDLIM3/CAV1/CD81/SIK1/ZFP36L1/MYL6/PGM5/SVIL/ATF3/LMNA/IGFBP5/SMTN/FHL1/LMOD1/ACTC1/BTG2/FLNC/MYLK/TPM1/DCN/EGR1/CSRP1/TAGLN/FOS/MYH11 |
term_of_interest = 'muscle'
idx = grep(DEG_gsea_dn@result$Description, pattern = term_of_interest, value = F, ignore.case = T)
print(idx)
[1] 3 4 6
gseaplot2(DEG_gsea_dn, geneSetID = idx)
ggpubr::ggarrange(
gseaplot(DEG_gsea_dn, geneSetID = 1, by = 'runningScore', title = DEG_gsea_dn@result$Description[1]) +
theme_classic(),
gseaplot(DEG_gsea_dn, geneSetID = 1, by = 'preranked') +
theme_classic(),
ncol =1, nrow = 2)
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi: 10.21105/joss.01686.
Alboukadel Kassambara (2020). ggpubr: 'ggplot2' Based Publication Ready Plots. R package version 0.4.0. https://CRAN.R-project.org/package=ggpubr
Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq. Journal of Open Source Software, 4(40), 1627, doi: 10.21105/joss.01627.
Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616.
Yu G, Wang L, Han Y, He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), 284-287. doi: 10.1089/omi.2011.0118.
Carlson M (2019). org.Hs.eg.db: Genome wide annotation for Human. R package version 3.8.2.
Yu G (2022). enrichplot: Visualization of Functional Enrichment Result. R package version 1.14.2, https://yulab-smu.top/biomedical-knowledge-mining-book/.