Metabolomics is a vital area of systems biology that focuses on the analysis of small molecules in biological samples to understand physiological conditions and disease processes. This demonstration uses metabolomics data structured similarly to the Taiwan Biobank data and presents a practical analysis workflow using R and the MetaboAnalystR package1, including data preprocessing, normalization, and statistical analyses.
To complete this demonstration, prepare the following software and steps:
metanr_packages <- function() {
metr_pkgs <- c("impute", "pcaMethods", "globaltest", "GlobalAncova", "Rgraphviz",
"preprocessCore", "genefilter", "sva", "limma", "KEGGgraph", "siggenes",
"BiocParallel", "MSnbase", "multtest", "RBGL", "edgeR", "fgsea", "devtools",
"crmn", "httr", "qs")
list_installed <- installed.packages()
new_pkgs <- subset(metr_pkgs, !(metr_pkgs %in% list_installed[, "Package"]))
if (length(new_pkgs) != 0) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(new_pkgs)
print(c(new_pkgs, " packages added..."))
}
if ((length(new_pkgs) < 1)) {
print("No new packages added...")
}
}
metanr_packages()Packages under R need to be loaded before they can be used in the analysis. Use the following snippet to load the necessary packages:
library(data.table) # package for importing data
library(tidyverse) # package for merging and collating data
library(MetaboAnalystR) # package for analyzing metabolomic data
library(qs) # package for importing and exporting data quickly
library(waldo) # package for comparing data frames
setwd("C:/Users/Admin/Desktop/twb_mtb_demo") # set working directory
rm(list = ls()) # clear all existing objects in the environmentThe raw metabolomic data must be reformatted to ensure compatibility with the MetaboAnalystR package. The first column should contain sample IDs, the second column should specify phenotype or group labels, and all remaining columns should represent metabolite values. To prevent distortion in downstream analyses, zero values—which indicate concentrations below the limit of detection (LOD)—should be converted to NA.
fread("demo_mtb_raw_20250612.csv") %>%
mutate(across(4:ncol(.), ~ ifelse(. == 0, NA, .))) %>%
select(MTBB_ID, Phenotype:`K-EDTA`) %>%
arrange(Phenotype) %>%
fwrite(., file = "demo_mtb_20250612.csv", sep = ",", quote = FALSE, col.names = TRUE)Then, initialize the MetaboAnalystR data object (referred to as mSet) and import the formatted metabolic dataset.
mSet <- InitDataObjects(data.type = "conc", anal.type = "stat", paired = FALSE)
# data.type options:
# - "conc": compound concentration data
# - "specbin": binned spectral data
# - "pktable": peak intensity table
# - others may also be available
# anal.type (the type of analysis to perform) options:
# - "stat": statistical analysis
# - "pathora": pathway over-representation analysis
# - "pathqea": pathway quantitative enrichment analysis
# - "msetora": metabolite set over-representation analysis
# - other modules may also be available## Starting Rserve...
## "C:\Users\jsc1101.OA\AppData\Local\Programs\R\R-43~1.3\library\Rserve\libs\x64\Rserve.exe" --no-save
## [1] "MetaboAnalyst R objects initialized ..."
mSet <- Read.TextData(mSet, "demo_mtb_20250612.csv", format = "rowu", lbl.type = "disc")
# format options:
# - "rowp": samples are paired and in rows
# - "rowu": samples are unpaired and in rows
# - "colp": samples are paired and in columns
# - "colu": samples are unpaired and in columns
# lbl.type options:
# - "disc": categorical labels
# - "cont": continuous labels
# output:
# data_orig.qs
# raw_dataview.csvMetaboAnalystR automatically removes columns that contain only missing values. A comparison between the original dataset and the imported data reveals these changes—specifically, the columns Coline and Ca-EDTA are omitted due to being entirely missing and the MTBB_ID and Phenotype columns are stored separately within the mSet object.
## MTBB_ID Phenotype Ethanol Glutamic acid Glutamine Glycerol
## <char> <int> <num> <num> <num> <num>
## 1: MTBB_0048 0 0.02 0.07 0.02 0.08
## 2: MTBB_0049 0 0.18 0.26 0.34 0.75
## 3: MTBB_0050 0 0.04 0.01 0.08 0.08
## 4: MTBB_0051 0 0.09 0.05 0.01 0.08
## 5: MTBB_0052 0 0.01 0.02 0.03 NA
## 6: MTBB_0053 0 0.10 0.04 0.11 0.16
## Classes ‘data.table’ and 'data.frame': 400 obs. of 43 variables:
## $ MTBB_ID : character MTBB_0048, MTBB_0049, MTBB_0050, MTBB_0051, MTBB_0052, MTBB_0053, ...
## $ Phenotype : integer 0, 0, 0, 0, 0, 0, ...
## $ Ethanol : numeric 0.02, 0.18, 0.04, 0.09, 0.01, 0.1, ...
## $ Glutamic acid : numeric 0.07, 0.26, 0.01, 0.05, 0.02, 0.04, ...
## $ Glutamine : numeric 0.02, 0.34, 0.08, 0.01, 0.03, 0.11, ...
## $ Glycerol : numeric 0.08, 0.75, 0.08, 0.08, NA, 0.16, ...
## $ Tyrosine : numeric 0.02, 0.29, 0.04, 0.01, 0.01, 0.63, ...
## $ D-Galactose : numeric 0.07, 0.23, 0.07, 0.03, 0.02, 0.23, ...
## $ Alanine : numeric 0.07, 0.69, 0.11, 0.12, 0.03, 0.63, ...
## Ethanol Glutamic acid Glutamine Glycerol Tyrosine D-Galactose
## MTBB_0048 0.02 0.07 0.02 0.08 0.02 0.07
## MTBB_0049 0.18 0.26 0.34 0.75 0.29 0.23
## MTBB_0050 0.04 0.01 0.08 0.08 0.04 0.07
## MTBB_0051 0.09 0.05 0.01 0.08 0.01 0.03
## MTBB_0052 0.01 0.02 0.03 NA 0.01 0.02
## MTBB_0053 0.10 0.04 0.11 0.16 0.63 0.23
## Classes ‘data.table’ and 'data.frame': 400 obs. of 39 variables:
## $ Ethanol : numeric 0.02, 0.18, 0.04, 0.09, 0.01, 0.1, 0.03, 0.04, 0.02, ...
## $ Glutamic acid : numeric 0.07, 0.26, 0.01, 0.05, 0.02, 0.04, 0.28, 0.03, 0.02, ...
## $ Glutamine : numeric 0.02, 0.34, 0.08, 0.01, 0.03, 0.11, 0.41, 0.01, NA, ...
## $ Glycerol : numeric 0.08, 0.75, 0.08, 0.08, NA, 0.16, 0.84, 0.06, 0.06, ...
## $ Tyrosine : numeric 0.02, 0.29, 0.04, 0.01, 0.01, 0.63, 0.21, 0.03, 0.02, ...
## $ D-Galactose : numeric 0.07, 0.23, 0.07, 0.03, 0.02, 0.23, 0.18, 0.01, 0.09, ...
## $ Alanine : numeric 0.07, 0.69, 0.11, 0.12, 0.03, 0.63, 0.61, 0.03, NA, ...
## $ Sarcosine : numeric 0.02, 0.43, 0.04, 0.05, 0.08, 0.27, 0.2, 0.01, NA, ...
## $ Threonine : numeric NA, 0.3, NA, 0.07, 0.07, 0.17, 0.2, 0.04, NA, ...
## [1] "MTBB_ID" "Phenotype" "Choline" "Ca-EDTA"
Next, the data undergo integrity checks to ensure that all values are valid and numeric, and to assess the extent of missing data. During this process, the variable K-EDTA is automatically excluded, as it contains three or fewer valid (non-missing) values across all samples and thus does not contribute meaningful variance to the analysis.
## [1] "Successfully passed sanity check!"
## [2] "Samples are not paired."
## [3] "2 groups were detected in samples."
## [4] "Only English letters, numbers, underscore, hyphen and forward slash (/) are allowed."
## [5] "<font color=\"orange\">Other special characters or punctuations (if any) will be stripped off.</font>"
## [6] "All data values are numeric."
## [7] "<font color=\"red\"> 1 features with a constant or single value across samples were found and deleted.</font>"
## [8] "A total of 1249 (8.2%) missing values were detected."
## [9] "<u>By default, missing values will be replaced by 1/5 of min positive values of their corresponding variables</u>"
## [10] "Click the <b>Proceed</b> button if you accept the default practice;"
## [11] "Or click the <b>Missing Values</b> button to use other methods."
## `old` is length 39
## `new` is length 38
##
## `names(old)[36:39]`: "Pyruvic acid" "Glucose" "Dimethylsulfone" "K-EDTA"
## `names(new)[36:38]`: "Pyruvic acid" "Glucose" "Dimethylsulfone"
##
## `old$K-EDTA` is a double vector (NA, NA, NA, NA, NA, ...)
## `new$K-EDTA` is absent
Before conducting statistical analysis, it is essential to preprocess the raw metabolomic data to enhance data quality, reduce technical noise, and ensure comparability across samples and metabolites. This preprocessing step serves several key purposes:
The following section outlines the preprocessing steps required to prepare the dataset for statistical analysis. First, metabolites with more than 70% missing values are removed to minimize bias. Then, the remaining missing values are imputed using the K-nearest neighbors (KNN) method to retain as much informative data as possible.
# remove features with too many missing values
mSet <- RemoveMissingPercent(mSet, percent = 0.7)
# output: preproc.qs (renewed)
# estimate the remaining missing values
mSet <- ImputeMissingVar(mSet, method = "knn_smp")
# mSet <- ReplaceMin(mSet)
# output: data_proc.qs## [1] "0 variables were removed for threshold 70 percent."
## [2] "Missing variables were imputated using KNN-SMP"
Finally, transformation and scaling are applied to enhance the comparability of metabolite levels.
# prepare data for normalization
mSet <- PreparePrenormData(mSet)
# output: prenorm.qs
# normalize data
mSet <- Normalization(mSet, rowNorm = "NULL", transNorm = "LogNorm", scaleNorm = "AutoNorm")
# output:
# row_norm.qs (after row normalization)
# complete_norm.qs# check the normalization methods
mSet$dataSet$rownorm.method
mSet$dataSet$trans.method
mSet$dataSet$scale.method## [1] "N/A"
## [1] "Log10 Normalization"
## [1] "Autoscaling"
## Ethanol Glutamic acid Glutamine Glycerol Tyrosine
## MTBB_0048 -1.1387754 0.2356533 -1.3647775 -0.60111417 -1.4742597
## MTBB_0049 0.9402735 1.3348727 1.0819523 1.49399832 0.9778101
## MTBB_0050 -0.4831675 -1.3924520 -0.1678326 -0.60111417 -0.8389577
## MTBB_0051 0.2842443 -0.0461789 -1.9618907 -0.60111417 -2.1082794
## MTBB_0052 -1.7930600 -0.8133446 -1.0148452 -0.30301074 -2.1082794
## MTBB_0053 0.3839606 -0.2330660 0.1072249 0.04774856 1.6893814
## D-Galactose
## MTBB_0048 -0.4508742
## MTBB_0049 0.7117588
## MTBB_0050 -0.4508742
## MTBB_0051 -1.2787865
## MTBB_0052 -1.6747420
## MTBB_0053 0.7117588
We can also visualize the distributions of metabolites before and after preprocessing using histograms. These plots provide a clear view of how the data have been transformed, allowing us to assess the effectiveness of the preprocessing steps. Prior to preprocessing, most metabolites exhibit right-skewed distributions. After preprocessing, the distributions become more symmetric and closer to normal, indicating improved data quality and enhanced comparability.
qread("preproc.qs") %>% select(1:10) %>%
pivot_longer(cols = everything(), names_to = "metabolites", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(fill = "skyblue", color = "white") +
facet_wrap(~ metabolites, scales = "free", ncol = 5) +
theme_minimal() +
labs(title = "Histograms of Metabolites", x = "", y = "")qread("complete_norm.qs") %>% select(1:10) %>%
pivot_longer(cols = everything(), names_to = "metabolites", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(fill = "skyblue", color = "white") +
facet_wrap(~ metabolites, scales = "free", ncol = 5) +
theme_minimal() +
labs(title = "Histograms of Metabolites", x = "", y = "")If needed, the tidy dataset can be exported to a CSV file for future reference or reuse in subsequent analyses.
With the dataset prepared, we proceed to statistical analysis aimed at identifying meaningful patterns associated with the phenotype of interest. In metabolomics, the majority of metabolite variation is often driven by inherent biological variability, while only a subset reflects true associations with the phenotype under study. Distinguishing these informative metabolites is essential for biomarker discovery and for advancing our understanding of the underlying biological mechanisms. This section introduces statistical approaches for comparing two groups, starting with univariate methods such as fold-change analysis and t-tests. These methods are widely used for their simplicity and interpretability in identifying significantly different features. To aid in interpretation, volcano plots are used to combine fold-change and statistical significance into a single, intuitive visualization.
While univariate methods are effective for detecting individually significant metabolites, they ignore correlations between variables. To complement this, multivariate approaches like Orthogonal Partial Least Squares–Discriminant Analysis (OPLS-DA) consider all metabolites simultaneously, making them well-suited for high-dimensional data. These methods are useful for uncovering coordinated changes that distinguish between groups. By combining univariate and multivariate analyses, we can obtain a more comprehensive understanding of metabolic alterations.
Fold-change analysis is a simple yet effective method for identifying metabolites with substantial differences between two groups. By comparing the raw average concentrations, it highlights features with notable increases or decreases, serving as an initial filter for biologically meaningful changes. In the example below, the fold-change threshold is set to 2.0, meaning only metabolites with at least a two-fold increase or a 0.5-fold decrease are considered significant. The results are saved to fold_change.csv, which lists the significantly altered metabolites for further analysis.
mSet <- FC.Anal(mSet, fc.thresh = 2.0, cmp.type = 1)
# output: fold_change.csv
# cmp.type: comparison type, 0 for group 1 minus group 2, and 1 for group 2 minus group 1
# levels(mSet$dataSet$cls)
# [1] "0" "1"## Fold Change log2(FC)
## Citric acid 3.7201 1.8954
## Leucine 3.2533 1.7019
## Creatinine 3.0078 1.5887
## Valine 2.6226 1.3910
## Histidine 2.5297 1.3390
## Pyruvic acid 2.4280 1.2797
# Create a data frame that displays group means and fold-change values
qread("row_norm.qs") %>%
mutate(Phenotype = mSet$dataSet$cls) %>%
group_by(Phenotype) %>%
summarise(across(everything(), mean, na.rm = TRUE)) %>%
pivot_longer(-Phenotype, names_to = "Metabolites", values_to = "Mean") %>%
pivot_wider(names_from = Phenotype, values_from = Mean) %>%
bind_cols(data.frame(log2_fc = as.vector(mSet$analSet$fc$fc.log))) %>%
arrange(desc(abs(log2_fc)))## # A tibble: 38 × 4
## Metabolites `0` `1` log2_fc
## <chr> <dbl> <dbl> <dbl>
## 1 Citric acid 0.0227 0.0845 1.90
## 2 Leucine 0.0293 0.0954 1.70
## 3 Creatinine 0.0512 0.154 1.59
## 4 Valine 0.0564 0.148 1.39
## 5 Histidine 0.0916 0.232 1.34
## 6 Pyruvic acid 0.0280 0.0680 1.28
## 7 D-Galactose 0.128 0.298 1.22
## 8 Lysine 0.121 0.281 1.21
## 9 Glutamic acid 0.0836 0.191 1.19
## 10 Threonine 0.128 0.292 1.19
## # ℹ 28 more rows
T-tests are commonly used to identify metabolites with significantly different mean concentrations between two groups. In this example, each metabolite is tested individually, with statistical significance evaluated using a p-value threshold of 0.05, adjusted by the false discovery rate (FDR) to correct for multiple testing. The results are exported into two CSV files: t_test.csv, which contains only the significant metabolites along with their t-statistics, p-values, –log10(p), and FDR-adjusted p-values, and t_test_all.csv, which provides the complete set of results for all tested metabolites.
mSet <- Ttests.Anal(mSet, threshp = 0.05, pvalType = "fdr", all_results = TRUE)
# output:
# t_test.csv
# t_test_all.csv## [1] "Performing regular t-tests ...."
## [1] "A total of 35 significant features were found."
## t.stat p.value -log10(p) FDR
## Creatinine -8.6901 9.5448e-17 16.020 3.6270e-15
## Leucine -7.8393 4.1890e-14 13.378 7.9590e-13
## Glycerol -7.5003 4.1878e-13 12.378 5.3046e-12
## Glutamic acid -7.4491 5.8916e-13 12.230 5.3725e-12
## D-Galactose -7.4217 7.0691e-13 12.151 5.3725e-12
## N_N-Dimethylglycine -7.2986 1.5917e-12 11.798 1.0081e-11
Volcano plots provide an intuitive visual summary of the results from univariate analyses by combining fold-change and statistical significance into a single plot. Each point represents a metabolite, with the x-axis showing the log2 fold change and the y-axis showing the –log10 adjusted p-value. This allows for quick identification of metabolites that are both significantly and substantially different between groups. A fold-change threshold of 2.0 and an FDR-adjusted p-value threshold of 0.05 are used in the example. The results are saved in two files: volcano_all.mat, which includes all metabolites along with their fold-change and p-value information, and volcano.csv, which lists only the significant ones. A labeled volcano plot is also generated for visual interpretation.
mSet <- Volcano.Anal(mSet, fcthresh = 2.0, cmpType = 1, threshp = 0.05, pval.type = "fdr")
# output:
# volcano.csv
# volcano_all.mat## [1] "A total of 35 significant features were found."
mSet <- PlotVolcano(mSet, imgName = "volcano_", plotLbl = 1, labelNum = 15, plotTheme = 0, dpi = 300)
# output: volcano_dpi300.pngOrthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) is a supervised multivariate technique designed to improve model interpretability by separating phenotype-associated variation from orthogonal variation unrelated to the phenotype. This approach enhances the clarity of group discrimination in complex metabolomic datasets.
In this analysis, sample projections onto the OPLS-DA components are saved in oplsda_score.csv, and the corresponding metabolite loadings are recorded in oplsda_loadings.csv. The contribution of each variable to phenotypic group separation is evaluated using Variable Importance in Projection (VIP) scores, with results stored in oplsda_vip.csv. VIP scores reflect the importance of each variable based on its contribution to the predictive (phenotype-associated) components of the model. Higher VIP scores indicate greater influence in distinguishing groups, with scores above 1 generally considered significant. A score plot showing the distribution of samples along the predictive and orthogonal (phenotype-unrelated) components is saved as opls_score_dpi300.png, while a VIP plot highlighting the most influential metabolites is provided in opls_imp_dpi300.png.
# Create the OPLS-DA score plot
mSet <- PlotOPLS2DScore(mSet, imgName = "opls_score_", dpi = 300, show = 0)
# output: opls_score_dpi300.png
# Create a plot of feature importance rankings
mSet <- PlotOPLS.Imp(mSet, imgName = "opls_imp_", feat.num = 20, dpi = 300)
# output:
# opls_imp_dpi300.png
# oplsda_vip.csvTo better identify metabolites that are truly relevant to the phenotype, we can integrate the results from both univariate and multivariate analyses. By focusing on metabolites that are not only significant in the univariate tests but also have high influence in the multivariate model (VIP scores above 1), we prioritize metabolites that are robust across different analytical perspectives. This combined approach helps to highlight metabolites with more strong evidence of association, improving the reliability of downstream interpretation and potential biomarker selection.
vip_mat <- mSet$analSet$oplsda$vip.mat
t_mat <- mSet$analSet$tt$sig.mat
common_metabolites <- intersect(rownames(t_mat), rownames(vip_mat[vip_mat[, 1] > 1, ]))
data.frame(
P_value = t_mat[common_metabolites, "FDR"],
VIP_score = vip_mat[common_metabolites, 1]) %>%
rownames_to_column("Metabolites") %>% print()## Metabolites P_value VIP_score
## 1 Creatinine 3.6270e-15 1.326049
## 2 Leucine 7.9590e-13 1.635045
## 3 Glycerol 5.3046e-12 1.195995
## 4 Glutamic acid 5.3725e-12 1.432961
## 5 D-Galactose 5.3725e-12 1.170988
## 6 Citric acid 1.4611e-11 1.474318
## 7 Lysine 3.4441e-11 1.250865
## 8 Valine 3.4795e-11 1.409012
## 9 Threonine 3.6089e-10 1.197866
## 10 Histidine 8.7364e-10 1.402364
## 11 2-Oxoglutaric acid 1.3353e-09 1.155958
## 12 Methionine 4.0324e-09 1.024290
## 13 Sarcosine 2.1264e-07 1.059835
## 14 Ornithine 3.6597e-05 1.297626
## 15 Pyruvic acid 9.6113e-04 1.083273
Beyond univariate and supervised analyses, some other methods are intended to demonstrate the broader potential of metabolomic data analysis. We explore unsupervised techniques including hierarchical heatmaps, principal component analysis (PCA), and K-means clustering in the following. These approaches can help uncover hidden patterns, visualize complex sample relationships, or generate new biological hypotheses from metabolomic profiles.
Heatmaps offer a visual summary of the expression patterns of multiple metabolites across samples. They are particularly useful for identifying clusters of metabolites with similar profiles or detecting outlier samples. The resulting image is saved as heatmap_dpi300.png, and the processed data used to generate the heatmap is exported to data_prefilter_pearson.csv.
mSet <- PlotSubHeatMap(mSet, imgName = "heatmap_",
dataOpt = "norm", scaleOpt = "none", smplDist = "euclidean", clstDist = "ward.D2", method.nm = "tanova",
palette = "bwr", fzCol = 1, fzRow = 8, dpi = 300, top.num = 25, viewOpt = "overview", download = TRUE)
# output:
# heatmap_dpi300.png
# dataOpt options: "norm", "raw"
# scaleOpt options: "none", "row", "col"
# smplDist options: "euclidean", "correlation", "minkowski"
# clstDist options: "ward.D2", "average", "complete", "single"
# method.nm options: "tanova", "vip", "rf"
# viewOpt options: "overview", "detail"PCA is an unsupervised dimensionality reduction technique that projects high-dimensional data into a low-dimensional space, allowing for visualization of major sources of variance. It helps assess sample clustering, detect outliers, and identify major trends in the data. After performing PCA analysis, the sample scores and metabolite loadings are saved in pca_score.csv and pca_loadings.csv files. The PlotPCAPairSummary function creates pairwise plots of the top principal components, while the PlotPCABiplot provides an integrated view of both sample distribution and top contributing metabolites along two selected components.
mSet <- PlotPCAPairSummary(mSet, "pca_pair_", dpi = 300, pc.num = 5)
mSet <- PlotPCABiplot(mSet, "pca_biplot_", dpi = 300, inx1 = 1, inx2 = 2, topnum = 10)
# output:
# pca_pair_dpi300.png
# pca_biplot_dpi300.pngK-means clustering is a widely used unsupervised learning algorithm that partitions samples into a specified number of clusters. This method groups samples with similar metabolite profiles together, potentially revealing underlying biological subtypes or patterns. The resulting cluster assignments can be examined through a PCA projection plot (km_pca_dpi300.png) for intuitive visualization. Cluster labels can also be extracted for further analysis.
## rowname mSet$analSet$kmeans$cluster
## 1 MTBB_0048 2
## 2 MTBB_0049 1
## 3 MTBB_0050 2
## 4 MTBB_0051 2
## 5 MTBB_0052 2
## 6 MTBB_0053 1
This demonstration presents a preliminary workflow for metabolomics data analysis. Beginning with essential preprocessing steps—such as handling missing values and applying normalization—it proceeds through basic statistical inference using fold-change analysis and t-tests, followed by visualization tools such as volcano plots and supervised modeling with OPLS-DA. In the final section, we introduce complementary unsupervised methods including heatmaps, PCA, and K-means clustering, which help reveal broader data structures and intrinsic patterns beyond hypothesis-driven approaches. Overall, this example serves as an initial exploratory framework designed to lay the groundwork for more detailed, hypothesis-driven, or integrative downstream analyses.