Statistical Analysis

Zhiqiang Pang, Jasmine Chong, Jeff Xia

2023-07-24

1. Introduction

MetaboAnalystR, in parallel with the webserver, provides a comprehensive suite of statistical analyses to perform on a user-uploaded data set. While researchers may have several goals for their metabolomics data, the ultimate goal is to identify any significant metabolite/s indicitive of a disease state, drugs, diet, environment, geographical location, etc.

The standard workflow for statistical analysis is as follows:

Processed metabolomic data -> Univariate analysis -> Multivariate analysis -> Biological interpretation

Univariate tests differ from multivariate tests in that they assess the importance of each variable seperately. Meanwhile, multivariate tests assess two or more variables at once, while also taking into consideration the relationship between the variables. Finally, biological interpretation provides a biological context for the significant metabolites identified using univariate/multivariate methods. Below, we will discuss the various statistical methods in greater detail. For the tutorial, we will be using a dataset consisting of concentrations of 77 urine samples from cancer patients (cachexic vs. control) measured by 1H NMR - Eisner et al. 2010.

2. Univariate Methods

To begin the SA module, we will start with identifying important features from the data set using various univariate tests, including classical methods such as the Student’s t-test and ANOVA, as well as other methods such as the volcano plot and correlation analysis.

# Load MetaboAnalystR
library(MetaboAnalystR)

# Clean global environment
rm(list = ls())
library(MetaboAnalystR)
mSet<-InitDataObjects("conc", "stat", FALSE);
## Starting Rserve:
##  /opt/R/4.2.2/lib/R/bin/R CMD /opt/R/4.2.2/lib/R/library/Rserve/libs//Rserve --no-save 
## 
## [1] "MetaboAnalyst R objects initialized ..."
mSet<-Read.TextData(mSet, "https://rest.xialab.ca/api/download/metaboanalyst/human_cachexia.csv", "rowu", "disc");
mSet<-SanityCheckData(mSet);
##  [1] "Successfully passed sanity check!"                                                                                
##  [2] "Samples are not paired."                                                                                          
##  [3] "2 groups were detected in samples."                                                                               
##  [4] "Only English letters, numbers, underscore, hyphen and forward slash (/) are allowed."                             
##  [5] "<font color=\"orange\">Other special characters or punctuations (if any) will be stripped off.</font>"            
##  [6] "All data values are numeric."                                                                                     
##  [7] "A total of 0 (0%) missing values were detected."                                                                  
##  [8] "<u>By default, missing values will be replaced by 1/5 of min positive values of their corresponding variables</u>"
##  [9] "Click the <b>Proceed</b> button if you accept the default practice;"                                              
## [10] "Or click the <b>Missing Values</b> button to use other methods."
mSet<-ReplaceMin(mSet);
mSet<-PreparePrenormData(mSet);
mSet<-Normalization(mSet, "NULL", "LogNorm", "MeanCenter", "S10T0", ratio=FALSE, ratioNum=20);
## [1] 77 63
mSet<-PlotNormSummary(mSet, "norm_0_", format ="png", dpi=72, width=NA);
mSet<-PlotSampleNormSummary(mSet, "snorm_0_", format = "png", dpi=72, width=NA);
Figure 1. Normalization results.

Figure 1. Normalization results.

2.1 Fold-change analysis

The goal of fold-change (FC) analysis is to compare the absolute value of change between two group means. Since column-wise normalization (i.e. log transformation, mean-centering) will significantly alter absolute values, FC is calculated as the ratio between two group means using the data before column-wise normalization was applied.

For paired analysis, the program first counts the number of pairs with consistent change above the given FC threshold. If this number exceeds a given count threshold, the variable will be reported as significant.

# Perform fold-change analysis on uploaded data, unpaired
mSet<-FC.Anal(mSet, 2.0, 0, FALSE)

# Plot fold-change analysis
mSet<-PlotFC(mSet, "fc_0_", "png", 72, width=NA)

# To view fold-change 
mSet$analSet$fc$fc.log
## 1,6-Anhydro-beta-D-glucose       1-Methylnicotinamide 
##                   0.888690                  -0.052019 
##            2-Aminobutyrate       2-Hydroxyisobutyrate 
##                   1.312700                   0.633520 
##             2-Oxoglutarate         3-Aminoisobutyrate 
##                   1.098400                   1.329100 
##          3-Hydroxybutyrate       3-Hydroxyisovalerate 
##                   1.563700                   1.164900 
##           3-Indoxylsulfate     4-Hydroxyphenylacetate 
##                   0.857170                   0.263810 
##                    Acetate                    Acetone 
##                   1.266100                   0.664560 
##                    Adipate                    Alanine 
##                   1.952900                   1.141300 
##                 Asparagine                    Betaine 
##                   0.852650                   1.004000 
##                  Carnitine                    Citrate 
##                   0.994090                   0.883620 
##                   Creatine                 Creatinine 
##                   1.763900                   0.932160 
##              Dimethylamine               Ethanolamine 
##                   1.120000                   0.729170 
##                    Formate                     Fucose 
##                   1.150600                   0.918770 
##                   Fumarate                    Glucose 
##                   1.262700                   2.553000 
##                  Glutamine                    Glycine 
##                   1.166100                   0.869890 
##                  Glycolate             Guanidoacetate 
##                   0.657790                   0.506100 
##                  Hippurate                  Histidine 
##                   1.075800                   1.013100 
##               Hypoxanthine                 Isoleucine 
##                   0.375470                   0.420550 
##                    Lactate                    Leucine 
##                   1.726900                   1.205400 
##                     Lysine                Methylamine 
##                   0.442780                   0.901220 
##            Methylguanidine        N,N-Dimethylglycine 
##                   0.517770                   1.342900 
##          O-Acetylcarnitine               Pantothenate 
##                   1.270300                  -0.397700 
##              Pyroglutamate                   Pyruvate 
##                   1.180400                   1.096200 
##                Quinolinate                     Serine 
##                   1.090600                   1.007700 
##                  Succinate                    Sucrose 
##                   1.416200                   1.432600 
##                   Tartrate                    Taurine 
##                   0.720000                   1.032700 
##                  Threonine               Trigonelline 
##                   0.990220                   1.460400 
##     Trimethylamine N-oxide                 Tryptophan 
##                   1.077700                   0.967880 
##                   Tyrosine                     Uracil 
##                   0.953700                   0.207270 
##                     Valine                     Xylose 
##                   1.178900                   1.194000 
##              cis-Aconitate               myo-Inositol 
##                   1.589400                   1.537500 
##            trans-Aconitate         pi-Methylhistidine 
##                   0.811730                   0.771640 
##        tau-Methylhistidine 
##                   0.708800
Figure 2. Fold change analysis results.

Figure 2. Fold change analysis results.

2.2 T-Test

MetaboAnalystR supports various options for performing T-test analysis. Users can select the analysis type (paired), the group variance (equal.var), whether the test is parametric or non-parametric (nonpar), and the adjusted p-value (FDR) cut-off (threshp).

Note, for a large data set (> 1000 variables), both the paired information and the group variance will be ignored, and the default parameters will be used for t-tests to save computational time. If you choose non-parametric tests (Wilcoxon rank-sum test), the group variance will be ignored.

# Perform T-test (parametric)
mSet<-Ttests.Anal(mSet, nonpar=F, threshp=0.05, paired=FALSE, equal.var=TRUE, "fdr", FALSE)
## Loading required package: memoise
## [1] "Performing regular t-tests ...."
## [1] "A total of 53 significant features were found."
# Plot of the T-test results
mSet<-PlotTT(mSet, imgName = "tt_0_", format = "png", dpi = 72, width=NA)
Figure 3. T test analysis results.

Figure 3. T test analysis results.

The t test results for MS1 feature table can be used for function analysis.

2.3 Volcano Plot

The volcano plot is a combination of fold change and t-test values. Note, for unpaired samples, the x-axis is log2(FC). For paired analysis, the x-axis is number of significant counts. The y-axis is -log10(p.value) for both cases, and can be based on raw or FDR adjusted p values from the t-tests. In Volcano.Anal, users can specify if the data are paired, the FC threshold, the comparison type, the signigicant count threshold if data are paired, whether the test is parametric or non-parametric, the p-value threshold, and the group variance.

# Perform the volcano analysis
mSet<-Volcano.Anal(mSet, FALSE, 2.0, 0, F, 0.1, TRUE, "raw")
## [1] "A total of 57 significant features were found."
# Create the volcano plot
mSet<-PlotVolcano(mSet, "volcano_0_", 1, 0, format ="png", dpi=72, width=NA)
## Loading required package: ggplot2
## Warning: Removed 30 rows containing missing values (`geom_text_repel()`).
Figure 4. Results of volcano analysis.

Figure 4. Results of volcano analysis.

2.4 One-way Analysis of Variance (ANOVA) - ONLY FOR MULTI-GROUP ANALYSIS!

ANOVA can only be computed on data with more than one group. Please note that the below example uses data from a different example dataset, which contains more than 2 groups.

# Perform ANOVA
mSet <- ANOVA.Anal(mSet, F, 0.05, "fisher")

# Plot ANOVA
mSet <- PlotANOVA(mSet, "aov_0_", "png", 72, width=NA)

2.5 Correlation Analysis

To perform correlation analysis to evaluate the correlation between all features or samples, use PlotCorrHeatMap. Here, users must specify the mSet object, the dimensions to be correlated, the name of the heatmap that will be created, the output, dpi, and width of the image, the distance measure (Pearson, Spearman, or Kendall), to view the heatmap as an overview or detailed view, to fix the colour distribution, the colors, and whether or not to perform clustering. Please refer to the user manual for more details.

Note, the heatmap will only show correlations for a maximum of 1000 features. For larger datasets, only top 1000 features will be selected based on their interquantile range (IQR). When color distribution is fixed, you can potentially compare the correlation patterns among different data sets. In this case, you can choose “do not perform clustering” for all data set, or only to perform clustering on a single reference data set, then manually re-arranged other data sets according to the clustering pattern of the reference data set.

### OPTION 1 - Heatmap specifying pearson distance and an overview
mSet<-PlotCorrHeatMap(mSet, "corr_0_", "png", 72, width=NA, "col", "pearson", "bwm", "overview", F, F, 0.0)
Figure 5. Results of heatmap analysis.

Figure 5. Results of heatmap analysis.

The following command shows the detailed view (zoomed-in) of the above heatmap.

### OPTION 2 - Heatmap specifying pearson correlation and a detailed view
mSet<-PlotCorrHeatMap(mSet, "corr_1_", format = "png", dpi=72, width=NA, "col", "spearman", "bwm", "detail", F, F, 999)

2.6 Pattern Searching

Correlation analysis can be performed either against a given feature or against a given pattern. The pattern is specified as a series of numbers separated by “-”. Each number corresponds to the expected expression pattern in the corresponding group. For example, a 1-2-3-4 pattern is used to search for features that increase linearly with time in a time-series data with four time points (or four groups). The order of the groups is given as the first item in the predefined patterns. Indicate the mSet object, the distance measure (Pearson, Spearman, or Kendall), and the pattern to use.

# Perform correlation analysis on a pattern (a feature of interest in this case)
mSet<-FeatureCorrelation(mSet, "pearson", "1,6-Anhydro-beta-D-glucose")

# Plot the correlation analysis on a pattern
mSet<-PlotCorr(mSet, "ptn_3_", format="png", dpi=72, width=NA)
Figure 6. Results of pattern search.

Figure 6. Results of pattern search.

2.7 Principal Component Analysis (PCA)

To perform PCA, first use PCA.Anal. MetaboAnalystR has the options to create a scree plot (PlotPCAScree), score plot (PlotPCA2DScore or PlotPCA3DScore), loadings plot (PlotPCALoading), and biplot (PlotPCABiplot). For the score plots, users will create both a static as well as interactive 3D score plot. To view the interactive plot, type “mSet$imgSet$pca.3d” into your R console. Please refer to the user manual for details about each function.

# Perform PCA analysis
mSet<-PCA.Anal(mSet)

# Create PCA overview
mSet<-PlotPCAPairSummary(mSet, "pca_pair_0_", format = "png", dpi = 72, width=NA, 5)

# Create PCA scree plot
mSet<-PlotPCAScree(mSet, "pca_scree_0_", "png", dpi = 72, width=NA, 5)

# Create a 2D PCA score plot
mSet<-PlotPCA2DScore(mSet, "pca_score2d_0_", format = "png", dpi=72, width=NA, 1, 2, 0.95, 1, 0)

# Create a 3D PCA score plot
mSet<-PlotPCA3DScoreImg(mSet, "pca_score3d_0_", "png", 72, width=NA, 1,2,3, 40)

# Create a PCA loadings Plots
mSet<-PlotPCALoading(mSet, "pca_loading_0_", "png", 72, width=NA, 1,2);

# Create a PCA Biplot
mSet<-PlotPCABiplot(mSet, "pca_biplot_0_", format = "png", dpi = 72, width=NA, 1, 2)
# View the 3D interactive PLS-DA score plot
mSet$imgSet$pca.3d
Figure 7. PCA score plot.

Figure 7. PCA score plot.

2.8 Partial Least Squares - Discriminant Analysis (PLS-DA)

To perform PLS-DA, first use PLSR.Anal. MetaboAnalystR has the options to create a plot component pairs (PlotPLSPairSummary), score plot (PlotPLS2DScore), loadings plot (PlotPLSLoading), perform cross-validation and permutation (PLSDA.CV and PLSDA.Permut), and plot the results of the cross-validation and permutation (PlotPLS.Imp and PlotPLS.Permutation). For the score plots, users will create both a static as well as interactive 3D score plot. To view the interactive plot, type “mSet$imgSet$plsda.3d” into your R console. Please refer to the user manual for details about each function.

mSet<-PLSR.Anal(mSet, reg=TRUE)

mSet<-PlotPLSPairSummary(mSet, "pls_pair_0_", "png", 72, width=NA, 5)

mSet<-PlotPLS2DScore(mSet, "pls_score2d_0_", "png", 72, width=NA, 1,2,0.95,1,0)

mSet<-PlotPLS3DScoreImg(mSet, "pls_score3d_0_", "png", 72, width=NA, 1,2,3, 40)

mSet<-PlotPLSLoading(mSet, "pls_loading_0_", "png", 72, width=NA, 1, 2);

mSet<-PLSDA.CV(mSet, "5", 5,5, "Q2")
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:MetaboAnalystR':
## 
##     splsda
## Loading required package: pls
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
mSet<-PlotPLS.Classification(mSet, "pls_cv_0_", "png", 72, width=NA)

mSet<-PlotPLS.Imp(mSet, "pls_imp_0_", "png", 72, width=NA, "vip", "Comp. 1", 15, FALSE)

mSet<-PLSDA.Permut(mSet, 100, "accu")
## [1] "performing 100 permutations ..."
## [1] "Empirical p value: p = 0.02 (2/100)"
mSet<-PlotPLS.Permutation(mSet, "pls_perm_1_", "png", 72, width=NA)
# View the 3D interactive PLS-DA score plot
mSet$imgSet$plsda.3d
Figure 8. PLS 2D score plot.

Figure 8. PLS 2D score plot.

Figure 9. PLS 3D score plot.

Figure 9. PLS 3D score plot.

Figure 10. PLS importance analysis results.

Figure 10. PLS importance analysis results.

Figure 11. PLS permutation test results.

Figure 11. PLS permutation test results.

2.9 Sparse Partial Least Squares - Discriminant Analysis (sPLS-DA)

The sparse PLS-DA (sPLS-DA) algorithm can effectively reduce the number of variables (metabolites) in high-dimensional metabolomics data to produce robust and easy-to-interpret models. Users can control the “sparseness” of the model by controlling the number of components included in the model and the number of variables within each component. More details can be found from Le Cao et. al 2011 (PMC3133555).

To begin, use SPLSR.Anal. MetaboAnalystR has the options to create an overview of the sPLS-DA analysis (PlotSPLSPairSummary), a score plot (PlotSPLS2DScore or PlotSPLS3DScore), loadings plot (PlotSPLSLoading), and classification plot (PlotSPLSDA.Classification). For the score plots, users will create both a static as well as interactive 3D score plot. To view the interactive plot, type “mSet$imgSet$splsda.3d” into your R console. Please refer to the user manual for details about each function. Note that the loadings plot shows the variables selected by the sPLS-DA model for a given component. The variables are ranked by the absolute values of their loadings.

To evaluate the performance of the created sPLS-DA models for classification, use PlotSPLSDA.Classification. The performance of the sPLS-DA models are evaluated using cross validations (CV) with increasing numbers of components created using the specified number of the variables. Users can choose to use either 5-fold CVs or leave one out cross-validation (LOOCV). Please note that the results from 5-fold CV may change slightly due to random subsampling procedures.

# Perform sPLS-DA analysis
mSet<-SPLSR.Anal(mSet, 5, 10, "same", "Mfold", 5, T)

# Plot sPLS-DA overview
mSet<-PlotSPLSPairSummary(mSet, "spls_pair_0_", format = "png", dpi=72, width=NA, 5)

# Create 2D sPLS-DA Score Plot
mSet<-PlotSPLS2DScore(mSet, "spls_score2d_0_", format = "png", dpi=72, width=NA, 1, 2, 0.95, 1, 0)

# Create 3D sPLS-DA Score Plot
mSet<-PlotSPLS3DScoreImg(mSet, "spls_score3d_0_", format = "png", 72, width=NA, 1, 2, 3, 40)

# Create sPLS-DA loadings plot
mSet<-PlotSPLSLoading(mSet, "spls_loading_0_", format = "png", dpi=72, width=NA, 1,"overview")

# Perform cross-validation and plot sPLS-DA classification
mSet<-PlotSPLSDA.Classification(mSet, "spls_cv_0_", format = "png", dpi=72, width=NA)
# View the 3D interactive PLS-DA score plot
mSet$imgSet$splsda.3d
Figure 12. sPLS 2D score plot.

Figure 12. sPLS 2D score plot.

Figure 13. sPLS 3D score plot.

Figure 13. sPLS 3D score plot.

Figure 14. sPLS loading plot

Figure 14. sPLS loading plot

Figure 15. sPLS cross-validation results.

Figure 15. sPLS cross-validation results.

2.10 Orthogonal Partial Least Squares - Discriminant Analysis (orthoPLS-DA)

MetaboAnalystR can create several outputs for oPLS-DA analysis. To begin, use the OPLSR.Anal function. The outputs for oPLS-DA include a score plot, a plot to identify significant features, a model overview plot, and a plot of model permutations. For the significant features plot, the plot visualizes the variable influence in the orthogonal PLS-DA model. It combines the covariance and correlation loading profiles. This corresponds to combining the contribution or magnitude (covariance) with the effect and reliability (correlation) for the model variables with respect to model component scores (details). To begin, please use the OPLSR.Anal function. Please refer to the user manual for further details on the functions.

# Perform oPLS-DA analysis
mSet<-OPLSR.Anal(mSet, reg=TRUE)

# Create a 2D oPLS-DA score plot
mSet<-PlotOPLS2DScore(mSet, "opls_score2d_0_", format = "png", dpi=72, width=NA, 1,2,0.95,1,0)

# Create a significant features plot
mSet<-PlotOPLS.Splot(mSet, "opls_splot_0_", "all", "png", 72, width=NA);

# Create a plot of features ranked by importance
mSet<-PlotOPLS.Imp(mSet, "opls_imp_0_", "png", 72, width=NA, "vip", "tscore", 15,FALSE)

# Create a plot of the model overview
mSet<-PlotOPLS.MDL(mSet, "opls_mdl_0_", format = "png", dpi=72, width=NA)

# Perform and plot oPLS-DA permutation 
mSet<-OPLSDA.Permut(mSet, 100)
## [1] "time taken for 10 permutations:  0.0561895370483398"
mSet<-PlotOPLS.Permutation(mSet, "opls_perm_2_", format = "png", dpi=72, width=NA)
## [1] "Empirical p-values Q2:  p < 0.01 (0/100)  and R2Y:  p = 0.01 (1/100)"
Figure 16. oPLS 2D score plotting.

Figure 16. oPLS 2D score plotting.

Figure 17. oPLS 2D permutation test result.

Figure 17. oPLS 2D permutation test result.

More plots can be found in your working directory.

2.11 Significance Analysis of Microarrary (and Metabolites) (SAM)

SAM is a well-established statistical method for identification of differentially expressed genes in mi- croarray data analysis. It is designed to address the false discovery rate (FDR) when running multiple tests on high-dimensional microarray data. SAM assigns a significance score to each variable based on its change relative to the standard deviation of repeated measurements. For a variable with scores greater than an adjustable threshold, its relative difference is compared to the distribution estimated by random permutations of the class labels. For each threshold, a certain proportion of the variables in the permutation set will be found to be significant by chance. The proportion is used to calculate the FDR. SAM is performed using the siggenes package. Users need to specify the Delta value to control FDR in order to proceed. To begin the SAM analysis, use the SAM.Anal function. Please refer to the user manual for further details on the functions.

# Perform SAM analysis
mSet<-SAM.Anal(mSet, "d.stat", FALSE, TRUE, 0.0, "sam_imp_0_")
## Loading required package: Biobase
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: multtest
## Loading required package: splines
## Warning: Expected class labels are 0 and 1. cachexic is set to 0, and control
## is set to 1.
## Warning: The spline based estimation of pi0 results in a non-positive value of pi0.
## Therefore, pi0 is estimated by using lambda = 0.5.
# Create a SAM plot of FDR values
mSet<-PlotSAM.FDR(mSet, "sam_view_0_", format = "png", dpi=72, width=NA)

# Create a SAM plot of results
mSet<-PlotSAM.Cmpd(mSet, "sam_imp_0_", format = "png", dpi=72, width=NA)
Figure 18. SAM analysis result (FDR).

Figure 18. SAM analysis result (FDR).

Figure 19. SAM analysis result.

Figure 19. SAM analysis result.

2.12 Empirical Bayesian Analysis of Microarray (and Metabolites) (EBAM)

EBAM is an empirical Bayesian method based on moderated t-statistics. EBAM uses a two-group mixture model for null and significant features. The prior and density parameters are estimated from the data. A feature is considered significant if its calculated posterior is larger than or equal to delta and no other features with a more extreme test score that is not called signicant. The default is delta = 0.9. The suggested fudge factor (a0) is chosen that leads to the largest number of significant features. EBAM is performed with ebam function in siggenes package.

To perform EBAM analysis, begin with the EBAM.Init function. To create a plot of the EBAM analysis, use PlotEBAM.Cmpd. Please refer to the user manual for further details on the functions.

# Perform EBAM analysis, plot EBAM analysis and create the EBAM matrix of significant features
mSet<-EBAM.Init(mSet, FALSE, TRUE, FALSE, -99.0, 0.9, "ebam_view_0_", "ebam_imp_0_")
## Warning: Some of the logit posterior probabilites are Inf. These probabilities
## are not plotted.
# Create a EBAM plot of results
PlotEBAM.Cmpd(mSet, "ebam_imp_0_", "png", 72, width=NA)
Figure 20. EBAM analysis result.

Figure 20. EBAM analysis result.

2.13 Hierarchical Clustering: Dendogram

In (agglomerative) hierarchical cluster analysis, each sample begins as a separate cluster and the algo- rithm proceeds to combine them until all samples belong to one cluster. Two parameters need to be considered when performing hierarchical clustering. The first one is distance measure, including the Euclidean distance, Pearson’s correlation, or Spearman’s rank correlation. The other parameter is the clustering algorithm, which includes the average linkage (clustering uses the centroids of the observations), complete linkage (clustering uses the farthest pair of observations between the two groups), single linkage (clustering uses the closest pair of observations) and Ward’s linkage (clustering to minimize the sum of squares of any two clusters). Heatmap is often presented as a visual aid in addition to the dendrogram. Hierachical clustering is performed with the hclust function in package stat. Use the PlotHCTree to create the dendogram, where users can specify the distance measure and the clustering algorithm. Please refer to the user manual for further details on the function.

# Perform hierarchical clustering and plot dendogram
mSet<-PlotHCTree(mSet, "tree_0_", format = "png", dpi=72, width=NA, "euclidean", "ward.D")
Figure 21. Hierarchical Clustering (Dendogram) analysis result.

Figure 21. Hierarchical Clustering (Dendogram) analysis result.

2.14 Hierarchical Clustering: Heatmaps

The heatmap provides intuitive visualization of the metabolomics data table. Each colored cell on the map corresponds to a concentration value in the user’s data table, with samples in rows and features/compounds in columns. Users can use a heatmap to identify samples/features that are unusually high/low. Further, the heatmap is often presented as a visual aid in addition to the dendrogram.

In (agglomerative) hierarchical cluster analysis, each sample begins as a separate cluster and the algorithm proceeds to combine them until all samples belong to one cluster. Two parameters need to be considered when performing hierarchical clustering. The first one is the similarity measure, options in MetaboAnalystR include Euclidean distance, Pearson’s correlation, and Spearman’s rank correlation. The other parameter is the choice of clustering algorithms, including average linkage (clustering uses the centroids of the observations), complete linkage (clustering uses the farthest pair of observations between the two groups), single linkage (clustering uses the closest pair of observations) and Ward’s linkage (clustering to minimize the sum of squares of any two clusters). Hierachical clustering is performed with the hclust function in package stat. Use the PlotHeatMap function, where users can specify the distance measure, the clustering algorithm, the color contrast, a detailed or overview view of the data, and options for the data input for the heatmap. Please refer to the user manual for further details.

# Perform hierarchical clustering and plot heat map
mSet<-PlotHeatMap(mSet, "heatmap_0_", "png", 72, width=NA, "norm", "row", "euclidean", "ward.D","bwm", 8, "overview", T, T, NULL, T, F, T, T, T)
Figure 22. Hierarchical Clustering (Heatmap) analysis result.

Figure 22. Hierarchical Clustering (Heatmap) analysis result.

2.15 Partitional Clustering: K-Means

K-means clustering is a nonhierarchical clustering technique. It begins by creating k random clusters (k is supplied by user). The program then calculates the mean of each cluster. If an observation is closer to the centroid of another cluster then the observation is made a member of that cluster. This process is repeated until none of the observations are reassigned to a different cluster. K-means analysis is performed using the kmeans function in the package stat. To begin, use the Kmeans.Anal function, and then the PlotKmeans function to create a plot of the analysis.

# Perform K-means analysis
mSet<-Kmeans.Anal(mSet, 3)

# Plot K-means analysis 
mSet<-PlotKmeans(mSet, "km_0_", format = "png", dpi=72, width=NA)
## Loading required package: data.table
Figure 23. Kmeans analysis result.

Figure 23. Kmeans analysis result.

2.16 Partitional Clustering: Self Organizing Maps (SOM)

SOM is an unsupervised neural network algorithm used to automatically identify major trends present in high-dimensional data. SOM is based on a grid of interconnected nodes, each of which represents a model. These models begin as random values, but during the process of iterative training they are updated to represent different subsets of the training set. Users need to specify the x and y dimension of the grid to perform SOM analysis. The SOM is performed using the som R package. Please refer to the user manual for further details on the functions.

MetaboAnalystR performs several outputs for self organizing maps. To begin, use the SOM.Anal function to perform SOM analysis. Then use the PlotSOM to create a plot of the SOM analysis.

# Perform SOM analysis
mSet<-SOM.Anal(mSet, 1, 3,"linear","gaussian")

# Plot SOM analysis
mSet<-PlotSOM(mSet, "som_0_", format = "png", dpi=72, width=NA)
Figure 24. SOM analysis result.

Figure 24. SOM analysis result.

2.17 Random Forest

Random Forest is a supervised learning algorithm suitable for high dimensional data analysis. It uses an ensemble of classification trees, each of which is grown by random feature selection from a bootstrap sample at each branch. Class prediction is based on the majority vote of the ensemble. RF also provides other useful information such as OOB (out-of-bag) error, variable importance measure, and outlier mea- sures. During tree construction, about one-third of the instances are left out of the bootstrap sample. This OOB data is then used as test sample to obtain an unbiased estimate of the classification error (OOB error). Variable importance is evaluated by measuring the increase of the OOB error when it is permuted. The outlier measures are based on the proximities during tree construction. RF analysis is performed using the randomForest R package.

MetaboAnalystR performs several outputs for the random forest analysis. To begin, use the RF.Anal function. The PlotRF.Classify function creates a plot of the out-of-the-bag error rate for the random forest classification. The PlotRF.VIP function creates a plot of the contributions of variables with high importance to the classification of the random forest model. In this plot, the features are ranked by their contributions to classification accuracy (Mean Dicrease Accuracy). The PlotRF.Outlier creates a plot of the outlying samples in the random forest model.

# Perform random forest analysis
mSet<-RF.Anal(mSet, 500, 7, 1)
## Warning in rm(.Random.seed): object '.Random.seed' not found
# Plot random forest classification
mSet<-PlotRF.Classify(mSet, "rf_cls_0_", format = "png", dpi=72, width=NA)

# Plot random forest variables of importance
mSet<-PlotRF.VIP(mSet, "rf_imp_0_", format = "png", dpi=72, width=NA)

# Plot random forest outliers 
mSet<-PlotRF.Outlier(mSet, "rf_outlier_0_", format = "png", dpi=72, width=NA)
Figure 25. Random Forest analysis result (classification).

Figure 25. Random Forest analysis result (classification).

Figure 26. Random Forest analysis result (variables of importance).

Figure 26. Random Forest analysis result (variables of importance).

Figure 27. Random Forest analysis result (outliers).

Figure 27. Random Forest analysis result (outliers).

2.18 Support Vector Machine (SVM)

SVM aims to find a nonlinear decision function in the input space by mapping the data into a higher dimensional feature space and separating it there by means of a maximum margin hyperplane. The SVM- based recursive feature selection and classification is performed using the R-SVM script11. The process is performed recursively using decreasing series of feature subsets (ladder) so that different classification models can be calculated. Feature importance is evaluated based on its frequencies being selected in the best classifier identified by recursive classification and cross-validation. Please note, R-SVM is very computationally intensive. Only the top 50 features (ranked by their p values from t-tests) will be evaluated.

# Perform SVM 
mSet<-RSVM.Anal(mSet, 10)

mSet<-PlotRSVM.Classification(mSet, "svm_cls_0_", format = "png", dpi=72, width=NA)

mSet<-PlotRSVM.Cmpd(mSet, "svm_imp_0_", format = "png", dpi=72, width=NA)
Figure 28. SVM analysis result (Classification).

Figure 28. SVM analysis result (Classification).

Figure 29. SVM analysis result (Importance).

Figure 29. SVM analysis result (Importance).