![]() |
|
MetaboAnalystR PackageMetaboAnalystR package is synchronized with the MetaboAnalyst website and is designed for metabolomics researchers who are comfortable using R to perform raw spectra processing and batch analysis. An optimized global metabolomics analysis workflow starting from raw spectra has been established. The following tutorials are meant to complement our web-based functions by providing step-by-step instructions for several of the most common tasks using the R package. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Contents
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Overview1.1 IntroductionMetaboAnalystR 3 contains the R functions and libraries underlying the popular MetaboAnalyst web server, including metabolomic data analysis, visualization, and functional interpretation. The package is synchronized with the MetaboAnalyst web server. After installing and loading the package, users will be able to reproduce the same results from their local computers using the corresponding R command history downloaded from MetaboAnalyst web site, thereby achieving maximum flexibility and reproducibility. The version 3 aims to improve the current global metabolomics workflow by implementing a fast parameter optimization algorithm for peak picking, and automated identification of the most suitable method for batch effect correction from 12 well-established approaches. In addition, more support for functional interpretation directly from m/z peaks via mummichog2 (PMID: 23861661), and a new pathway-based method to integrate multiomics data has been added. To demonstrate this new functionality, we perform end-to-end metabolomics data analysis on the clinical IBD samples in this tutorial. More case study have been provided in the vignette of this R package. Here we'd prefer to provide a comprehensive starting tutorial from installation of MetaboAnalystR 3.2 (Updated at 16-Dec-2021). 1.2 InstallationStep 1. Install package dependenciesTo use MetaboAnalystR 3.2, first install all package dependencies. Ensure that you have necessary system environment configured. For Linux (e.g. Ubuntu 18.04/20.04): libcairo2-dev, libnetcdf-dev, libxml2, libxt-dev and libssl-dev should be installed at frist; For Windows (e.g. 7/8/8.1/10): Rtools should be installed. For Mac OS: In order to compile R for Mac OS, you need Xcode and GNU Fortran compiler installed (https://mac.r-project.org/tools/). We suggest you follow these steps: https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/ to help with your installation. R base with version > 4.0 is required. The compatibility of latest version (v4.2) is under evaluation. As for installation of package dependencies, there are two options: Option 1 Enter the R function (metanr_packages) and then use the function. A printed message will appear informing you whether or not any R packages were installed. Function to download packages:
Usage of function:
Option 2 Use the pacman R package (for those with >R 3.5.1).
Step 2. Install the packageMetaboAnalystR 3.2 is freely available from GitHub. The package documentation, including the vignettes for each module and user manual is available within the downloaded R package file. You can install the MetaboAnalylstR 3.0 via any of the three options: A) using the R package devtools, B) cloning the github, C) manually downloading the .tar.gz file. Note that the MetaboAnalystR 3.2 github will have the most up-to-date version of the package. Option A) Install the package directly from github using the devtools package. Open R and enter:Due to issues with Latex, some users may find that they are only able to install MetaboAnalystR 3.2 without any documentation (i.e. vignettes).
Option B) Install from a pre-built source package
Option C) Clone Github and install locallyThe * must be replaced by what is actually downloaded and built.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. Case Study IBDInflammatory bowel diseases, which include Crohn’s disease and ulcerative colitis, have affected several million individuals around the world. Jason L. et al. have performed a longtitude multiomics study on the role of microbiome in the pathogenesis of IBD. Metabolomics study on the facal samples is introduced here for example case study of this novel parameters optimization pipeline. 2.1 Raw Data ProcessingRaw data processing is the first for all metabolomics studies. MetaboAnalystR 3.2 support “.mzXML”, “.mzML” and “.CDF” formats. The original formats (e.g. “.raw” and “.RAW”) generated by vendors will be supported soon. You can convert your original data with ProteoWizard (PMID: 23051804) as the supported formats. Centroided data is preferred. 2.1.1 Load MetaboAnalystRIf you have finished the installation and been ready to use the package. Use the library() function to load the package into R.
2.1.2 Download IBD Example QC DataThe example Quality Control (QC) samples will be used for the next steps. Download the MS data for parameters' optimization at this step. To reach a goal of quicking learning and avoid the long time running for the whole samples (over 600 samples). We only provide 5 samples for each CD and nonIBD group here. If you want to repeat and verify the results in our manuscript, please go IBD MultiOmics Database, download the full batch data and run them using this pipeline.
2.1.3 Data InspectationBefore running the data analysis, the general data structure and information can be inspected with PerformDataInspect. If there are some extremly significant contaminats, it will be discovered directly.
2.1.4 ROI ExtractionQC samples data ROI extraction was performed with PerformROIExtraction or PerformDataTrimming now. There are 3 modes for data ROI extraction. Default is Standard Simulation Method (“ssm_trim”). Under this mode, the 3D MS will trimmed from m/z demionsion firstly and then from RT deminson. rt.idx could be used to adjust the RT range for RT data reminning. The other 2 modes, which are named as “rt_specific” and “mz_specific” can be used to extract (with positive values) or remove (with negative values) some parts of the spectra. Trimmed MS file can be saved with write = T. The trimmed file will be save as “.mzML”. The chromatogram can be plotted with plot = T.
2.1.5 Initalize Parameters' SettingInstrument specific initial parameters have been embedded. Currently, the most popular platforms that include UPLC-Q/E, UPLC-Q/TOF, UPLC-T/TOF, UPLC-Ion_trap, UPLC-Orbitrap, UPLC-G2S, HPLC-Q/TOF, HPLC-Ion_Trap, HPLC-Orbitrap and HPLC-S/Q. If your platform is not listed here, the “general” option can be applied for this setting. Both “centWave” and “matchedFIlter” modes have been configured for further processing. Besides, the specific parameters could also be set individually, if you have your own opinion about the parameters. This parameters optimization step can be skipped.
2.1.6 Parameters' OptimizationParameters optimization with “DoE” strategy. Initial parameters are from the optimized parameters above or the internal default parameters from embedded SetPeakParam function.
2.1.7 Peak ProfilingThe ImportRawMSData function reads in raw MS data files and saves it as an OnDiskMSnExp object. Two plots can be output with SetPlotParam function set with “Plot = T” - the Total Ion Chromatogram (TIC) which provides an overview of the entire spectra, and the Base Peak Chromatogram (BPC) which is a cleaner profile of the spectra based on the most abundant signals. These plots are useful to inform the setting of parameters downstream. For users who wish to view a peak of interest, an Extracted Ion Chromatogram (EIC) can be generated using the PlotEIC function.
The PerformPeakProfiling function is an updated peak processing pipeline from XCMS R functions that performs peak detection, alignment, and grouping in an automatical step. The function also generates two diagnostic plots including statistics on the total intensity of peaks in different samples, a retention time adjustment map, and a PCA plot showing the overall sample clustering prior to data cleaning and statistical analysis.
Intensity statistics of all samples. 2.1.8 Peak AnnotationThe PerformPeakAnnotation function annotates isotope and adduct peaks with the method from CAMERA (PMID: 22111785). It outputs the result as a CSV file (“annotated_peaklist.csv”) and saves the annotated peaks to mSet object. Finally, the peak list is formatted to the correct structure for MetaboAnalystR and filtered based upon user’s specifications using the FormatPeakList function. This function permits the filtering of adducts (i.e. removal of all adducts except for [M+H]+/[M-H]-) and filtering of isotopes (i.e. removal of all isotopes except for monoisotopic peaks). The goal of filtering peaks is to remove degenerative signals and reduce the file size.
Part of the table (first 8 rows and 2 samples) are shown below.
2.2 Data Processing and Statistical AnalysisThis step can be easily finished with 'Statistical Analysis' module in MetaboAnalyst website. The running R code generated by website could easily be used to repeat the same results. This brief chapter is established to show case how to generate the most results. More statistical utilities can be achieved with the same way. Remaining in the same working directory, we will read in the filtered peak table. Then, we will replace missing values with a small value (half of the minimum positive value within the original data). Next we will filter varibles out non-informative signals, then normalize the samples to their median and perform log transformation. Finally, we will perform t-test analysis to identify differentially enriched features, setting the FDR-adjusted p-value threshold to 0.25. 2.2.1 mSet initialized for further analysis
2.2.2 Perform Data Statistics
2.2.3 Univariate Analysis - t test
2.2.4 Multivariate Analysis - PCA
2.2.5 Multivariate Analysis - PLS-DA
2.3 From MS peaks to PathwaysPrevious versions of MetaboAnalyst encompassed two modules for functional analysis, metabolic pathway analysis (MetPA) (Xia et al. 2010, https://academic.oup.com/bioinformatics/article/26/18/2342/208464) and metabolite set enrichment analysis (MSEA) (Xia et al. 2010, https://academic.oup.com/nar/article/38/suppl_2/W71/1101310). However, these modules require metabolite identifications prior to use, which remains an important challenge in untargeted metabolomics. In comparison, the mummichog algorithm (Li et al. 2013, http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003123) bypasses the bottleneck of metabolite identification prior to pathway analysis, leveraging a priori pathway and network knowledge to directly infer biological activity based on mass peaks. We have therefore implemented the mummichog algorithm in R in a new module named “MS Peaks to Pathways”. The knowledge-base for this module consists of five genome-scale metabolic models from the original Python implementation which have either been manually curated or downloaded from BioCyc, as well as an expanded library of 21 organisms derived from KEGG metabolic pathways. To use this module, users must upload a table (.txt) using the Read.PeakListData function containing either:
All inputted files must be in .txt format. If the input is a three column table, both the mummichog and GSEA algorithms (and their combination) can be applied. If only p-values (or ranked by p-values) are provided, then only the mummichog algorithm will be applied. If only t-scores (or ranked by t-scores) are provided, then only the GSEA algorithm will be applied. If p-values have not yet been calculated, users can use the “Statistical Analysis” module of MetaboAnalyst website or this R package to upload their raw peak tables, process the data, perform t-tests or fold-change analysis, and then upload these results into the module. With the table, users also need to specify the type of MS instrument, the ion mode (positive or negative) using the UpdateInstrumentParameters function. Currently, MetaboAnalystR only supports the handling of peaks obtained from high-resolution MS instruments such as Orbitrap, or Fourier Transform (FT)-MS instruments as recommended by the original mummichog implementation. Following data upload, users much select the organism’s library from which to perform untargeted pathway analysis using the SetMass.PathLib function. Users can then perform the mummichog algorithm on their data using PerformPSEA. First, users will set algorithm to be used to mummichog using SetPeakEnrichMethod. Second, users will set the p-value cutoff to delineate between significantly enriched and non-significantly enriched m/z features using the SetMummichogPval function. Finally, use the PerformPSEA to calcaulte pathway activity. The output of this module first consists of a table of results identifying the top-pathways that are enriched in the user-uploaded data, which can be found in your working directory named “mummichog_pathway_enrichment.csv”. The table consists of the total number of hits, the raw p-value (Fisher’s or Hypergeometric), the EASE score, and the adjusted p-value (for permutations) per pathway. A second table can also be found in your working directory named “mummichog_matched_compound_all.csv”, that contains all matched metabolites from the user’s uploaded list of m/z features. For this tutorial we will first directly use the an example peak list data obtained from untargeted metabolomics of IBD data above (mass accuracy 5.0, negative ion mode). This is an raw data table generated by the raw data processing pipeline. 2.3.1 Mummichog Version 1Perform T test for further MS peaks to pathway analysis
2.3.2 Mummichog Version 2The SetPeakEnrichMethod now has a parameter to let users select whether to use Version 1 or Version 2 of the MS Peaks to Paths algorithms. With Version 2, users can now upload retention time information to perform pathway analysis. Note that Version 2 should only be used if user’s data contains a retention time (“rt” or “r.t”) column. Retention time is used to move pathway analysis from the “Compound” space to “Empirical Compound” space (details below in “How are Empirical Compounds calculated?”). The inclusion of retention time will increase the confidence and robustness of the potential compound matches. Another difference is that currency compounds are removed directly from the user’s selected pathway library, versus removed from potential compound hits during the permutations.
2.3.3 Adduct & Currency CustomizationAdduct CustomizationRaw MS peaks contain a significant amount of adducts specific to their MS instrument and analytical mode. A comprehensive adduct list is shown below in the “Available” panel. Use this to customize the adduct list used by the Mummichog/GSEA algorithms to match m/z peaks to potential compounds hits. The list of available adducts to choose from can be found in here: positive; here: negative ; here:mixed. For the negative ion mode, the adducts used are: M-H[-], M-2H[2-], M(C13)-H[-], M(S34)-H[-], M(Cl37)-H[-], M+Na-2H[-], M+K-2H[-], M-H2O-H[-], M+Cl[-], M+Cl37[-], M+Br[-], M+Br81[-], M+ACN-H[-], M+HCOO[-], M+CH3COO[-], and M-H+O[-]. For the positive ion mode, the adducts used are: M[1+], M+H[1+], M+2H[2+], M+3H[3+], M(C13)+H[1+], M(C13)+2H[2+], M(C13)+3H[3+], M(S34)+H[1+], M(Cl37)+H[1+], M+Na[1+], M+H+Na[2+], M+K[1+], M+H2O+H[1+], M-H2O+H[1+], M-H4O2+H[1+], M-NH3+H[1+], M-CO+H[1+], M-CO2+H[1+], M-HCOOH+H[1+], M+HCOONa[1+], M-HCOONa+H[1+], M+NaCl[1+], M-C3H4O2+H[1+], M+HCOOK[1+], and M-HCOOK+H[1+]. Currency CustomizationCurrency metabolites are abundant substances such as water and carbon dioxide known to occur in normal functioning cells and participate in a large number of metabolic reactions (Huss and Holme, 2007). Because of their ubiquitous nature, removing them can greatly improve pathway analysis and visualization. The list of available currency metabolites can be found here. By default, the MS Peaks to Paths module considers these metabolites as currency: ‘C00001’, ‘C00080’, ‘C00007’, ‘C00006’, ‘C00005’, ‘C00003’, ‘C00004’, ‘C00002’, ‘C00013’, ‘C00008’, ‘C00009’, ‘C00011’, ‘G11113’, ‘H2O’, ‘H+’, ‘Oxygen’, ‘NADP+’, ‘NADPH’, ‘NAD+’, ‘NADH’, ‘ATP’, ‘Pyrophosphate’, ‘ADP’, ‘Orthophosphate’, and ‘CO2’. Now we will use another example peak list data obtained from untargeted metabolomics of mice alveolar macrophages in lungs, using an Orbitrap LC-MS (C18 negative ion mode and HILIC positive ion mode). This is an example of the four-column table containing the m.z, mode, p.value, and t.score. We will perform both currency and adduct customization.
2.3.4 Customizing Empirical Compound FormationBy default, V2 of the MS Peaks to Pathways algorithms enforces primary ions to be present when creating Empirical Compounds (step 3 above). Users can disable this in the UpdateInstrumentParameters function by setting the “force_primary_ion” parameter to “no”. Additionally, by default the retention-time window used to split Empirical Compounds is calculated as the maximum retention time in the user’s data multiplied by the retention time fraction (default is 0.02). Users can either change the retention time fraction (rt_frac) or set the retention time-window (rt_tol - in seconds). Code details are shown below.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3. Featured Utilities3.1 Batch Effect CorrectionIntroduction of Batch Effect CorrectionThe batch effect correction analysis module performs an automatic or specific correction on the peak data table generated by peak picking step. Multiple correction methods have been embedded inside. They include combat, WaveICA, EigenMS, QC_RLSC, ANCOVA, RUV_random, RUV_2, RUVseq series (RUV_s, RUV_r and RUV_g), NOMIS, CCMN. The detailed limitation and mathmatic mechanism of different methods have been illustrated in our manuscript. Here, 4 cases is provided to show users a step to step workflow for both batch effect and signal drift correction. Data preparationThe peak table generated by FormatPeakList function needs to be manually prepared by hand to supplements the corresponding information below.
Here is an example table.
3.1.1 Case 1 - IBD Benchmark DataThis Benchmark data is a large batch data with over 600 samples. The data file size is over 70 MB. This part is designed for repeating our published results. Running of this part may take over 30min to finish. You are strongly recommonded to use your own data instead of this large file to avoid the consuming of patience. Data Downloading and Preparation
Library Package
Data Filtering and Normalization
Initializing mSet Object
Data Reading
Automatic Correction
|