Introduction:

This tool is a wrapper for GC-MS data processing with the R 'eRah' package. An R package with an integrated design that allows for an innovative deconvolution of GC-MS chromatograms using multivariate techniques based on blind source separation (BSS).eRah automatically detects and deconvolves the spectra of the compounds appearing in GC-MS chromatograms. eRah processes the raw data files (netCDF or mzXML) of a complete metabolomics experiment in an automated manner. After that, compounds are aligned by spectral similarity and retention time distance. eRah computes the Euclidean distance between retention time distance and spectral similarity for all compounds in the chromatograms, resulting in compounds appearing across the maximum number of samples and with the least retention time and spectral distance.Also, an missing compound recovery step, can be applied to recover those compounds that are missing in some samples. Missing compounds appear as a result of an incorrect deconvolution or alignment - due to a low compound concentration in a sample - , or because it is not present in the sample. This forces the final data table with compound names and compounds area, to not have any missing (zero) values.

Input files:

1.         Multiple GC-MS raw data files in netCDF, mzXML or mzML format.

Parameter£º

1.        RT window: The chromatographic retention time window to process. If 0 all the chromatogram is processed.

2.        Minimum peak width: This is a critical parameter that conditions the efficiency of eRah. Typically, this should be the half of the mean compound width.

3.        noise. threshold: Data above this threshold will be considered as noise

4.        avoid.processing.mz: The masses that do not want to be considered for processing. Typically, in GCMS those masses are 73,74,75,147,148 and 149, since they are they are ubiquitous mass fragments typically generated from compounds carrying a trimethylsilyl moiety.

5.        Minimum spectral correlation value: From 0 (no similar) to 1 (very similar). This value sets how similar two or more compounds have be to be considered for alignment between them.

6.        Maximum retention time distance: This value (in seconds) sets how far two or more compounds can be to be considered for alignment between them.

7.        Minimum. sample: The minimum number of samples in which a compound has to appear to be considered for searching into the rest of the samples where this compound missing.

8.        blocks. size : For experiments containing more than 100 (Windows) or 1000 (Mac or Linux) samples (numbers depending on the computer resoures and sample type). In those cases alignment can be conducted by block segmentation. For an experiment of e.g. 1000 samples, the block.size can be set to 100, so the alignment will perform as multiple (ten) 100-samples experiments, to later align them into a single experiment.This parameter is designed to solve the typical problem that appear when aligning under Windows operating system: "Error: cannot allocate vector of size XX Gb". Such a problem will not appear with Mac or Linux, but several hours of computation are expected when aligning a large number of samples. Using block segmentation provides a greatly improved run-time performance.

Output files:

1.      'gcms_raw_pkTable.txt', raw PeakTable is generated with one line per "compound" and one column per sample.

For example:

AlignID

STDmix_GC_01

STDmix_GC_02

STDmix_GC_03

EC1

1486892478

561322777

3448620272

EC2

5492977592

684434115

3265669981

EC3

2265686433

4182838129

4365291513

EC4

13390154

12612932

21155307

EC5

14588107

8510918

7224351

 

2.      'gcms_mass_spectra.msp', Corresponding pseudospectrum(compound) mass spectrum information in MSP format, the identifier is same in PeakTable file.

For example:

Name: EC1

 rt: 3.4253

 FoundIn: 3

 Comments: MSP spectra exported by eRah

 Num Peaks: 10

 32 838; 33 60; 40 42; 41 54; 42 815; 43 713;

 43 713; 47 1000; 48 36; 49 6; 77 20;

 

 Name: EC2

 rt: 3.7521

 FoundIn: 3

 Comments: MSP spectra exported by eRah

 Num Peaks: 13

 30 1000; 31 335; 32 91; 33 11; 40 12; 41 47;

 41 47; 42 232; 43 299; 45 189; 46 831; 47 348;

 47 348; 48 6; 77 11;

 

3.      'gcms_mass_spectra_999norm.msp', intensities normalized mass spectrum information in MSP format, intensities sum=999.

4.      'TICs.pdf', Total Ion Chromatograms.

5.      'BPCs.pdf', Base Peak Chromatograms.

6.      'EICs', Extracted Ion Chromatograms.

Note£º

Here ProteoWizard software (http://proteowizard.sourceforge.net/doc_users.html) is recommended. It supports the reading/writing of the following open formats on all platforms (note: vendor formats require Windows with vendor libraries).

mzML 1.1

mzML 1.0

mzXML

MGF

MS2/CMS2/BMS2

mzIdentML

 

Please read the protocol of this software carefully. It can not be used for any commercial purposes.

Reference:

[1]     X. Domingo-Almenara, et al., eRah: a computational tool integrating spectral deconvolution and alignment with quantification and identification of metabolites in GC{MS-based metabolomics. Analytical Chemistry. 88 (2016) 9821{9829. DOI: 10.1021/acs.analchem.6v02927

[2]     X. Domingo-Almenara, et al., Compound identification in gas chromatography/mass spectrometry-based metabolomics by blind source separation. Journal of Chromatography A 1409 (2015) 226{233. DOI: 10.1016/j.chroma.2015.07.044

[3]     Chambers M C, Maclean B, Burke R, et al. A cross-platform toolkit for mass spectrometry and proteomics[J]. Nature Biotechnology, 2012, 30(10):918-920.http://proteowizard.sourceforge.net/doc_users.html