Introduction:
This tool is a wrapper for GC-MS data processing
with the R 'eRah' package. An R package with an
integrated design that allows for an innovative deconvolution of GC-MS chromatograms
using multivariate techniques based on blind source separation (BSS).eRah automatically detects and deconvolves the spectra of
the compounds appearing in GC-MS chromatograms. eRah
processes the raw data files (netCDF or mzXML) of a complete metabolomics experiment in an
automated manner. After that, compounds are aligned by spectral similarity and
retention time distance. eRah
computes the Euclidean distance between retention time distance and spectral
similarity for all compounds in the chromatograms, resulting in compounds
appearing across the maximum number of samples and with the least retention
time and spectral distance.Also, an missing compound
recovery step, can be applied to recover those compounds that are missing in
some samples. Missing compounds appear as a result of an incorrect
deconvolution or alignment - due to a low compound concentration in a sample -
, or because it is not present in the sample. This forces the final data table
with compound names and compounds area, to not have any missing (zero) values.
Input files:
1.
Multiple GC-MS raw data files in
netCDF, mzXML or mzML format.
Parameter£º
1.
RT window: The chromatographic
retention time window to process. If 0 all the chromatogram is processed.
2.
Minimum peak width: This is a
critical parameter that conditions the efficiency of eRah.
Typically, this should be the half of the mean compound width.
3.
noise. threshold: Data above this threshold will be considered as noise
4.
avoid.processing.mz: The masses
that do not want to be considered for processing. Typically, in GCMS those
masses are 73,74,75,147,148 and 149, since they are
they are ubiquitous mass fragments typically generated from compounds carrying
a trimethylsilyl moiety.
5.
Minimum spectral correlation
value: From 0 (no similar) to 1 (very similar). This value sets how similar two
or more compounds have be to be considered for alignment between them.
6.
Maximum retention time distance:
This value (in seconds) sets how far two or more compounds can be to be
considered for alignment between them.
7.
Minimum. sample:
The minimum number of samples in which a compound has to appear to be
considered for searching into the rest of the samples where this compound
missing.
8.
blocks. size : For experiments containing more than
100 (Windows) or 1000 (Mac or Linux) samples (numbers depending on the computer
resoures and sample type). In those cases alignment
can be conducted by block segmentation. For an experiment of e.g. 1000 samples,
the block.size can be set to 100, so the alignment
will perform as multiple (ten) 100-samples experiments, to later align them
into a single experiment.This parameter is designed
to solve the typical problem that appear when aligning under Windows operating
system: "Error: cannot allocate vector of size XX Gb". Such a problem
will not appear with Mac or Linux, but several hours of computation are
expected when aligning a large number of samples. Using block segmentation
provides a greatly improved run-time performance.
Output files:
1.
'gcms_raw_pkTable.txt', raw PeakTable is generated with one line per
"compound" and one column per sample.
For example:
|
AlignID |
STDmix_GC_01 |
STDmix_GC_02 |
STDmix_GC_03 |
|
EC1 |
1486892478 |
561322777 |
3448620272 |
|
EC2 |
5492977592 |
684434115 |
3265669981 |
|
EC3 |
2265686433 |
4182838129 |
4365291513 |
|
EC4 |
13390154 |
12612932 |
21155307 |
|
EC5 |
14588107 |
8510918 |
7224351 |
2.
'gcms_mass_spectra.msp',
Corresponding pseudospectrum(compound) mass spectrum information in MSP format,
the identifier is same in PeakTable file.
For example:
Name:
EC1
rt: 3.4253
FoundIn: 3
Comments: MSP spectra exported by eRah
Num Peaks: 10
32 838; 33 60; 40 42; 41 54; 42 815; 43
713;
43 713; 47 1000; 48 36; 49 6; 77 20;
Name: EC2
rt: 3.7521
FoundIn: 3
Comments: MSP spectra exported by eRah
Num Peaks: 13
30 1000; 31 335; 32 91; 33 11; 40 12; 41
47;
41 47; 42 232; 43 299; 45 189; 46 831; 47
348;
47 348; 48 6; 77 11;
3.
'gcms_mass_spectra_999norm.msp',
intensities normalized mass spectrum information in MSP format, intensities
sum=999.
4.
'TICs.pdf', Total Ion
Chromatograms.
5.
'BPCs.pdf', Base Peak
Chromatograms.
6.
'EICs', Extracted Ion
Chromatograms.
Note£º
Here ProteoWizard
software (http://proteowizard.sourceforge.net/doc_users.html) is recommended.
It supports the reading/writing of the following open formats on all platforms
(note: vendor formats require Windows with vendor libraries).
mzML 1.1
mzML 1.0
mzXML
MGF
MS2/CMS2/BMS2
mzIdentML
Please read the protocol of this software
carefully. It can not be used for any commercial
purposes.
Reference:
[1]
X. Domingo-Almenara,
et al., eRah: a computational tool integrating
spectral deconvolution and alignment with quantification and identification of
metabolites in GC{MS-based metabolomics. Analytical
Chemistry. 88 (2016) 9821{9829. DOI: 10.1021/acs.analchem.6v02927
[2]
X. Domingo-Almenara,
et al., Compound identification in gas chromatography/mass spectrometry-based
metabolomics by blind source separation. Journal of Chromatography A 1409
(2015) 226{233. DOI: 10.1016/j.chroma.2015.07.044
[3]
Chambers M C, Maclean B, Burke
R, et al. A cross-platform toolkit for mass spectrometry and proteomics[J].
Nature Biotechnology, 2012,
30(10):918-920.http://proteowizard.sourceforge.net/doc_users.html