Introduction:

This tool performs support vector machines to rank peaks in the inputted table by SVM-RFE. Group information is given by a group design file (Tab-delimited text file).

The SVM-RFE algorithm proposed by Guyon returns a ranking of the features of a classification problem by training an SVM with a linear kernel and removing the feature with the smallest ranking criterion. This criterion is the w value of the decision hyperplane given by the SVM. For more detailed information, please review the original paper.

Input files:

1.      Peak table file in Tab-delimited txt format, with the first column as the compound identifier and the others as samples.

For example:

HU_011

HU_014

HU_015

HU_017

HU_018

HU_019

(2-methoxyethoxy)propanoic acid isomer

3.019766

3.814339

3.519691

2.562183

3.781922

4.161074

(gamma)Glu-Leu/Ile

3.888479

4.277149

4.195649

4.32376

4.629329

4.412266

1-Methyluric acid

3.869006

3.837704

4.102254

4.53852

4.178829

4.516805

1-Methylxanthine

3.717259

3.776851

4.291665

4.432216

4.11736

4.562052

1,3-Dimethyluric acid

3.535461

3.932581

3.955376

4.228491

4.005545

4.320582

1,7-Dimethyluric acid

3.325199

4.025125

3.972904

4.109927

4.024092

4.326856

2-acetamido-4-methylphenyl acetate

4.204754

5.181858

3.88568

4.237915

1.852994

4.080681

2-Aminoadipic acid

4.080204

4.359246

4.249111

4.231404

4.323679

4.244485

 

2.      Group design file in Tab-delimited text file with two columns (samplename     groupname).

For example:

HU_011

M

HU 014

F

HU_015

M

HU_017

M

HU_018

M

HU_019

M

Parameter:

1.          kernel function: The kernel function reflects the similarity between the inputted data. The correct choice of kernel parameters is crucial for obtaining good results, which practically means that an extensive search must be conducted on the parameter space before results can be trusted.

       

2.          Linear kernel: Simple and safe, try it first. The model is interpretative. It indicates which features or data points in the model are important. But it is not available if the data is not linearly separable.

3.          Polynomial kernel: Less restrictive than linear applications, it can solve non-linear separable data. But it is more complicated with three parameters.

4.          Radial basis function (RBF): Usually defined as a monotonic function of the Euclidean distance between any points in space to a certain center. It maps primitive features to infinite dimensions. It is able to achieve nonlinear mapping, and also has less numerical difficulties.

5.          Sigmoid: Squashes numbers to the range [0, 1]. Historically popular since they have a nice interpretation as a saturating “firing rate” of a neuron. But there are some fatal disadvantages. For instance, saturated neurons “kill” the gradients, sigmoid outputs are not zero-centered, and exp () is a bit computationally expensive.

Output files:

1.      'SVM_Prediction.txt', SVM model sample prediction results using inputted data.

2.      'SVM_Prediction_Summary.txt', prediction summary.

3.      'SVM_Imp_Rank.txt', feature ranked results that are sorted by SVM-RFE.

4.      'SVM_Imp.pdf', scatter plot about feature importance.

5.      'SVM_Top10_Imp.pdf', plot for Top 10 features.

Note:

Group names of characters or string are preferred. Numbers are also supported but not recommended.

Reference:

Marchiori E, Sebag M. Bayesian Learning with Local Support Vector Machines for Cancer Classification with Gene Expression Data[M]// Applications of Evolutionary Computing. Springer Berlin Heidelberg, 2005:74-83.