Introduction:

This tool performs feature selection with the Boruta algorithm. Boruta is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure (VIM); by default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilise that test.

Group information is given by a group design file (Tab-delimited text file)

Input files:

1.      Peak table file in Tab-delimited txt format, with the first column as the compound identifier and the others as samples.

For example:

HU_011

HU_014

HU_015

HU_017

HU_018

HU_019

(2-methoxyethoxy)propanoic acid isomer

3.019766

3.814339

3.519691

2.562183

3.781922

4.161074

(gamma)Glu-Leu/Ile

3.888479

4.277149

4.195649

4.32376

4.629329

4.412266

1-Methyluric acid

3.869006

3.837704

4.102254

4.53852

4.178829

4.516805

1-Methylxanthine

3.717259

3.776851

4.291665

4.432216

4.11736

4.562052

1,3-Dimethyluric acid

3.535461

3.932581

3.955376

4.228491

4.005545

4.320582

1,7-Dimethyluric acid

3.325199

4.025125

3.972904

4.109927

4.024092

4.326856

2-acetamido-4-methylphenyl acetate

4.204754

5.181858

3.88568

4.237915

1.852994

4.080681

2-Aminoadipic acid

4.080204

4.359246

4.249111

4.231404

4.323679

4.244485

 

2.      Group design file in Tab-delimited text format with two columns (samplename     groupname).

For example:

HU_011

M

HU 014

F

HU_015

M

HU_017

M

HU_018

M

HU_019

M

Parameter:

1.        maxRuns: maximal number of importance source runs. You may increase it to resolve attributes left Tentative.

2.        pValue: confidence level.

Details:

Boruta iteratively compares importances of attributes with importances of shadow attributes, created by shuffling original ones. Attributes that have significantly worst importance than shadow ones are being consecutively dropped. On the other hand, attributes that are significantly better than shadows are admitted to be Confirmed. Shadows are re-created in each iteration. Algorithm stops when only Confirmed attributes are left, or when it reaches maxRuns importance source runs. If the second scenario occurs, some attributes may be left without a decision. They are claimed Tentative. You may try to extend maxRuns or lower pValue to clarify them, but in some cases their importances do fluctuate too much for Boruta to converge.

In boxplot, the first three blue boxes represent the minimum, average, and maximum importance values of the shadow attributes created in each iteration. Other green or red ones are the importance of the real features in each iteration. The red box indicates that the true feature is not significant, while the green box indicates that the feature is significant.

Output files:

1.      'Boruta_Decision_Info.txt', final result of feature selection.

2.      'Boruta_Decision_Boxplot.pdf', important bands plot.

Reference:

Miron B. Kursa, Witold R. Rudnicki (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), p. 1-13.