Introduction:
This tool performs
feature selection with the Boruta algorithm. Boruta
is an all relevant feature selection wrapper algorithm, capable of working with
any classification method that output variable importance measure (VIM); by
default, Boruta uses Random Forest. The method
performs a top-down search for relevant features by comparing original
attributes' importance with importance achievable at random, estimated using
their permuted copies, and progressively eliminating irrelevant features to stabilise that test.
Group
information is given by a group design file (Tab-delimited text file)
Input files:
1.
Peak table file in Tab-delimited
txt format, with the first column as the compound identifier and the others as
samples.
For example:
|
HU_011 |
HU_014 |
HU_015 |
HU_017 |
HU_018 |
HU_019 |
|
|
(2-methoxyethoxy)propanoic
acid isomer |
3.019766 |
3.814339 |
3.519691 |
2.562183 |
3.781922 |
4.161074 |
|
(gamma)Glu-Leu/Ile |
3.888479 |
4.277149 |
4.195649 |
4.32376 |
4.629329 |
4.412266 |
|
1-Methyluric acid |
3.869006 |
3.837704 |
4.102254 |
4.53852 |
4.178829 |
4.516805 |
|
1-Methylxanthine |
3.717259 |
3.776851 |
4.291665 |
4.432216 |
4.11736 |
4.562052 |
|
1,3-Dimethyluric acid |
3.535461 |
3.932581 |
3.955376 |
4.228491 |
4.005545 |
4.320582 |
|
1,7-Dimethyluric acid |
3.325199 |
4.025125 |
3.972904 |
4.109927 |
4.024092 |
4.326856 |
|
2-acetamido-4-methylphenyl acetate |
4.204754 |
5.181858 |
3.88568 |
4.237915 |
1.852994 |
4.080681 |
|
2-Aminoadipic acid |
4.080204 |
4.359246 |
4.249111 |
4.231404 |
4.323679 |
4.244485 |
2.
Group design file in
Tab-delimited text format with two columns (samplename groupname).
For example:
|
HU_011 |
M |
|
HU 014 |
F |
|
HU_015 |
M |
|
HU_017 |
M |
|
HU_018 |
M |
|
HU_019 |
M |
Parameter:
1.
maxRuns: maximal number of importance
source runs. You may increase it to resolve attributes left Tentative.
2.
pValue: confidence level.
Details:
Boruta iteratively compares importances of attributes with importances
of shadow attributes, created by shuffling original ones. Attributes that have
significantly worst importance than shadow ones are being consecutively
dropped. On the other hand, attributes that are significantly better than
shadows are admitted to be Confirmed. Shadows are
re-created in each iteration. Algorithm stops when
only Confirmed attributes are left, or when it reaches maxRuns
importance source runs. If the second scenario occurs, some attributes may be
left without a decision. They are claimed Tentative. You may try to extend maxRuns or lower pValue to
clarify them, but in some cases their importances do
fluctuate too much for Boruta to converge.
In boxplot, the first three blue boxes represent the minimum,
average, and maximum importance values of the shadow attributes created in each
iteration. Other green or red ones are the importance of the real features in each iteration. The red box indicates that the true
feature is not significant, while the green box indicates that the feature is
significant.
Output files:
1.
'Boruta_Decision_Info.txt',
final result of feature selection.
2.
'Boruta_Decision_Boxplot.pdf',
important bands plot.
Reference:
Miron B. Kursa, Witold R. Rudnicki (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), p.
1-13.