|Table of Contents|
Multifactorial diseases arise from complex patterns of interaction between a set of genetic traits and the environment. To fully capture the genetic biomarkers that jointly explain the heritability component of a disease, thus, all SNPs from a genome-wide association study should be analyzed simultaneously.
In this paper, we present Bag of Naïve Bayes (BoNB), an algorithm for genetic biomarker selection and subjects classification from the simultaneous analysis of genome-wide SNP data. BoNB is based on the Naïve Bayes classification framework, enriched by three main features: bootstrap aggregating of an ensemble of Naïve Bayes classifiers, a novel strategy for ranking and selecting the attributes used by each classifier in the ensemble and a permutation-based procedure for selecting significant biomarkers, based on their marginal utility in the classification process. BoNB is tested on the Wellcome Trust Case-Control study on Type 1 Diabetes and its performance is compared with the ones of both a standard Naïve Bayes algorithm and HyperLASSO, a penalized logistic regression algorithm from the state-of-the-art in simultaneous genome-wide data analysis.
The significantly higher classification accuracy obtained by BoNB, together with the significance of the biomarkers identified from the Type 1 Diabetes dataset, prove the effectiveness of BoNB as an algorithm for both classification and biomarker selection from genome-wide SNP data.
Source code of the BoNB algorithm is released under the GNU General Public Licence and is available at http://www.dei.unipd.it/~sambofra/bonb.html.
BoNB is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License.
If you use the software for your research, please refer to the original BoNB paper with the citation below.
C++ source code of the BoNB algorithm can be downloaded here.
Download the tar archive, move to its location and decompress it:
tar -xf bonb-X.Y.tar.gz
then change directory and compile:
Usage: bonb [OPTIONS] root_file_name root_file_name is the common file name, without extension, of a triplet of .bim, .fam and .bed files, in PLINK binary format, SNP major (http://pngu.mgh.harvard.edu/~purcell/plink/). Options: -s, --seed=NUMBER Seed of the random number generator (default 0) -b, --bootstrap=NUMBER Number of bootstrap replicates (default 200) -t, --threshold=NUMBER Threshold of the squared correlation coefficient below which two SNPs at distance < 1Mb are considered independent (default 0.1) -c, --crossval=NUMBER Sample at random NUMBER% of the subjects and use them as an independent set to assess MCC and classification accuracy of the classifier (default 0) -S, --simple Use a simple Naive Bayes classifier, instead of the ensemble of classifiers, trained on all SNPs with p-value of association < 5e-7 for a general 2df chi-square test -v, --verbose Outputs additional information on the training process and the list of attributes used by the classifiers in the ensemble, with the corresponding marginal utilities. If used in conjunction with -S, outputs the list of SNPs passing the chi-square test -h, --help Print this help and exit -m, --multiple For data in multiple triplets, one triplet per chromosome, in the form: root_file_name01.bed root_file_name02.bed ... root_file_name22.bed root_file_name01.bim root_file_name02.bim ... root_file_name22.bim root_file_name01.fam root_file_name02.fam ... root_file_name22.fam Note: analysis of sex chromosomes is, for the moment, disabled Mandatory arguments to long options are also mandatory for any corresponding short options.
Sambo F, Trifoglio E, Di Camillo B, Toffolo GM, Cobelli C: Bag of Naïve Bayes: biomarker selection and classification from genome-wide SNP data. BMC Bioinformatics 2012, 13(Suppl 14):S2.
[ Publisher Full Text | Bibtex ]