Online Material for the article:

Bag of Naïve Bayes: biomarker selection and classification from genome-wide SNP data.

Authors: Francesco Sambo, Emanuele Trifoglio, Barbara Di Camillo, Gianna Maria Toffolo and Claudio Cobelli
Last updated: Jan 2014

Table of Contents
  1. Abstract
  2. BoNB Source Code
  3. Installation Instructions
  4. Software usage
  5. Citation

1. Abstract

Background
Multifactorial diseases arise from complex patterns of interaction between a set of genetic traits and the environment. To fully capture the genetic biomarkers that jointly explain the heritability component of a disease, thus, all SNPs from a genome-wide association study should be analyzed simultaneously.

Results
In this paper, we present Bag of Naïve Bayes (BoNB), an algorithm for genetic biomarker selection and subjects classification from the simultaneous analysis of genome-wide SNP data. BoNB is based on the Naïve Bayes classification framework, enriched by three main features: bootstrap aggregating of an ensemble of Naïve Bayes classifiers, a novel strategy for ranking and selecting the attributes used by each classifier in the ensemble and a permutation-based procedure for selecting significant biomarkers, based on their marginal utility in the classification process. BoNB is tested on the Wellcome Trust Case-Control study on Type 1 Diabetes and its performance is compared with the ones of both a standard Naïve Bayes algorithm and HyperLASSO, a penalized logistic regression algorithm from the state-of-the-art in simultaneous genome-wide data analysis.

Conclusions
The significantly higher classification accuracy obtained by BoNB, together with the significance of the biomarkers identified from the Type 1 Diabetes dataset, prove the effectiveness of BoNB as an algorithm for both classification and biomarker selection from genome-wide SNP data.

Availability
Source code of the BoNB algorithm is released under the GNU General Public Licence and is available at http://www.dei.unipd.it/~sambofra/bonb.html.

2. Source Code

BoNB is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License.

If you use the software for your research, please refer to the original BoNB paper with the citation below.

C++ source code of the BoNB algorithm can be downloaded here.

3. Installation Instructions

Download the tar archive, move to its location and decompress it:

tar -xf bonb-X.Y.tar.gz

then change directory and compile:

cd bonb
make

to compile the sequential version, or

make parallel

for the OpenMP parallel version.

4. Software Usage

     Usage: bonb [OPTIONS] root_file_name

     root_file_name is the common file name, without extension, of a triplet of
     .bim, .fam and .bed files, in PLINK binary format, SNP major
     (http://pngu.mgh.harvard.edu/~purcell/plink/).

     Options:

       -s, --seed=NUMBER          Seed of the random number generator (default 0)

       -b, --bootstrap=NUMBER     Number of bootstrap replicates (default 200)

       -t, --threshold=NUMBER     Threshold of the squared correlation coefficient
                                  below which two SNPs at distance < 1Mb are
                                  considered independent (default 0.1)

       -c, --crossval=NUMBER      Sample at random NUMBER% of the subjects and use
                                  them as an independent set to assess MCC and
                                  classification accuracy of the classifier
                                  (default 0)

       -C, --covdatafile=FNAME    Tab delimited file containing numerical
                                  covariates in each column, with one row for each
                                  subject and no header (must be provided together
                                  with -i)

       -i, --covinfofile=FNAME    Tab delimited file containing, for each line,
                                  information on a specific covariate to be taken
                                  from the covariate data file. Information must
                                  be in the form:

       col_num    cov_name    cov_type    cov_levels

                                  where col_num is the column number in the
                                  covariates file (starting from 0), cov_name is
                                  the name of the covariate, cov_type can be
                                  either D (discrete) or C (continuous) and
                                  cov_levels is the number of levels of the
                                  covariate, if discrete, or the number of levels
                                  to be used for discretizing a continous
                                  covariate (must be provided together with -C).

       -S, --simple               Use a simple Naive Bayes classifier, instead of
                                  the ensemble of classifiers, trained on all SNPs
                                  with p-value of association < 5e-7 for a general
                                  2df chi-square test

       -v, --verbose              Outputs additional information on the training
                                  process and the list of attributes used by the
                                  classifiers in the ensemble, with the
                                  corresponding marginal utilities. If used in
                                  conjunction with -S, outputs the list of SNPs
                                  passing the chi-square test

       -h, --help                 Print this help and exit

       -m, --multiple             For data in multiple triplets, one triplet per
                                  chromosome, in the form:

       root_file_name01.bed   root_file_name02.bed   ...   root_file_name22.bed
       root_file_name01.bim   root_file_name02.bim   ...   root_file_name22.bim
       root_file_name01.fam   root_file_name02.fam   ...   root_file_name22.fam

       -h, --help                 Print this help and exit
 
     Note: analysis of sex chromosomes is, for the moment, disabled

     Mandatory arguments to long options are also mandatory for any corresponding
     short options.

5. Citation

Sambo F, Trifoglio E, Di Camillo B, Toffolo GM, Cobelli C: Bag of Naïve Bayes: biomarker selection and classification from genome-wide SNP data. BMC Bioinformatics 2012, 13(Suppl 14):S2.

[ Publisher Full Text | Bibtex ]