Alignment-Free Phylogeny of Whole Genomes using Underlying Subwords
With the progress of modern sequencing technologies a large number
of complete genomes are now available. Traditionally the comparison
of two related genomes is carried out by sequence alignment. There
are cases where these techniques cannot be applied, for example if
two genomes do not share the same set of genes, or if they are not
alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species.
For these cases the comparison of complete genomes can be carried out
only with ad hoc methods that are usually called alignment-free methods.
In this paper we propose a distance function based on subword
compositions called Underlying Approach (UA).
We prove that the matching statistics, a popular concept in the field of
string algorithms able to capture the statistics of common words
between two sequences, can be derived from a small set of
``independent'' subwords, namely the irredundant common subwords. We
define a distance-like measure based on these subwords, such that
each region of genomes contributes only once, thus avoiding to
count shared subwords a multiple number of times. In a nutshell, this filter discards
subwords occurring in regions covered by other more significant
The Underlying Approach (UA) builds a scoring function based on this
set of patterns, called underlying. We prove that this set is by
construction linear in the size of input, without overlaps, and can be efficiently constructed. Results show the validity of our
method in the reconstruction of phylogenetic trees, where the
Underlying Approach outperforms the current state of the art
methods. Moreover, we show that the accuracy of UA is achieved with a
very small number of subwords, which in some cases carry meaningful
Here you can find the java application UnderlyingApproach with an example.
Unzip the following file:
Run UnderlyingApproach using the command:
java -jar UnderlyingApproach.jar Sequences.txt NumberOfSequences MinLength OutputScoringTabele.txt
Where the file "Sequences.txt" is a text file with one genome per line (see test.fa),
"NumberOfSequences" is the number of genomes in the input file "Sequences.txt"
and "MinLength" is the minimum length of subwords used (we suggest to use the value 2).
To run the example included type:
java -jar UnderlyingApproach.jar test.fa 4 2 output.txt
The software is freely available for academic use.
For questions about the tool, please contact Matteo Comin
or Davide Verzotto.
Please cite the following papers:
M. Comin, D. Verzotto,
"Alignment-Free Phylogeny of Whole Genomes using Underlying Subwords",
In BMC Algorithms for Molecular Biology 2012, 7:34.
M.Comin, D. Verzotto,
"Whole-Genome Phylogeny by Virtue of Unic Subwords"
Proceedings of 23rd International Workshop on Database and Expert Systems Applications, BIOKDD 2012.