## Assembly-free Genome Comparison based on Next-Generation Sequencing Reads and Variable Length Patterns

Abstract:
Background
With the advent of Next-Generation Sequencing technologies (NGS), a large amount of short read data has been generated. If a reference genome is not available, the assembly of a template sequence is usually challenging because of repeats and the short length of reads. When NGS reads can not be mapped onto a reference genome alignment-based methods are not applicable. However it is still possible to study the evolutionary relationship of unassembled genomes based on NGS data.
Results
We present a parameter-free alignment-free method, called $\overline{Under_2}$, based on variable-length patterns, for the direct comparison of sets of NGS reads. We define a similarity measure using variable-length patterns, as well as reverses and reverse-complements, along with their statistical and syntactical properties. %, so that uninformative'' patterns will be discarded. We evaluate several alignment-free statistics on the comparison of NGS reads coming from simulated and real genomes. In almost all simulations our method $\overline{Under_2}$ outperforms all other statistics. The performance gain becomes more evident when real genomes are used.
Conclusion
The new alignment-free statistic is highly successful in discriminating related genomes based on NGS reads data. In almost all experiments, it outperforms traditional alignment-free statistics that are based on fixed length patterns.

SOON AVAILABLE.

### Licence

The software is freely available for academic use.