Alignment-free genome comparison based on Sequencing Reads and Quality Values
Sequencing technologies are generating enormous amounts of
read data, however the assembly of genomes and metagenomes, from fragments,
is still one of the most challenging task. If a reference genome is not available the basic
task of reads mapping cannot be carried out. For metagenomic
communities the problem is even more challenging because we can not
know in advance which genomes are present in the metagenomic sample.
In this paper we study the comparison of genomes and metagenomes based on
read data, without assembly, using alignment-free sequence statistics.
Quality scores produced by sequencing platforms are fundamental for various
analysis, like reads mapping and error detection. Moreover
future-generation sequencing platforms, e.g. PacBio and MinION, will produce long
reads but with an error rate around 15\%.
In this context it will be fundamental to exploit quality value information
within the framework of alignment-free statistics.
In this paper we present a family of alignment-free measures, called $d^q$-type,
that are based on $k$-mer counts and quality values. These statistics can be used
to compare genomes and metagenomes based on their read sets, without the need
to assembly. Results on simulated genomes show that the evolutionary relationship of genomes can be reconstructed based on the direct comparison of sets of reads.
The use of quality values on average improves the classification accuracy, and
its contribution increases when the reads are more noisy.
Preliminary experiments on microbial communities show that similar metagenomes can be detected just by processing their read data.
The zip file contains source code (file c2q.cpp) and example reads files (reads subdirectory).
To compile 'make' can be used or the line (the former includes -O3 optimization flag)
c++ c2q.cpp -o c2q
The program c2q requires 4 mandatory arguments
c2q [k] [list] [type] [M]
k k-mer length
list a file containing a list of input sequences file in the
form [fast file] [seq name]
type either Q for quality value based output or A for non
M length of the reads (needed to compute averages)
To run the provided example with k=5 use the command
c2q 5 reads.txt Q 300
The software outputs (along with several support files) the 4 matrices d2,d2*,d2q and d2q* in 4 files with extension .out
The software is freely available for academic use.
For questions about the tool, please contact Matteo Comin.
Please cite the following paper:
"Fast Comparison of Genomic and Meta-Genomic Reads with Alignment-Free Measures based on Quality Values",
Accepted for publication in TBC 2015.