Bioinformatics and Computational Biology

Research activities:

With thousands of genomes made available by next-generation sequencing technologies, one of the core challenges for bioinformaticians is how to analyze and compare them on a large scale.  Within this context it is essential to develop efficient algorithms and tools that are capable of dealing with whole genomes representations as long sequences or huge sets of reads using appropriate data structures and combinatorial pattern matching techniques. Current research includes:

  • Design and development of alignment free techniques, in particular models where the biological variability is taken into account using approximate components
  • Design and development of algorithms based on efficient data structures to speed-up sequence analysis, and to deal with larger datasets
  • Design and development of tools for data analysis with applications to phylogenetics, metagenomics, and motif discovery
  • Design of models to better characterize the content of biological sequences (approximate-pattern based models, entropic profiles, weighted patterns) and fast algorithms to compute related statistics
  • Theoretical studies of mathematical models, data structures and combinatorial properties of strings: the outcome of this more abstract research line allows to develop conceptual tools for sequence analysis that have potential application also on other several contexts (e.g. text analysis, time series analysis, social data analysis, etc.)

People: Cinzia Pizzi (contact person), Matteo Comin, Fabio Vandin

Modern sequencing technologies generate data more efficiently, economically, and with greater depth than previously possible. This has fostered a number of sequencing-based applications like genome re-sequencing, RNA-Seq, ChIP-Seq etc. However the data volume generated is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In particular, core research activities in the field are:

  • Comparison of unassembled genomes with alignment-free techniques
  • Boosting assembly with reads clustering
  • Compression of sequencing data, quality score sparsification
  • Microbial communities analysis: metagenomic reads binning, abundance rate estimation

People: Matteo Comin (contact person), Cinzia Pizzi

Next-generation sequencing technologies allow the collection of massive amounts of genomic measurements, including somatic mutations, in large cohorts of cancer patients. The analysis of these massive amounts of data poses many computational challenges and requires the design of efficient and rigorous algorithmic techniques. We design efficient and mathematically well-founded computational and statistical methods to solve problems that arise in the analysis of large datasets from cancer studies, with a major focus on the identification of mutations and genomic features associated with the disease. Specific areas of investigation include:

  • Finding significantly mutated pathways: efficient methods to find groups of interacting genes that are significantly mutated in cancer using various computational techniques (e.g., analysis of large interaction networks, discovery of combinatorial patterns of mutations including exclusivity, etc.)
  • Discovery of mutations associated with clinical parameters: rigorous and efficient methods for the identification of groups of genes with mutations associated with survival time, drug response, etc.
  • Inference of cancer evolution from sequencing data: combinatorial and statistical methods for the reconstruction of cancer evolution from sequencing data (e.g., cross-sectional datasets or multiple samples from one patient)

People: Fabio Vandin (contact person)

  • Multivariate selection of genetic markers in complex diseases
  • Prediction of disease evolution based on genetic and phenotypic markers
  • Compression and fast retrieval of genome-wide genetic variation data

People: Silvana Badaloni (contact person)