Filtering Degenerate Motifs with Application to Protein Sequence Analysis


Abstract:
In biology the notion of degenerate pattern/motif plays a central role for describing various phenomena. For example, protein functional patterns, like those contained in the PROSITE database, e.g. [FY]DPC[LIM][ASG]C[ASG], are in general represented by degenerate patterns with character classes. Researchers have developed several approaches over the years to discover degenerate pattern. Although such methods have been exhaustively and successfully tested on genomes and proteins, their outcomes often far exceed the size of the original input, making the output hard to be managed and to be interpreted by refined analysis requiring manual inspection. In this paper we discuss a characterization of degenerate motifs with character classes without gaps, and we introduce the concept of pattern priority for comparing and ranking different motifs. We define the class of underlying motifs for filtering any set of degenerate motifs into a new set that is linear in size of the input sequence. We present some preliminary results on the detection of subtle signals in protein families. Results show that our approach drastically reduces the number of motifs in output for any state-of-the-art tool for protein analysis, while retaining the functional patterns.

Software

Here you can find the java application UnderlyingFilter with some examples.
Unzip the following file: ZIP

Run UnderlyingFilter using the command:
java -jar UnderlyingFilter.jar MotifsFileName SequenceFileName NumberOfSequences VarunStyleWithZScore UnderlyingQuorum MinLength SortByZScore

Where the file "MotifsFileName" contains a set of patterns with character classes and no gaps in Varun/Teiresias style (see ni_D22). "SequenceFileName" is a text file with one genome per line (see ni_hgenase.txt), "NumberOfSequences" is the number of genomes in the input file "SequenceFileName", "MinLength" and "UnderlyingQuorum" are the minimum length of patterns used and the quorum for the Underlying patterns in output. If the patterns file "MotifsFileName" contains also statistical information like Z-Score (see ni_zscore_22) these can be processed by setting the boolean options "VarunStyleWithZScore" and/or "SortByZScore" to true.

To run the examples included type:
java -jar UnderlyingFilter.jar ni_22 ni_hgenase.txt 22 false 2 3 false > output22.txt
java -jar UnderlyingFilter.jar ni_zscore_22 ni_hgenase.txt 22 true 2 4 false > output_zscore_22.txt
java -jar UnderlyingFilter.jar ni_2 ni_hgenase.txt 22 false 15 5 false > output15-5.txt

For more information about the software Varun please visit:
www.research.ibm.com/computationalgenomics

Licence

The software is freely available for academic use.
For questions about the tool, please contact Matteo Comin or Davide Verzotto.

Reference

Please cite the following papers:
M. Comin, D. Verzotto,
"Filtering Degenerate Motifs with Application to Protein Sequence Analysis",
Algorithms 6, no. 2: pp. 352-370.
Pdf
M.Comin, D. Verzotto,
"Reducing the space of degenerate patterns in protein remote homology detection"
Proceedings of 24rd International Workshop on Database and Expert Systems Applications, BIOKDD 2013
pp.76-80. Pdf