Filtering Degenerate Motifs with Application to Protein Sequence Analysis
In biology the notion of degenerate pattern/motif plays a
central role for describing various phenomena. For example, protein
functional patterns, like those contained in the PROSITE
database, e.g. [FY]DPC[LIM][ASG]C[ASG], are in
general represented by degenerate patterns with character classes.
Researchers have developed several approaches over the years to
discover degenerate pattern.
Although such methods have been exhaustively and successfully tested
on genomes and proteins, their outcomes often far exceed the size of
the original input, making the output hard to be managed and to be
interpreted by refined analysis requiring manual inspection. In this
paper we discuss a characterization of degenerate motifs with
character classes without gaps, and we introduce the concept of
pattern priority for comparing and ranking different motifs.
We define the class of underlying motifs for filtering any
set of degenerate motifs into a new set that is linear in size of
the input sequence. We present some preliminary results on the
detection of subtle signals in protein families. Results show that
our approach drastically reduces the number of motifs in output for
any state-of-the-art tool for protein analysis, while retaining the
Here you can find the java application UnderlyingFilter with some examples.
Unzip the following file: ZIP
Run UnderlyingFilter using the command:
java -jar UnderlyingFilter.jar MotifsFileName SequenceFileName NumberOfSequences VarunStyleWithZScore UnderlyingQuorum MinLength SortByZScore
Where the file "MotifsFileName" contains a set of patterns with character
classes and no gaps in Varun/Teiresias style (see ni_D22).
"SequenceFileName" is a text file with one genome per line (see ni_hgenase.txt),
"NumberOfSequences" is the number of genomes in the input file "SequenceFileName",
"MinLength" and "UnderlyingQuorum" are the minimum length of patterns used
and the quorum for the Underlying patterns in output.
If the patterns file "MotifsFileName" contains also statistical information like Z-Score (see ni_zscore_22)
these can be processed by setting the boolean options "VarunStyleWithZScore"
and/or "SortByZScore" to true.
To run the examples included type:
java -jar UnderlyingFilter.jar ni_22 ni_hgenase.txt 22 false 2 3 false > output22.txt
java -jar UnderlyingFilter.jar ni_zscore_22 ni_hgenase.txt 22 true 2 4 false > output_zscore_22.txt
java -jar UnderlyingFilter.jar ni_2 ni_hgenase.txt 22 false 15 5 false > output15-5.txt
For more information about the software Varun please visit:
The software is freely available for academic use.
For questions about the tool, please contact Matteo Comin
or Davide Verzotto.
Please cite the following papers:
M. Comin, D. Verzotto,
"Filtering Degenerate Motifs with Application to Protein Sequence Analysis",
Algorithms 6, no. 2: pp. 352-370.
M.Comin, D. Verzotto,
"Reducing the space of degenerate patterns in protein remote homology detection"
Proceedings of 24rd International Workshop on Database and Expert Systems Applications, BIOKDD 2013