Methods for finding motifs in sets of related biosequences
Inge Jonassen
Department of Informatics, University of Bergen, Norway. June 1996.
Abstract
The automatic discovery of patterns
conserved in groups of related biological sequences is an important
problem in molecular biology. This thesis discusses this problem,
and presents a systematisation of a large number of reported methods.
New methods for the automatic discovery of patterns and collection
of patterns in sets of unaligned protein sequences, are proposed.
The methods are able to discover patterns of a quite general type,
and are guaranteed to find the best, according to a defined evaluation
function, conserved patterns. Both nonheuristic and heuristic
search methods are proposed. The problem of evaluating discovered
patterns is discussed and several new evaluation functions are
proposed. The new functions are shown to have useful properties
for a set of test cases. The methods proposed in this thesis have
been primarily designed for analysing protein sequences, but they
may also be applicable to the analysis of nucleotide (DNA/RNA)
sequences and possibly other types of sequence data.
Keywords: bio-informatics, protein sequences,
pattern discovery, machine learning, search methods, PROSITE,
minimum descript length principle
Contents
1 Introduction
2 Machine Learning
2.1 Occam's razor principle
2.2 Learning about strings
2.2.1 Grammatical inference
2.2.2 Pattern inference
3 Discovering biopatterns
3.1 Training set
3.1.1 Only positive or both positive and negative examples
3.1.2 Clean or noisy training data
3.1.3 Training sets containing multiple groups
3.2 Solution space
3.2.1 Regular patterns and protein families
3.2.2 The PROSITE pattern language
3.2.3 Pattern languages used in systems for discovery of biopatterns
3.3 Evaluation of biopatterns
3.4 Discovering patterns from examples.
3.4.1 Pattern discovery and sequence alignment
3.4.2 Pattern discovery methods
4 Summary of contributions
4.1 Approaches to the automatic discovery of patterns in biosequences
4.2 Finding flexible patterns in unaligned protein sequences
4.3 Efficient discovery of conserved patterns using a pattern graph
4.4 Scoring function for pattern discovery programs taking into account sequence diversity
4.5 Discovering patterns and subfamilies in biosequences
5 Further work
5.1 Discovering approximately conserved patterns
5.1.1 Allowing M mismatches
5.1.2 Insertions/deletions
5.2 Discovering patterns of secondary structure in RNA sequences
5.3 Discovery of higherlevel patterns
5.3.1 Use of a shortest path algorithm
5.3.2 Discovering patterns of patterns
6 Conclusion
References
Appendix: Portfolio of research papers
A: Approaches to the automatic discovery
of patterns in biosequences.
A. Brazma, I. Jonassen, I. Eidhammer,
and D. R. Gilbert. 1995.
B: Finding flexible patterns in unaligned
protein sequences.
I. Jonassen, J. F. Collins, and D. G. Higgins. 1995.
C: Efficient discovery of conserved
patterns using a pattern graph.
I. Jonassen. 1996.
D: Scoring function for pattern discovery
programs taking into account sequence diversity.
I. Jonassen, C. Helgesen, and D. G. Higgins. 1996.
E: Discovering patterns and subfamilies
in biosequences.
A. Brazma, I. Jonassen, E. Ukkonen, and J. Vilo. 1996.
Abstract
This paper is a survey of approaches
and algorithms used for the automatic discovery of patterns in
biosequences. Patterns with the expressive power in the class
of regular languages are considered, and a classification of pattern
languages in this class is developed, covering those patterns
which are the most frequently used in molecular bioinformatics.
A formulation is given of the problem of the automatic discovery
of such patterns from a set of sequences, and an analysis presented
of the ways in which an assessment can be made of the significance
and usefulness of the discovered patterns. It is shown that this
problem is related to problems studied in the field of machine
learning. The largest part of this paper comprises a review of
a number of existing methods developed to solve this problem and
how these relate to each other, focusing on the algorithms underlying
the approaches. A comparison is given of the algorithms, and examples
are given of patterns that have been discovered using the different
methods.
Keywords: automatic discovery, bioinformatics,
biosequences, machine learning, patterns.
Finding flexible patterns in unaligned
protein sequences
Abstract
We present a new method for the identification
of conserved patterns in a set of unaligned related protein sequences.
It is able to discover patterns of a quite general form, allowing
for both ambiguous positions and for variable length wildcard
regions. It allows the user to define a class of patterns (e.g.
the degree of ambiguity allowed and the length and number of gaps),
and the method is then guaranteed to find the conserved patterns
in this class scoring highest according to a significance measure
defined. Identified patterns may be refined using one of two new
algorithms. We present a new (nonstatistical) significance
measure for flexible patterns. The method is shown to recover
known motifs for PROSITE families, and is also applied to some
recently described families from the literature.
Keywords: Patterns, protein families,
algorithm, flexible gaps, PROSITE.
Abstract
We have earlier reported an algorithm
[JCH95] for discovering patterns conserved in sets of related
protein sequences. The algorithm was implemented in a program
called Pratt. It uses a depth first search of the pattern space,
together with a data structure making it possible to find all
segments matching a pattern very quickly (block data structure).
Here we present a new search having two main advantages over the
old strategy. Firstly it allows for easier integration with programs
for multiple sequence alignment and data base search. Secondly,
it makes it possible to use branchandbound search,
and heuristics, to speed the search. The new search strategy has
been implemented in a new version of the Pratt program.
Abstract
An important problem in sequence analysis
is to discover patterns matching subsets of a given set of biosequences.
When a pattern common to a subset is found, the quality of the
match should be evaluated. This paper proposes that an evaluation
scheme for measuring the quality of a match between a sequence
set and a common pattern should take into account both the strength
of the pattern and the diversity of the sequences matched. A pattern
matching a diverse set of sequences (i.e., having low degree of
similarity) should get a higher score than an equally strong pattern
matching a set of more similar sequences. This will avoid sets
of very similar sequences from biasing the score of the patterns.
Ideally a measure of statistical significance of a match taking
sequence diversity into account, should be defined. As this is
a nontrivial problem, this paper proposes a nonstatistical
scoring scheme. Ideal requirements for such a scoring scheme are
given. It is assumed that the strength of a pattern and the diversity
of the sequence set can be evaluated independently, and combined
into a total score for the match. We use a restricted class of
PROSITElike patterns, and an earlier reported method for
evaluating pattern strength. Two alternative schemes for evaluating
the diversity of a set of sequences are proposed: one based on
the sum of edge lengths in an estimated phylogenetic tree for
the set of sequences, and one based on the minimum weight spanning
tree on the graph of all pairwise distances between sequences.
For both cases, algorithms and practical application cases are
given. The combined measures are shown to have useful properties
for a set of test cases.
Keywords: pattern discovery, PROSITE,
scoring scheme, sequence diversity, pattern strength.
Abstract
We consider the problem of automatic
discovery of patterns and the corresponding subfamilies in a set
of biosequences. The sequences are unaligned and may contain noise
of unknown level. The patterns are of the type used in PROSITE
database. In our approach we discover patterns and the respective
subfamilies simultaneously. We develop a theoretically substantiated
significance measure for a set of such patterns and an algorithm
approximating the best pattern set and the subfamilies. The approach
is based on the minimum description length (MDL) principle. We
report a computing experiment correctly finding subfamilies in
the family of chromo domains and revealing new strong patterns.
Keywords: pattern discovery, sequence
motifs, machine learning, protein subfamilies, PROSITE, clustering,
algorithms, Bayesian inference, MDL principle
Dissertation Database
- Bergen University Library