The automatic discovery of patterns conserved in groups of related biological sequences is an important problem in molecular biology. This thesis discusses this problem, and presents a systematisation of a large number of reported methods. New methods for the automatic discovery of patterns and collection of patterns in sets of unaligned protein sequences, are proposed. The methods are able to discover patterns of a quite general type, and are guaranteed to find the best, according to a defined evaluation function, conserved patterns. Both nonheuristic and heuristic search methods are proposed. The problem of evaluating discovered patterns is discussed and several new evaluation functions are proposed. The new functions are shown to have useful properties for a set of test cases. The methods proposed in this thesis have been primarily designed for analysing protein sequences, but they may also be applicable to the analysis of nucleotide (DNA/RNA) sequences and possibly other types of sequence data.

1 Introduction

2 Machine Learning

- 2.1 Occam's razor principle

2.2 Learning about strings

- 2.2.1 Grammatical inference

2.2.2 Pattern inference

3 Discovering biopatterns

- 3.1 Training set

- 3.1.1 Only positive or both positive and negative examples

3.1.2 Clean or noisy training data

3.1.3 Training sets containing multiple groups

3.2 Solution space

- 3.2.1 Regular patterns and protein families

3.2.2 The PROSITE pattern language

3.2.3 Pattern languages used in systems for discovery of biopatterns

3.3 Evaluation of biopatterns

3.4 Discovering patterns from examples.

- 3.4.1 Pattern discovery and sequence alignment

3.4.2 Pattern discovery methods

4 Summary of contributions

- 4.1 Approaches to the automatic discovery of patterns in biosequences

4.2 Finding flexible patterns in unaligned protein sequences

4.3 Efficient discovery of conserved patterns using a pattern graph

4.4 Scoring function for pattern discovery programs taking into account sequence diversity

4.5 Discovering patterns and subfamilies in biosequences

5 Further work

- 5.1 Discovering approximately conserved patterns

- 5.1.1 Allowing

5.1.2 Insertions/deletions

5.2 Discovering patterns of secondary structure in RNA sequences

5.3 Discovery of higherlevel patterns

- 5.3.1 Use of a shortest path algorithm

5.3.2 Discovering patterns of patterns

6 Conclusion

References

This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis presented of the ways in which an assessment can be made of the significance and usefulness of the discovered patterns. It is shown that this problem is related to problems studied in the field of machine learning. The largest part of this paper comprises a review of a number of existing methods developed to solve this problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered using the different methods.

We present a new method for the identification of conserved patterns in a set of unaligned related protein sequences. It is able to discover patterns of a quite general form, allowing for both ambiguous positions and for variable length wildcard regions. It allows the user to define a class of patterns (e.g. the degree of ambiguity allowed and the length and number of gaps), and the method is then guaranteed to find the conserved patterns in this class scoring highest according to a significance measure defined. Identified patterns may be refined using one of two new algorithms. We present a new (nonstatistical) significance measure for flexible patterns. The method is shown to recover known motifs for PROSITE families, and is also applied to some recently described families from the literature.

We have earlier reported an algorithm [JCH95] for discovering patterns conserved in sets of related protein sequences. The algorithm was implemented in a program called Pratt. It uses a depth first search of the pattern space, together with a data structure making it possible to find all segments matching a pattern very quickly (block data structure). Here we present a new search having two main advantages over the old strategy. Firstly it allows for easier integration with programs for multiple sequence alignment and data base search. Secondly, it makes it possible to use branchandbound search, and heuristics, to speed the search. The new search strategy has been implemented in a new version of the Pratt program.

An important problem in sequence analysis is to discover patterns matching subsets of a given set of biosequences. When a pattern common to a subset is found, the quality of the match should be evaluated. This paper proposes that an evaluation scheme for measuring the quality of a match between a sequence set and a common pattern should take into account both the strength of the pattern and the diversity of the sequences matched. A pattern matching a diverse set of sequences (i.e., having low degree of similarity) should get a higher score than an equally strong pattern matching a set of more similar sequences. This will avoid sets of very similar sequences from biasing the score of the patterns. Ideally a measure of statistical significance of a match taking sequence diversity into account, should be defined. As this is a nontrivial problem, this paper proposes a nonstatistical scoring scheme. Ideal requirements for such a scoring scheme are given. It is assumed that the strength of a pattern and the diversity of the sequence set can be evaluated independently, and combined into a total score for the match. We use a restricted class of PROSITElike patterns, and an earlier reported method for evaluating pattern strength. Two alternative schemes for evaluating the diversity of a set of sequences are proposed: one based on the sum of edge lengths in an estimated phylogenetic tree for the set of sequences, and one based on the minimum weight spanning tree on the graph of all pairwise distances between sequences. For both cases, algorithms and practical application cases are given. The combined measures are shown to have useful properties for a set of test cases.

We consider the problem of automatic discovery of patterns and the corresponding subfamilies in a set of biosequences. The sequences are unaligned and may contain noise of unknown level. The patterns are of the type used in PROSITE database. In our approach we discover patterns and the respective subfamilies simultaneously. We develop a theoretically substantiated significance measure for a set of such patterns and an algorithm approximating the best pattern set and the subfamilies. The approach is based on the minimum description length (MDL) principle. We report a computing experiment correctly finding subfamilies in the family of chromo domains and revealing new strong patterns.

Dissertation Database - Bergen University Library