# Biology 4900, Fall '05 Course Diary (Wareham)

## Wednesday, September 1

• 9:30 -- 10:40: What is Bioinformatics and Why Should You Care?

• Why Bioinformatics / Computational Biology?
• The main reason for applying computational techniques to biological data is that they are now too large to analyze by hand.
• Biological datasets are getting larger by the day; fast computer programs are our only hope of being able to perform the necessary analyses.
• Two truths about application-oriented computing:
• Computers do exactly what you tell them to do, and nothing else => you need to know exactly what you want to do.
• Chances are, someone else besides you, i.e., a computer programmer, will be telling the computer what to do => you need to know how to deal with CS folk and their various concerns, strengths, and bias.
• Dealing with Bioinformatics Software I: What Are They Talking About?
• To understand the software, you need to understand its creators. One way of doing this is to look at the words they use to describe what they do.
• Frequently-heard CS terms:
• Problems, Algorithms, and Programs
• A (computational) problem is an abstract relation of inputs and outputs; it just says what you're given and what you want, not how you go about getting it.

DNA Sequence Assembly
Input: A set S of m DNA fragments of maximum length n.
Output: The original DNA sequence, i.e., a DNA sequence r such that each s in S is a substring of r.

Evolutionary Tree Reconstruction
Input: An m x n object by character matrix M.
Output: The tree T with m leaves that gives the evolutionary history of speciation events that derives the m taxa in M from their most recent common ancestor.

• Many problems are optimization problems in which, among all possible solutions for a given input, you want the one that maximizes or minimizes some function.

Shortest DNA Sequence Assembly
Input: A set S of m DNA fragments of maximum length n.
Output: The shortest DNA sequence r such that each s in S is a substring of r.

Optimal Evolutionary Tree under Criterion X
Input: An m x n object-by-character matrix M.
Output: The tree T with m leaves that has optimal score for criterion X relative to M.

Note that in the context of computational problems in biology, the optimal solution need not be the right one -- rather, we invoke Occam's Razor / the Principle of Parsimony (albeit in a warped fashion) and consider the optimal solution most likely to be right and hence a good one to investigate first.

• An algorithm for a problem X is an abstractly-stated method for computing the output associated with an input of X.
• A program for problem X is an executable implementation of an algorithm for X that is written in some computer language, e.g., C, C++, Java.
• CS folk make this three-way distinction in order to write programs good for all computers; also makes analysis easier (albeit more abstract).
• Exact vs. Approximation vs. Heuristic Algorithms
• A useful three-way classification of algorithms for optimization problems.
• An exact algorithm always returns the optimal solution for a given input.
• An approximation algorithm always returns a solution for a given input whose cost is within some additive of multiplicative factor of the cost of the optimal solution.
• A heuristic algorithm always returns a solution for a given input.
• Heuristics tend to run faster than approximation algorithms, which tend to run faster than exact algorithms.
• When dealing with algorithms, ask yourself what is more important, solution optimality or quick runtime -- you can seldom have both. If you do not answer this question yourself, someone else may answer it for you (and not to your benefit).
• (Worst-case) Time / Space Complexity: Big-Oh (O(f(n)) notation
• A useful measure of how fast an algorithm runs.
• In Big-Oh notation, n is some measure of the size of the input, e.g., maximum sequence length, number of taxa x number of characters, and f(n) is a function of n that upper-bounds the time or resource usage of the algorithm as n goes to infinity.
• Average-case analyses may be more appropriate in many situations but are typically much harder to do.
• In general, functions f() that are polynomials are good, and anything else, e.g., exponential functions, are bad; this, however, can be modified for an individual input (if large, low-order polynomial is good; if very small, exponential is OK).
• Computational Intractability: NP-hardness/completeness.
• A useful measure of the difficulty of a problem.
• If a problem X is NP-hard/complete, it probably does not have a fast, i.e., polynomial-time, algorithm.
• Not necessarily the end of the road; there may still be other types of fast algorithms for X, e.g., approximation / heuristic algorithms.
• Consequences for biologists:
• CS version of problem may be simplified version of what you want => actual meaning of CS analysis results must be considered carefully.
• Best algorithm over all computers may not be the best algorithm for your computer.
• Best algorithm over all inputs may not be the best algorithm for your inputs.
• Three basic types of analysis problems:

• Infer structure from partial information, e.g., DNA sequence assembly, evolutionary tree inference.
• Find occurrences of known patterns in a given structure (pattern matching).
• Derive common patterns in a given set of structures (pattern detection).

We will spend most of our lecture-time looking at the latter two problems relative to molecular sequence data.

• The role of evolutionary theory in computational biology.
• Evolution as descent with modification.
• As a first approximation, molecular sequence mutations and indels occur at random in the sequence.
• Once species diverge from a common ancestor, mutations in sequences inherited from that ancestor occur independently in the descendant species.
• Mutations in non-functional regions do not affect an organism and are almost always passed on to descendants; mutations in functional regions are almost always deleterious, i.e., they kill the organism, and are rarely passed on to descendants => functional regions evolve more slowly than non-functional regions.
• Two consequences:

• Similar structures tend to have similar functions => To infer function of new structure, assess similarity to structures of known function! (pattern matching)
• Conservation is indicative of function => To find functionally important regions, look for conserved regions! (pattern detection)

• Evolution is the reason we can assume similarity (and hence the basic problem of looking for similarity among structures) is meaningful in computational biology.

• 10:50 -- 11:55: Pattern Matching

• Recall our two basic pattern-problems:

• Pattern Matching: Given a pattern-string P and a text-string T, find all occurrences of P in T.
• Pattern Detection: Given a set T of text strings, find the set P of all pattern strings p such that p occurs at least once in each string t in T.

A finer further distinction for pattern problems is exact vs. approximate pattern matching, i.e., how do you interpret "occurrence" of a pattern in a text?

• Pattern Matching
• Pattern matching good for seeing what sequences contain a putative functional element (match putative functional element against sequence database) or for seeing what known functional elements appear in (and hence might be indicative of the function of) a new sequence (match new sequence against database of known functional elements).
• Background
• Pattern Types: Strings, Profiles, Regular expressions
• Exact Occurrence of Pattern (Substrings)
• Approximate Occurrence of Pattern
• Phrase in terms of a matching-function m(P,t) that measures the strength of the match of pattern P with a particular substring t of the text T.
• Phrase matching-function in terms of mutation operations: (single-symbol) substitution, insertion, deletion
• Note that each operation can be expressed as a pair of symbols (x,y) where x is the input to the operation and y is the output.
• If the input (output) is no character as it is in insertion (deletion), denote the input (output) by a dash-symbol ('-').
• Note that insertions and deletions are opposite operations, i.e., an insertion done in reverse is a deletion. As in many cases we will not know the "direction" of an operation, insertions and deletions are lumped together and called indels.
• One can express similarity in a very basic way in terms of the smallest number of operations required to transform a pattern into a given piece of text; however, this implies that all operations are equally likely, which is not typical, e.g., transitions are more frequent than transversions are more frequent than insertions/deletions. Perhaps we should model this by associating costs with individual operations ...
• Save costs of mutation operations in a mutation cost-matrix M.
• Distance matrix (all operations have cost greater than 0; also satisfies other axioms); lower is better.

```    MD1:                  MD2:
A  G  T  C  -         A  G  T  C  -
A   0  1  1  1  1     A   0  1  2  2  4
G   1  0  1  1  1     G   1  0  2  2  4
T   1  1  0  1  1     T   2  2  0  1  4
C   1  1  1  0  1     C   2  2  1  0  4
-   1  1  1  1  *     -   4  4  4  4  *
```

• Similarity matrix (operations may have negative values; is (typically) symmetric); higher is better.

```    MS1:                  MS2:
A  G  T  C  -         A  G  T  C  -
A   2 -1 -1 -1 -1     A   2 -1 -2 -2 -3
G  -1  2 -1 -1 -1     G  -1  2 -2 -2 -3
T  -1 -1  2 -1 -1     T  -2 -2  2 -1 -3
C  -1 -1 -1  2 -1     C  -2 -2 -1  2 -3
-  -1 -1 -1 -1  *     -  -3 -3 -3 -3  *
```

In all examples below, we will use similarity matrix MS2.
• Classical mutation cost-matrices:
• Protein matrices (PAM, BLOSSUM)
• DNA matrices (model-based (Jukes-Canter, Kimura 2-Parameter))
• Transform pattern into occurrence by sequence of operations; cost of operation-sequence = sum of costs of individual operations.
• Note that such an operation-sequence cannot contain self-substitutions, e.g., A => A
• Example: Three operation-sequences for transforming AGA into GA.

```   S1:        S2:      S3:

123        123      123

AGA        AGA      AGA       Cost of S1 =
|          |        |         M(A,-) = -3
| Del 1    | Del 2  | Del 3
V          V        V        Cost of S2 =
GA         AA       AG         M(G,-) + M(A,G) =
|        |         -3 + -1 = -4
| Sub 1  | Del 2
V        V        Cost of S3 =
GA       G          M(A,-) + M(A,-) +
|         M(-,A) =
| Ins 1   -3 + -3 + -3 = -9
V
GA
```

• Degree of similarity m(P,t) = cost of maximum-cost operation-sequence that transforms P into t.
• In the example above, the operation-sequence with maximum cost is S1; hence, the degree of similarity of AGA and GA is -3.
• Best match of P in T is substring t of T such that m(P,t) has largest value over all t in T.
• Example: Sample matches of pattern TAC to text TTGTCA:

```    Match #1: TTGTCA  cost = -1 + -1 = -2
TAC

Match #2: TTGTCA  cost = -1 + -3 = -4
TA-C

Match #3: TTGTCA  cost = -3 + -2 = -5
T-AC
```

The reader can verify that Match #1 above has the highest score of any match of the given pattern to the given text, and is therefore the best match.

• Algorithms
• Exact pattern matching
• Can be done in linear, i.e., O(|P| + |T|), time and space (Gusfield (1997), Section I).
• Approximate pattern matching
• Can be done in quadratic, i.e., O(|P||T|), time and space; however, faster algorithms are possible if you are looking at patterns that have very good matches in the text ( Gusfield (1997), Chapters 11 and 12; Setubal and Meidanis (1997), Sections 3.1-3.3).
• Software

## Thursday, September 2

• 9:30 -- 10:40: Pairwise Sequence Alignment and Database Search

• Pairwise Sequence Alignment (Mount (2004), Chapter 3)
• Alignment generalizes notion of match to two whole sequences s and s' -- are no longer just looking for best match of a pattern and a text but are instead looking for best match of s against s'. An alignment also provides a very nice visual display of the sequence-matching.
• Pairwise alignments good for detecting patterns common to two sequences.
• Background
• Derive alignment score by summing the scores of all columns of the alignment relative to cost-matrix M.
• Example: Three alignments of GCTGT and TGCCTAG:

```    Alignment #1:   TGCCTAG
-GC-TGT

cost = -3 + 2 + 2 + -3 + 2 + -1 + -2
= -3

Alignment #2:   TGCCTAG-
--GCT-GT

cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3
= -8

Alignment #3:   TGCCTAG-
-GC-TG-T

cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3
= -7
```

• On occasion, want to further mold alignment to minimize the number of gaps i.e., number maximal strings of indels.
• In the example above, Alignment #1 has 2 gaps, Alignment #2 has 3 gaps, and Alignment #3 has 4 gaps.
• Several types of gap-cost schemes:
• Constant, e.g., score + WgG, where Wg is the penalty for a gap and G is the number of gaps.
• Affine, e.g., score + sum Wgs + GiWgc where Wgs is the penalty for starting a gap, Gi is the length in symbols of gap i, and Wgc is the penalty for extending a gap by one symbol.
• Arbitrary, e.g., score + sum f(Gi), where f is an arbitrary function.
• Example: Three alignments of GCTGT and TGCCTAG (constant gap-cost scheme / Wg = -2):

```    Alignment #1:   TGCCTAG
-GC-TGT

cost = -3 + 2 + 2 + -3 + 2 + -1 + -2 + (2 * -2)
= -7

Alignment #2:   TGCCTAG-
--GCT-GT

cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3 + (3 * -2)
= -14

Alignment #3:   TGCCTAG-
-GC-TG-T

cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3 + (4 * -2)
= -15
```

• Best alignment = alignment with highest score.
• In both of the examples, the best alignment is Alignment #1. The reader can verify the that Alignment #1 is the best alignment relative to all possible alignments of the given sequences.
• Several options for match-extent of s and s':

• Global alignment (Needleman-Wunsh): Look for best alignment of all of s with all of s'.
• Useful in comparing very closely related sequences.

```        -------------------------
|#####--###-##-####-##--|
|--##-####-######-######|
-------------------------
```

• Semi-global alignment: Look for best alignment of s with s' that ignores various combinations of initial and final indels.
• There are sixteen possible forms of semi-global alignment, one of which is global alignment (= do not ignore any initial or final indels).
• Two variants (ignore initial indels in s and final indels in s' / ignore final indels in s and initial indels in ) are useful in determining degree of sequence overlap (DNA sequence assembly).

```             ------------
---------|##-######-|########
#########|####-##-##|--------
------------

------------
#########|##-######-|--------
---------|####-##-##|########
------------
```

• Note that (approximate) pattern matching looks for best alignment of all of s with part of s', ignoring initial and final indels w.r.t. s', and hence is a special case of semi-global alignment.

```          --------------
######|#-#####--##-|########
------|####-####-##|--------
--------------
```

• Local alignment (Smith-Waterman): Look for the best alignment of a region of s with a region of s' => pattern detection!
• Useful in comparing distantly related sequences.

```          ----------
######|##-##-##|####
##|#-#####-|#######
----------
```

• Algorithms
• Exact
• All forms of non-gap-cost as well as global constant/affine gap-cost pairwise alignment can be done in quadratic, i.e., O(|s||s'|), time and space; however, faster algorithms are possible if you are looking for alignments of very similar sequences (Gusfield (1997), Chapters 11 and 12; Setubal and Meidanis (1997), Sections 3.1-3.3).
• There are versions of these algorithms that run in quadratic time and linear space (Gusfield (1997), Section 12.1; Setubal and Meidanis (1997), Section 3.3.1)
• Global arbitrary gap-cost pairwise alignment can be done in cubic time time and quadratic space (Gusfield (1997), Section 11.8.5).
• Heuristic
• Used in database search! (see below)
• Software
• Global: GAP (DNA/proteins)
• Local: SIM (proteins)
• Database Search (Mount (2004), Chapter 6)
• Background
• Heuristic pairwise alignment algorithms required to perform database search in reasonable amount of time on large databases.
• Even if sufficient computing power is available to do quadratic time string comparison in database search in a reasonable amount of time, heuristics are still useful as filters to remove "obviously" bad database entries from being considered with quadratic-time exact algorithms.
• Algorithms
• Exact
• Uses exact pairwise local alignment algorithm (Smith-Waterman).
• Heuristic
• Two most famous algorithms are FASTA and BLAST.
• Run in "approximately" linear time.
• Software

• 10:50 -- 11:55: Multiple Sequence Alignment

• Multiple Sequence Alignment (Mount (2004), Chapter 5)
• Multiple sequence alignments good for detecting patterns common to multiple sequences; are especially useful in detecting mutation-degraded patterns common in a phylogenetically diverse set of sequences, e.g,, protein family diagnostic patterns.
• If you are doing phylogenetic analysis, multiple sequence alignment is critical for establishing homology of sequence-positions across sequences.
• Background
• Most popular alignment functions constructed from global pairwise alignments among pairs of strings from the given set.
• Phylogenetic multiple sequence alignment
• Given a (evolutionary) tree, label the leaves of this tree with the given sequences and assign sequences to the internal nodes; the cost of the alignment is then the sum of the scores of the the best pairwise alignments along the tree edges.
• Example: Tree alignment for given sequences {AAGA, TGGA, ATAA, ATTA} relative to given tree T below:

```    T:
AAGA            TGGA
\          /
AAAA--TTAA
/          \
ATTA            ATAA

Edge alignments:

AAGA      cost = 2 + 2 + -1 + 2
AAAA           = 5

ATTA      cost = 2 + -2 + -2 + 2
AAAA           = 0

AAAA      cost = -2 + -2 + 2 + 2
TTAA           = 0

TTAA      cost = 2 + -2 + -1 + 2
TGGA           = 1

TTAA      cost = -2 + 2 + 2 + 2
ATAA           = 4

cost of tree alignment = 5 + 0 + 0 + 1 + 4
= 10
```

• Has ideal biological justification; create alignment that assume least amount of evolutionary change relative to a given tree,and hence may be most plausible in an evolutionary sense.
• Also has version in which tree is inferred at same time as alignment.
• Both versions are NP-hard (Wang and Jiang (1994); Wareham (1995))
• Sum-of-Pairs (SP) multiple sequence alignment
• The cost of an alignment is the sum of all the scores of all best pairwise alignments between distinct pairs of given strings.
• Example: SP alignment for given sequences {AAGA, TGGA, ATAA, ATTA}:

```    Pairwise alignments:

AAGA      cost = -2 + -1 + -1 + 2
TGAA           = -2

AAGA      cost = 2 + -2 + 2 + 2
ATGA           = 4

AAGA      cost = 2 + -2 + -2 + 2
ATTA           = 0

TGAA      cost = -2 + -2 + 1 + 2
ATGA           = -3

TGAA      cost = -2 + -2 + -2 + 2
ATTA           = -4

ATGA      cost = 2 + 2 + -1 + 2
ATTA           = 5

cost of SP alignment = -2 + 4 + 0 + -3 + -4 + 5
= 0
```

• Has no biological justification; however, does have nice mathematical properties.
• Is NP-hard (Wang and Jiang (1994))
• Algorithms
• Exact
• Phylogenetic: Exponential time and space.
• Sum-of-Pairs: Exponential time and space.
• Contrary to many statements in the literature, SP alignment requires more time to compute than phylogenetic alignment; however, SP alignment is popular because there are nice cost-cutting tricks that can be applied to the algorithm to reduce the running time in most (but not all) cases (see program MSA below).
• Approximation
• There are various low-order polynomial time approximation algorithms for computing phylogenetic and SP alignments; however, they produce alignments that are within a multiplicative factor of two of optimal.
• Heuristics
• Many such low-order polynomial-time heuristics. Two main types are:
• Progressive methods: Build an evolutionary tree for a group of sequences and build up multiple alignment by aligning closest pair of sequences in the tree and then adding other sequences.
• The evolutionary tree used in this heuristic algorithm is typically itself built using a heuristic algorithm, e.g., cluster analysis, neighbor-joining.
• Repeated-motif methods: Find blocks common to all sequences and use as alignment "anchors", and recursively repeat this process on interblock regions.
• Software
• Exact
• MSA (global SP)
• Heuristic
• CLUSTALW (progressive)
• DALIGN (repeated-motif)
• T-COFFEE (progressive)
• Manual (Sequence Editors)
• Dealing with Bioinformatics Software II: Caveat Emptor
• Two important questions:
• What (exactly) do you want to do?
• What is the appropriate technology to do what you want to do?
• Advanced technology is not always the answer (Rutherford's Dictum); however, if you need technology, where should you go?
• Bioinformaticians
• Other biologists who have worked on problems similar to your own
• Recent books on bioinformatics for biologists, e.g., Mount (2004).
• Recent review articles in biological journals.
• On-line repositories
• CS folk
Note that this list is ordered by decreasing utility of advice. Any source listed above can be used; however, the further down the list it is, the more salt should be taken with said advice ...

## References

• Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.
• Mount, D.W. (2004) Bioinformatics: Sequence and Genome Analysis. Second edition. Cold Spring Harbor Laboratory Press; Cold Spring Harbor, NY.
• Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Company.
• Wang, L. and Jiang, T. (1994) "On the complexity of multiple sequence alignment." Journal of Computational Biology, 1(4), 337-348.
• Wareham, H.T. (1995) "A Simplified Proof of the NP- and MAX SNP-hardness of Multiple Sequence Tree Alignment." Journal of Computational Biology, 2(4), 509-514.