# Biology 4900, Spring '07 Course Diary (Wareham)

## Wednesday, May 9

• 9:30 -- 10:40: What is Bioinformatics and Why Should You Care?

• Why Bioinformatics / Computational Biology?
• The main reason for applying computational techniques to biological data is that they are now too large to analyze by hand.
• Biological datasets are getting larger by the day; fast computer programs are our only hope of being able to perform the necessary analyses.
• Two truths about application-oriented computing:
• Computers do exactly what you tell them to do, and nothing else => you need to know exactly what you want to do.
• Chances are, someone else besides you, i.e., a computer programmer, will be telling the computer what to do => you need to know how to deal with CS folk and their various concerns, strengths, and bias.
• Dealing with Bioinformatics Software I: What Are They Talking About?
• To understand the software, you need to understand its creators. One way of doing this is to look at the words they use to describe what they do.
• Frequently-heard CS terms (Cormen et al. (2001); Wareham (2004)):
• Problems, Algorithms, and Programs
• A (computational) problem is an abstract relation of inputs and outputs; it just says what you're given and what you want, not how you go about getting it.

DNA Sequence Assembly
Input: A set S of m DNA fragments of maximum length n.
Output: The original DNA sequence, i.e., a DNA sequence r such that each s in S is a substring of r.

Evolutionary Tree Reconstruction
Input: An m x n object by character matrix M.
Output: The tree T with m leaves that gives the evolutionary history of speciation events that derives the m taxa in M from their most recent common ancestor.

• Many problems are optimization problems in which, among all possible solutions for a given input, you want the one that maximizes or minimizes some function.

Shortest DNA Sequence Assembly
Input: A set S of m DNA fragments of maximum length n.
Output: The shortest DNA sequence r such that each s in S is a substring of r.

Optimal Evolutionary Tree under Criterion X
Input: An m x n object-by-character matrix M.
Output: The tree T with m leaves that has optimal score for criterion X relative to M.

Note that in the context of computational problems in biology, the optimal solution need not be the right one -- rather, we invoke Occam's Razor / the Principle of Parsimony (albeit in a warped fashion) and consider the optimal solution most likely to be right and hence a good one to investigate first.

• An algorithm for a problem X is an abstractly-stated method for computing the output associated with an input of X.
• A program for problem X is an executable implementation of an algorithm for X that is written in some computer language, e.g., C, C++, Java.
• CS folk make this three-way distinction in order to write programs good for all computers; also makes analysis of required running time and space easier (albeit more abstract).
• CS folk typically work on simplified versions of problems to make analysis easier; moreover, such simplification (or misunderstandings) may accidentally and fundamentally modify your original problem (spherical cows)
• Take-Home Lesson #1: When formulating and dealing with computational problems, you need to both (1) be careful when interpreting what CS analyses of problems actually mean, and (2) make sure that the problem being analyzed and/or programmed is actually the one you want solved.
• Exact vs. Approximation vs. Heuristic Algorithms
• A useful three-way classification of algorithms for optimization problems:
• An exact algorithm always returns the optimal solution for a given input.
• An approximation algorithm always returns a solution for a given input whose cost is within some additive of multiplicative factor of the cost of the optimal solution.
• A heuristic algorithm always returns a solution for a given input.
• Heuristics tend to run faster than approximation algorithms, which tend to run faster than exact algorithms.
• Take-Home Lesson #2: When choosing an algorithms, ask yourself what is more important, solution optimality or quick runtime -- you can seldom have both. If you do not answer this question yourself, someone else may answer it for you (and not to your benefit).
• (Worst-case) Time / Space Complexity: Big-Oh (O(f(n)) notation
• Useful measures of how fast an algorithm runs and how much computer memory an algorithm will require.
• In Big-Oh notation, n is some measure of the size of the input, e.g., maximum sequence length, number of taxa x number of characters, and f(n) is a function of n that upper-bounds the time or space usage of the algorithm as n goes to infinity.
• Average-case analyses may be more appropriate in many situations but are typically much harder to do.
• Function-description shorthand:
• O(n) => linear
• O(n^2) = O(n * n) => quadratic
• O(n^3) = O(n * n * n) => cubic
• O(n^c) for some constant c >= 1 => polynomial
• O(c^n) for some constant c > 1 => exponential
• Take-Home Lesson #3: In general, time and space complexity functions f() that are polynomials are good, and anything larger, e.g., exponential functions, is bad [handoutTSC.pdf]; however, this may be relaxed for particular inputs (if n is large, low-order polynomial is good; if n is *** very *** small, exponential is OK).
• Take-Home Lesson #4: Existing computer hardware can be left to run as long as necessary, but computer memory is typically limited and adding more memory can get very expensive; hence, space is often the limiting resourse for computations on large inputs, and space complexity should be very low-order polynomial whenever possible.
• Computational Intractability: NP-hardness/completeness.
• A useful measure of the difficulty of a problem.
• If a problem X is NP-hard/complete, it probably does not have a fast, i.e., polynomial-time, algorithm.
• Take-Home Lesson #5: Computational intractability is not necessarily the end of the road; other types of fast algorithms may still be possible, e.g., approximation / heuristic algorithms.
• Learning to understand CS folk can indeed be difficult; however, your efforts will be repaid many times over, both by opening up new opportunities for research collaboration and by dramatically increasing the chances that you will actually get whay you want and need out of any such collaboration.
• Three basic types of biological-data analysis problems:

• Infer structure from partial information, e.g., DNA sequence (contig) assembly, evolutionary tree (phylogeny) inference.
• Find occurrences of known patterns in a given structure (pattern matching).
• Derive common patterns in a given set of structures (pattern detection).

In Dr. Carr's lectures on Monday and Tuesday, you saw several examples of the first type of problem relative to molecular-sequence data. We will spend most of our remaining lecture-time today and tomorrow looking at examples of the latter two types of problems.

• The role of evolutionary theory in computational biology.
• Evolution as descent with modification.
• As a first approximation, molecular sequence mutations and indels occur at random in the sequence.
• Once species diverge from a common ancestor, mutations in sequences inherited from that ancestor occur independently in the descendant species.
• Mutations in non-functional regions do not affect an organism and are almost always passed on to descendants; mutations in functional regions are almost always deleterious, i.e., they kill the organism, and are rarely passed on to descendants => functional regions evolve more slowly than non-functional regions.
• Two consequences:

• Similar structures tend to have similar functions => To infer function of new structure, assess similarity to structures of known function! (pattern matching)
• Conservation is indicative of function => To find functionally important regions, look for conserved regions! (pattern detection)

• Evolution is the reason why we can assume that similarity (and hence the basic problem of looking for similarity among structures) is meaningful in computational biology.

• 10:50 -- 11:55: Pattern Matching

• Recall our two basic pattern-problems:

• Pattern Matching: Given a pattern P and a text-string T, find all occurrences of P in T.
• Pattern Detection: Given a set T of text strings, find the set P of all patterns p such that p occurs at least once in each string t in T.

A finer further distinction for pattern-problems is exact vs. approximate pattern matching, i.e., how do you interpret "occurrence" of a pattern in a text?

• Pattern Matching
• Pattern matching good for seeing what sequences contain a putative functional element (match putative functional element against sequence database) or for seeing what known functional elements appear in (and hence might be indicative of the function of) a new sequence (match new sequence against database of known functional elements).
• Background
• Pattern Types
• Varies depending on how specific the pattern is.
• Most common types:
• String: Fixed pattern length, no variation across positions, e.g., TATA, AAGCA.
• Profile (Gusfield (1997), Section 14.3.1; Mount (2004), pp. 198-204): Fixed pattern length, variation across positions (specified by symbol-probabilities at each position), e.g.,

```              Position
Base   1    2    3    4    5
A    95%  25%   0%   0% 100%
C     0%  25%   0%  50%   0%
G     5%  30% 100%  50%   0%
T     0%  20%   0%   0%   0%
```

• Regular expression (Gusfield (1997), Section 3.6: Variable pattern length, variation across positions, e,g,, (A)[*](A|C|G|T)(A|G)[2-5](G)(A).
• Focus here on patterns as strings.
• Exact Occurrence of Pattern
• Occurence of pattern P in a text T is a substring t of T that is identical to P, e.g. pattern AAG occurs twice in text AAAGCACTTAAGC at positions 2 and 10, while pattern CC does not occur in this text.
• Approximate Occurrence of Pattern
• Phrase in terms of a matching-function m(P,t) that measures the strength of the match of pattern P with a particular substring t of the text T.
• Phrase matching-function in terms of mutation operations: (single-symbol) substitution, insertion, deletion
• Note that each operation can be expressed as a pair of symbols (x,y) where x is the input to the operation and y is the output.
• If the input (output) is no character as it is in insertion (deletion), denote the input (output) by a dash-symbol ('-').
• Note that insertions and deletions are opposite operations, i.e., an insertion done in reverse is a deletion. As in many cases we will not know the "direction" of an operation, insertions and deletions are lumped together and called indels.
• One can express similarity in a very basic way in terms of the smallest number of operations required to transform a pattern into a given piece of text; however, this implies that all operations are equally likely, which is not typical, e.g., transitions are more frequent than transversions are more frequent than insertions/deletions. Perhaps we should model this by associating costs with individual operations ...
• Save costs of mutation operations in a mutation cost-matrix M.
• Distance matrix (all operations have cost greater than 0; also satisfies other axioms); lower is better.

```    MD1:                  MD2:
A  G  T  C  -         A  G  T  C  -
A   0  1  1  1  1     A   0  1  2  2  4
G   1  0  1  1  1     G   1  0  2  2  4
T   1  1  0  1  1     T   2  2  0  1  4
C   1  1  1  0  1     C   2  2  1  0  4
-   1  1  1  1  *     -   4  4  4  4  *
```

• Similarity matrix (operations may have negative values; is (typically) symmetric); higher is better.

```    MS1:                  MS2:
A  G  T  C  -         A  G  T  C  -
A   2 -1 -1 -1 -1     A   2 -1 -2 -2 -3
G  -1  2 -1 -1 -1     G  -1  2 -2 -2 -3
T  -1 -1  2 -1 -1     T  -2 -2  2 -1 -3
C  -1 -1 -1  2 -1     C  -2 -2 -1  2 -3
-  -1 -1 -1 -1  *     -  -3 -3 -3 -3  *
```

In all examples below, we will use similarity matrix MS2.
• Classical mutation cost-matrices:
• Protein matrices (PAM, BLOSSUM)
• DNA matrices (model-based (Jukes-Canter, Kimura 2-Parameter))
• Transform pattern into occurrence by sequence of operations; cost of operation-sequence = sum of costs of individual operations.
• Example: Three operation-sequences for transforming AGA into GA.

```   S1:        S2:      S3:

123        123      123

AGA        AGA      AGA       Cost of S1 =
|          |        |         M(A,-) = -3
| Del 1    | Del 2  | Del 3
V          V        V        Cost of S2 =
-GA        A-A      AG-        M(G,-) + M(A,G) =
|        |         -3 + -1 = -4
| Sub 1  | Del 1
V        V        Cost of S3 =
G-A      -G-        M(A,-) + M(A,-) +
|         M(-,A) =
| Ins 3   -3 + -3 + -3 = -9
V
-GA
```

• Degree of similarity m(P,t) = cost of maximum-cost operation-sequence that transforms P into t.
• In the example above, the operation-sequence with maximum cost is S1; hence, the degree of similarity of AGA and GA is -3.
• This simple view of similarity as operation-sequences has several problems:
• Can artificially lower score by a series of operations on a position that do not change that position (position 3 in S3 above).
• Can artificially raise score by series of operations on a position of the form X=> X.
• Does not take into account positions that are not modifified by any operation (and hence were similar to begin with).
Fix this by:
• Allowing at most one opertation per position; and
• Allowing X => X operations in the operation-cost sum for positions that are not modified.
• Example: Three operation-sequences for transforming AGA into GA (Revised).

```   S1:        S2:      S3 = S1:

123        123      123

AGA        AGA      AGA       Cost of S1 =
|          | Sub 1, |         M(A,-) + M(G,G) + M(A,A) =
| Del 1    | Del 2  | Del 1   -3 + 2 + 2 = 1
V          V        V        Cost of S2 =
-GA        G-A      -GA        M(A,A) + M(G,-) + M(A,G) =
2 + -3 + -1 = -2
Cost of S3 = Cost of S1 = 1
```

• In the revised example, the operation-sequence with maximum cost is still S1; however, the degree of similarity of AGA and GA is now 1.
• Best match of P in T is substring t of T such that m(P,t) has largest value over all t in T. Note that there may be several such best matches (and hence several best approximate occurences of P in T).
• Example: Sample matches of pattern TAC to text TTGTCA:

```    Match #1: TTGTCA  cost = 2 + -1 + -1 = 0
TAC

Match #2: TTGTCA  cost = 2 + -3 + -2 + 2 = -1
T-AC

Match #3: TT-GTCA  cost = 2 + -3 + -2 = -3
TAC

Match #4: TTGTCA  cost = 2 + -1 + -3 + 2 = 0
TA-C

Match #5: TTGT-CA  cost = 2 + -3 2 = 1
TAC
```

The reader can verify that Match #5 above has the highest scores of any match of the given pattern to the given text, and is therefore the best match.

• Algorithms
• Exact pattern matching
• Can be done in linear, i.e., O(|P| + |T|), time and space (Gusfield (1997), Section I).
• Approximate pattern matching
• Can be done in quadratic, i.e., O(|P||T|), time and space; however, faster algorithms are possible if you are looking at patterns that have very good matches in the text (Gusfield (1997), Chapters 11 and 12; Setubal and Meidanis (1997), Sections 3.1-3.3).
• Software
• CDART (protein domain architectures)
• eMOTIF (generic sequence patterns)
• InterPro (protein families, domains, and functional sites)
• BLOCKS (protein domains)
• Prosite (protein families and domains)
• SMART (signaling domains)
• TargetDB (cellular-localization domains)

## Thursday, May 10

• 9:30 -- 10:40: Pairwise Sequence Alignment and Database Search

• Pairwise Sequence Alignment (Mount (2004), Chapter 3)
• Alignment generalizes notion of match to two whole sequences s and s' -- are no longer just looking for best match of a pattern and a text but are instead looking for best match of s against s'. An alignment also provides a very nice visual display of the sequence-matching.
• Pairwise alignments good for detecting patterns common to two sequences.
• Background
• Derive alignment score by summing the scores of all columns of the alignment relative to cost-matrix M.
• Example: Three alignments of GCTGT and TGCCTAG:

```    Alignment #1:   TGCCTAG
-GC-TGT

cost = -3 + 2 + 2 + -3 + 2 + -1 + -2
= -3

Alignment #2:   TGCCTAG-
--GCT-GT

cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3
= -8

Alignment #3:   TGCCTAG-
-GC-TG-T

cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3
= -7
```

• On occasion, want to further mold alignment to minimize the number of gaps i.e., number maximal strings of indels.
• In the example above, Alignment #1 has 2 gaps, Alignment #2 has 3 gaps, and Alignment #3 has 4 gaps.
• Several types of gap-cost schemes:
• Constant, e.g., score + WgG, where Wg is the penalty for a gap and G is the number of gaps.
• Affine, e.g., score + sum Wgs + GiWgc where Wgs is the penalty for starting a gap, Gi is the length in symbols of gap i, and Wgc is the penalty for extending a gap by one symbol.
• Arbitrary, e.g., score + sum f(Gi), where f is an arbitrary function.
• Example: Three alignments of GCTGT and TGCCTAG (constant gap-cost scheme / Wg = -2):

```    Alignment #1:   TGCCTAG
-GC-TGT

cost = -3 + 2 + 2 + -3 + 2 + -1 + -2 + (2 * -2)
= -7

Alignment #2:   TGCCTAG-
--GCT-GT

cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3 + (3 * -2)
= -14

Alignment #3:   TGCCTAG-
-GC-TG-T

cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3 + (4 * -2)
= -15
```

• Best alignment = alignment with highest score.
• In both of the examples, the best alignment is Alignment #1. The reader can verify the that Alignment #1 is the best alignment relative to all possible alignments of the given sequences.
• Several options for match-extent of s and s':

• Global alignment (Needleman-Wunsh): Look for best alignment of all of s with all of s'.
• Useful in comparing very closely related sequences.

```        -------------------------
|#####--###-##-####-##--|
|--##-####-######-######|
-------------------------
```

• Semi-global alignment: Look for best alignment of s with s' that ignores various combinations of initial and final indels.
• There are sixteen possible forms of semi-global alignment, one of which is global alignment (= do not ignore any initial or final indels).
• Two variants (ignore initial indels in s and final indels in s' / ignore final indels in s and initial indels in ) are useful in determining degree of sequence overlap (DNA sequence assembly).

```             ------------
---------|##-######-|########
#########|####-##-##|--------
------------

------------
#########|##-######-|--------
---------|####-##-##|########
------------
```

• Note that (approximate) pattern matching looks for best alignment of all of s with part of s', ignoring initial and final indels w.r.t. s', and hence is a special case of semi-global alignment.

```          --------------
######|#-#####--##-|########
------|####-####-##|--------
--------------
```

• Local alignment (Smith-Waterman): Look for the best alignment of a region of s with a region of s' => pattern detection!
• Useful in comparing distantly related sequences.

```          ----------
######|##-##-##|####
##|#-#####-|#######
----------
```

• Algorithms
• Exact
• All forms of non-gap-cost as well as global constant/affine gap-cost pairwise alignment can be done in quadratic, i.e., O(|s||s'|), time and space; however, faster algorithms are possible if you are looking for alignments of very similar sequences (Gusfield (1997), Chapters 11 and 12; Setubal and Meidanis (1997), Sections 3.1-3.3).
• There are versions of these algorithms that run in quadratic time and linear space (Gusfield (1997), Section 12.1; Setubal and Meidanis (1997), Section 3.3.1)
• Global arbitrary gap-cost pairwise alignment can be done in cubic time time and quadratic space (Gusfield (1997), Section 11.8.5).
• Heuristic
• Used in database search! (see below)
• Software
• Global: GAP (DNA/proteins)
• Local: SIM (proteins)
• Database Search (Mount (2004), Chapter 6; Gusfield (1997), Section 15)
• Background
• Heuristic pairwise alignment algorithms required to perform database search in reasonable amount of time on large databases.
• Even if sufficient computing power is available to do quadratic time string comparison in database search in a reasonable amount of time, heuristics are still useful as filters to remove "obviously" bad database entries from being considered with quadratic-time exact algorithms.
• Algorithms
• Exact
• Uses exact pairwise local alignment algorithm (Smith-Waterman).
• Heuristic
• Two most famous algorithms are FASTA and BLAST.
• Run in "approximately" linear time and space.
• Software

• 10:50 -- 11:55: Multiple Sequence Alignment

• Multiple Sequence Alignment (Mount (2004), Chapter 5)
• Multiple sequence alignments good for detecting patterns common to multiple sequences; are especially useful in detecting mutation-degraded patterns common in a phylogenetically diverse set of sequences, e.g,, protein family diagnostic patterns.
• If you are doing phylogenetic analysis, multiple sequence alignment is critical for establishing homology of sequence-positions across sequences.
• Background
• Most popular alignment functions constructed from global pairwise alignments among pairs of strings from the given set.
• Phylogenetic multiple sequence alignment
• Given a (evolutionary) tree, label the leaves of this tree with the given sequences and assign sequences to the internal nodes; the cost of the alignment is then the sum of the scores of the the best pairwise alignments along the tree edges.
• Example: Tree alignment for given sequences {AAGA, TGGA, ATAA, ATTA} relative to given tree T below:

```    T:
AAGA            TGGA
\          /
AAAA--TTAA
/          \
ATTA            ATAA

Edge alignments:

AAGA      cost = 2 + 2 + -1 + 2
AAAA           = 5

ATTA      cost = 2 + -2 + -2 + 2
AAAA           = 0

AAAA      cost = -2 + -2 + 2 + 2
TTAA           = 0

TTAA      cost = 2 + -2 + -1 + 2
TGGA           = 1

TTAA      cost = -2 + 2 + 2 + 2
ATAA           = 4

cost of tree alignment = 5 + 0 + 0 + 1 + 4
= 10
```

• Has ideal biological justification; create alignment that assume least amount of evolutionary change relative to a given tree,and hence may be most plausible in an evolutionary sense.
• Also has version in which tree is inferred at same time as alignment.
• Both versions are NP-hard (Wang and Jiang (1994); Wareham (1995))
• Sum-of-Pairs (SP) multiple sequence alignment
• The cost of an alignment is the sum of all the scores of all best pairwise alignments between distinct pairs of given strings.
• Example: SP alignment for given sequences {AAGA, TGGA, ATAA, ATTA}:

```    Pairwise alignments:

AAGA      cost = -2 + -1 + 2 + 2
TGGA           = 1

AAGA      cost = 2 + -2 + -1 + 2
ATAA           = 1

AAGA      cost = 2 + -2 + -2 + 2
ATTA           = 0

TGGA      cost = -2 + -2 + -1 + 2
ATAA           = -3

TGGA      cost = -2 + -2 + -2 + 2
ATTA           = -4

ATAA      cost = 2 + 2 + -2 + 2
ATTA           = 4

cost of SP alignment = 1 + 1 + 0 + -3 + -4 + 4
= -1
```

• Has no biological justification; however, does have nice mathematical properties.
• Is NP-hard (Wang and Jiang (1994))
• Algorithms
• Exact
• Phylogenetic: Exponential time and space.
• Sum-of-Pairs: Exponential time and space.
• Contrary to many statements in the literature, SP alignment requires more time to compute than phylogenetic alignment; however, SP alignment is popular because there are nice cost-cutting tricks that can be applied to the algorithm to reduce the running time and space for most (but not all) inputs (see program MSA below).
• Approximation
• There are various low-order polynomial time approximation algorithms for computing phylogenetic and SP alignments; however, they produce alignments that are within a multiplicative factor of two of optimal.
• Heuristics
• Many low-order polynomial-time heuristics. Two main types are:
• Progressive methods: Build an evolutionary tree for a group of sequences and build up multiple alignment by aligning closest pair of sequences in the tree and then adding other sequences.
• The evolutionary tree used in this heuristic algorithm is typically itself built using a heuristic algorithm, e.g., cluster analysis, neighbor-joining.
• Repeated-motif methods: Find blocks common to all sequences and use as alignment "anchors", and recursively repeat this process on interblock regions.
• Software
• Dealing with Bioinformatics Software II: Caveat Emptor
• Two important questions:
• What (exactly) do you want to do?
• What is the appropriate technology to do what you want to do?
• Advanced technology is not always the answer (Rutherford's Dictum); however, if you need technology, where should you go?
• Bioinformaticians
• Other biologists who have worked on problems similar to your own
• Recent books on bioinformatics for biologists, e.g., Mount (2004).
• Recent review articles in biological journals.
• On-line repositories
• CS folk
Note that this list is ordered by decreasing utility of advice. Any source listed above can be used; however, the further down the list it is, the more salt should be taken with said advice ...

## References

• Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. (2001) Introduction to to Algorithms (Second Edition). MIT Press.
• Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.
• Mount, D.W. (2004) Bioinformatics: Sequence and Genome Analysis. Second edition. Cold Spring Harbor Laboratory Press; Cold Spring Harbor, NY.
• Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Company.
• Wang, L. and Jiang, T. (1994) "On the complexity of multiple sequence alignment." Journal of Computational Biology, 1(4), 337-348.
• Wareham, T. (1995) "A Simplified Proof of the NP- and MAX SNP-hardness of Multiple Sequence Tree Alignment." Journal of Computational Biology, 2(4), 509-514.
• Wareham, T. (2004) "Inside Computer Programming." Enrichment Mini-Course Notes. [PDF (42 pages)]