# Biology 4900, Fall '09 Course Diary (Wareham)

## Friday, Nov 13

• 2:00 -- 3:20: What is Bioinformatics and Why Should You Care?

• Why Bioinformatics / Computational Biology?
• The main reason for applying computational techniques to biological data is that they are now too large to analyze by hand.
• Biological datasets are getting larger by the day; fast computer programs are our only hope of being able to perform the necessary analyses.
• Two truths about application-oriented computing:
• Computers do exactly what you tell them to do, and nothing else => you need to know exactly what you want to do.
• Chances are, someone else besides you, i.e., a computer programmer, will be telling the computer what to do => you need to know how to deal with CS folk and their various concerns, strengths, and bias.
• Dealing with Bioinformatics Software I: What Are They Talking About?
• To understand the software, you need to understand its creators. One way of doing this is to look at the words they use to describe what they do.
• Frequently-heard CS terms (Cormen et al. (2001); Wareham (2004)):
• Problems, Algorithms, and Programs
• A (computational) problem is an abstract relation of inputs and outputs; it just says what you're given and what you want, not how you go about getting it.

DNA Sequence Assembly
Input: A set S of m DNA fragments of maximum length n.
Output: The original DNA sequence, i.e., a DNA sequence r such that each s in S is a substring of r.

Evolutionary Tree Reconstruction
Input: An m x n object by character matrix M.
Output: The tree T with m leaves that gives the evolutionary history of speciation events that derives the m taxa in M from their most recent common ancestor.

• Many problems are optimization problems in which, among all possible solutions for a given input, you want the one that maximizes or minimizes some function.

Shortest DNA Sequence Assembly
Input: A set S of m DNA fragments of maximum length n.
Output: The shortest DNA sequence r such that each s in S is a substring of r.

Optimal Evolutionary Tree under Criterion X
Input: An m x n object-by-character matrix M.
Output: The tree T with m leaves that has optimal score for criterion X relative to M.

Note that in the context of computational problems in biology, the optimal solution need not be the right one -- rather, we invoke Occam's Razor / the Principle of Parsimony (albeit in a warped fashion) and consider the optimal solution most likely to be right and hence a good one to investigate first.

• An algorithm for a problem X is an abstractly-stated method for computing the output associated with an input of X.
• A program for problem X is an executable implementation of an algorithm for X that is written in some computer language, e.g., C, C++, Java.
• CS folk make this three-way distinction in order to write programs good for all computers; also makes analysis of required running time and space easier (albeit more abstract).
• CS folk typically work on simplified versions of problems to make analysis easier; moreover, such simplification (or misunderstandings) may accidentally and fundamentally modify your original problem (spherical cows).
• Take-Home Lesson #1: When formulating and dealing with computational problems, you need to both (1) be careful when interpreting what CS analyses of problems actually mean, and (2) make sure that the problem being analyzed and/or programmed is actually the one you want solved.
• Exact vs. Approximation vs. Heuristic Algorithms
• A useful three-way classification of algorithms for optimization problems:
• An exact algorithm always returns the optimal solution for a given input.
• An approximation algorithm always returns a solution for a given input whose cost is within some additive of multiplicative factor of the cost of the optimal solution.
• A heuristic algorithm always returns a solution for a given input.
• Heuristics tend to run faster than approximation algorithms, which tend to run faster than exact algorithms.
• Take-Home Lesson #2: When choosing an algorithms, ask yourself what is more important, solution optimality or quick runtime -- you can seldom have both. If you do not answer this question yourself, someone else may answer it for you (and not to your benefit).
• (Worst-case) Time / Space Complexity: Big-Oh (O(f(n)) notation
• Useful measures of how fast an algorithm runs and how much computer memory an algorithm will require.
• In Big-Oh notation, n is some measure of the size of the input, e.g., maximum sequence length, number of taxa x number of characters, and f(n) is a function of n that upper-bounds the time or space usage of the algorithm as n goes to infinity.
• Average-case analyses may be more appropriate in many situations but are typically much harder to do.
• Function-description shorthand:
• O(n) => linear
• O(n^2) = O(n * n) => quadratic
• O(n^3) = O(n * n * n) => cubic
• O(n^c) for some constant c >= 1 => polynomial
• O(c^n) for some constant c > 1 => exponential
• Take-Home Lesson #3: In general, time and space complexity functions f() that are polynomials are good, and anything larger, e.g., exponential functions, is bad [handoutTSC.pdf]; however, this may be relaxed for particular inputs (if n is large, low-order polynomial is good; if n is *** very *** small, exponential is OK).
• Take-Home Lesson #4: Existing computer hardware can be left to run as long as necessary, but computer memory is typically limited and adding more memory can get very expensive; hence, space is often the limiting resource for computations on large inputs, and space complexity should be very low-order polynomial whenever possible.
• Computational Intractability: NP-hardness/completeness.
• A useful measure of the difficulty of a problem.
• If a problem X is NP-hard/complete, it probably does not have a fast, i.e., polynomial-time, algorithm.
• Take-Home Lesson #5: Computational intractability is not necessarily the end of the road; other types of fast algorithms may still be possible, e.g., approximation / heuristic algorithms.
• Learning to understand CS folk can indeed be difficult; however, your efforts will be repaid many times over, both by opening up new opportunities for research collaboration and by dramatically increasing the chances that you will actually get what you want and need out of any such collaboration.
• Three basic types of biological-data analysis problems:

• Infer structure from partial information, e.g., DNA sequence (contig) assembly, evolutionary tree (phylogeny) inference.
• Find occurrences of known patterns in a given structure (pattern matching).
• Derive common patterns in a given set of structures (pattern detection).

In Dr. Marshall's lectures next week, you will see several examples of the first type of problem relative to molecular-sequence data. We will spend most of our remaining lecture-time today looking at examples of the latter two types of problems.

• The role of evolutionary theory in computational biology.
• Evolution as descent with modification.
• As a first approximation, molecular sequence mutations and indels occur at random in the sequence.
• Once species diverge from a common ancestor, mutations in sequences inherited from that ancestor occur independently in the descendant species.
• Mutations in non-functional regions do not affect an organism and are almost always passed on to descendants; mutations in functional regions are almost always deleterious, i.e., they kill the organism, and are rarely passed on to descendants => functional regions evolve more slowly than non-functional regions.
• This is commonly assumed, but recent evidence suggests the relationship between conservation and functionality is more complex (see Monroe (2009)).
• Two consequences:

• Similar structures tend to have similar functions => To infer function of new structure, assess similarity to structures of known function! (pattern matching)
• Conservation is indicative of function => To find functionally important regions, look for conserved regions! (pattern detection)

• Evolution is the reason why we can assume that similarity (and hence the basic problem of looking for similarity among structures) is meaningful in computational biology.

• 3:30 -- 4:50: Common Bioinformatics problems

• Recall our two basic pattern-problems:

• Pattern Matching: Given a pattern P and a text-string T, find all occurrences of P in T.
• Pattern Detection: Given a set T of text strings, find the set P of all patterns p such that p occurs at least once in each string t in T.

A finer further distinction for pattern-problems is exact vs. approximate pattern matching, i.e., how do you interpret "occurrence" of a pattern in a text?

• We will now give a brief overview of each problem and its known algorithmic options; for more details on these problems, see the previous Biology 4900 diary for this lecture and the references cited below.
• Pattern Matching
• Algorithms
• Exact pattern matching
• Can be done in linear, i.e., O(|P| + |T|), time and space (Gusfield (1997), Section I).
• Approximate pattern matching
• Can be done in quadratic, i.e., O(|P||T|), time and space; however, faster algorithms are possible if you are looking at patterns that have very good matches in the text (Gusfield (1997), Chapters 11 and 12; Setubal and Meidanis (1997), Sections 3.1-3.3).
• CDART (protein domain architectures)
• eMOTIF (generic sequence patterns)
• InterPro (protein families, domains, and functional sites)
• BLOCKS (protein domains)
• Prosite (protein families and domains)
• SMART (signaling domains)
• TargetDB (cellular-localization domains)
• Pattern Detection: Pairwise Sequence Alignment (Mount (2004), Chapter 3)
• Alignment generalizes notion of match to two whole sequences s and s' -- are no longer just looking for best match of a pattern and a text but are instead looking for best match of s against s'. An alignment also provides a very nice visual display of the sequence-matching.
• Pairwise alignments good for detecting patterns common to two sequences.
• Several options for match-extent of s and s':

• Global alignment (Needleman-Wunsh): Look for best alignment of all of s with all of s'.
• Useful in comparing very closely related sequences.

```        -------------------------
|#####--###-##-####-##--|
|--##-####-######-######|
-------------------------
```

• Semi-global alignment: Look for best alignment of s with s' that ignores various combinations of initial and final indels.
• There are sixteen possible forms of semi-global alignment, one of which is global alignment (= do not ignore any initial or final indels).
• Two variants (ignore initial indels in s and final indels in s' / ignore final indels in s and initial indels in ) are useful in determining degree of sequence overlap (DNA sequence assembly).

```             ------------
---------|##-######-|########
#########|####-##-##|--------
------------

------------
#########|##-######-|--------
---------|####-##-##|########
------------
```

• Note that (approximate) pattern matching looks for best alignment of all of s with part of s', ignoring initial and final indels w.r.t. s', and hence is a special case of semi-global alignment.

```          --------------
######|#-#####--##-|########
------|####-####-##|--------
--------------
```

• Local alignment (Smith-Waterman): Look for the best alignment of a region of s with a region of s' => pattern detection!
• Useful in comparing distantly related sequences.

```          ----------
######|##-##-##|####
##|#-#####-|#######
----------
```

• Algorithms
• Exact
• All forms of non-gap-cost as well as global constant/affine gap-cost pairwise alignment can be done in quadratic, i.e., O(|s||s'|), time and space; however, faster algorithms are possible if you are looking for alignments of very similar sequences (Gusfield (1997), Chapters 11 and 12; Setubal and Meidanis (1997), Sections 3.1-3.3).
• There are versions of these algorithms that run in quadratic time and linear space (Gusfield (1997), Section 12.1; Setubal and Meidanis (1997), Section 3.3.1)
• Global arbitrary gap-cost pairwise alignment can be done in cubic time time and quadratic space (Gusfield (1997), Section 11.8.5).
• Heuristic
• Used in database search! (see below)
• Pattern Detection: Database Search (Mount (2004), Chapter 6; Gusfield (1997), Section 15)
• Background
• Heuristic pairwise alignment algorithms required to perform database search in reasonable amount of time on large databases.
• Even if sufficient computing power is available to do quadratic time string comparison in database search in a reasonable amount of time, heuristics are still useful as filters to remove "obviously" bad database entries from being considered with quadratic-time exact algorithms.
• Algorithms
• Exact
• Uses exact pairwise local alignment algorithm (Smith-Waterman).
• Heuristic
• Two most famous algorithms are FASTA and BLAST.
• Run in "approximately" linear time and space.

• Pattern Detection: Multiple Sequence Alignment (Mount (2004), Chapter 5)
• Multiple sequence alignments good for detecting patterns common to multiple sequences; are especially useful in detecting mutation-degraded patterns common in a phylogenetically diverse set of sequences, e.g,, protein family diagnostic patterns.
• If you are doing phylogenetic analysis, multiple sequence alignment is critical for establishing homology of sequence-positions across sequences.
• Algorithms
• Exact
• Exponential time and space in general; however, there are nice cost-cutting tricks that can be applied to reduce the running time and space for most (but not all) inputs (see program MSA below).
• Approximation
• There are various low-order polynomial time approximation algorithms; however, they produce alignments that are within a multiplicative factor of two of optimal.
• Heuristics
• Many low-order polynomial-time heuristics. Two main types are:
• Progressive methods: Build an evolutionary tree for a group of sequences and build up multiple alignment by aligning closest pair of sequences in the tree and then adding other sequences.
• The evolutionary tree used in this heuristic algorithm is typically itself built using a heuristic algorithm, e.g., cluster analysis, neighbor-joining.
• Repeated-motif methods: Find blocks common to all sequences and use as alignment "anchors", and recursively repeat this process on interblock regions.
• Dealing with Bioinformatics Software II: Caveat Emptor
• Two important questions:
• What (exactly) do you want to do?
• What is the appropriate technology to do what you want to do?
• Advanced technology is not always the answer; however, if you need technology, where should you go?
• Bioinformaticians
• Other biologists who have worked on problems similar to your own
• Recent books on bioinformatics for biologists, e.g., Mount (2004).
• Recent review articles in biological journals.
• On-line repositories
• CS folk
Note that this list is ordered by decreasing utility of advice. Any source listed above can be used; however, the further down the list it is, the more salt should be taken with said advice ...

## References

• Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. (2001) Introduction to to Algorithms (Second Edition). MIT Press.
• Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.
• Monroe, D. (2009) "Genomic Clues to DNA Treasure Sometimes Lead Nowhere." Science, 325, 142-143. [PDF]
• Mount, D.W. (2004) Bioinformatics: Sequence and Genome Analysis. Second edition. Cold Spring Harbor Laboratory Press; Cold Spring Harbor, NY.
• Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Company.
• Wang, L. and Jiang, T. (1994) "On the complexity of multiple sequence alignment." Journal of Computational Biology, 1(4), 337-348.
• Wareham, T. (1995) "A Simplified Proof of the NP- and MAX SNP-hardness of Multiple Sequence Tree Alignment." Journal of Computational Biology, 2(4), 509-514.
• Wareham, T. (2004) "Inside Computer Programming." Enrichment Mini-Course Notes. [PDF (42 pages)]