Evolutionary Tree Reconstruction
Input: An m x n object by
character matrix M.
Output: The tree T with m
leaves that gives the evolutionary history
of speciation events that derives the
m taxa in M from their most
recent common ancestor.
Shortest DNA Sequence Assembly
Input: A set S of m
DNA fragments of maximum length n.
Output: The shortest
DNA sequence r such
that each s in S is a
substring of r.
Optimal Evolutionary Tree under Criterion X
Input: An m x n object-by-character matrix M.
Output: The tree T with m
leaves that has optimal score for criterion
X relative to M.
Note that in the context of computational problems in biology, the optimal solution need not be the right one -- rather, we invoke Occam's Razor / the Principle of Parsimony (albeit in a warped fashion) and consider the optimal solution most likely to be right and hence a good one to investigate first.
We will spend most of our lecture-time looking at the latter two problems relative to molecular sequence data.
A finer further distinction for pattern problems is exact vs. approximate pattern matching, i.e., how do you interpret "occurrence" of a pattern in a text?
MD1: MD2: A G T C - A G T C - A 0 1 1 1 1 A 0 1 2 2 4 G 1 0 1 1 1 G 1 0 2 2 4 T 1 1 0 1 1 T 2 2 0 1 4 C 1 1 1 0 1 C 2 2 1 0 4 - 1 1 1 1 * - 4 4 4 4 *
MS1: MS2: A G T C - A G T C - A 2 -1 -1 -1 -1 A 2 -1 -2 -2 -3 G -1 2 -1 -1 -1 G -1 2 -2 -2 -3 T -1 -1 2 -1 -1 T -2 -2 2 -1 -3 C -1 -1 -1 2 -1 C -2 -2 -1 2 -3 - -1 -1 -1 -1 * - -3 -3 -3 -3 *
S1: S2: S3: 123 123 123 AGA AGA AGA Cost of S1 = | | | M(A,-) = -3 | Del 1 | Del 2 | Del 3 V V V Cost of S2 = GA AA AG M(G,-) + M(A,G) = | | -3 + -1 = -4 | Sub 1 | Del 2 V V Cost of S3 = GA G M(A,-) + M(A,-) + | M(-,A) = | Ins 1 -3 + -3 + -3 = -9 V GA
Match #1: TTGTCA cost = -1 + -1 = -2 TAC Match #2: TTGTCA cost = -1 + -3 = -4 TA-C Match #3: TTGTCA cost = -3 + -2 = -5 T-AC
The reader can verify that Match #1 above has the highest score of any match of the given pattern to the given text, and is therefore the best match.
Alignment #1: TGCCTAG -GC-TGT cost = -3 + 2 + 2 + -3 + 2 + -1 + -2 = -3 Alignment #2: TGCCTAG- --GCT-GT cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3 = -8 Alignment #3: TGCCTAG- -GC-TG-T cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3 = -7
Alignment #1: TGCCTAG -GC-TGT cost = -3 + 2 + 2 + -3 + 2 + -1 + -2 + (2 * -2) = -7 Alignment #2: TGCCTAG- --GCT-GT cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3 + (3 * -2) = -14 Alignment #3: TGCCTAG- -GC-TG-T cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3 + (4 * -2) = -15
------------------------- |#####--###-##-####-##--| |--##-####-######-######| -------------------------
------------ ---------|##-######-|######## #########|####-##-##|-------- ------------ ------------ #########|##-######-|-------- ---------|####-##-##|######## ------------
-------------- ######|#-#####--##-|######## ------|####-####-##|-------- --------------
---------- ######|##-##-##|#### ##|#-#####-|####### ----------
T: AAGA TGGA \ / AAAA--TTAA / \ ATTA ATAA Edge alignments: AAGA cost = 2 + 2 + -1 + 2 AAAA = 5 ATTA cost = 2 + -2 + -2 + 2 AAAA = 0 AAAA cost = -2 + -2 + 2 + 2 TTAA = 0 TTAA cost = 2 + -2 + -1 + 2 TGGA = 1 TTAA cost = -2 + 2 + 2 + 2 ATAA = 4 cost of tree alignment = 5 + 0 + 0 + 1 + 4 = 10
Pairwise alignments: AAGA cost = -2 + -1 + -1 + 2 TGAA = -2 AAGA cost = 2 + -2 + 2 + 2 ATGA = 4 AAGA cost = 2 + -2 + -2 + 2 ATTA = 0 TGAA cost = -2 + -2 + 1 + 2 ATGA = -3 TGAA cost = -2 + -2 + -2 + 2 ATTA = -4 ATGA cost = 2 + 2 + -1 + 2 ATTA = 5 cost of SP alignment = -2 + 4 + 0 + -3 + -4 + 5 = 0