Evolutionary Tree Reconstruction
Input: An m x n object by
character matrix M.
Output: The tree T with m
leaves that gives the evolutionary history
of speciation events that derives the
m taxa in M from their most
recent common ancestor.
Shortest DNA Sequence Assembly
Input: A set S of m
DNA fragments of maximum length n.
Output: The shortest
DNA sequence r such
that each s in S is a
substring of r.
Optimal Evolutionary Tree under Criterion X
Input: An m x n object-by-character matrix M.
Output: The tree T with m
leaves that has optimal score for criterion
X relative to M.
Note that in the context of computational problems in biology, the optimal solution need not be the right one -- rather, we invoke Occam's Razor / the Principle of Parsimony (albeit in a warped fashion) and consider the optimal solution most likely to be right and hence a good one to investigate first.
We will spend most of our lecture-time looking at the latter two problems relative to molecular sequence data.
A finer further distinction for pattern problems is exact vs. approximate pattern matching, i.e., how do you interpret "occurrence" of a pattern in a text?
MD1: MD2: A T G C - A T G C - A 0 1 1 1 1 A 0 1 2 2 4 T 1 0 1 1 1 T 1 0 2 2 4 G 1 1 0 1 1 G 2 2 0 1 4 C 1 1 1 0 1 C 2 2 1 0 4 - 1 1 1 1 * - 4 4 4 4 *
MS1: MS2: A T G C - A T G C - A 2 -1 -1 -1 -1 A 2 -1 -2 -2 -3 T -1 2 -1 -1 -1 T -1 2 -2 -2 -3 G -1 -1 2 -1 -1 G -2 -2 2 -1 -3 C -1 -1 -1 2 -1 C -2 -2 -1 2 -3 - -1 -1 -1 -1 * - -3 -3 -3 -3 *
S1: S2: S3: 123 123 123 AGA AGA AGA Cost of S1 = | | | M(A,-) = -3 | Del 1 | Del 2 | Del 3 V V V Cost of S2 = GA AA AG M(G,-) + M(A,G) = | | -3 + -2 = -5 | Sub 1 | Del 2 V V Cost of S3 = GA G M(A,-) + M(A,-) + | M(-,A) = | Ins 1 -3 + -3 + -3 = -9 V GA
Match #1: TTGATA cost = -1 + -1 = -2 TCT Match #2: TTGATA cost = -1 + -3 = -4 TC-T Match #3: TTGATA cost = -3 + -2 = -5 T-CT
The reader can verify that Match #1 above has the highest score of any match of the given pattern to the given text, and is therefore the best match.
Alignment #1: TGCCTAG -GC-ATT cost = -3 + 4 + 4 + -3 + -1 + -1 + -2 = -2 Alignment #2: TGCCTAG- --GC-ATT cost = -3 + -3 + -1 + 4 + -3 + 4 + -2 + -3 = -7 Alignment #3: TGCCTAG- -GC-AT-T cost = -3 + 4 + 4 + -3 + -1 + -1 + -3 + -3 = -6
Alignment #1: TGCCTAG -GC-ATT cost = -3 + 4 + 4 + -3 + -1 + -1 + -2 + (2 * -2) = -6 Alignment #2: TGCCTAG- --GC-ATT cost = -3 + -3 + -1 + 4 + -3 + 4 + -2 + -3 + (3 * -2) = -13 Alignment #3: TGCCTAG- -GC-AT-T cost = -3 + 4 + 4 + -3 + -1 + -1 + -3 + -3 + (4 * -2) = -14
------------------------- |#####--###-##-####-##--| |--##-####-######-######| -------------------------
------------ ---------|##-######-|######## #########|####-##-##|-------- ------------ ------------ #########|##-######-|-------- ---------|####-##-##|######## ------------
-------------- ######|#-#####--##-|######## ------|####-####-##|-------- --------------
---------- ######|##-##-##|#### ##|#-#####-|####### ----------
T: AATA TGAA \ / AAAA--TTAA / \ ATTA ATAA Edge alignments: AATA cost = 4 + 4 + -1 + 4 AAAA = 11 ATTA cost = 4 + -1 + -1 + 4 AAAA = 6 AAAA cost = -1 + -1 + 4 + 4 TTAA = 6 TTAA cost = 4 + -2 + 4 + 4 TGAA = 10 TTAA cost = -1 + 4 + 4 + 4 ATAA = 11 cost of tree alignment = 11 + 6 + 6 + 10 + 11 = 44
Pairwise alignments: AATA cost = -1 + -2 + -1 + 4 TGAA = 0 AATA cost = 4 + -1 + -1 + 4 ATAA = 6 AATA cost = 4 + -1 + 4 + 4 ATTA = 11 TGAA cost = -1 + -2 + 4 + 4 ATAA = 5 TGAA cost = -1 + -2 + -1 + 4 ATTA = 0 ATAA cost = 4 + 4 + -1 + 4 ATTA = 11 cost of SP alignment = 0 + 6 + 11 + 5 + 0 + 11 = 33