Evolutionary Tree Reconstruction
Input: An m x n object by
character matrix M.
Output: The tree T with m
leaves that gives the evolutionary history
of speciation events that derives the
m taxa in M from their most
recent common ancestor.
Shortest DNA Sequence Assembly
Input: A set S of m
DNA fragments of maximum length n.
Output: The shortest
DNA sequence r such
that each s in S is a
substring of r.
Optimal Evolutionary Tree under Criterion X
Input: An m x n object-by-character matrix M.
Output: The tree T with m
leaves that has optimal score for criterion
X relative to M.
Note that in the context of computational problems in biology, the optimal solution need not be the right one -- rather, we invoke Occam's Razor / the Principle of Parsimony (albeit in a warped fashion) and consider the optimal solution most likely to be right and hence a good one to investigate first.
In Dr. Carr's lectures next week, you will see several examples of the first type of problem relative to molecular-sequence data. We will spend most of our remaining lecture-time today looking at examples of the latter two types of problems.
A finer further distinction for pattern-problems is exact vs. approximate pattern matching, i.e., how do you interpret "occurrence" of a pattern in a text?
Position Base 1 2 3 4 5 A 95% 25% 0% 0% 100% C 0% 25% 0% 50% 0% G 5% 30% 100% 50% 0% T 0% 20% 0% 0% 0%
MD1: MD2: A G T C - A G T C - A 0 1 1 1 1 A 0 1 2 2 4 G 1 0 1 1 1 G 1 0 2 2 4 T 1 1 0 1 1 T 2 2 0 1 4 C 1 1 1 0 1 C 2 2 1 0 4 - 1 1 1 1 * - 4 4 4 4 *
MS1: MS2: A G T C - A G T C - A 2 -1 -1 -1 -1 A 2 -1 -2 -2 -3 G -1 2 -1 -1 -1 G -1 2 -2 -2 -3 T -1 -1 2 -1 -1 T -2 -2 2 -1 -3 C -1 -1 -1 2 -1 C -2 -2 -1 2 -3 - -1 -1 -1 -1 * - -3 -3 -3 -3 *
S1: S2: S3: 123 123 123 AGA AGA AGA Cost of S1 = | | | M(A,-) = -3 | Del 1 | Del 2 | Del 3 V V V Cost of S2 = -GA A-A AG- M(G,-) + M(A,G) = | | -3 + -1 = -4 | Sub 1 | Del 1 V V Cost of S3 = G-A -G- M(A,-) + M(A,-) + | M(-,A) = | Ins 3 -3 + -3 + -3 = -9 V -GA
S1: S2: S3 = S1: 123 123 123 AGA AGA AGA Cost of S1 = | | Sub 1, | M(A,-) + M(G,G) + M(A,A) = | Del 1 | Del 2 | Del 1 -3 + 2 + 2 = 1 V V V Cost of S2 = -GA G-A -GA M(A,A) + M(G,-) + M(A,G) = 2 + -3 + -1 = -2 Cost of S3 = Cost of S1 = 1
Match #1: TTGTCA cost = 2 + -1 + -1 = 0 TAC Match #2: TTGTCA cost = 2 + -3 + -2 + 2 = -1 T-AC Match #3: TT-GTCA cost = 2 + -3 + -2 = -3 TAC Match #4: TTGTCA cost = 2 + -1 + -3 + 2 = 0 TA-C Match #5: TTGT-CA cost = 2 + -3 2 = 1 TAC
The reader can verify that Match #5 above has the highest scores of any match of the given pattern to the given text, and is therefore the best match.
Alignment #1: TGCCTAG -GC-TGT cost = -3 + 2 + 2 + -3 + 2 + -1 + -2 = -3 Alignment #2: TGCCTAG- --GCT-GT cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3 = -8 Alignment #3: TGCCTAG- -GC-TG-T cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3 = -7
Alignment #1: TGCCTAG -GC-TGT cost = -3 + 2 + 2 + -3 + 2 + -1 + -2 + (2 * -2) = -7 Alignment #2: TGCCTAG- --GCT-GT cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3 + (3 * -2) = -14 Alignment #3: TGCCTAG- -GC-TG-T cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3 + (4 * -2) = -15
------------------------- |#####--###-##-####-##--| |--##-####-######-######| -------------------------
------------ ---------|##-######-|######## #########|####-##-##|-------- ------------ ------------ #########|##-######-|-------- ---------|####-##-##|######## ------------
-------------- ######|#-#####--##-|######## ------|####-####-##|-------- --------------
---------- ######|##-##-##|#### ##|#-#####-|####### ----------
T: AAGA TGGA \ / AAAA--TTAA / \ ATTA ATAA Edge alignments: AAGA cost = 2 + 2 + -1 + 2 AAAA = 5 ATTA cost = 2 + -2 + -2 + 2 AAAA = 0 AAAA cost = -2 + -2 + 2 + 2 TTAA = 0 TTAA cost = 2 + -2 + -1 + 2 TGGA = 1 TTAA cost = -2 + 2 + 2 + 2 ATAA = 4 cost of tree alignment = 5 + 0 + 0 + 1 + 4 = 10
Pairwise alignments: AAGA cost = -2 + -1 + 2 + 2 TGGA = 1 AAGA cost = 2 + -2 + -1 + 2 ATAA = 1 AAGA cost = 2 + -2 + -2 + 2 ATTA = 0 TGGA cost = -2 + -2 + -1 + 2 ATAA = -3 TGGA cost = -2 + -2 + -2 + 2 ATTA = -4 ATAA cost = 2 + 2 + -2 + 2 ATTA = 4 cost of SP alignment = 1 + 1 + 0 + -3 + -4 + 4 = -1