Evolutionary Tree Reconstruction
Input: An m x n object by
character matrix M.
Output: The tree T with m
leaves that gives the evolutionary history
of speciation events that derives the
m taxa in M from their most
recent common ancestor.
Shortest DNA Sequence Assembly
Input: A set S of m
DNA fragments of maximum length n.
Output: The shortest
DNA sequence r such
that each s in S is a
substring of r.
Optimal Evolutionary Tree under Criterion X
Input: An m x n object-by-character matrix M.
Output: The tree T with m
leaves that has optimal score for criterion
X relative to M.
Note that in the context of computational problems in biology, the optimal solution need not be the right one -- rather, we invoke Occam's Razor / the Principle of Parsimony (albeit in a warped fashion) and consider the optimal solution most likely to be right and hence a good one to investigate first.
We will spend most of our lecture-time looking at the latter two problems relative to molecular sequence data.
A finer further distinction for pattern problems is exact vs. approximate pattern matching, i.e., how do you interpret "occurrence" of a pattern in a text?
MD1: MD2:
A T G C - A T G C -
A 0 1 1 1 1 A 0 1 2 2 4
T 1 0 1 1 1 T 1 0 2 2 4
G 1 1 0 1 1 G 2 2 0 1 4
C 1 1 1 0 1 C 2 2 1 0 4
- 1 1 1 1 * - 4 4 4 4 *
MS1: MS2:
A T G C - A T G C -
A 2 -1 -1 -1 -1 A 2 -1 -2 -2 -3
T -1 2 -1 -1 -1 T -1 2 -2 -2 -3
G -1 -1 2 -1 -1 G -2 -2 2 -1 -3
C -1 -1 -1 2 -1 C -2 -2 -1 2 -3
- -1 -1 -1 -1 * - -3 -3 -3 -3 *
S1: S2: S3:
123 123 123
AGA AGA AGA Cost of S1 =
| | | M(A,-) = -3
| Del 1 | Del 2 | Del 3
V V V Cost of S2 =
GA AA AG M(G,-) + M(A,G) =
| | -3 + -2 = -5
| Sub 1 | Del 2
V V Cost of S3 =
GA G M(A,-) + M(A,-) +
| M(-,A) =
| Ins 1 -3 + -3 + -3 = -9
V
GA
Match #1: TTGATA cost = -1 + -1 = -2
TCT
Match #2: TTGATA cost = -1 + -3 = -4
TC-T
Match #3: TTGATA cost = -3 + -2 = -5
T-CT
The reader can verify that Match #1 above has the highest score of any match of the given pattern to the given text, and is therefore the best match.
Alignment #1: TGCCTAG
-GC-ATT
cost = -3 + 4 + 4 + -3 + -1 + -1 + -2
= -2
Alignment #2: TGCCTAG-
--GC-ATT
cost = -3 + -3 + -1 + 4 + -3 + 4 + -2 + -3
= -7
Alignment #3: TGCCTAG-
-GC-AT-T
cost = -3 + 4 + 4 + -3 + -1 + -1 + -3 + -3
= -6
Alignment #1: TGCCTAG
-GC-ATT
cost = -3 + 4 + 4 + -3 + -1 + -1 + -2 + (2 * -2)
= -6
Alignment #2: TGCCTAG-
--GC-ATT
cost = -3 + -3 + -1 + 4 + -3 + 4 + -2 + -3 + (3 * -2)
= -13
Alignment #3: TGCCTAG-
-GC-AT-T
cost = -3 + 4 + 4 + -3 + -1 + -1 + -3 + -3 + (4 * -2)
= -14
-------------------------
|#####--###-##-####-##--|
|--##-####-######-######|
-------------------------
------------
---------|##-######-|########
#########|####-##-##|--------
------------
------------
#########|##-######-|--------
---------|####-##-##|########
------------
--------------
######|#-#####--##-|########
------|####-####-##|--------
--------------
----------
######|##-##-##|####
##|#-#####-|#######
----------
T:
AATA TGAA
\ /
AAAA--TTAA
/ \
ATTA ATAA
Edge alignments:
AATA cost = 4 + 4 + -1 + 4
AAAA = 11
ATTA cost = 4 + -1 + -1 + 4
AAAA = 6
AAAA cost = -1 + -1 + 4 + 4
TTAA = 6
TTAA cost = 4 + -2 + 4 + 4
TGAA = 10
TTAA cost = -1 + 4 + 4 + 4
ATAA = 11
cost of tree alignment = 11 + 6 + 6 + 10 + 11
= 44
Pairwise alignments:
AATA cost = -1 + -2 + -1 + 4
TGAA = 0
AATA cost = 4 + -1 + -1 + 4
ATAA = 6
AATA cost = 4 + -1 + 4 + 4
ATTA = 11
TGAA cost = -1 + -2 + 4 + 4
ATAA = 5
TGAA cost = -1 + -2 + -1 + 4
ATTA = 0
ATAA cost = 4 + 4 + -1 + 4
ATTA = 11
cost of SP alignment = 0 + 6 + 11 + 5 + 0 + 11
= 33