Evolutionary Tree Reconstruction
Input: An m x n object by
character matrix M.
Output: The tree T with m
leaves that gives the evolutionary history
of speciation events that derives the
m taxa in M from their most
recent common ancestor.
Shortest DNA Sequence Assembly
Input: A set S of m
DNA fragments of maximum length n.
Output: The shortest
DNA sequence r such
that each s in S is a
substring of r.
Optimal Evolutionary Tree under Criterion X
Input: An m x n object-by-character matrix M.
Output: The tree T with m
leaves that has optimal score for criterion
X relative to M.
Note that in the context of computational problems in biology, the optimal solution need not be the right one -- rather, we invoke Occam's Razor / the Principle of Parsimony (albeit in a warped fashion) and consider the optimal solution most likely to be right and hence a good one to investigate first.
We will spend most of our lecture-time looking at the latter two problems relative to molecular sequence data.
A finer further distinction for pattern problems is exact vs. approximate pattern matching, i.e., how do you interpret "occurrence" of a pattern in a text?
MD1: MD2:
A G T C - A G T C -
A 0 1 1 1 1 A 0 1 2 2 4
G 1 0 1 1 1 G 1 0 2 2 4
T 1 1 0 1 1 T 2 2 0 1 4
C 1 1 1 0 1 C 2 2 1 0 4
- 1 1 1 1 * - 4 4 4 4 *
MS1: MS2:
A G T C - A G T C -
A 2 -1 -1 -1 -1 A 2 -1 -2 -2 -3
G -1 2 -1 -1 -1 G -1 2 -2 -2 -3
T -1 -1 2 -1 -1 T -2 -2 2 -1 -3
C -1 -1 -1 2 -1 C -2 -2 -1 2 -3
- -1 -1 -1 -1 * - -3 -3 -3 -3 *
S1: S2: S3:
123 123 123
AGA AGA AGA Cost of S1 =
| | | M(A,-) = -3
| Del 1 | Del 2 | Del 3
V V V Cost of S2 =
GA AA AG M(G,-) + M(A,G) =
| | -3 + -1 = -4
| Sub 1 | Del 2
V V Cost of S3 =
GA G M(A,-) + M(A,-) +
| M(-,A) =
| Ins 1 -3 + -3 + -3 = -9
V
GA
Match #1: TTGTCA cost = -1 + -1 = -2
TAC
Match #2: TTGTCA cost = -1 + -3 = -4
TA-C
Match #3: TTGTCA cost = -3 + -2 = -5
T-AC
The reader can verify that Match #1 above has the highest score of any match of the given pattern to the given text, and is therefore the best match.
Alignment #1: TGCCTAG
-GC-TGT
cost = -3 + 2 + 2 + -3 + 2 + -1 + -2
= -3
Alignment #2: TGCCTAG-
--GCT-GT
cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3
= -8
Alignment #3: TGCCTAG-
-GC-TG-T
cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3
= -7
Alignment #1: TGCCTAG
-GC-TGT
cost = -3 + 2 + 2 + -3 + 2 + -1 + -2 + (2 * -2)
= -7
Alignment #2: TGCCTAG-
--GCT-GT
cost = -3 + -3 + -2 + 2 + 2 + -3 + 2 + -3 + (3 * -2)
= -14
Alignment #3: TGCCTAG-
-GC-TG-T
cost = -3 + 2 + 2 + -3 + 2 + -1 + -3 + -3 + (4 * -2)
= -15
-------------------------
|#####--###-##-####-##--|
|--##-####-######-######|
-------------------------
------------
---------|##-######-|########
#########|####-##-##|--------
------------
------------
#########|##-######-|--------
---------|####-##-##|########
------------
--------------
######|#-#####--##-|########
------|####-####-##|--------
--------------
----------
######|##-##-##|####
##|#-#####-|#######
----------
T:
AAGA TGGA
\ /
AAAA--TTAA
/ \
ATTA ATAA
Edge alignments:
AAGA cost = 2 + 2 + -1 + 2
AAAA = 5
ATTA cost = 2 + -2 + -2 + 2
AAAA = 0
AAAA cost = -2 + -2 + 2 + 2
TTAA = 0
TTAA cost = 2 + -2 + -1 + 2
TGGA = 1
TTAA cost = -2 + 2 + 2 + 2
ATAA = 4
cost of tree alignment = 5 + 0 + 0 + 1 + 4
= 10
Pairwise alignments:
AAGA cost = -2 + -1 + -1 + 2
TGAA = -2
AAGA cost = 2 + -2 + 2 + 2
ATGA = 4
AAGA cost = 2 + -2 + -2 + 2
ATTA = 0
TGAA cost = -2 + -2 + 1 + 2
ATGA = -3
TGAA cost = -2 + -2 + -2 + 2
ATTA = -4
ATGA cost = 2 + 2 + -1 + 2
ATTA = 5
cost of SP alignment = -2 + 4 + 0 + -3 + -4 + 5
= 0