Dynamic Programming Practice Problems : Genes and Sequences
Here we look at a problem from computational biology. You can think of a DNA sequence as sequence of the characters “a”,”c”,”g”,”t”. Suppose you are given DNA sequences D1 of n1 characters and DNA sequence D2 of n2 characters. You might want to know if these sequences appear to be from the same object. However, in obtaining the sequences, laboratory errors could cause reversed, repeated or missing characters. This leads to the following sequence alignment problem.
An alignment is defined by inserting any number of spaces in D1 and D2 so that the resulting strings D_1 and D_2 both have the same length (with the spaces included as part of the sequence). Each character of D_1 (including each space as a character) has a corresponding character (matching or non-matching) in the same position in D_2.
For a particular alignment A we say cost(A) is the number of mismatches (where you can think of a space as just another character and hence a space matches a space but does not match one of the other 4 characters). To be sure this problem is clear suppose that D1 is ctatg and D2 is ttaagc. One possible alignment is given by:
- D1 is a DNA sequence with n1 characters and D1 is a DNA sequence with n2 characters.
- The general form of the subproblem we solve will be: Find the best alignment for the first i characters of D1 and the first j characters of D2 for 1 < i <= n1 and 1 < j <= n2.
- Let D(i) be the ith character in string D.
- Let c[i, j] be the cost of an optimal alignment for D1(1), . . . ,D1(i) and D2(1), . . . ,D2(j). We can define c[i, j] recursively as shown (where c[n1, n2] gives the optimal cost to the original problem):
- As usual we take i =n and j =m and start from the back of the sequence and then solve the problem
- In an optimal alignment either the last character of D_1 is a space or it is the last character (character i) of D1 and the last character of D_2 is a space or it is the last character (character j) of D2.
- If D1(i) = D2(j) then clearly it is best to align them (so no need to add any space).
- However, if D1(i) != D2(j) then a space could be added to neither or just one. In all three cases one mismatch is caused by the last characters.
- Notice that there would never be any benefit in ending both D1 and D2 with a space. Hence the above recursive definition considers all possible cases that the optimal alignment could have. Since the solution to the original problem is either the value of the subproblem solution (if D1(i) = D2(j)) or otherwise one plus the subproblem solution, the optimal substructure property clearly holds.
- Thus the solution output is correct.
ct at g
tta agc
In the above both D_1 and D_2 have length 8. The cost is 5. (There are mismatches in
position 1, 3, 5, 6 and 8).
Give the most efficient algorithm you can (analyzed as a function of n1 and n2) to compute the alignment of minimum cost.
Solution :
So,
PROOF
We now argue this recursive definition is correct.
You can form D_1 and D_2 (and hence the alignment) for the subproblem from the right to left as follows.
For the time complexity it is clearly O(n1 · n2) since there are n1 · n2 subproblems each of which is solved in constant time. Finally, the c[i, j] matrix can be computed in row major order and just as in the LCS problem a second matrix that contains which of the above 4 cases was applied can also be stored and then used to construct an optimal alignment.
No comments:
Post a Comment