DNA

Programming Assignment 8

Due:
Version I8pm, Thursday 22 April 2004
Version II8pm, Thursday 29 April 2004


Contents:


Preface

In this assignment, you will be responsible for creating an entire program, from start to finish, without any files or libraries provided by the instructor. You have been given two weeks to complete the assignment and should use that entire period. Those who do not start early, and make continued progress throughout are unlikely to find success.

For this reason, we have set a preliminary due date by which you must submit what we denote as "Version I" of your program in the Suggested Steps of Progress. You should later resubmit your completed version of the program by the second due date.


Overview

In this assignment, we explore one of the important computational techniques used in sequencing genomes. We will give a brief overview of the background material, sufficient for this assignment. However those who are interested can view a very nice explanation of the sequencing process titled "Sequence for Yourself" provided by PBS' Nova series, at http://www.pbs.org/wgbh/nova/genome/sequencer.html (this assignment will be related to Part 5 of that process).

Abbreviated Background

In this assignment, we will implement the final step, namely taking as input the many smaller strands of DNA and trying to reconstruct the longer original sequence.

Newly Introduced Techniques


Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish. In this regard, please make sure you adhere to the policies on academic integrity.


Your Task

For this assignment, you will reconstruct the DNA molecule from multiple strands of characters representing the bases. Your program will read, as input, the original collection of smaller strands, and will report as output the larger reconstructed DNA molecule(s).

There are two key operations you will devise:

Notice that either of these operations results in reducing the number of strands in your collection by one. Ideally, by repeatedly applying such operations, the original collection of strands can be reduced to one single strand which contains all of the originals. Of course, the success in reaching such a single strand depends in part on the integrity of the original data, but also on the order in which you apply these operations.

For this reason, your program should behave as follows.

  1. Remove all strands from the original collection that are contained via a containing match.

  2. Once the containing matches have been removed, look for overlapping matches. In particular, repeatedly process that overlapping match which displays the largest size overlap.

    As a side note, following these two rules ensures that no further containing matches are introduced. You might wish to check this fact while developing your software, though in the final version such a check in unnecessary and time-consuming.

  3. If no overlapping matches exist, stop and output your results.

An example

Consider the following relatively simple example, consisting of 7 original strands:

aaattcctttctattttaggccc
tgaaaattcctttctattttaggcccatgcaat
ggcattagggcggt
atgcaatggcattagggcggttaa
ggtta
tgaaaattcctttctattt
taggcccatgcaatggcattagggc
In the first step, we notice that four of the original strands are contained in other strands, as
tgaaaattcctttctattt is a substring of tgaaaattcctttctattttaggcccatgcaat
aaattcctttctattttaggccc is a substring of tgaaaattcctttctattttaggcccatgcaat
ggcattagggcggt is a substring of atgcaatggcattagggcggttaa
ggtta is a substring of atgcaatggcattagggcggttaa

Having removed those matches, we are left with the following three strands:

tgaaaattcctttctattttaggcccatgcaat
atgcaatggcattagggcggttaa
taggcccatgcaatggcattagggc

The biggest overlap among these has size 18, between:

taggcccatgcaatggcattagggc
       atgcaatggcattagggcggttaa

Merging those two, we now consider the collection:

tgaaaattcctttctattttaggcccatgcaat
taggcccatgcaatggcattagggcggttaa
These two have an overlap of 14 characters, if the first is aligned to the left of the second as
tgaaaattcctttctattttaggcccatgcaat
                   taggcccatgcaatggcattagggcggttaa
After merging those two, we have only one remaining strand:
tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa
A prettier summary of this example is discussed as Extra Credit.


Programming Details

To aid in the organization of your software, please adhere to the following design:

Suggested Steps of Progress

You should make sure to set progressive goals for your intermediate progress, and to make sure to test your success in meeting each of those intermediate goals. To get you going, we suggest the following early goals:

  1. Make sure that you can read input with your driver. Demonstrate this by printing a count of the number of initial strands, or by waiting until all of the input is read and then echoing all of the initial strands back as output.

  2. Begin to define your Sequencer class, making sure to consider how information will be passed between the driver and it.

    For example, see if you can echo the original input from WITHIN this class, rather than from the original driver. This will ensure that you have accurately passed the necessary information to the sequencer.

  3. Write the code to copy the original array of String objects into a private array of String objects. Again, test your success by echoing the original strands based upon this new array.

  4. Write a low-level utility method which checks whether strand X is contained in strand Y.

  5. Demonstrate the success of this method by outputting, for every possible distinct pair (X,Y), whether X is contained in Y.

  6. Write a low-level utility method which checks the size of a overlapping match between a pair of distinct strands (X,Y).

  7. Demonstrate the success of this method by writing loops which output the size of the achievable overlapping match between every possible distinct pair (X,Y).

  8. Getting to this point is what we denote as "Version I" of your program. By the first of the two due dates, you are expected to submit code which demonstrates this accomplishment.
  9. Keep going, while continuing to set manageable, intermediate goals. For example, remember to start by removing all contained matches.

    Look for more opportunities to write low-level utility methods when appropriate.


Interesting Input

We have created a directory /home/csa120/assignments/DNA/input/ which contains a variety of sample input files that we have prepared. To use these files, you must copy those files into the same directory as the rest of your project, or else you must give the full pathname, such as:

    java DNADriver /home/csa120/assignments/DNA/input/simple.txt
at the commandline, or as
    {"/home/csa120/assignments/DNA/input/simple.txt"}
if starting the main method via BlueJ.

The sample inputs we provide are:


Submitting Your Assignment

Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.

Grading Standards

The assignment is worth 10 points. The general criteria will be the correctness and style of your implementation.


Extra Credit

Generate additional output which not only gives the resulting strand(s), but also diagrams the placement of each of the original strings within the result. For example, here is such an expanded output for the file input/simple.txt


The sequencing resulted in a total of 1 distinct strand:

Strand 1 (length=50)
tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa
--------------------------------------------------
tgaaaattcctttctattt
tgaaaattcctttctattttaggcccatgcaat
   aaattcctttctattttaggccc
                   taggcccatgcaatggcattagggc
                          atgcaatggcattagggcggttaa
                                 ggcattagggcggt
                                            ggtta

To accomplish this, you can wait until the results were computed and then go back and locate each of the original strings in one of the resulting strands. Note that this will be more complicated in the case where there is more than one strand in the final results. Also, taking care to diagram the contained strands from left-to-right in terms of starting point makes the diagram more pleasing.


Michael Goldwasser
Last modified: Wednesday, 21 April 2004