| Version I | 8pm, Thursday 22 April 2004 |
| Version II | 8pm, Thursday 29 April 2004 |
In this assignment, you will be responsible for creating an entire program, from start to finish, without any files or libraries provided by the instructor. You have been given two weeks to complete the assignment and should use that entire period. Those who do not start early, and make continued progress throughout are unlikely to find success.
For this reason, we have set a preliminary due date by which you must submit what we denote as "Version I" of your program in the Suggested Steps of Progress. You should later resubmit your completed version of the program by the second due date.
In this assignment, we explore one of the important computational techniques used in sequencing genomes. We will give a brief overview of the background material, sufficient for this assignment. However those who are interested can view a very nice explanation of the sequencing process titled "Sequence for Yourself" provided by PBS' Nova series, at http://www.pbs.org/wgbh/nova/genome/sequencer.html (this assignment will be related to Part 5 of that process).
A base is one of the underlying nucleotides that make up
DNA. There are four types of nucleotides commonly abbreviated
as
A chromosome is a long DNA sequence, and can thus be modeled as a string over this four character alphabet, such as ACATGCGTACAGGATC. The 23 chromosome pairs in the human genome contains roughly 3 billion such bases.
Laboratory techniques exist for determining individual bases comprising a strand of DNA, however those techniques can only be used for relatively short strands, perhaps limited to 500 bases.
Therefore, the bases of a significantly longer DNA molecule can only be determined if that molecule is first broken into such smaller pieces.
However, a problem exists. If we take a single original molecule and break it up, we will not know how to reassemble the smaller pieces in reconstructing the original DNA molecule. We could not determine the original order of the pieces.
The solution to this problem is the following. Many copies of the original chromosome are made, and then all of those copies are broken down into manageable size strands, though with some random-like effects in the precise locations of the breaks.
In this way, we will likely be able to find significant overlaps between some of the resulting strings, and we can use those overlaps to piece together the original sequence.
For this assignment, you are allowed to work with one other student if you wish. In this regard, please make sure you adhere to the policies on academic integrity.
For this assignment, you will reconstruct the DNA molecule from multiple strands of characters representing the bases. Your program will read, as input, the original collection of smaller strands, and will report as output the larger reconstructed DNA molecule(s).
There are two key operations you will devise:
If we find one strand which is completely contained in another strand, we can simply ignore the smaller strand. For example consider the case if X is the strand acggtcac and Y is cggt:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| a | c | g | g | t | c | a | c |
| c | g | g | t | ||||
| 0 | 1 | 2 | 3 |
If the right-end of one strand overlaps the left-end of another strand, those two strands might be merged into one longer strand.
For example, if X is the strand acggtcac and Y is gtcacatta then X and Y have an overlap of size 5, as diagramed below:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||||
| a | c | g | g | t | c | a | c | ||||
| g | t | c | a | c | a | t | t | a | |||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Notice that either of these operations results in reducing the number of strands in your collection by one. Ideally, by repeatedly applying such operations, the original collection of strands can be reduced to one single strand which contains all of the originals. Of course, the success in reaching such a single strand depends in part on the integrity of the original data, but also on the order in which you apply these operations.
For this reason, your program should behave as follows.
Remove all strands from the original collection that are contained via a containing match.
Once the containing matches have been removed, look for overlapping matches. In particular, repeatedly process that overlapping match which displays the largest size overlap.
As a side note, following these two rules ensures that no further containing matches are introduced. You might wish to check this fact while developing your software, though in the final version such a check in unnecessary and time-consuming.
If no overlapping matches exist, stop and output your results.
Consider the following relatively simple example,
consisting of 7 original strands:
aaattcctttctattttaggccc
tgaaaattcctttctattttaggcccatgcaat
ggcattagggcggt
atgcaatggcattagggcggttaa
ggtta
tgaaaattcctttctattt
taggcccatgcaatggcattagggc
In the first step, we notice that four of the original strands are
contained in other strands, as
| tgaaaattcctttctattt | is a substring of | tgaaaattcctttctattttaggcccatgcaat |
| aaattcctttctattttaggccc | is a substring of | tgaaaattcctttctattttaggcccatgcaat |
| ggcattagggcggt | is a substring of | atgcaatggcattagggcggttaa |
| ggtta | is a substring of | atgcaatggcattagggcggttaa |
Having removed those matches, we are left with the following three strands:
tgaaaattcctttctattttaggcccatgcaat
atgcaatggcattagggcggttaa
taggcccatgcaatggcattagggc
The biggest overlap among these has size 18, between:
taggcccatgcaatggcattagggc
atgcaatggcattagggcggttaa
Merging those two, we now consider the collection:
tgaaaattcctttctattttaggcccatgcaat
taggcccatgcaatggcattagggcggttaa
These two have an overlap of 14 characters, if the first is aligned to
the left of the second as
tgaaaattcctttctattttaggcccatgcaat
taggcccatgcaatggcattagggcggttaa
After merging those two, we have only one remaining strand:
tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaaA prettier summary of this example is discussed as Extra Credit.
This driver class should have a main method which is used to start your program. This method should be responsible for
The user is supposed to provide a legitimate file name as the first (and only) command-line argument when starting the program. Therefore, you should assume that args[0] is a string which is that filename (though you might certainly take the additional step of checking that such an argument was provided). Additional discussion of command-line arguments was given in lecture. Given the filename, your program should open that file and read the original strands, stored one per line. As you read the lines, you should be storing each one into an array of String's.
We have not explicitly discussed how to open and read a file in Java, however your code can be modeled very closely after Figure 14-11 from p. 705 of the text. However, that code fragment is based on displaying the text into a text area "display" and your goal is to store each string into an array. So every line in the text involving 'display' should be replaced appropriately for our needs. If anyone is interested in understanding this code fragment, please read the discussion on pp.702-705. Note Well: To use the I/O related classes such as BufferedReader and FileReader, you should first use the import command to make them available. That is, the first line of your source code should read:
import java.io.*;This import statement was not given in the code fragment of Figure 14-11, but is included in the more complete code in Figure 14-14. The issues involved in reading the original input are not meant to be a significant hurdle; please feel free to ask for help if needed.
There is one problem in that you have to instantiate the array with a specific length before you start reading the file, yet you do not know how many strands exist until after you are finished reading the file. For the purpose of this assignment, you can assume that there will be at most 10000 strands, and thus use an array of size 10000. Of course, you should keep a true count of how many strings were actually given by the user and place those strings in the first so many cells of the array. In this way, the rest of your program can proceed based on the actual number of strands rather than the artificial maximum capacity.
Such a class is discussed more in the remainder of this section. When instantiating the object, the array of original Strings can be passed as a parameter to the constructor, as well as the number of such strings.
Such a processing method should have a return value which is a final array of the resulting strand(s). The reason to use an array is that there may be more than one resulting string in some cases.
Please make sure that the output is generated from the main routine, rather than from within the Sequencer class.
Your output should be expressed as follows. First print the number of resulting strands. Then, print each such strand on a separate line of output.
The constructor for this class should immediately create its own private array of String objects, which is a copy of the original array. Of course, by this point you should know precisely how many original strings were given, so your new array can be sized accordingly (rather than using the maximum cap of 10000 items as with the original array). This new array is the one that you should process as the program continues.
The advantage of copying the original array is that it leaves this original data as a record; this may be helpful in later evaluating or demonstrating the correctness of your sequencer.
As you process strands using the two key operations, the number of strands in your collection will be reduced. Given that you are using an array to keep track of these strands, you will need to take great care to keep track of the number of "remaining" strands, and to make sure that those strands are found in the first so-many cells of the array.
Code is always more easily maintained when you can separate out distinct behaviors into separate methods. You should find many opportunities for such natural methods in this assignment, at varying levels of abstraction. For example, low level tasks such as checking for matches, determining the size of overlaps, and merging two strands into one could be defined, as well as higher-level tasks such as identifying the pair with maximum overlap, or removing all contained strands, and so on.
You should make sure to set progressive goals for your intermediate progress, and to make sure to test your success in meeting each of those intermediate goals. To get you going, we suggest the following early goals:
Make sure that you can read input with your driver. Demonstrate this by printing a count of the number of initial strands, or by waiting until all of the input is read and then echoing all of the initial strands back as output.
Begin to define your Sequencer class, making sure to consider how information will be passed between the driver and it.
For example, see if you can echo the original input from WITHIN this class, rather than from the original driver. This will ensure that you have accurately passed the necessary information to the sequencer.
Write the code to copy the original array of String objects into a private array of String objects. Again, test your success by echoing the original strands based upon this new array.
Write a low-level utility method which checks whether strand X is contained in strand Y.
Demonstrate the success of this method by outputting, for every possible distinct pair (X,Y), whether X is contained in Y.
Write a low-level utility method which checks the size of a overlapping match between a pair of distinct strands (X,Y).
Demonstrate the success of this method by writing loops which output the size of the achievable overlapping match between every possible distinct pair (X,Y).
Keep going, while continuing to set manageable, intermediate goals. For example, remember to start by removing all contained matches.
Look for more opportunities to write low-level utility methods when appropriate.
We have created a directory /home/csa120/assignments/DNA/input/
which contains a variety of sample input files that we have prepared.
To use these files, you must copy those files into the same directory
as the rest of your project, or else you must give the full pathname,
such as:
java DNADriver /home/csa120/assignments/DNA/input/simple.txt
at the commandline, or as
{"/home/csa120/assignments/DNA/input/simple.txt"}
if starting the main method via BlueJ.
The sample inputs we provide are:
The assignment is worth 10 points. The general criteria will be the correctness and style of your implementation.
Generate additional output which not only gives the resulting strand(s), but also diagrams the placement of each of the original strings within the result. For example, here is such an expanded output for the file input/simple.txt
The sequencing resulted in a total of 1 distinct strand:
Strand 1 (length=50)
tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa
--------------------------------------------------
tgaaaattcctttctattt
tgaaaattcctttctattttaggcccatgcaat
aaattcctttctattttaggccc
taggcccatgcaatggcattagggc
atgcaatggcattagggcggttaa
ggcattagggcggt
ggtta
To accomplish this, you can wait until the results were computed and then go back and locate each of the original strings in one of the resulting strands. Note that this will be more complicated in the case where there is more than one strand in the final results. Also, taking care to diagram the contained strands from left-to-right in terms of starting point makes the diagram more pleasing.