[an error occurred while processing this directive]

Saint Louis University

CS A 120
Computer Science I
Michael Goldwasser

Spring 2004

Dept. of Mathematics and
Mathematical Computer Science

`DNA`

Programming Assignment 8 [an error occurred while processing this directive]

Due:

Version I 8pm, Thursday 22 April 2004

Version II 8pm, Thursday 29 April 2004

[an error occurred while processing this directive]

Preface

In this assignment, you will be responsible for creating an entire program, from start to finish, without any files or libraries provided by the instructor. You have been given two weeks to complete the assignment and should use that entire period. Those who do not start early, and make continued progress throughout are unlikely to find success.

For this reason, we have set a preliminary due date by which you must submit what we denote as "Version I" of your program in the Suggested Steps of Progress. You should later resubmit your completed version of the program by the second due date.

Overview

In this assignment, we explore one of the important computational techniques used in sequencing genomes. We will give a brief overview of the background material, sufficient for this assignment. However those who are interested can view a very nice explanation of the sequencing process titled "Sequence for Yourself" provided by PBS' Nova series, at http://www.pbs.org/wgbh/nova/genome/sequencer.html (this assignment will be related to Part 5 of that process).

Abbreviated Background

A base is one of the underlying nucleotides that make up DNA. There are four types of nucleotides commonly abbreviated as A, T, C, and G, for adenine, thymine, cytosine, and guanine, respectively.
A chromosome is a long DNA sequence, and can thus be modeled as a string over this four character alphabet, such as ACATGCGTACAGGATC. The 23 chromosome pairs in the human genome contains roughly 3 billion such bases.
Laboratory techniques exist for determining individual bases comprising a strand of DNA, however those techniques can only be used for relatively short strands, perhaps limited to 500 bases.
Therefore, the bases of a significantly longer DNA molecule can only be determined if that molecule is first broken into such smaller pieces.

However, a problem exists. If we take a single original molecule and break it up, we will not know how to reassemble the smaller pieces in reconstructing the original DNA molecule. We could not determine the original order of the pieces.
The solution to this problem is the following. Many copies of the original chromosome are made, and then all of those copies are broken down into manageable size strands, though with some random-like effects in the precise locations of the breaks.

In this way, we will likely be able to find significant overlaps between some of the resulting strings, and we can use those overlaps to piece together the original sequence.

In this assignment, we will implement the final step, namely taking as input the many smaller strands of DNA and trying to reconstruct the longer original sequence.

Newly Introduced Techniques

Writing a stand-alone, self-contained Java program
Using a command-line argument
Reading input from a file
Instantiating arrays

Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish. In this regard, please make sure you adhere to the policies on academic integrity.

Your Task

For this assignment, you will reconstruct the DNA molecule from multiple strands of characters representing the bases. Your program will read, as input, the original collection of smaller strands, and will report as output the larger reconstructed DNA molecule(s).

There are two key operations you will devise:

Removing a Contained Match
If we find one strand which is completely contained in another strand, we can simply ignore the smaller strand. For example consider the case if X is the strand acggtcac and Y is cggt:

0 1 2 3 4 5 6 7

a c g g t c a c

c g g t

0 1 2 3

In this case, we can simply remove Y from our collection, since it is already subsumed by the existence of X.

Merging an Overlapping Match

If the right-end of one strand overlaps the left-end of another strand, those two strands might be merged into one longer strand.

For example, if X is the strand acggtcac and Y is gtcacatta then X and Y have an overlap of size 5, as diagramed below:

0	1	2	3	4	5	6	7
a	c	g	g	t	c	a	c
			g	t	c	a	c	a	t	t	a
			0	1	2	3	4	5	6	7	8

In such a case, we could replace the two strands with the single, merged strand acggtcacatta, noting that the merged strand contains each of the two originals.

Notice that either of these operations results in reducing the number of strands in your collection by one. Ideally, by repeatedly applying such operations, the original collection of strands can be reduced to one single strand which contains all of the originals. Of course, the success in reaching such a single strand depends in part on the integrity of the original data, but also on the order in which you apply these operations.

For this reason, your program should behave as follows.

Remove all strands from the original collection that are contained via a containing match.
Once the containing matches have been removed, look for overlapping matches. In particular, repeatedly process that overlapping match which displays the largest size overlap.

As a side note, following these two rules ensures that no further containing matches are introduced. You might wish to check this fact while developing your software, though in the final version such a check in unnecessary and time-consuming.
If no overlapping matches exist, stop and output your results.

An example

Consider the following relatively simple example, consisting of 7 original strands:

aaattcctttctattttaggccc
tgaaaattcctttctattttaggcccatgcaat
ggcattagggcggt
atgcaatggcattagggcggttaa
ggtta
tgaaaattcctttctattt
taggcccatgcaatggcattagggc

In the first step, we notice that four of the original strands are contained in other strands, as

`tgaaaattcctttctattt`	is a substring of	`tgaaaattcctttctattttaggcccatgcaat`
`aaattcctttctattttaggccc`	is a substring of	`tgaaaattcctttctattttaggcccatgcaat`
`ggcattagggcggt`	is a substring of	`atgcaatggcattagggcggttaa`
`ggtta`	is a substring of	`atgcaatggcattagggcggttaa`

Having removed those matches, we are left with the following three strands:

tgaaaattcctttctattttaggcccatgcaat
atgcaatggcattagggcggttaa
taggcccatgcaatggcattagggc

The biggest overlap among these has size 18, between:

taggcccatgcaatggcattagggc
       atgcaatggcattagggcggttaa

Merging those two, we now consider the collection:

tgaaaattcctttctattttaggcccatgcaat
taggcccatgcaatggcattagggcggttaa

These two have an overlap of 14 characters, if the first is aligned to the left of the second as tgaaaattcctttctattttaggcccatgcaat taggcccatgcaatggcattagggcggttaa After merging those two, we have only one remaining strand:

tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa

A prettier summary of this example is discussed as Extra Credit.

Programming Details

To aid in the organization of your software, please adhere to the following design:

Create a driver class which has a main method
This driver class should have a main method which is used to start your program. This method should be responsible for
1. Read the input
  The user is supposed to provide a legitimate file name as the first (and only) command-line argument when starting the program. Therefore, you should assume that args[0] is a string which is that filename (though you might certainly take the additional step of checking that such an argument was provided). Additional discussion of command-line arguments was given in lecture. Given the filename, your program should open that file and read the original strands, stored one per line. As you read the lines, you should be storing each one into an array of String's.
  
  We have not explicitly discussed how to open and read a file in Java, however your code can be modeled very closely after Figure 14-11 from p. 705 of the text. However, that code fragment is based on displaying the text into a text area "display" and your goal is to store each string into an array. So every line in the text involving 'display' should be replaced appropriately for our needs. If anyone is interested in understanding this code fragment, please read the discussion on pp.702-705. Note Well: To use the I/O related classes such as BufferedReader and FileReader, you should first use the import command to make them available. That is, the first line of your source code should read:
```
    import java.io.*;
```
  This import statement was not given in the code fragment of Figure 14-11, but is included in the more complete code in Figure 14-14. The issues involved in reading the original input are not meant to be a significant hurdle; please feel free to ask for help if needed.
  
  There is one problem in that you have to instantiate the array with a specific length before you start reading the file, yet you do not know how many strands exist until after you are finished reading the file. For the purpose of this assignment, you can assume that there will be at most 10000 strands, and thus use an array of size 10000. Of course, you should keep a true count of how many strings were actually given by the user and place those strings in the first so many cells of the array. In this way, the rest of your program can proceed based on the actual number of strands rather than the artificial maximum capacity.
2. Instantiate a Sequencer object
  Such a class is discussed more in the remainder of this section. When instantiating the object, the array of original Strings can be passed as a parameter to the constructor, as well as the number of such strings.
3. Use the Sequencer to process the collection of strands
  Such a processing method should have a return value which is a final array of the resulting strand(s). The reason to use an array is that there may be more than one resulting string in some cases.
4. Generate Desired Output
  Please make sure that the output is generated from the main routine, rather than from within the Sequencer class.
  
  Your output should be expressed as follows. First print the number of resulting strands. Then, print each such strand on a separate line of output.
Design a Sequencer class
- Have constructor copy the original array of Strings to a new private array
  The constructor for this class should immediately create its own private array of String objects, which is a copy of the original array. Of course, by this point you should know precisely how many original strings were given, so your new array can be sized accordingly (rather than using the maximum cap of 10000 items as with the original array). This new array is the one that you should process as the program continues.
  
  The advantage of copying the original array is that it leaves this original data as a record; this may be helpful in later evaluating or demonstrating the correctness of your sequencer.
- Design a method which controls the processing algorithm, starting with a large array of strings and resulting in an array of the final, non-matching string(s).
  As you process strands using the two key operations, the number of strands in your collection will be reduced. Given that you are using an array to keep track of these strands, you will need to take great care to keep track of the number of "remaining" strands, and to make sure that those strands are found in the first so-many cells of the array.
- Defining your own private methods
  Code is always more easily maintained when you can separate out distinct behaviors into separate methods. You should find many opportunities for such natural methods in this assignment, at varying levels of abstraction. For example, low level tasks such as checking for matches, determining the size of overlaps, and merging two strands into one could be defined, as well as higher-level tasks such as identifying the pair with maximum overlap, or removing all contained strands, and so on.

Suggested Steps of Progress

You should make sure to set progressive goals for your intermediate progress, and to make sure to test your success in meeting each of those intermediate goals. To get you going, we suggest the following early goals:

Make sure that you can read input with your driver. Demonstrate this by printing a count of the number of initial strands, or by waiting until all of the input is read and then echoing all of the initial strands back as output.
Begin to define your Sequencer class, making sure to consider how information will be passed between the driver and it.

For example, see if you can echo the original input from WITHIN this class, rather than from the original driver. This will ensure that you have accurately passed the necessary information to the sequencer.
Write the code to copy the original array of String objects into a private array of String objects. Again, test your success by echoing the original strands based upon this new array.
Write a low-level utility method which checks whether strand X is contained in strand Y.
Demonstrate the success of this method by outputting, for every possible distinct pair (X,Y), whether X is contained in Y.
Write a low-level utility method which checks the size of a overlapping match between a pair of distinct strands (X,Y).
Demonstrate the success of this method by writing loops which output the size of the achievable overlapping match between every possible distinct pair (X,Y).

Getting to this point is what we denote as "Version I" of your program. By the first of the two due dates, you are expected to submit code which demonstrates this accomplishment.

Keep going, while continuing to set manageable, intermediate goals. For example, remember to start by removing all contained matches.

Look for more opportunities to write low-level utility methods when appropriate.

Interesting Input

We have created a directory /home/csa120/assignments/DNA/input/ which contains a variety of sample input files that we have prepared. To use these files, you must copy those files into the same directory as the rest of your project, or else you must give the full pathname, such as:

    java DNADriver /home/csa120/assignments/DNA/input/simple.txt

at the commandline, or as {"/home/csa120/assignments/DNA/input/simple.txt"} if starting the main method via BlueJ.

The sample inputs we provide are:

nocontained.txt (solution)
7 original strands, none of which are contained in others; result is one strand of length 26.
simple.txt (solution)
7 original strands; only 3 remain after removing contained matches; result is one strand of length 50.
greedy.txt (solution)
only 3 original strands, none contained; following the rule of taking the largest overlap first results in two distinct strands in output.
complex.txt (solution)
13 original strands, though signficantly longer; 11 remain after removing contained matches

Submitting Your Assignment

Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.

Grading Standards

The assignment is worth 10 points. The general criteria will be the correctness and style of your implementation.

Extra Credit

Generate additional output which not only gives the resulting strand(s), but also diagrams the placement of each of the original strings within the result. For example, here is such an expanded output for the file input/simple.txt


The sequencing resulted in a total of 1 distinct strand:

Strand 1 (length=50)
tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa
--------------------------------------------------
tgaaaattcctttctattt
tgaaaattcctttctattttaggcccatgcaat
   aaattcctttctattttaggccc
                   taggcccatgcaatggcattagggc
                          atgcaatggcattagggcggttaa
                                 ggcattagggcggt
                                            ggtta

To accomplish this, you can wait until the results were computed and then go back and locate each of the original strings in one of the resulting strands. Note that this will be more complicated in the case where there is more than one strand in the final results. Also, taking care to diagram the contained strands from left-to-right in terms of starting point makes the diagram more pleasing.

Michael Goldwasser

Last modified: Wednesday, 21 April 2004

Version I	8pm, Thursday 22 April 2004
Version II	8pm, Thursday 29 April 2004

CS A 120 Computer Science I Michael Goldwasser