Shakespearl

Programming Assignment 6

Due: 8pm, Monday 5 April 2004


Contents:

(image from
http://members.aol.com/LearnNothing/TypingMonkeys.htm)

Overview

We have all heard the argument that having a million monkeys typing on a million keyboards would eventually lead to the creation of each of Shakespeare's works. In essence, there is a miniscule chance that a selection of random characters might match such a work, and so with enough random experiments it should eventually happen. Not surprisingly, small scale tests of this experiment have thus far failed.


(image from http://www.vivaria.net/experiments/notes/documentation/)

Our goal in this assignment will be to help those monkeys out just a bit. In fact, perhaps they can create works even better than those of Shakespeare himself. The idea is the following. Rather than modeling the process of generating text by choosing each keystroke uniformly at random, we will use an existing work of Shakespeare to seed a random process for language generation.

Newly Introduced Techniques


Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish. In this regard, please make sure you adhere to the policies on academic integrity.


A Model for Language Generation

The input to the generator will be a source document together with a parameter k which we will call the order of the model. Our output will always begin with the first k characters of the source document. After that point, however, each additional character is chosen, based upon the most recent k-character string in the output. Let us denote this k-character string as the key. We will find all occurrences of that key in the source document, and select one of those occurrences uniformly at random. The next character of the output will then be chosen to match the character which followed the chosen occurrence from the source.

For example, assume that k=2 and that the current key is th. If the source document had the following form,

*******th***thth******th****th************th****th***********th**

we would hope to pick one of the eight occurrences of th uniformly at random, such as,

*******th***thth******th****th************th****th***********th**

and then generate the next character of output based on which character followed the chosen occurrence in the source document, e.g.,

*******th***thth******th****th************the***th***********th**

Next time we pick a new character of the output, we would continue with he as the key.

Note well: If implementing this scheme correctly, you can always be assured that the key occurs at least once in the source document. However, one special case which may occur is when the randomly chosen occurrence of the key happens to be the final k characters of the source. In this case, there is no followup character to use; your generator can return a special value which will signify the end of the output.

Another Note: Consider a scenario with k=3, a current key of aaa, and a source string "aaaaaaab". You should consider this string to have five different occurrences of the key, with four of those five followed by an additional a and the fifth occurrence followed by b, and thus we'd expect that the next character should be chosen as an a with probability of 4/5.


Your Task

Your task will be to write code for a new class Generator which supports the following two public methods:

Your class should not have any reason to store the entire output, though you will need to maintain a string which represents the most recent k-character key.

The biggest challenge will be the implementation of the nextChar() method. In particular, to properly identify all of the occurrences of the key in the source, as well as to choose one of those occurrences uniformly at random. For clarity, we first discuss a straight-forward approach, though not the one that you should implement as it is less efficient. It requires two passes through the input to generate a new character.

Instead, you should implement this more efficient approach, requiring only a single pass. The apparent hurdle is that we want to choose among those occurrences of the key with equal probability but we do not know how many such occurrences we are choosing between until we have made a pass through the input. Fortunately, there is a clever mathematical trick to make this selection on-the-fly, while still ensuring that the end result is a fair choice.
Though you certainly do not need to prove that this technique results in all occurrences being chosen equally likely, you may want to try it out by hand to convince yourself. For example, trace through what happens when there are a total of three occurrences between which to choose.


Additional Programming Details

To successfully complete this assignment, you will need to make use of two existing classes from the Java library, String and Random.

Testing Your Program

The text generated by this language model will be greatly effected by the parameter k. When k is small, the output will seem almost like gibberish; If k were quite large, the output would eventually be identical to the source document. But as k varies in between, we get some interesting texts which are original, though in the style of the source work.

This may seem like fun and games as you watch your program spitting out random looking text. So how are you to tell whether or not your program is working correctly, given the presence of some randomness in the model? This is not so easy, but indeed you are responsible for assuring the accuracy of your program in carrying out the prescribed model. (We will certainly be evaluating your work in this regard.)

We suggest that you start by using somewhat small, controlled input sources. This will allow you to test and debug your program while you are still able to hand simulate the scenario. We will offer the following three simple examples. What behavior would you expect for these source strings:

If you feel that you have things under control for these types of examples, then have fun and move onto some larger examples, such as those discussed in the next section.

Alternative Document Sources

To create your own input file, use one of the text editors available from the popup menu at the lower-left corner of your virtual desktop, under the subheading "Editors" (much as you find BlueJ in the submenu lableled "Java"). When saving those files, it will be most convenient if you place them in your Shakespearl project directory.

For lengthier and more interesting examples, you will find a subdirectory named input in your own project directory, containing the following files:


Files You Will Need

We have placed a copy of the template files for this assignment, Shakespearl, in each of your home directories on patel2.slu.edu.

We start you with two classes in the project:

In addition, you can rely on the provided input files in subdirectory input.

Running the Shakespearl Program

To run the program, execute the main() method of the class Shakespearl. You need to provide up to four parameters when starting the program:

  1. The number of characters of output which should be generated.

  2. The value of k to be used in the generation.

  3. The full pathname and filename for the source doucument.

  4. (Optionally) A numerical value to be used in seeding the random number generator. If this fourth parameter is not specfied, the program will choose its own seed.

There are two ways by which you can provide these parameters when running the Shakespearl program. When you start the main method of that class in BlueJ, it initially prompts you with a window such as the following:

The parameters would need to be entered using a precise syntax, as a collection of string literals, for example:
{"500", "5", "input/billiken.txt", "12345678"}

If you were to choose to omit the optional fourth argument, the program will pick the seed itself, thus leading to a different "random" trial each time you ran the program. For example, the following snapshot was taken when specifying only the first three parameters:

Alternatively, if you were to click 'OK' without specifying the parameters, the program will then begin by explicitly prompting you for each of the three required parameters.


Submitting Your Assignment

Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.

Grading Standards

The assignment is worth 10 points. The primary criteria will be whether your program generates text in accordance with the precise algorithm described in this assignment. Please make sure you read the section on testing your program.

A secondary criteria will be on the style, readability and documentation of your source code.


Extra Credit

The use of the indexOf method of class String greatly simplified your task in this assignment. For 1 point of extra credit, write a method nextCharAlternate() which has the same behavior as nextChar however which is implemented without any use of the indexOf method.

Note: Do not make any changes to your original nextChar() routine. That one will still be the one which is graded for the required assignment. We do not want you to accidentally harm the first 10 points of your assignment because of a mistake you made in attempting the extra credit.

You can test your extra credit routine by starting the program with the method Shakespearl.mainExtra(), rather than with the method Shakespearl.main().


Michael Goldwasser
Last modified: Friday, 02 April 2004