We have all heard the argument that having a million monkeys
typing on a million keyboards would eventually lead to the creation of
each of Shakespeare's works. In essence, there is a miniscule chance
that a selection of random characters might match such a work, and so
with enough random experiments it should eventually happen.
Not surprisingly, small scale tests of this experiment have thus far
failed.
Our goal in this assignment will be to help those monkeys out just a bit. In fact, perhaps they can create works even better than those of Shakespeare himself. The idea is the following. Rather than modeling the process of generating text by choosing each keystroke uniformly at random, we will use an existing work of Shakespeare to seed a random process for language generation.
For this assignment, you are allowed to work with one other student if you wish. In this regard, please make sure you adhere to the policies on academic integrity.
The input to the generator will be a source document together with a parameter k which we will call the order of the model. Our output will always begin with the first k characters of the source document. After that point, however, each additional character is chosen, based upon the most recent k-character string in the output. Let us denote this k-character string as the key. We will find all occurrences of that key in the source document, and select one of those occurrences uniformly at random. The next character of the output will then be chosen to match the character which followed the chosen occurrence from the source.
For example, assume that k=2 and that the current key is
th. If the source document had the following form,
*******th***thth******th****th************th****th***********th**
we would hope to pick one of the eight occurrences of th
uniformly at random, such as,
*******th***thth******th****th************th****th***********th**
and then generate the next character of output based on which
character followed the chosen occurrence in the source document, e.g.,
*******th***thth******th****th************the***th***********th**
Next time we pick a new character of the output,
we would continue with he as the key.
public Generator(String source, int k, long seed)
A constructor for your generator which can perform any
initialization which you deem necessary necessary. The first
parameter is the original source document; the second is the
value of k as discussed above; the third is a 'seed'
which you should use when initializing the random number
generator (more details below)
public char nextChar()
This method is responsible for generating another character of
the output. For the first k calls, the method should
be returning each of the first k characters of the
original source. From that point on, each new character should
be based upon the k most recently generated characters,
as per the assignment description. In the
special case discussed
earlier, you should return the value 0.
Your class should not have any reason to store the entire output, though you will need to maintain a string which represents the most recent k-character key.
The biggest challenge will be the implementation of the nextChar() method. In particular, to properly identify all of the occurrences of the key in the source, as well as to choose one of those occurrences uniformly at random. For clarity, we first discuss a straight-forward approach, though not the one that you should implement as it is less efficient. It requires two passes through the input to generate a new character.
Instead, you should implement this more efficient approach, requiring only a single pass. The apparent hurdle is that we want to choose among those occurrences of the key with equal probability but we do not know how many such occurrences we are choosing between until we have made a pass through the input. Fortunately, there is a clever mathematical trick to make this selection on-the-fly, while still ensuring that the end result is a fair choice.
- During a first pass through the source, count the number of occurrences of the key. Let's say that there are p such occurrences.
- Now use a random number generator to choose a number between 1 and p, then make a second pass through the source to locate the desired occurrence of the key.
Though you certainly do not need to prove that this technique results in all occurrences being chosen equally likely, you may want to try it out by hand to convince yourself. For example, trace through what happens when there are a total of three occurrences between which to choose.
Perform a pass through the source, searching for occurrences of the key. While doing this, maintain two important pieces of additional information:
- Identify one of the occurrences so far as the current candidate to become the final choice.
- Keep track of a count of the number of occurrences which you have seen thus far.
When you come upon the nth occurrence of the key, select this one, with probability 1/n to become the candidate. Do this for each occurrence which is found, until the end of the source is reached. Then use the leading candidate as the chosen occurrence for determining the new output character.
The String class is discussed in great detail in Ch. 7 of the text, as well as in our own lecture notes. The original source document will be represented as one long String object. You will also want to use either a String (or perhaps a StringBuffer) to represent the current key.
The indexOf method will be very useful for finding occurrences of the key in the source document. Though not precisely identical to this application, you will probably find Chapter 7.4 of the text as particularly instructive.
Much like your own Generator class is used in this assignment to generate characters by this model, Java provides a class Random which can serve as a random number generator. This class does not truly have the ability to pick things at random, rather it simulates this by using a mathematical function which seems somewhat random. There are only two methods you will need to know about this class.
To construct a new object from class Random, call the constructor with a single parameter seed (which was given to your object during its own constructor). The seed is used to start off the pseudo-random process. The reason that we specify this is that it allows us to repeat the identical random sequence while testing the program, so long as the identical seed is provided. Each different seed will likely result in a different eventual output for your program.
The method int nextInt(n) will return to you a (pseudo)random number which is an integer within the range [0, n-1], with each of those values produced with probability approximately 1/n.
The text generated by this language model will be greatly effected by the parameter k. When k is small, the output will seem almost like gibberish; If k were quite large, the output would eventually be identical to the source document. But as k varies in between, we get some interesting texts which are original, though in the style of the source work.
This may seem like fun and games as you watch your program spitting out random looking text. So how are you to tell whether or not your program is working correctly, given the presence of some randomness in the model? This is not so easy, but indeed you are responsible for assuring the accuracy of your program in carrying out the prescribed model. (We will certainly be evaluating your work in this regard.)
We suggest that you start by using somewhat small, controlled input sources. This will allow you to test and debug your program while you are still able to hand simulate the scenario. We will offer the following three simple examples. What behavior would you expect for these source strings:
example1.txt
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()
using any value of k
example2.txt
aaabaaacaaadaaabaaabaaac
using value of k=3
example3.txt
aaaaaaaaaaaaXaaaaaaaaaaaaXaaaaaaaaaaaaXaaaaaaaaaaaaXc
using value of k=4
For lengthier and more interesting examples, you will find a subdirectory named input in your own project directory, containing the following files:
muchado.txt (123413 characters)
Much Ado About Nothing. (Let's see how those monkeys do with this).
billiken.txt (3938 characters)
An article on the history of the Billiken mascot.
amendments.txt
(18369 characters)
The Amendments to the Constitution
(given the current political
climate, it will be interesting to see which side of the debate
is supported by our experiments).
manifesto.txt
(72955 characters)
Communist Party Manifesto
(to consider perhaps alternative political structures)
aesopshort.txt
(10280 characters)
A short collection of Aesop's fables.
aesop.txt (191945
characters)
A much longer collection of Aesop's fables.
lilwomen.txt
(1042048 characters)
Little Women
We have placed a copy of the template files for this assignment, Shakespearl, in each of your home directories on patel2.slu.edu.
We start you with two classes in the project:
Generator
A template for the class which you must write. This is the only
file that you should be modifying.
Shakespearl
This is a file which you should not edit. It is the main driver
to execute your program.
To run the program, execute the main() method of the class Shakespearl. You need to provide up to four parameters when starting the program:
The number of characters of output which should be generated.
The value of k to be used in the generation.
The full pathname and filename for the source doucument.
(Optionally) A numerical value to be used in seeding the random number generator. If this fourth parameter is not specfied, the program will choose its own seed.
There are two ways by which you can provide these parameters when
running the Shakespearl program. When you start the main
method of that class in BlueJ, it initially prompts you with a window
such as the following:
The parameters would need to be entered using a precise syntax, as a
collection of string literals, for example:
If you were to choose to omit the optional fourth argument, the
program will pick the seed itself, thus leading to a different
"random" trial each time you ran the program. For example, the
following snapshot was taken when specifying only the first three
parameters:
Alternatively, if you were to click 'OK' without specifying the parameters, the program will then begin by explicitly prompting you for each of the three required parameters.
The assignment is worth 10 points. The primary criteria will be whether your program generates text in accordance with the precise algorithm described in this assignment. Please make sure you read the section on testing your program.
A secondary criteria will be on the style, readability and documentation of your source code.
The use of the indexOf method of class String greatly simplified your task in this assignment. For 1 point of extra credit, write a method nextCharAlternate() which has the same behavior as nextChar however which is implemented without any use of the indexOf method.
Note: Do not make any changes to your original nextChar() routine. That one will still be the one which is graded for the required assignment. We do not want you to accidentally harm the first 10 points of your assignment because of a mistake you made in attempting the extra credit.
You can test your extra credit routine by starting the program with the method Shakespearl.mainExtra(), rather than with the method Shakespearl.main().