Difference between revisions of "CISC181 S2017 Lab8"

From class_wiki
Jump to: navigation, search
(Instructions)
(Instructions)
Line 81: Line 81:
 
Now the seed is "ir".  " " or "a" is chosen with equal probability.  Suppose "a" is chosen.  Now the output is "theira" and the seed is "ra".
 
Now the seed is "ir".  " " or "a" is chosen with equal probability.  Suppose "a" is chosen.  Now the output is "theira" and the seed is "ra".
  
And so on.  If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), your program should pick a new random seed and continue.
+
And so on.  If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), pick a new random seed and continue.
  
 
====RandomWriter====
 
====RandomWriter====

Revision as of 11:15, 17 April 2017

Preliminaries

  • Make a new project with n = 8 (following these instructions)
  • Name your main class "Lab8" (when creating a new module in the instructions above, in the Java class name field)
  • Modify Lab8.java by adding your name and section number in a comment before the Lab8 class body.

Instructions

In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks".

Furthermore, we can keep track of what characters follow each n-gram. For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below:

2-gram Characters after 2-gram Characters after
"th" "e", "r", "e" "ra" "t"
"he" " ", "I" "at" "e", "e"
"e " "t", "p", "t" "te" "s", " "
" t" "h", "h" "es" " "
"hr" "e" "s " "a"
"re" "e" " a" "t"
"ee" " " "ei" "r"
" p" "i", "i" "r " "p"
"pi" "r", "e" "ie" null
"ir" "a", " "

Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on. However, we will ignore capitalization.

Now consider how you might generate a new random text with the same statistics as the one you analyzed. Start with a "seed" n-gram chosen randomly from the text. Suppose "th" is chosen for the 2-gram pirate example. This will be the beginning of your output.

The next character output is chosen randomly from the list associated with "th": "e" is chosen with a 2/3 chance and "r" with a 1/3 chance. Suppose an "e" is picked. The output is now "the".

Now we drop the first character "t" from the last n-gram (the seed) that we were using and append the new character "e" to get our new seed "he". We select a character randomly from the list associated with "he": " " (space) with 1/2 chance and "i" with 1/2 chance. Suppose we choose "i". The output is now "thei".

Update the seed again; now we have "ei". There is only one character, "r", in the list associated with this 2-gram, so we pick it. The output is now "their".

Now the seed is "ir". " " or "a" is chosen with equal probability. Suppose "a" is chosen. Now the output is "theira" and the seed is "ra".

And so on. If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), pick a new random seed and continue.

RandomWriter

You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:

  • String source: The name of an input file to read and analyze
  • int n: A non-negative number indicating the length of each "gram," or character sequence, to break the file into

and also a method generateText() that takes the following two parameters:

  • int length: A non-negative number of characters to generate.
  • String result: The name of the output file

Testing

In main(), run your code on the following files:

Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 4-grams, and 6-grams.

Acknowledgments

This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania.