Difference between revisions of "CISC181 S2017 Lab8"

Latest revision as of 09:16, 15 May 2017

In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks".

Furthermore, we can keep track of what characters follow each n-gram. For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below:

2-gram	Characters after	2-gram	Characters after
"th"	"e", "r", "e"	"ra"	"t"
"he"	" ", "I"	"at"	"e", "e"
"e "	"t", "p", "t"	"te"	"s", " "
" t"	"h", "h"	"es"	" "
"hr"	"e"	"s "	"a"
"re"	"e"	" a"	"t"
"ee"	" "	"ei"	"r"
" p"	"i", "i"	"r "	"p"
"pi"	"r", "e"	"ie"	null
"ir"	"a", " "

Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on. However, we will ignore capitalization.

Now consider how you might generate a new random text with the same statistics as the one you analyzed. Start with a "seed" n-gram chosen randomly from the text. Suppose "th" is chosen for the 2-gram pirate example. This will be the beginning of your output.

The next character output is chosen randomly from the list associated with "th": "e" is chosen with a 2/3 chance and "r" with a 1/3 chance. Suppose an "e" is picked. The output is now "the".

Now we drop the first character "t" from the last n-gram (the seed) that we were using and append the new character "e" to get our new seed "he". We select a character randomly from the list associated with "he": " " (space) with 1/2 chance and "i" with 1/2 chance. Suppose we choose "i". The output is now "thei".

Update the seed again; now we have "ei". There is only one character, "r", in the list associated with this 2-gram, so we pick it. The output is now "their".

Now the seed is "ir". " " or "a" is chosen with equal probability. Suppose "a" is chosen. Now the output is "theira" and the seed is "ra".

And so on. If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), pick a new random seed and continue.

RandomWriter

You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:

String source: The name of an input file to read and analyze
int n: A non-negative number indicating the length of each "gram," or character sequence, to break the file into

and also a method generateText() that takes the following two parameters:

int length: A non-negative number of characters to generate.
String result: The name of the output file

Some kind of map is the recommended data structure to store your n-grams and their character list associations.

Testing

In main(), run your code on the following files:

Generate 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 4-grams, and 6-grams.

Submission

Submit your RandomWriter.java to Sakai, as well as a text file results.txt containing the outputs of your program for the different input files and n-gram lengths. Inside the results.txt, clearly label what the source file and value of n was for each block of output text (there should be 3 input files x 4 values of n = 12 such blocks). Put your name in both files.

Acknowledgments

This assignment is adapted from one created by David Matuszek at the University of Pennsylvania and Joe Zachary's random writer assignment.

@@ Line 1: / Line 1: @@
+<p style="font-size:40px">[http://nameless.cis.udel.edu/class_data/181_s2017/lab8_grading.pdf Lab #8 grading]</p>
 ===Preliminaries===
@@ Line 69: / Line 71: @@
 |}
-Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on.  However, we will ignore capitalization.
+Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on.  However, we will '''ignore capitalization'''.
 Now consider how you might generate a new random text with the same statistics as the one you analyzed.  Start with a "seed" n-gram chosen randomly from the text.  Suppose "th" is chosen for the 2-gram pirate example.  This will be the beginning of your output.
@@ Line 81: / Line 83: @@
 Now the seed is "ir".  " " or "a" is chosen with equal probability.  Suppose "a" is chosen.  Now the output is "theira" and the seed is "ra".
-And so on.  If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), your program should pick a new random seed and continue.
+And so on.  If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), pick a new random seed and continue.
 ====RandomWriter====
@@ Line 94: / Line 96: @@
 * <tt>int length</tt>: A non-negative number of characters to generate.
 * <tt>String result</tt>: The name of the output file
+Some kind of [https://docs.oracle.com/javase/7/docs/api/java/util/Map.html ''map''] is the recommended data structure to store your n-grams and their character list associations.
 ====Testing====
@@ Line 103: / Line 107: @@
 * [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp]
-Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word).
+Generate 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word).  Do this for 1-grams, 2-grams, 4-grams, and 6-grams.
-Do this for 1-grams, 2-grams, 4-grams, and 6-grams.
+===Submission===
+Submit your <tt>RandomWriter.java</tt> to Sakai, as well as a text file <tt>results.txt</tt> containing the outputs of your program for the different input files and n-gram lengths.  Inside the <tt>results.txt</tt>, clearly label what the source file and value of n was for each block of output text (there should be 3 input files x 4 values of n = 12 such blocks).  Put your name in both files.
 ===Acknowledgments===
-This assignment is shamelessly copied from [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html one created by David Matuszek] at the University of Pennsylvania.
+This assignment is adapted from [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html one created by David Matuszek] at the University of Pennsylvania and Joe Zachary's [http://nifty.stanford.edu/2003/randomwriter/handout.html random writer assignment].

Difference between revisions of "CISC181 S2017 Lab8"

Latest revision as of 09:16, 15 May 2017

Contents

Preliminaries

Instructions

RandomWriter

Testing

Submission

Acknowledgments

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools