Difference between revisions of "CISC181 S2017 Lab8"

Latest revision as of 08:16, 15 May 2017

In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks".

Furthermore, we can keep track of what characters follow each n-gram. For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below:

2-gram	Characters after	2-gram	Characters after
"th"	"e", "r", "e"	"ra"	"t"
"he"	" ", "I"	"at"	"e", "e"
"e "	"t", "p", "t"	"te"	"s", " "
" t"	"h", "h"	"es"	" "
"hr"	"e"	"s "	"a"
"re"	"e"	" a"	"t"
"ee"	" "	"ei"	"r"
" p"	"i", "i"	"r "	"p"
"pi"	"r", "e"	"ie"	null
"ir"	"a", " "

Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on. However, we will ignore capitalization.

Now consider how you might generate a new random text with the same statistics as the one you analyzed. Start with a "seed" n-gram chosen randomly from the text. Suppose "th" is chosen for the 2-gram pirate example. This will be the beginning of your output.

The next character output is chosen randomly from the list associated with "th": "e" is chosen with a 2/3 chance and "r" with a 1/3 chance. Suppose an "e" is picked. The output is now "the".

Now we drop the first character "t" from the last n-gram (the seed) that we were using and append the new character "e" to get our new seed "he". We select a character randomly from the list associated with "he": " " (space) with 1/2 chance and "i" with 1/2 chance. Suppose we choose "i". The output is now "thei".

Update the seed again; now we have "ei". There is only one character, "r", in the list associated with this 2-gram, so we pick it. The output is now "their".

Now the seed is "ir". " " or "a" is chosen with equal probability. Suppose "a" is chosen. Now the output is "theira" and the seed is "ra".

And so on. If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), pick a new random seed and continue.

RandomWriter

You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:

String source: The name of an input file to read and analyze
int n: A non-negative number indicating the length of each "gram," or character sequence, to break the file into

and also a method generateText() that takes the following two parameters:

int length: A non-negative number of characters to generate.
String result: The name of the output file

Some kind of map is the recommended data structure to store your n-grams and their character list associations.

Testing

In main(), run your code on the following files:

Generate 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 4-grams, and 6-grams.

Submission

Submit your RandomWriter.java to Sakai, as well as a text file results.txt containing the outputs of your program for the different input files and n-gram lengths. Inside the results.txt, clearly label what the source file and value of n was for each block of output text (there should be 3 input files x 4 values of n = 12 such blocks). Put your name in both files.

Acknowledgments

This assignment is adapted from one created by David Matuszek at the University of Pennsylvania and Joe Zachary's random writer assignment.

@@ Line 1: / Line 1: @@
+<p style="font-size:40px">[http://nameless.cis.udel.edu/class_data/181_s2017/lab8_grading.pdf Lab #8 grading]</p>
 ===Preliminaries===
@@ Line 69: / Line 71: @@
 |}
-Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on.  However, we will ignore capitalization.
+Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on.  However, we will '''ignore capitalization'''.
 Now consider how you might generate a new random text with the same statistics as the one you analyzed.  Start with a "seed" n-gram chosen randomly from the text.  Suppose "th" is chosen for the 2-gram pirate example.  This will be the beginning of your output.
@@ Line 94: / Line 96: @@
 * <tt>int length</tt>: A non-negative number of characters to generate.
 * <tt>String result</tt>: The name of the output file
+Some kind of [https://docs.oracle.com/javase/7/docs/api/java/util/Map.html ''map''] is the recommended data structure to store your n-grams and their character list associations.
 ====Testing====
@@ Line 103: / Line 107: @@
 * [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp]
-Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word).
+Generate 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word).  Do this for 1-grams, 2-grams, 4-grams, and 6-grams.
-Do this for 1-grams, 2-grams, 4-grams, and 6-grams.
+===Submission===
+Submit your <tt>RandomWriter.java</tt> to Sakai, as well as a text file <tt>results.txt</tt> containing the outputs of your program for the different input files and n-gram lengths.  Inside the <tt>results.txt</tt>, clearly label what the source file and value of n was for each block of output text (there should be 3 input files x 4 values of n = 12 such blocks).  Put your name in both files.
 ===Acknowledgments===
-This assignment is shamelessly copied from [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html one created by David Matuszek] at the University of Pennsylvania.
+This assignment is adapted from [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html one created by David Matuszek] at the University of Pennsylvania and Joe Zachary's [http://nifty.stanford.edu/2003/randomwriter/handout.html random writer assignment].

Difference between revisions of "CISC181 S2017 Lab8"

Latest revision as of 08:16, 15 May 2017

Contents

Preliminaries

Instructions

RandomWriter

Testing

Submission

Acknowledgments

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools