Difference between revisions of "CISC181 S2017 Lab8"
(→RandomWriter) |
|||
(26 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | <p style="font-size:40px">[http://nameless.cis.udel.edu/class_data/181_s2017/lab8_grading.pdf Lab #8 grading]</p> | ||
+ | |||
===Preliminaries=== | ===Preliminaries=== | ||
Line 8: | Line 10: | ||
===Instructions=== | ===Instructions=== | ||
− | In this lab you will analyze text files by breaking them into [https://en.wikipedia.org/wiki/N-gram ''n-grams''] at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). | + | In this lab you will analyze text files by breaking them into [https://en.wikipedia.org/wiki/N-gram ''n-grams''] at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks". |
+ | |||
+ | Furthermore, we can keep track of what characters '''follow''' each n-gram. For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below: | ||
+ | |||
+ | {| class="wikitable" style="text-align: center" border="1" cellpadding="5" | ||
+ | !width="10%"|2-gram | ||
+ | !width="40%"|Characters after | ||
+ | !width="10%"|2-gram | ||
+ | !width="40%"|Characters after | ||
+ | |- | ||
+ | |"th" | ||
+ | |"e", "r", "e" | ||
+ | |"ra" | ||
+ | |"t" | ||
+ | |- | ||
+ | |"he" | ||
+ | |" ", "I" | ||
+ | |"at" | ||
+ | |"e", "e" | ||
+ | |- | ||
+ | |"e " | ||
+ | |"t", "p", "t" | ||
+ | |"te" | ||
+ | |"s", " " | ||
+ | |- | ||
+ | |" t" | ||
+ | |"h", "h" | ||
+ | |"es" | ||
+ | |" " | ||
+ | |- | ||
+ | |"hr" | ||
+ | |"e" | ||
+ | |"s " | ||
+ | |"a" | ||
+ | |- | ||
+ | |"re" | ||
+ | |"e" | ||
+ | |" a" | ||
+ | |"t" | ||
+ | |- | ||
+ | |"ee" | ||
+ | |" " | ||
+ | |"ei" | ||
+ | |"r" | ||
+ | |- | ||
+ | |" p" | ||
+ | |"i", "i" | ||
+ | |"r " | ||
+ | |"p" | ||
+ | |- | ||
+ | |"pi" | ||
+ | |"r", "e" | ||
+ | |"ie" | ||
+ | |null | ||
+ | |- | ||
+ | |"ir" | ||
+ | |"a", " " | ||
+ | | | ||
+ | | | ||
+ | |} | ||
+ | |||
+ | Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on. However, we will '''ignore capitalization'''. | ||
+ | |||
+ | Now consider how you might generate a new random text with the same statistics as the one you analyzed. Start with a "seed" n-gram chosen randomly from the text. Suppose "th" is chosen for the 2-gram pirate example. This will be the beginning of your output. | ||
+ | |||
+ | The next character output is chosen randomly from the list associated with "th": "e" is chosen with a 2/3 chance and "r" with a 1/3 chance. Suppose an "e" is picked. The output is now "the". | ||
+ | |||
+ | Now we drop the first character "t" from the last n-gram (the seed) that we were using and append the new character "e" to get our new seed "he". We select a character randomly from the list associated with "he": " " (space) with 1/2 chance and "i" with 1/2 chance. Suppose we choose "i". The output is now "thei". | ||
+ | |||
+ | Update the seed again; now we have "ei". There is only one character, "r", in the list associated with this 2-gram, so we pick it. The output is now "their". | ||
+ | |||
+ | Now the seed is "ir". " " or "a" is chosen with equal probability. Suppose "a" is chosen. Now the output is "theira" and the seed is "ra". | ||
− | + | And so on. If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), pick a new random seed and continue. | |
====RandomWriter==== | ====RandomWriter==== | ||
Line 17: | Line 90: | ||
* <tt>String source</tt>: The name of an input file to read and analyze | * <tt>String source</tt>: The name of an input file to read and analyze | ||
− | * <tt>int n</tt> A non-negative number indicating the length of each "gram," or character sequence, to break the file into | + | * <tt>int n</tt>: A non-negative number indicating the length of each "gram," or character sequence, to break the file into |
and also a method <tt>generateText()</tt> that takes the following two parameters: | and also a method <tt>generateText()</tt> that takes the following two parameters: | ||
− | * <tt>int length</tt> A non-negative number of characters to generate. | + | * <tt>int length</tt>: A non-negative number of characters to generate. |
* <tt>String result</tt>: The name of the output file | * <tt>String result</tt>: The name of the output file | ||
+ | |||
+ | Some kind of [https://docs.oracle.com/javase/7/docs/api/java/util/Map.html ''map''] is the recommended data structure to store your n-grams and their character list associations. | ||
====Testing==== | ====Testing==== | ||
Line 28: | Line 103: | ||
In <tt>main()</tt>, run your code on the following files: | In <tt>main()</tt>, run your code on the following files: | ||
− | |||
* [http://nameless.cis.udel.edu/class_data/181_s2017/doi.txt doi] | * [http://nameless.cis.udel.edu/class_data/181_s2017/doi.txt doi] | ||
* [http://nameless.cis.udel.edu/class_data/181_s2017/bts.txt bts] | * [http://nameless.cis.udel.edu/class_data/181_s2017/bts.txt bts] | ||
* [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp] | * [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp] | ||
− | Generate | + | Generate 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 4-grams, and 6-grams. |
− | Do this for 1-grams, 2-grams, | + | |
+ | ===Submission=== | ||
+ | |||
+ | Submit your <tt>RandomWriter.java</tt> to Sakai, as well as a text file <tt>results.txt</tt> containing the outputs of your program for the different input files and n-gram lengths. Inside the <tt>results.txt</tt>, clearly label what the source file and value of n was for each block of output text (there should be 3 input files x 4 values of n = 12 such blocks). Put your name in both files. | ||
===Acknowledgments=== | ===Acknowledgments=== | ||
− | This assignment is | + | This assignment is adapted from [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html one created by David Matuszek] at the University of Pennsylvania and Joe Zachary's [http://nifty.stanford.edu/2003/randomwriter/handout.html random writer assignment]. |
Latest revision as of 08:16, 15 May 2017
Preliminaries
- Make a new project with n = 8 (following these instructions)
- Name your main class "Lab8" (when creating a new module in the instructions above, in the Java class name field)
- Modify Lab8.java by adding your name and section number in a comment before the Lab8 class body.
Instructions
In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks".
Furthermore, we can keep track of what characters follow each n-gram. For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below:
2-gram | Characters after | 2-gram | Characters after |
---|---|---|---|
"th" | "e", "r", "e" | "ra" | "t" |
"he" | " ", "I" | "at" | "e", "e" |
"e " | "t", "p", "t" | "te" | "s", " " |
" t" | "h", "h" | "es" | " " |
"hr" | "e" | "s " | "a" |
"re" | "e" | " a" | "t" |
"ee" | " " | "ei" | "r" |
" p" | "i", "i" | "r " | "p" |
"pi" | "r", "e" | "ie" | null |
"ir" | "a", " " |
Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on. However, we will ignore capitalization.
Now consider how you might generate a new random text with the same statistics as the one you analyzed. Start with a "seed" n-gram chosen randomly from the text. Suppose "th" is chosen for the 2-gram pirate example. This will be the beginning of your output.
The next character output is chosen randomly from the list associated with "th": "e" is chosen with a 2/3 chance and "r" with a 1/3 chance. Suppose an "e" is picked. The output is now "the".
Now we drop the first character "t" from the last n-gram (the seed) that we were using and append the new character "e" to get our new seed "he". We select a character randomly from the list associated with "he": " " (space) with 1/2 chance and "i" with 1/2 chance. Suppose we choose "i". The output is now "thei".
Update the seed again; now we have "ei". There is only one character, "r", in the list associated with this 2-gram, so we pick it. The output is now "their".
Now the seed is "ir". " " or "a" is chosen with equal probability. Suppose "a" is chosen. Now the output is "theira" and the seed is "ra".
And so on. If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), pick a new random seed and continue.
RandomWriter
You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:
- String source: The name of an input file to read and analyze
- int n: A non-negative number indicating the length of each "gram," or character sequence, to break the file into
and also a method generateText() that takes the following two parameters:
- int length: A non-negative number of characters to generate.
- String result: The name of the output file
Some kind of map is the recommended data structure to store your n-grams and their character list associations.
Testing
In main(), run your code on the following files:
Generate 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 4-grams, and 6-grams.
Submission
Submit your RandomWriter.java to Sakai, as well as a text file results.txt containing the outputs of your program for the different input files and n-gram lengths. Inside the results.txt, clearly label what the source file and value of n was for each block of output text (there should be 3 input files x 4 values of n = 12 such blocks). Put your name in both files.
Acknowledgments
This assignment is adapted from one created by David Matuszek at the University of Pennsylvania and Joe Zachary's random writer assignment.