Difference between revisions of "CISC181 S2017 Lab8"
(→Instructions) |
(→Instructions) |
||
Line 8: | Line 8: | ||
===Instructions=== | ===Instructions=== | ||
− | In this lab you will analyze text files by breaking them into [https://en.wikipedia.org/wiki/N-gram ''n-grams''] at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "the three pirates ate their pie", the 2-grams are | + | In this lab you will analyze text files by breaking them into [https://en.wikipedia.org/wiki/N-gram ''n-grams''] at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks". |
+ | |||
+ | Furthermore, we can keep track of what characters '''follow''' each n-gram. For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below: | ||
{| class="wikitable" style="text-align: center" border="1" cellpadding="5" | {| class="wikitable" style="text-align: center" border="1" cellpadding="5" | ||
Line 19: | Line 21: | ||
|"e", "r", "e" | |"e", "r", "e" | ||
|"ra" | |"ra" | ||
− | | | + | |"t" |
|- | |- | ||
|"he" | |"he" | ||
− | | | + | |" ", "I" |
|"at" | |"at" | ||
− | | | + | |"e", "e" |
|- | |- | ||
|"e " | |"e " | ||
− | | | + | |"t", "p", "t" |
|"te" | |"te" | ||
− | | | + | |"s", " " |
|- | |- | ||
|" t" | |" t" | ||
− | | | + | |"h", "h" |
|"es" | |"es" | ||
− | | | + | |" " |
|- | |- | ||
|"hr" | |"hr" | ||
− | | | + | |"e" |
|"s " | |"s " | ||
− | | | + | |"a" |
|- | |- | ||
|"re" | |"re" | ||
− | | | + | |"e" |
|" a" | |" a" | ||
− | | | + | |"t" |
|- | |- | ||
|"ee" | |"ee" | ||
− | | | + | |" " |
|"ei" | |"ei" | ||
− | | | + | |"r" |
|- | |- | ||
|" p" | |" p" | ||
− | | | + | |"i", "i" |
|"r " | |"r " | ||
− | | | + | |"p" |
|- | |- | ||
|"pi" | |"pi" | ||
− | | | + | |"r", "e" |
|"ie" | |"ie" | ||
− | | | + | |null |
|- | |- | ||
|"ir" | |"ir" | ||
Line 67: | Line 69: | ||
|} | |} | ||
− | + | Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on. However, we will ignore capitalization. | |
Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es". Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03. | Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es". Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03. |
Revision as of 09:59, 17 April 2017
Preliminaries
- Make a new project with n = 8 (following these instructions)
- Name your main class "Lab8" (when creating a new module in the instructions above, in the Java class name field)
- Modify Lab8.java by adding your name and section number in a comment before the Lab8 class body.
Instructions
In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks".
Furthermore, we can keep track of what characters follow each n-gram. For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below:
2-gram | Characters after | 2-gram | Characters after |
---|---|---|---|
"th" | "e", "r", "e" | "ra" | "t" |
"he" | " ", "I" | "at" | "e", "e" |
"e " | "t", "p", "t" | "te" | "s", " " |
" t" | "h", "h" | "es" | " " |
"hr" | "e" | "s " | "a" |
"re" | "e" | " a" | "t" |
"ee" | " " | "ei" | "r" |
" p" | "i", "i" | "r " | "p" |
"pi" | "r", "e" | "ie" | null |
"ir" |
Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on. However, we will ignore capitalization.
Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es". Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03.
RandomWriter
You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:
- String source: The name of an input file to read and analyze
- int n: A non-negative number indicating the length of each "gram," or character sequence, to break the file into
and also a method generateText() that takes the following two parameters:
- int length: A non-negative number of characters to generate.
- String result: The name of the output file
Testing
In main(), run your code on the following files:
Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 4-grams, and 6-grams.
Acknowledgments
This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania.