Difference between revisions of "CISC181 S2017 Lab8"

From class_wiki
Jump to: navigation, search
(n-grams)
 
(28 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
<p style="font-size:40px">[http://nameless.cis.udel.edu/class_data/181_s2017/lab8_grading.pdf Lab #8 grading]</p>
 +
 
===Preliminaries===
 
===Preliminaries===
  
Line 8: Line 10:
 
===Instructions===
 
===Instructions===
  
In this lab you will analyze text files by breaking them into [https://en.wikipedia.org/wiki/N-gram ''n-grams''] at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense).
+
In this lab you will analyze text files by breaking them into [https://en.wikipedia.org/wiki/N-gram ''n-grams''] at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense).  An n-gram is a sequence of n consecutive characters from the input.  The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks". 
 +
 
 +
Furthermore, we can keep track of what characters '''follow''' each n-gram.  For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below:
 +
 
 +
{| class="wikitable" style="text-align: center" border="1" cellpadding="5"
 +
!width="10%"|2-gram
 +
!width="40%"|Characters after
 +
!width="10%"|2-gram
 +
!width="40%"|Characters after
 +
|-
 +
|"th"
 +
|"e", "r", "e"
 +
|"ra"
 +
|"t"
 +
|-
 +
|"he"
 +
|" ", "I"
 +
|"at"
 +
|"e", "e"
 +
|-
 +
|"e "
 +
|"t", "p", "t"
 +
|"te"
 +
|"s", " "
 +
|-
 +
|" t"
 +
|"h", "h"
 +
|"es"
 +
|" "
 +
|-
 +
|"hr"
 +
|"e"
 +
|"s "
 +
|"a"
 +
|-
 +
|"re"
 +
|"e"
 +
|" a"
 +
|"t"
 +
|-
 +
|"ee"
 +
|" "
 +
|"ei"
 +
|"r"
 +
|-
 +
|" p"
 +
|"i", "i"
 +
|"r "
 +
|"p"
 +
|-
 +
|"pi"
 +
|"r", "e"
 +
|"ie"
 +
|null
 +
|-
 +
|"ir"
 +
|"a", " "
 +
|
 +
|
 +
|}
 +
 
 +
Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on.  However, we will '''ignore capitalization'''.
 +
 
 +
Now consider how you might generate a new random text with the same statistics as the one you analyzed.  Start with a "seed" n-gram chosen randomly from the text.  Suppose "th" is chosen for the 2-gram pirate example.  This will be the beginning of your output. 
 +
 
 +
The next character output is chosen randomly from the list associated with "th": "e" is chosen with a 2/3 chance and "r" with a 1/3 chance.  Suppose an "e" is picked.  The output is now "the".
 +
 
 +
Now we drop the first character "t" from the last n-gram (the seed) that we were using and append the new character "e" to get our new seed "he".  We select a character randomly from the list associated with "he": " " (space) with 1/2 chance and "i" with 1/2 chance.  Suppose we choose "i".  The output is now "thei".
 +
 
 +
Update the seed again; now we have "ei".  There is only one character, "r", in the list associated with this 2-gram, so we pick it.  The output is now "their".
 +
 
 +
Now the seed is "ir".  " " or "a" is chosen with equal probability.  Suppose "a" is chosen.  Now the output is "theira" and the seed is "ra".
  
Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es" (made-up numbers!). Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1,  "at" with probability 0.07, and "es" with probability 0.03.
+
And so on.  If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), pick a new random seed and continue.
  
 
====RandomWriter====
 
====RandomWriter====
Line 17: Line 90:
  
 
* <tt>String source</tt>: The name of an input file to read and analyze
 
* <tt>String source</tt>: The name of an input file to read and analyze
* <tt>int n</tt> A non-negative number indicating the length of each "gram," or character sequence, to break the file into
+
* <tt>int n</tt>: A non-negative number indicating the length of each "gram," or character sequence, to break the file into
  
 
and also a method <tt>generateText()</tt> that takes the following two parameters:
 
and also a method <tt>generateText()</tt> that takes the following two parameters:
  
* <tt>int length</tt> A non-negative number of characters to generate.
+
* <tt>int length</tt>: A non-negative number of characters to generate.
* <tt>String result</tt>: The name of the output file
+
* <tt>String result</tt>: The name of the output file
 
 
After reading every word in the file, print the following information:
 
  
# Number of words
+
Some kind of [https://docs.oracle.com/javase/7/docs/api/java/util/Map.html ''map''] is the recommended data structure to store your n-grams and their character list associations.
# Longest word. Note that if there are multiple words which "tie", the expected behavior is to output the first one found
 
# Word with most vowels. Treat 'y' as a consonant
 
# Alphabetically first word with 4 or more letters (treating upper-case and lower-case the same). Do not count words that start with a non-alphabetic character
 
# Alphabetically last word with 4 or more letters (treating upper-case and lower-case the same)
 
  
After printing this information, make sure to close the file, then prompt the user again until they want to quit.
+
====Testing====
  
All of this should be be in a public class <tt>WordStats</tt>.  <tt>WordStats</tt> should have a constructor which takes the base directory string as its sole parameter, and handles the console input and looping itself.  '''Note: <tt>FileInputStream</tt>'s constructor wants the full-path name of the file.  The user should only have to enter the simple file name.  So the <tt>WordStats</tt> constructor parameter is the full path to the directory that the files live in, then concatenate that with the filename typed by the user'''
+
In <tt>main()</tt>, run your code on the following files:
  
Please instantiate your <tt>WordStats</tt> class by creating an object in <tt>main()</tt>.  You should test it with the following files:
 
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/getty.txt getty]
 
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/doi.txt doi]
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/doi.txt doi]
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/bts.txt bts]
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/bts.txt bts]
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp]
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp]
  
Here is the expected output for the above files (you don't have to print it in this format--these are just the right values):
+
Generate 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word).  Do this for 1-grams, 2-grams, 4-grams, and 6-grams.
 +
 
 +
===Submission===
  
* getty.txt: 268 words, longest word = "proposition", word with most vowels = "proposition", first word = "above", last word = "years"
+
Submit your <tt>RandomWriter.java</tt> to Sakai, as well as a text file <tt>results.txt</tt> containing the outputs of your program for the different input files and n-gram lengths.  Inside the <tt>results.txt</tt>, clearly label what the source file and value of n was for each block of output text (there should be 3 input files x 4 values of n = 12 such blocks).  Put your name in both files.
* doi.txt: 1325 words, longest word = "undistinguished", word with most vowels = "naturalization", first word = "abdicated", last word = "would"
 
* bts.txt: 14574 words, longest word = "obstreperousness", word with most vowels = "qualifications", first word = "abate", last word = "youth"
 
* greatexp.txt: 186685 words, longest word = "architectooralooral", word with most vowels = "architectooralooral", first word = "a'most", last word = "zest"
 
  
 
===Acknowledgments===
 
===Acknowledgments===
  
This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania: [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html].
+
This assignment is adapted from [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html one created by David Matuszek] at the University of Pennsylvania and Joe Zachary's [http://nifty.stanford.edu/2003/randomwriter/handout.html random writer assignment].

Latest revision as of 08:16, 15 May 2017

Lab #8 grading

Preliminaries

  • Make a new project with n = 8 (following these instructions)
  • Name your main class "Lab8" (when creating a new module in the instructions above, in the Java class name field)
  • Modify Lab8.java by adding your name and section number in a comment before the Lab8 class body.

Instructions

In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks".

Furthermore, we can keep track of what characters follow each n-gram. For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below:

2-gram Characters after 2-gram Characters after
"th" "e", "r", "e" "ra" "t"
"he" " ", "I" "at" "e", "e"
"e " "t", "p", "t" "te" "s", " "
" t" "h", "h" "es" " "
"hr" "e" "s " "a"
"re" "e" " a" "t"
"ee" " " "ei" "r"
" p" "i", "i" "r " "p"
"pi" "r", "e" "ie" null
"ir" "a", " "

Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on. However, we will ignore capitalization.

Now consider how you might generate a new random text with the same statistics as the one you analyzed. Start with a "seed" n-gram chosen randomly from the text. Suppose "th" is chosen for the 2-gram pirate example. This will be the beginning of your output.

The next character output is chosen randomly from the list associated with "th": "e" is chosen with a 2/3 chance and "r" with a 1/3 chance. Suppose an "e" is picked. The output is now "the".

Now we drop the first character "t" from the last n-gram (the seed) that we were using and append the new character "e" to get our new seed "he". We select a character randomly from the list associated with "he": " " (space) with 1/2 chance and "i" with 1/2 chance. Suppose we choose "i". The output is now "thei".

Update the seed again; now we have "ei". There is only one character, "r", in the list associated with this 2-gram, so we pick it. The output is now "their".

Now the seed is "ir". " " or "a" is chosen with equal probability. Suppose "a" is chosen. Now the output is "theira" and the seed is "ra".

And so on. If your program ever gets into a situation in which there are no characters to choose from (which can happen if the only occurrence of the current seed is at the exact end of the source), pick a new random seed and continue.

RandomWriter

You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:

  • String source: The name of an input file to read and analyze
  • int n: A non-negative number indicating the length of each "gram," or character sequence, to break the file into

and also a method generateText() that takes the following two parameters:

  • int length: A non-negative number of characters to generate.
  • String result: The name of the output file

Some kind of map is the recommended data structure to store your n-grams and their character list associations.

Testing

In main(), run your code on the following files:

Generate 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 4-grams, and 6-grams.

Submission

Submit your RandomWriter.java to Sakai, as well as a text file results.txt containing the outputs of your program for the different input files and n-gram lengths. Inside the results.txt, clearly label what the source file and value of n was for each block of output text (there should be 3 input files x 4 values of n = 12 such blocks). Put your name in both files.

Acknowledgments

This assignment is adapted from one created by David Matuszek at the University of Pennsylvania and Joe Zachary's random writer assignment.