Difference between revisions of "CISC181 S2017 Lab8"

From class_wiki
Jump to: navigation, search
(RandomWriter)
(RandomWriter)
Line 24: Line 24:
 
* <tt>String result</tt>: The name of the output file
 
* <tt>String result</tt>: The name of the output file
  
After reading every word in the file, print the following information:
+
====Testing====
  
# Number of words
+
In <tt>main()</tt>, run your code on the following files:
# Longest word.  Note that if there are multiple words which "tie", the expected behavior is to output the first one found
 
# Word with most vowels.  Treat 'y' as a consonant
 
# Alphabetically first word with 4 or more letters (treating upper-case and lower-case the same).  Do not count words that start with a non-alphabetic character
 
# Alphabetically last word with 4 or more letters (treating upper-case and lower-case the same)
 
 
 
After printing this information, make sure to close the file, then prompt the user again until they want to quit.
 
 
 
All of this should be be in a public class <tt>WordStats</tt>.  <tt>WordStats</tt> should have a constructor which takes the base directory string as its sole parameter, and handles the console input and looping itself.  '''Note: <tt>FileInputStream</tt>'s constructor wants the full-path name of the file.  The user should only have to enter the simple file name.  So the <tt>WordStats</tt> constructor parameter is the full path to the directory that the files live in, then concatenate that with the filename typed by the user'''
 
 
 
Please instantiate your <tt>WordStats</tt> class by creating an object in <tt>main()</tt>.  You should test it with the following files:
 
  
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/getty.txt getty]
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/getty.txt getty]
Line 43: Line 33:
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp]
 
* [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp]
  
Here is the expected output for the above files (you don't have to print it in this format--these are just the right values):
+
Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word).
 
+
Do this for 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams.
* getty.txt: 268 words, longest word = "proposition", word with most vowels = "proposition", first word = "above", last word = "years"
 
* doi.txt: 1325 words, longest word = "undistinguished", word with most vowels = "naturalization", first word = "abdicated", last word = "would"
 
* bts.txt: 14574 words, longest word = "obstreperousness", word with most vowels = "qualifications", first word = "abate", last word = "youth"
 
* greatexp.txt: 186685 words, longest word = "architectooralooral", word with most vowels = "architectooralooral", first word = "a'most", last word = "zest"
 
  
 
===Acknowledgments===
 
===Acknowledgments===
  
 
This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania: [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html].
 
This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania: [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html].

Revision as of 14:34, 14 April 2017

Preliminaries

  • Make a new project with n = 8 (following these instructions)
  • Name your main class "Lab8" (when creating a new module in the instructions above, in the Java class name field)
  • Modify Lab8.java by adding your name and section number in a comment before the Lab8 class body.

Instructions

In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense).

Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es" (made-up numbers!). Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03.

RandomWriter

You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:

  • String source: The name of an input file to read and analyze
  • int n A non-negative number indicating the length of each "gram," or character sequence, to break the file into

and also a method generateText() that takes the following two parameters:

  • int length A non-negative number of characters to generate.
  • String result: The name of the output file

Testing

In main(), run your code on the following files:

Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams.

Acknowledgments

This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania: [1].