Difference between revisions of "CISC181 S2017 Lab8"
(Created page with "===Preliminaries=== * Make a new project with ''n'' = 8 (following these instructions) * Name your main class "Lab8" (when cre...") |
|||
Line 8: | Line 8: | ||
===Instructions=== | ===Instructions=== | ||
− | In this lab you will analyze text files | + | In this lab you will analyze text files by breaking them into [https://en.wikipedia.org/wiki/N-gram ''n-grams''] at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). |
− | |||
− | |||
Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es" (made-up numbers!). Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03. | Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es" (made-up numbers!). Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03. |
Revision as of 13:21, 14 April 2017
Preliminaries
- Make a new project with n = 8 (following these instructions)
- Name your main class "Lab8" (when creating a new module in the instructions above, in the Java class name field)
- Modify Lab8.java by adding your name and section number in a comment before the Lab8 class body.
Instructions
In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense).
Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es" (made-up numbers!). Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03.
n-grams
In this exercise you will repeatedly prompt the user in the console to enter the name of a text file (relative to a base directory of your choosing) or 'q' to quit. If they do not want to quit, open the file with FileInputStream and read it with a Scanner, using this regular expression as your delimiter/word separator:
"[\\s.!?,;:\\-()_\"]+"
Be careful about cutting and pasting this into Android Studio. I have seen extra backslashes inserted automatically for several students, so make sure your delimiter string matches what you see here
After reading every word in the file, print the following information:
- Number of words
- Longest word. Note that if there are multiple words which "tie", the expected behavior is to output the first one found
- Word with most vowels. Treat 'y' as a consonant
- Alphabetically first word with 4 or more letters (treating upper-case and lower-case the same). Do not count words that start with a non-alphabetic character
- Alphabetically last word with 4 or more letters (treating upper-case and lower-case the same)
After printing this information, make sure to close the file, then prompt the user again until they want to quit.
All of this should be be in a public class WordStats. WordStats should have a constructor which takes the base directory string as its sole parameter, and handles the console input and looping itself. Note: FileInputStream's constructor wants the full-path name of the file. The user should only have to enter the simple file name. So the WordStats constructor parameter is the full path to the directory that the files live in, then concatenate that with the filename typed by the user
Please instantiate your WordStats class by creating an object in main(). You should test it with the following files:
Here is the expected output for the above files (you don't have to print it in this format--these are just the right values):
- getty.txt: 268 words, longest word = "proposition", word with most vowels = "proposition", first word = "above", last word = "years"
- doi.txt: 1325 words, longest word = "undistinguished", word with most vowels = "naturalization", first word = "abdicated", last word = "would"
- bts.txt: 14574 words, longest word = "obstreperousness", word with most vowels = "qualifications", first word = "abate", last word = "youth"
- greatexp.txt: 186685 words, longest word = "architectooralooral", word with most vowels = "architectooralooral", first word = "a'most", last word = "zest"
Acknowledgments
This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania: [1].