Difference between revisions of "CISC181 S2017 Lab8"
(→RandomWriter) |
(→RandomWriter) |
||
Line 24: | Line 24: | ||
* <tt>String result</tt>: The name of the output file | * <tt>String result</tt>: The name of the output file | ||
− | + | ====Testing==== | |
− | + | In <tt>main()</tt>, run your code on the following files: | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
* [http://nameless.cis.udel.edu/class_data/181_s2017/getty.txt getty] | * [http://nameless.cis.udel.edu/class_data/181_s2017/getty.txt getty] | ||
Line 43: | Line 33: | ||
* [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp] | * [http://nameless.cis.udel.edu/class_data/181_s2017/greatexp.txt greatexp] | ||
− | + | Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). | |
− | + | Do this for 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams. | |
− | |||
− | |||
− | |||
− | |||
===Acknowledgments=== | ===Acknowledgments=== | ||
This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania: [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html]. | This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania: [http://www.cis.upenn.edu/~matuszek/cis554-2016/Assignments/scala-2-ngrams.html]. |
Revision as of 13:34, 14 April 2017
Preliminaries
- Make a new project with n = 8 (following these instructions)
- Name your main class "Lab8" (when creating a new module in the instructions above, in the Java class name field)
- Modify Lab8.java by adding your name and section number in a comment before the Lab8 class body.
Instructions
In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense).
Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es" (made-up numbers!). Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03.
RandomWriter
You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:
- String source: The name of an input file to read and analyze
- int n A non-negative number indicating the length of each "gram," or character sequence, to break the file into
and also a method generateText() that takes the following two parameters:
- int length A non-negative number of characters to generate.
- String result: The name of the output file
Testing
In main(), run your code on the following files:
Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams.
Acknowledgments
This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania: [1].