Difference between revisions of "CISC181 S2017 Lab8"

Revision as of 09:43, 17 April 2017

In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "the three pirates ate their pie", the 2-grams are listed below:

2-gram	Characters after	2-gram
"th"	"e", "r", "e"	"ra"
"he"		"at"
"e "		"te"
" t"		"es"
"hr"		"s "
"re"		" a"
"ee"		" t"
" p"		"ei"
"pi"		"r "
"ir"		"ie"

Your job is to find all of the n-grams for a text, and furthermore to record all of the possible characters that follow each particular n-gram. In the "woodchucks" example, no 3-grams are repeated, but suppose you look at 1-grams. Then the set of characters is "w", "o", "d", "c", "h", "u", "k", and "s". "o" is followed by an "o" once, and a "d" once. "c" is followed by an "h" once and a "k" once.

Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es". Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03.

RandomWriter

You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:

String source: The name of an input file to read and analyze
int n: A non-negative number indicating the length of each "gram," or character sequence, to break the file into

and also a method generateText() that takes the following two parameters:

int length: A non-negative number of characters to generate.
String result: The name of the output file

Testing

In main(), run your code on the following files:

Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 4-grams, and 6-grams.

Acknowledgments

This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania.

@@ Line 34: / Line 34: @@
 |"e "
 |
-|
+|"te"
 |
 |
@@ Line 41: / Line 41: @@
 |" t"
 |
-|"te"
+|"es"
 |
 |
@@ Line 48: / Line 48: @@
 |"hr"
 |
-|"es"
+|"s "
 |
 |
@@ Line 55: / Line 55: @@
 |"re"
 |
-|"s "
+|" a"
 |
 |
@@ Line 62: / Line 62: @@
 |"ee"
 |
-|" c"
+|" t"
 |
 |
@@ Line 69: / Line 69: @@
 |" p"
 |
-|"ch"
+|"ei"
 |
 |
@@ Line 76: / Line 76: @@
 |"pi"
 |
-|"ha"
+|"r "
 |
 |
@@ Line 83: / Line 83: @@
 |"ir"
 |
-|
+|"ie"
 |
 |

Difference between revisions of "CISC181 S2017 Lab8"

Revision as of 09:43, 17 April 2017

Contents

Preliminaries

Instructions

RandomWriter

Testing

Acknowledgments

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools