Difference between revisions of "CISC181 S2017 Lab8"

From class_wiki
Jump to: navigation, search
(Instructions)
(Instructions)
Line 8: Line 8:
 
===Instructions===
 
===Instructions===
  
In this lab you will analyze text files by breaking them into [https://en.wikipedia.org/wiki/N-gram ''n-grams''] at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense).  An n-gram is a sequence of n consecutive characters from the input.  The complete set of n-grams for a text overlap each other--for example, if the text is "the three pirates ate their pie", the 2-grams are listed below:
+
In this lab you will analyze text files by breaking them into [https://en.wikipedia.org/wiki/N-gram ''n-grams''] at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense).  An n-gram is a sequence of n consecutive characters from the input.  The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks". 
 +
 
 +
Furthermore, we can keep track of what characters '''follow''' each n-gram.  For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below:
  
 
{| class="wikitable" style="text-align: center" border="1" cellpadding="5"  
 
{| class="wikitable" style="text-align: center" border="1" cellpadding="5"  
Line 19: Line 21:
 
|"e", "r", "e"
 
|"e", "r", "e"
 
|"ra"
 
|"ra"
|
+
|"t"
 
|-
 
|-
 
|"he"
 
|"he"
|
+
|" ", "I"
 
|"at"
 
|"at"
|
+
|"e", "e"
 
|-
 
|-
 
|"e "
 
|"e "
|
+
|"t", "p", "t"
 
|"te"
 
|"te"
|
+
|"s", " "
 
|-
 
|-
 
|" t"
 
|" t"
|
+
|"h", "h"
 
|"es"
 
|"es"
|
+
|" "
 
|-
 
|-
 
|"hr"
 
|"hr"
|
+
|"e"
 
|"s "
 
|"s "
|
+
|"a"
 
|-
 
|-
 
|"re"
 
|"re"
|
+
|"e"
 
|" a"
 
|" a"
|
+
|"t"
 
|-
 
|-
 
|"ee"
 
|"ee"
|
+
|" "
 
|"ei"
 
|"ei"
|
+
|"r"
 
|-
 
|-
 
|" p"
 
|" p"
|
+
|"i", "i"
 
|"r "
 
|"r "
|
+
|"p"
 
|-
 
|-
 
|"pi"
 
|"pi"
|
+
|"r", "e"
 
|"ie"
 
|"ie"
|
+
|null
 
|-
 
|-
 
|"ir"
 
|"ir"
Line 67: Line 69:
 
|}
 
|}
  
Your job is to find all of the n-grams for a text, and furthermore to record all of the possible characters that follow each particular n-gram.  In the "woodchucks" example, no 3-grams are repeated, but suppose you look at 1-grams.  Then the set of characters is "w", "o", "d", "c", "h", "u", "k", and "s""o" is followed by an "o" once, and a "d" once.  "c" is followed by an "h" once and a "k" once.
+
Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so onHowever, we will ignore capitalization.
  
 
Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es". Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1,  "at" with probability 0.07, and "es" with probability 0.03.
 
Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es". Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1,  "at" with probability 0.07, and "es" with probability 0.03.

Revision as of 09:59, 17 April 2017

Preliminaries

  • Make a new project with n = 8 (following these instructions)
  • Name your main class "Lab8" (when creating a new module in the instructions above, in the Java class name field)
  • Modify Lab8.java by adding your name and section number in a comment before the Lab8 class body.

Instructions

In this lab you will analyze text files by breaking them into n-grams at the character level, and use those n-grams to generate random text in the same "style" (in a statistical sense). An n-gram is a sequence of n consecutive characters from the input. The complete set of n-grams for a text overlap each other--for example, if the text is "woodchucks", the 3-grams are "woo", "ood", "odc", "dch", "chu", "huc", "uck", and "cks".

Furthermore, we can keep track of what characters follow each n-gram. For example, if the text is "the three pirates ate their pie", the 2-grams and a list of the characters following them are shown below:

2-gram Characters after 2-gram Characters after
"th" "e", "r", "e" "ra" "t"
"he" " ", "I" "at" "e", "e"
"e " "t", "p", "t" "te" "s", " "
" t" "h", "h" "es" " "
"hr" "e" "s " "a"
"re" "e" " a" "t"
"ee" " " "ei" "r"
" p" "i", "i" "r " "p"
"pi" "r", "e" "ie" null
"ir"

Note that non-alphabetic characters are also recorded: spaces, punctuation, digits, and so on. However, we will ignore capitalization.

Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es". Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03.

RandomWriter

You are to implement a Java public class RandomWriter that provides a random writing application. Your class should have a two-argument constructor that takes:

  • String source: The name of an input file to read and analyze
  • int n: A non-negative number indicating the length of each "gram," or character sequence, to break the file into

and also a method generateText() that takes the following two parameters:

  • int length: A non-negative number of characters to generate.
  • String result: The name of the output file

Testing

In main(), run your code on the following files:

Generate approximately 500 characters of text for each input. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word). Do this for 1-grams, 2-grams, 4-grams, and 6-grams.

Acknowledgments

This assignment is shamelessly copied from one created by David Matuszek at the University of Pennsylvania.