Difference between revisions of "CISC220 F2023 Lab8"

From class_wiki
Jump to: navigation, search
(1. File compression with Huffman coding)
(2. Sample inputs/outputs)
 
(25 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
===1. File compression with Huffman coding===
 
===1. File compression with Huffman coding===
  
As discussed in class on Nov. 2, tries and Huffman coding can be used to compress files/messages by analyzing character frequencies and choosing bitcodes accordingly.
+
As discussed in class on Nov. 2, [https://en.wikipedia.org/wiki/Trie tries] and [https://en.wikipedia.org/wiki/Huffman_coding Huffman coding] can be used to compress files/messages by analyzing character frequencies and choosing bitcodes accordingly.
  
 
A nearly-complete <tt>Huffman</tt> class is defined for you in <tt>huffman.hh</tt> ([https://drive.google.com/file/d/1UXuhd4Xq6yypZvWZWaoGTrTEMZLOFdyC/view?usp=sharing code here]).  It uses a <tt>TrieNode</tt> class that stores parent and child links as well as the character being encoded and its frequency in the file
 
A nearly-complete <tt>Huffman</tt> class is defined for you in <tt>huffman.hh</tt> ([https://drive.google.com/file/d/1UXuhd4Xq6yypZvWZWaoGTrTEMZLOFdyC/view?usp=sharing code here]).  It uses a <tt>TrieNode</tt> class that stores parent and child links as well as the character being encoded and its frequency in the file
Line 15: Line 15:
 
** Writes <tt>decompression_map</tt> as a header to the output file
 
** Writes <tt>decompression_map</tt> as a header to the output file
 
** Re-reads input file character by character and, using <tt>compression_map</tt>, writes encoded versions to the output file body (in ASCII or binary depending on <tt>do_binary</tt> flag)
 
** Re-reads input file character by character and, using <tt>compression_map</tt>, writes encoded versions to the output file body (in ASCII or binary depending on <tt>do_binary</tt> flag)
* <tt>decompress</tt>  
+
* <tt>decompress()</tt>  
 
** Reads the code look-up table from header of input file to <tt>decompression_map</tt>
 
** Reads the code look-up table from header of input file to <tt>decompression_map</tt>
 
** Reads rest of input file bitstream, gets character for each variable-length code read using <tt>decompression_map</tt>, and writes decoded versions to output file.  The ASCII part of this function, <tt>ascii_decompress_body()</tt> is TO BE WRITTEN BY YOU
 
** Reads rest of input file bitstream, gets character for each variable-length code read using <tt>decompression_map</tt>, and writes decoded versions to output file.  The ASCII part of this function, <tt>ascii_decompress_body()</tt> is TO BE WRITTEN BY YOU
Line 21: Line 21:
 
The "forest of tries" discussed in lecture is implemented in the <tt>Huffman</tt> class with an [https://en.cppreference.com/w/cpp/container/priority_queue STL priority queue] of <tt>TrieNode</tt> pointers (each representing the root of a subtrie), ordered on the combined frequency of each subtrie.  This priority queue is called <tt>trie_pq</tt>.
 
The "forest of tries" discussed in lecture is implemented in the <tt>Huffman</tt> class with an [https://en.cppreference.com/w/cpp/container/priority_queue STL priority queue] of <tt>TrieNode</tt> pointers (each representing the root of a subtrie), ordered on the combined frequency of each subtrie.  This priority queue is called <tt>trie_pq</tt>.
  
Several major functions are unfinished.  In order to separate them from the finished ones, they are located in main.cpp -- this
+
The compiled executable is run as follows: '''<tt>huff <optional flags> <path to input file></tt>'''
is where you will be writing code.
 
  
The Huffman member functions to finish in main.cpp:
+
If the input filename has the suffix <tt>.huf</tt>, it will be ''decompressed'' to a plain-text file with an identical name except the suffix is changed to <tt>.HUF</tt>.
  
* '''[2 points]''' ''merge_two_least_frequent_subtries()'' : This function should do the following (there are further hints in the comments):
+
If the input filename has any other suffix, it will be ''compressed'' to a file with the suffix <tt>.huf</tt>.
** (1) remove trie in PQ with SMALLEST frequency, assign to first
 
** (2) remove trie in PQ with NEXT SMALLEST frequency, assign to second
 
** (3) make new trie node (new_root) and plug first and second tries in as children, set new_root's frequency to first->frequency + second->frequency, and insert in new_root in PQ
 
* '''[1 point]''' ''compute_all_codes_from_trie(TrieNode *T)'' :  The function should recursively traverse the binary tree (trie) rooted at T to determine the huffcode string at every leaf. If T is a leaf, make sure to execute the compression_map and decompression_map assignments
 
* '''[1 point]''' ''calculate_huffman_file_size()'' : Return how many bits Huffman-coded version of the file takes. This is the sum of the frequency * huffcode length over every leaf
 
  
===2. Testing [1 point] ===
+
The default compressor creates a binary file as output and the default decompressor expects a binary file as input.  Feel free to compare the sizes of compressed files to the originals. However, for grading, compressed output will be written in ASCII and compressed input will also be expected to be ASCII.  This is indicated by the optional flag <tt>-ascii</tt>.
  
* [http://nameless.cis.udel.edu/class_data/220_f2014/data/cleaned_doi.txt DOI] (about 1K total words), [http://nameless.cis.udel.edu/class_data/220_f2014/data/cleaned_bts.txt BTS] (~14K words), and [http://nameless.cis.udel.edu/class_data/220_f2014/data/cleaned_greatexp.txt GE] (~185K words) require 8588 bits, 83964 bits, and 1013761 bytes, respectively according to the Unix ''ls'' command
+
The output files will be echoed to the terminal for grading for both compression (to test your <tt>build_optimal_trie()</tt> + <tt>compute_all_codes_from_trie()</tt> combination) and decompression (to test your <tt>ascii_decompress_body()</tt> in isolation). In order to test <tt>build_optimal_trie()</tt> alone, another optional flag <tt>-trie</tt> will print only a representation of the built trie and not echo the output file.
* How many *bytes* are required for each of these files after compression when you set the ''-debug'' flag in (a) "custom" form (shortest fixed-length code) and (b) using the variable-length Huffman code that your completed class generates?
 
  
===3. Submission===
+
===2. Sample inputs/outputs===
  
* Make a '''PDF file''' <Your Name>_Lab9_README.pdf with your answers to the testing questions above
+
[https://drive.google.com/file/d/1esJTzIJFqqVSP8QzfLqkvGwP3DpQoxLK/view?usp=sharing This tarfile] contains the following files:
* Rename your code directory <Your Last Name>_Lab9 and create a single tar/zip/rar file out of it named <Your Last Name>_Lab9.tar (or .zip or .rar, etc.).
+
* <tt>seashells.txt</tt> (20 total chars because there is a newline, 7 different chars)
* Submit it in Canvas by ''midnight at the end of Tuesday, November 16''
+
* <tt>seashells.txt.huf</tt>
 +
* <tt>output1</tt>, <tt>output2</tt>, <tt>output3</tt> (these are what is written to the terminal, not what is saved in a file)
 +
 
 +
<tt>output1</tt> results from running <tt>./huff -ascii seashells.txt</tt>
 +
 
 +
<tt>output2</tt> results from running <tt>./huff -ascii -trie seashells.txt</tt>
 +
 
 +
<tt>output3</tt> results from running <tt>./huff -ascii seashells.txt.huf</tt>
 +
 
 +
===3. Programming tasks===
 +
 
 +
The <tt>Huffman</tt> member functions to finish are in <tt>main.cpp</tt> -- don't make any changes in <tt>huffman.hh</tt>:
 +
 
 +
* '''[1 point]''' <tt>Huffman::ascii_decompress_body()</tt> : This function should read one "bit" (a '0' or '1' char) at a time from the input stream until a legal code is assembled, then write the corresponding decoded character to the output stream, and repeat until reaching the end of file
 +
* '''[2 points]''' <tt>Huffman::build_optimal_trie()</tt> : This function should create a forest of one-node tries (aka <tt>trie_pq</tt>) representing the characters in the input file and their frequencies, then repeatedly combine the two smallest tries until only one is left according to the following instructions:
 +
** 1. Remove trie in <tt>trie_pq</tt> with SMALLEST frequency, assign to ''first'', remove trie in PQ with NEXT SMALLEST frequency, assign to ''second''
 +
** 2. Create new <tt>TrieNode</tt> and make ''first'' and ''second'' its left and right children, respectively. Set new node's frequency to the sum of ''first'' and ''second'' 's frequencies, and then put the whole thing back into the priority queue
 +
* '''[2 points]''' <tt>Huffman::compute_all_codes_from_trie()</tt> :  The function should recursively traverse the trie rooted at <tt>T</tt> to determine the <tt>huffcode</tt> string at every leaf and initialize <tt>compression_map</tt> and <tt>decompression_map</tt> with this information
 +
 
 +
===4. Submission===
 +
 
 +
Once again, you may work with a (human) partner on this lab.
 +
 
 +
Submit 2 files to Gradescope: (1) your README and (2) your modified <tt>main.cpp</tt>. The README should contain your name and your partner's name, notes on any limitations or issues with your code, and your interpretation of any ambiguities in the assignment instructions. Your <tt>main.cpp</tt> should also contain your name(s)

Latest revision as of 12:42, 3 November 2023

Lab #8

1. File compression with Huffman coding

As discussed in class on Nov. 2, tries and Huffman coding can be used to compress files/messages by analyzing character frequencies and choosing bitcodes accordingly.

A nearly-complete Huffman class is defined for you in huffman.hh (code here). It uses a TrieNode class that stores parent and child links as well as the character being encoded and its frequency in the file being compressed.

The main work of Huffman happens in the following functions:

  • compress()
    • Reads input file character by character and counts occurrences (aka computes frequency) for each, storing the results by ASCII index in the char_frequency vector
    • Applies the Huffman trie-building algorithm presented in class 11/2 in build_optimal_trie(). This function is TO BE WRITTEN BY YOU
    • Creates code look-up tables (string to char decompression_map and char to string compression_map) from the optimal trie in compute_all_codes_from_trie(). This function is TO BE WRITTEN BY YOU
    • Writes decompression_map as a header to the output file
    • Re-reads input file character by character and, using compression_map, writes encoded versions to the output file body (in ASCII or binary depending on do_binary flag)
  • decompress()
    • Reads the code look-up table from header of input file to decompression_map
    • Reads rest of input file bitstream, gets character for each variable-length code read using decompression_map, and writes decoded versions to output file. The ASCII part of this function, ascii_decompress_body() is TO BE WRITTEN BY YOU

The "forest of tries" discussed in lecture is implemented in the Huffman class with an STL priority queue of TrieNode pointers (each representing the root of a subtrie), ordered on the combined frequency of each subtrie. This priority queue is called trie_pq.

The compiled executable is run as follows: huff <optional flags> <path to input file>

If the input filename has the suffix .huf, it will be decompressed to a plain-text file with an identical name except the suffix is changed to .HUF.

If the input filename has any other suffix, it will be compressed to a file with the suffix .huf.

The default compressor creates a binary file as output and the default decompressor expects a binary file as input. Feel free to compare the sizes of compressed files to the originals. However, for grading, compressed output will be written in ASCII and compressed input will also be expected to be ASCII. This is indicated by the optional flag -ascii.

The output files will be echoed to the terminal for grading for both compression (to test your build_optimal_trie() + compute_all_codes_from_trie() combination) and decompression (to test your ascii_decompress_body() in isolation). In order to test build_optimal_trie() alone, another optional flag -trie will print only a representation of the built trie and not echo the output file.

2. Sample inputs/outputs

This tarfile contains the following files:

  • seashells.txt (20 total chars because there is a newline, 7 different chars)
  • seashells.txt.huf
  • output1, output2, output3 (these are what is written to the terminal, not what is saved in a file)

output1 results from running ./huff -ascii seashells.txt

output2 results from running ./huff -ascii -trie seashells.txt

output3 results from running ./huff -ascii seashells.txt.huf

3. Programming tasks

The Huffman member functions to finish are in main.cpp -- don't make any changes in huffman.hh:

  • [1 point] Huffman::ascii_decompress_body() : This function should read one "bit" (a '0' or '1' char) at a time from the input stream until a legal code is assembled, then write the corresponding decoded character to the output stream, and repeat until reaching the end of file
  • [2 points] Huffman::build_optimal_trie() : This function should create a forest of one-node tries (aka trie_pq) representing the characters in the input file and their frequencies, then repeatedly combine the two smallest tries until only one is left according to the following instructions:
    • 1. Remove trie in trie_pq with SMALLEST frequency, assign to first, remove trie in PQ with NEXT SMALLEST frequency, assign to second
    • 2. Create new TrieNode and make first and second its left and right children, respectively. Set new node's frequency to the sum of first and second 's frequencies, and then put the whole thing back into the priority queue
  • [2 points] Huffman::compute_all_codes_from_trie() : The function should recursively traverse the trie rooted at T to determine the huffcode string at every leaf and initialize compression_map and decompression_map with this information

4. Submission

Once again, you may work with a (human) partner on this lab.

Submit 2 files to Gradescope: (1) your README and (2) your modified main.cpp. The README should contain your name and your partner's name, notes on any limitations or issues with your code, and your interpretation of any ambiguities in the assignment instructions. Your main.cpp should also contain your name(s)