Difference between revisions of "CISC220 F2023 Lab8"

From class_wiki
Jump to: navigation, search
(Created page with "==Lab #8== ===1. File compression with Huffman coding=== As discussed in class on Nov. 2, tries and Huffman coding can be used to compress files/messages by analyzing charac...")
 
(1. File compression with Huffman coding)
Line 5: Line 5:
 
As discussed in class on Nov. 2, tries and Huffman coding can be used to compress files/messages by analyzing character frequencies and choosing bitcodes accordingly.
 
As discussed in class on Nov. 2, tries and Huffman coding can be used to compress files/messages by analyzing character frequencies and choosing bitcodes accordingly.
  
A nearly-complete <tt>Huffman</tt> class is provided in starter code [https://drive.google.com/file/d/1UXuhd4Xq6yypZvWZWaoGTrTEMZLOFdyC/view?usp=sharing here].  It uses a <tt>TrieNode</tt> class that stores parent and child links as well as the character being encoded and its frequency in the file
+
A nearly-complete <tt>Huffman</tt> class is defined in <tt>huffman.hh</tt> [https://drive.google.com/file/d/1UXuhd4Xq6yypZvWZWaoGTrTEMZLOFdyC/view?usp=sharing code here].  It uses a <tt>TrieNode</tt> class that stores parent and child links as well as the character being encoded and its frequency in the file
 
being compressed.
 
being compressed.
  
The main work of ''Huffman'' happens in the following functions:
+
The main work of <tt>Huffman</tt> happens in the following functions:
* ''compress(in_filename, out_filename, do_binary)''
+
* <tt>compress()</tt>
** Reads in_filename character by character and counts occurrences (aka computes frequency) for each, storing the result by ASCII index in the ''char_counter'' vector
+
** Reads input file character by character and counts occurrences (aka computes frequency) for each, storing the results by ASCII index in the <tt>char_frequency</tt> vector
** Applies the Huffman trie-building algorithm presented in class 11/9 in ''build_optimal_trie()''.  This function is UNFINISHED, as it calls ''merge_two_least_frequent_subtries()'' which you must write
+
** Applies the Huffman trie-building algorithm presented in class 11/2 in <tt>build_optimal_trie()</tt>.  This function is TO BE WRITTEN BY YOU
** Writes code table (the decompression_map) to out_filename
+
** Writes code table (the <tt>string</tt> to <tt>char</tt> <tt>decompression_map</tt>) as a header to the output file
** Re-reads in_filename character by character and writes encoded versions to out_filename (in ASCII or binary depending on do_binary)
+
** Re-reads input file character by character and writes encoded versions to the output file body (in ASCII or binary depending on <tt>do_binary</tt> flag)
 
* ''decompress(in_filename, out_filename, do_binary)''  
 
* ''decompress(in_filename, out_filename, do_binary)''  
 
** Reads the code table from in_filename
 
** Reads the code table from in_filename

Revision as of 22:02, 2 November 2023

Lab #8

1. File compression with Huffman coding

As discussed in class on Nov. 2, tries and Huffman coding can be used to compress files/messages by analyzing character frequencies and choosing bitcodes accordingly.

A nearly-complete Huffman class is defined in huffman.hh code here. It uses a TrieNode class that stores parent and child links as well as the character being encoded and its frequency in the file being compressed.

The main work of Huffman happens in the following functions:

  • compress()
    • Reads input file character by character and counts occurrences (aka computes frequency) for each, storing the results by ASCII index in the char_frequency vector
    • Applies the Huffman trie-building algorithm presented in class 11/2 in build_optimal_trie(). This function is TO BE WRITTEN BY YOU
    • Writes code table (the string to char decompression_map) as a header to the output file
    • Re-reads input file character by character and writes encoded versions to the output file body (in ASCII or binary depending on do_binary flag)
  • decompress(in_filename, out_filename, do_binary)
    • Reads the code table from in_filename
    • Reads in_filename character by character and writes decoded versions to out_filename

The "forest of tries" discussed in lecture on 11/9 is implemented with an STL priority queue of TrieNode pointers (each representing the root of a subtrie), ordered on the combined frequency of each subtrie.

Several major functions are unfinished. In order to separate them from the finished ones, they are located in main.cpp -- this is where you will be writing code.

The Huffman member functions to finish in main.cpp:

  • [2 points] merge_two_least_frequent_subtries() : This function should do the following (there are further hints in the comments):
    • (1) remove trie in PQ with SMALLEST frequency, assign to first
    • (2) remove trie in PQ with NEXT SMALLEST frequency, assign to second
    • (3) make new trie node (new_root) and plug first and second tries in as children, set new_root's frequency to first->frequency + second->frequency, and insert in new_root in PQ
  • [1 point] compute_all_codes_from_trie(TrieNode *T) : The function should recursively traverse the binary tree (trie) rooted at T to determine the huffcode string at every leaf. If T is a leaf, make sure to execute the compression_map and decompression_map assignments
  • [1 point] calculate_huffman_file_size() : Return how many bits Huffman-coded version of the file takes. This is the sum of the frequency * huffcode length over every leaf

2. Testing [1 point]

  • DOI (about 1K total words), BTS (~14K words), and GE (~185K words) require 8588 bits, 83964 bits, and 1013761 bytes, respectively according to the Unix ls command
  • How many *bytes* are required for each of these files after compression when you set the -debug flag in (a) "custom" form (shortest fixed-length code) and (b) using the variable-length Huffman code that your completed class generates?

3. Submission

  • Make a PDF file <Your Name>_Lab9_README.pdf with your answers to the testing questions above
  • Rename your code directory <Your Last Name>_Lab9 and create a single tar/zip/rar file out of it named <Your Last Name>_Lab9.tar (or .zip or .rar, etc.).
  • Submit it in Canvas by midnight at the end of Tuesday, November 16