buarki
FollowSite Reliability Engineer, Software Engineer, coffee addicted, traveler
Building A File Compressor Using C And Wasm
February 17, 2024
6984 views
Share it
Checking Wasm Tools For C
In a previous article I did a small experiment with Go and Wasm to check the state of tools available for it in 2024. During such endeavour I find a tool called Emscripten that drawn my attention and made my mind to do one experiment, now using C.
Emscripten is an Open Source compiler toolchain to WebAssembly. As its docs says, practically any portable C or C++ codebase can be compiled into Wasm using it.
With the tool to use defined next step was defining a problem to approach.
Problem To Solve: Compress Files Using Huffman Algorithm
I must be honest, C is boo of mine <3! Due to that, I really wanted to build something real that could make me do some bit fiddling, and I decided to implement a very interesting compression algorithm called Huffman Coding.
Huffman Coding was created by David A. Huffman back in 1952. It is a lossless compression method, which means that all bits present on original file will be carried to the decompressed one. The basic idea of the algorithm is to collect all the symbols present on a file besides its frequency to create a binary tree in which all symbols will be leaves and the path from the tree root to the leaf should be used to represent the symbol. I know that at first glance it looks like rocket science, but trust me, it is far away from that and in this one we will see it in baby steps :)
This project was something really nice doing and I hope you enjoy following up.
Non-Goals Of This Project And Its Limitations
This project was not intended to be a production ready file compressor like WinRAR. It also was not intended to support files bigger than 150MB to avoid memory issues on user browser. For sure it can be adjusted, but I guess this size is fine for a hobby project :)
How Can Use The Program?
To use the program, zipper as I named it, just access it, and in case you want to check the details, the full code it is available on my Github.
Explaining Huffman Coding In Baby Steps
There's no better way to understand something rather than a practical example, so let's apply the algorithm with the following text: coding is fun and fun is coding.
First step: collect all the symbols besides its frequencies. By doing so we get:
Second step: sort collected symbols by its frequencies. The output in ascending order from top to the bottom is:
Third step: create a binary tree from this sorted list. Do so by removing the two least frequenct symbols and creating a node tree with them, in which the less frequent symbol goes to the left and the most frequent to the right. And also, the "frequency" of this new node is the sum of the subtree frequencies. Put this new created node back into that list and repeat this process until the list have only one item. Keep calm! It seems more complex than the portuguese grammar but trust me, it is not. Let's just execute this algorithm and you'll agree with me.
It says, "remove the two least frequent symbols". The two least frequent symbols are a and c, so let's remove them.
Then it says "create a tree node where the least frequent symbol goes to the left and most frequent to the right. And the frequency of this new node is the sum of the subtrees". In order to have a way to distinguish the regular file symbols to this "joining" symbol let's adopt the @ as the joining symbol. The result of such process is this one:
Once the new node is built we must put it back to the list. As the list is sorted we should keep it sorted while adding it. The list items, arranged horizontally to help you visualize it, looks like this now:
Now we just repeat the process, until only one item remains on the list. The next two items to process are f and g. The new tree node created from them is this one:
Adding it to the list we get:
The next two items to process are o and o. The new tree node created from them is this one:
Adding it to the list we get:
The next two items to process are u and the joining symbol @:3. The new tree node created from them is this one:
Adding it to the list we get:
The next two items to process are d and the first joining symbol @:4. The new tree node created from them is this one:
Adding it to the list we get:
The next two items to process are @:4 and the joining symbol i:4. The new tree node created from them is this one:
Adding it to the list we get:
The next two items to process are @:5 and n:5. The new tree node created from them is this one:
Adding it to the list we get:
The next two items to process are _space_:6 and @:7. The new tree node created from them is this one:
Adding it to the list we get:
The next two items to process are @:8 and @:10. The new tree node created from them is this one:
Adding it to the list we get:
The last two items to process are @:10 and @:18. The new tree node created is the complete Huffman tree:
Fourth step: find the corresponding symbol code by walking the tree from the root to the leaf and use 0 to mark a jump to the left and 1 to the right. Adding the 0s and 1s we get this:
Thus, the corresponding codes are:
It worth noticing that that the higher the frequency of a symbol, the smaller the corresponding code length is. Just compare the length of _space_ and a.
Fifth step: Replace the symbol for its corresponding code. If we rewrite the present words using found codes we get:
Thus, the full text coding is fun and fun is coding is: 11011 0110 010 101 111 1001 00 101 0111 00 1000 1100 111 00 11010 111 010 00 1000 1100 111 00 101 0111 00 1011 0110 010 101 111 1001.
As ASCII characters uses 1 byte, the original text coding is fun and fun is coding requires 31 (length of text) x 1 byte = 31 bytes to be stored. On the other hand, the compressed version requires only 12 bytes and 6 bits of one extra byte, so in total 13 bytes. To visualize it, just group the 0s and 1s in chunks of 8 (because 1 byte is a group of 8 bits): 11011011 00101011 11100100 10101110 01000110 01110011 01011101 00010001 10011100 10101110 01011011 00101011 111001.
With simple math we can see that the compressed content is almost 58% smaller than the original one! This simple example shows how powerful Huffman Coding is.
Once you became a Huffman Coding master we can proceed to next step :)
The (Funny) C Part Of Compression
In order to elaborate the details of the compression part we must define the Compression API written in C that will be used on the JavaScript layer through Wasm. The compression header file can be seen here and as we can see it expects the full file content to be passed alongside with its size in bytes. Such API imposes some limitations, like the size of the file we can handle, because it would not be feasible loading a 2GB file into an unsigned char buffer, but that's totally ok for this project as it was not intending to be a production ready compressor like WinRAR ;)
That said, we can go on with compression details. As shown above, we start by collecting the symbols with its frequencies, and the implementation of it can be found here. The idea is creating an array with 256 buckets, each one representing the frequency of the corresponding ASCII symbol in the file. Thus, the frequency of symbol 'a' would be stored at index 97. It worth mentioning it uses calloc to create such array, to ensure that all buckets have zero values once created, otherwise we could have some "trash" on such buckets, which could lead to bugs during this stage.
With the frequencies of symbols collected, we can now sort them. To do so, I decided to use a Min Heap data structure as it is easy to find the lower value present inside of it. As we are always getting the two lowest values, using such data structure seems a good design decision. I implemented the Min Heap used on this project, and you can find the Abstract Data Type (ADT) of it here.
Creating the Min Heap from the array with symbols frequency is a straightforward process and the implementation can be seen here.
Once we have the Min Heap ready, we can now build the Huffman tree. You can find the implementation of such process here and also the Huffman Tree ADT.
From the Huffman Coding execution we did above, the last step missing is building the Huffman table with the codes. You can find the implementation of it here. The table itself is a 256x256 2D array of type unsigned char, where the table[i] represents the code of the a symbol present in the file. For instance, if the symbol 'a' is present and its code turns out to be 1101 it means that the table[97] will carry the string '1', '1', '0', '1', '\0'.
All above points is all we need to implement Huffman Coding, but that's not all yet. Because in order to decompress the compressed file, the sequence of 0s and 1s, we need the tree. Thus, to give the compressed file to someone expecting to decompress it, the tree must be sent together, and it introduces a nice problem to solve.
The Architecture Of The Compressed File
Let's assume that with the tree and the codes table on hand we get the 0s and 1s of the compressed file. Now we want to give the compressed content to a friend besides the tree to it knows how to decompress the file. How should we pass the tree? With The symbols traversed in some order? maybe postorder? And should we pass the tree content before or after the 1s and 0s?
There's no right or wrong answer for above questions, but the one picked must be very well implemented. For this project, I decided to build the "file that we share" with the following structure: the contents of the tree traversed in preorder and right after it the 0s and 1s.
Traversing a tree in preorder means, in simple terms, that you'll collect the symbol of a node, look for the left subtree and then to the right subtree. A concreate example of it, using the tree we created for the text coding is fun and fun is coding is: @@_space_@d@os@@@fgi@@u@acn.
That said, the "file we share" with the compressed file "would be": @@_space_@d@os@@@fgi@@u@acn 11011011 00101011 11100100 10101110 01000110 01110011 01011101 00010001 10011100 10101110 01011011 00101011 111001. Ok, now we defined the basic layout of the file that is sent.
But how the one that has received the compressed file would know where the content of tree ends? Different files would have different trees and consequently different series of 0s and 1s, so we need a strategy to inform to the receiver where the tree content and the 0s and 1s starts and ends.
In order to the receiver knows how to properly find the tree and the compressed content we need to add a header to the sent file.You can think about the header as some special bytes at the beginning of the file that will provide some information about it. For this project, it must inform is how to properly find the content of the tree and the compressed files separately. A header must be well planned because it inevitably increases the final size of the file shared. Let's see how we built the header for this project by addressing the tree content size first.
Using above example of file that could be shared, the one attempting to decompress it must be able to find properly @@_space_@d@os@@@fgi@@u@acn as the content of the tree. In such example we see that the tree content uses 27 bytes. It's important pointing that by tree size I mean the amount of nodes on the tree, not the height.
Counting the number of nodes on the tree is a rather easy task, and the implementation can be seen here. As we know that the Huffman tree is a binary tree, if a tree completely uses all 256 ASCII symbols, we can say it would have (2 * 256) - 1 = 511 nodes. But for this project this equation requires one small adjustment. If you remember the execution we did above, you'll notice that for this project we used the symbol @ as the "joining symbol". The problem is that a file can also have this symbol in its content, due to that we need a way to distinguish when we are dealing the the @ of the joining symbol and the @ of the file per se.
The strategy adopted here was using the symbol \ as "escaping". But the same problem goes for it, so we need to use \ not only for the @ but also for the \. The implication of it is that, if @ or \ are present we would require 2 additional bytes, or "fake nodes", to comport it. Thus, for the worst case scenario, the maximum number of nodes on the tree would be ((2 * leaves) - 1) + 2, thus, 513 nodes. Such number in binary is 1000000001, which means it requires 1 byte and 2 additional bits to be stored.
So far, we know that we need two additional bytes on the "compressed file" to represent the size of tree content (two because 10 bits is bigger than 8 bits, one byte, so we need one extra full byte to comport it). But, even if we put the information of tree size in the header there is still one more problem to be addressed: how to know how many bits the compressed part exactly uses?
If we look again to the content of compressed file, 11011011 00101011 11100100 10101110 01000110 01110011 01011101 00010001 10011100 10101110 01011011 00101011 111001, we can see it uses 12 complete bytes with one extra byte that uses only 6 bits. Thus, even if we know where the content of the compressed file starts, the number of bytes required by the tree + 1, we must know exactly how many bits do we need to look for, especially on the last byte to skip the padding bits.
Padding bits are nothing but bits used to "complete" the byte. It happens because a byte is a sequence of 8 bits, even if we use less. Check bellow image to visualize it:
The good part is that calculating how many padding bits we would have is easy. To do so, we need to first compute the "length" of the compressed content, in other other words, the length of 0s and 1s. This is done by summing all the products (symbol frequency * respective code length) and you can see the implementation here. With the length of compressed content on hand, we just need to check the remainder of it with 8 bits to know how many bits are being used, and then, we check how many bits are missing to complete 8 bits, in other others: 8 - (compressed content length % 8). You can see the the implementation here.The text coding is fun and fun is coding has 102 bits length, so 8 - (102 % 8) = 2 padding bits :)
Unless that the last byte is fully used, the max number of padding bits will be 7, and it can be represented using 3 bits: 111.
Thus, we got a puzzle to solve: we have two information to store in the header, the tree content size, which requires at most 10 bits, and the number of padding bits, that requires at most 3 bits, resulting in 13 bits for the header, and consequently, 2 bytes. Due to the importance of the header in this project I decided to define an ADT for it to highlight its API. The solution adopted for the header puzzle was: the 3 first bits of the header first byte will store the padding bits, and the last 2 bits of first byte besides the entire second byte will store the tree content size. For the text coding is fun and fun is coding the padding is 2, thus 10 in binary, and the tree size is 27, or 11011. Bellow image illustrates how the header bits are used:
Building the header like above is pure bit fiddling, and is done like this: we create a byte where the 3 bits of padding are placed at the most significant bits by doing a left shift of 5 bits. If the padding for the text example is 2 (2 in 8 bits is 00000010), 2 << 5 will gives us 01000000. Then, we collect the most significant bytes of tree content size. In our example, the size is 27, or 0000000000011011 in binary using 2 bytes. To collect the most significant bytes we do a right shift of 8 bits like 0000000000011011 >> 8 and we get 00000000. We also need to collect the least significant bits of tree size, we do it executing a bitwise AND with the tree size and a full set byte: 0000000000011011 & 11111111 and we get 00011011. The last step is creating a 2 bytes sized buffer to carry the header, where the first byte will have the result of the byte with the 3 bits of padding at the most significant bits bitwise OR the most significant bits of tree size, in our case 01000000 | 00000000 = 01000000. And the second byte will just receive the least significant bits of the tree size, which is 00011011. The final result is 0100000000011011. You can check the implementation of this step here.
And we finally have the three parts of the file:
The full compression implementation can be seen here.
Decompressing And Reconstructing The Original File
I must say that the decompression is way easier than compression, and it also requires an API, similar to the compression. The process kicks off identifying the metada we added during compression (padding bits and tree size) to know where to look for the content of tree and compressed file. You can check the implementation here.
Once we know how many bytes are used to store the tree content we can properly find the beginning of the tree and compressed file. If the bytes of the "shared file" are in the buffer compressedContent, then the tree content is placed right after the header, thus it can be found at &compressedContent[HEADER_REQUIRED_BYTES], while the compressed content is right after the tree, thus &compressedContent[HEADER_REQUIRED_BYTES + treeContentSize]. It worth pointing that by doing so we are not using extra memory spaces, just pointers pointing to the proper place.
By knowing where to look for things, we start rebuilding the tree from its content that was attached to the file in preorder. The implementation can be seen here and it worth noticing that it handles the special symbols we have in this project, the "@" and the \.
With the tree rebuilt, we must calculate how many bytes we need for the decompressed file (the original one). This is needed to do because I decided to not include the original file size on the header, because it'd increase the total "shareable" file size. The implementation was a little bit tricky as you can see but it seems to work :)
Last, but certainly not least, we effectively fill the buffer of the "decompressed" file, in other words, we rebuild the original file that was compressed. You can check the implementation here. And we are almost done with the C part.
Building The Wasm Binary
In order to create a binary able to run the compression and decompression Emscripten was used. It was done by providing a main.c file exposing the compression and decompression API. And to build it we can use the tool emcc as implemented in the project Makefile.
Something very important that must be mentioned is that Wasm only understand numbers and pointers, due to that I broke the compression and decompression in the main.c file in two steps: one to execute the operation and return the number of bytes required by the created compressed or decompressed file, and other one to just collect the compressed or decompressed file content.
To understand it, let's check the compression flow from the JavaScript to C. The compress() function (in the app.js file) is placing the content of the uploaded file into a Uint8Array with length equals to the number of bytes the uploaded file has. It then passes such array alongside its size to the c_compress function via the built in ccall function and getting as output the number of bytes that the compressed file (more precisely, the shareable file, with header, tree, 0s and 1s) requires. Then, it uses this number of bytes to allocate a buffer with this size in the heap. As final step, the created buffer besides the number of bytes required for the compressed size are passed to function receiveCompressedContent, also via ccall, where each byte of the compressed file is copied to it. Once the very content gets copied, we do some pointer math to extract it on the JavaScript layer.
As mentioned, the content of compressed file is copied to a buffer allocated in the Heap. Emscripten allows us to access the heap as a typed array by providing some objects, as we are dealing with chunks of 8 bits here, we should use HEAPU8 (unsigned 8 bits). Such object provides access to the heap through the attribute buffer. Thus, if we have access to the heap, we know where the sequence of bytes of the compressed file is (the address of created buffer) and we know the size of this sequence (how many bytes the compressed file requires) we can extract such bytes creating a "slice" of the Heap from "the buffer address" until the "compressed size bytes". If you didn't get it, keep calm and check bellow example.
Consider the array unsigned char buffer[] = {'a', 'b', 'c'}. The content of this array is placed at 3 sequential bytes. If you know C a little bit you know that buffer is just a pointer to some address. Assume that for instance such address is 99. Then, we can say that buffer points to 99. As it is a contiguous block of memory, we can access the neighboring byte by doing buffer+ 1, and we could get (in this example, for sure) 100. And buffer + 2 would gives us 101. Thus, if we collect the content stored at the address 99, 100 and 101 we would properly collect 'a', 'b' and 'c'. Actually, in C notation, *buffer, or *(buffer + 0), is equal to 'a', *(buffer + 1) is equal to 'b', and *(buffer + 2) is equal to 'c'. And in case you didn't know, yes, buffer[2] is just a syntactic sugar for *(buffer + 2) :)
Once we collected all the bytes of the compressed file to a JavaScript variable/object, we must deallocate the pointers used to interact with the C binary (the one to store the number of bytes needed and the buffer to store the file bytes). This is done by simply applying them to the free() function we exported from C during the compilation.
With the pointers deallocated, the JavaScript promise is resolved returning an object carrying the size of original file, the size of compressed file and the bytes of the compressed file that will be downloaded to the user's device. And the same process goes for the decompression.
Takeaways And My Conclusion
First of all, I just loved it all!
I found the Emscripten tool + documentation nice, actually, I found it more practical to "check and use" than the resources I've found for Go.
Something worth pointing is that sharing content from the C layer to the JavaScript using the Heap might be a bottleneck depending on the size of the object you are passing. In this project, for instance, to handle such "processing times" I added a spinner during compression and decompression. It doesn't seems to be a dealbreaker though, because depending on the problem we can use different approaches, like processing the hard tasks using service workers.