by Lord Julus / 29A
january 2000
³ Foreword ³
Hello people, and welcome to the first breath in the so talked about
Y2K… It looks to me that everything worked out just fine, so I decided to
move my butt and start writing something again, after quite a while.
You might consider this article off-topic a little to the virus
writing field, but actually the things are not at all like that. The raising
Win32 programming and the little ways of stealthing the code withing the
portable executable generated a quest for inserting within the PE without
modifying the size of the file. This was done by many authors in many ways,
like overwriting a useless section or hiding in the slack space which
remains after the rounding to disk alignment. Others choosed however to use
a compression algorithm to shrink a portion of the file and insert into the
remaining space. This was done for example by JackyQwerty in the Redemption
virus with his JQCoding algorithm, or by Vecna in Babylonia. The examples
are probably many but this was just to show you the link…
Now, in the world there exist many compression algorithms, ones
better than others. From them a few raised as of being recognized by
everybody as the best ones. Let me tell you a few: Lev-Zimpel algorithms
(also known as LZ), Shano-Fano algorithms, Running Length algorithm, Huffman
algorithm. It is very hard to say which is the best because it depends very
much of the raw data it works on. However, as a starter I decided to phocus
myself on the Huffman algorithm. Unfortunately this is not quite the best of
them as it gives lower compression ratios than most of the LZ algorithms.
But, still it is a very interesting algorithm and it deserves all attention.
So, in this article I will present the basic theory behind the
Huffman algorithm with examples, and then show you how I implemented this
method in one of my latest release: Lord Julus’ Huffman Compression Engine
(LJHCE32).
³ Theory ³
The Huffman algorithm is based on the following idea: define the
characters (which are normally coded on 8 bits) on less or more bits so that
on average the total number of bits will be smaller. So, this algorithm,
unlike others, doesn’t work on strings, but on characters and this is why
it’s results are not very good.
The first idea anyone would have is to encode each of the characters
that appear in the input buffer with the least bits possible. Let’s imagine
such an encoding:
A = 1
B = 0
C = 01
D = 10
E = 11, etc…
However, this is a non-decodable codification, because an encoding
like 111101 can be decoded as AAAAC, or EEC, or EAABA, etc… So, this shows
that a special method must be used to obtain a good decodable code. Actually
the real problem does not come from the way the code looks like, but from
the fact that they are not unique defined from one side or another and this
would require a mark to be set between them so that at decompression you can
know which code is which. Of course using a separator between the codes is
out of the question. Another method must be found.
Here is the first example that comes into my mind of unique codes:
A = 1
B = 01
C = 001
D = 0001
If you want to encode the string ABCDAABC you get:
10100100011101001
If you read bit by bit you can remove the bits as they match a
combination in the definition table above… Of course, this is just a
simple example that shows you one of the important assets of the huffman
codes and that is the unique form when read from left to right. The idea is
to compare each bit of the encrypted code and match it with all of the codes
in the definition table. Once you matched one code you can be absolutely
sure that it’s the one, because they are not repeating in the inside
structure.
Of course, this is the very first of the problem, how to get unique
codes. The other part of the problem is each character which code should it
have?
Here we come to the second of the huffman’s algorithm ideas and that
is the percentage of appareance. Let us consider the following string:
AABCEDAAABCDEAABCCCCDAA
Let’s see how many times each character appears in the string:
A – 9
B – 3
C – 6
D – 3
E – 2
———-
Total – 23
Now, let’s transform each appareance into a percentage relative to
the total, following the formula:
appareance * 100
—————-
total
A – 39.13%
B – 13.04%
C – 26.09%
D – 13.04%
E – 8.70%
—————
Total – 100.00%
The next step is to order our characters descending with the
percentage of appareance. After this we will be moving on with the following
idea: we add the last two smallest values and we obtain a composite position
with a percentage equal to the sum of the last two. We rearrange the
positions descending and we keep on going until we reach 100%:
³ Phase 1 ³ Phase 2 ³ Phase 3 ³ Phase 4 ³ Phase 5 ³
³ A – 39.13% ³ A – 39.13% ³ A – 39.13% ³ BCDE – 60.87% ³ ABCDE – 100% ³
³ C – 26.09% ³ C – 26.09% ³ BDE – 34.78% ³ A – 39.13%
³ B – 13.04% ³ DE – 21.74% ³ C – 26.09%
³ D – 13.04% ³ B – 13.04%
³ E – 8.70%
Now, we are very advanced inside the algorithm and we are almost
reaching the end. Now, all we need to do is to associate some bits with the
characters and this is done by selecting from each phase the last two values
and give a 0 to the smaller one and a 1 to the highest one (not considering
the last phase which gives 100%):
³ D – 1 ³ DE – 1 ³ BDE – 1 ³ BCDE – 1 ³
³ E – 0 ³ B – 0 ³ C – 0 ³ A – 0 ³
And now, let’s read the codes. Start from the end towards the
beginning and wherever you see the character add the specified code to it.
We get:
A = 0
B = 110
C = 10
D = 1111
E = 1110
Ok, that’s it!! We got our huffman codes!!
After all this is done we just need to take the input buffer and
instead of each character store it’s code. Let’s use the above codification
and compress the string: AABCEDAAABCDEAABCCCCDAA
00110101110111100011010111111100011010101010111100
Dividing it into bytes we get:
00110101 = 35h
11011110 = DEh
00110101 = 35h
11111100 = FCh
01101010 = 6Ah
10101111 = AFh
00______ = 00h
So we compressed a 23 bytes length string into a 7 bytes length
string. That’s a 58% compression ratio.
This looks a little bit too easy, right? Yep… And I will explain
why in a second… You noticed above how we encoded each character with a
code and did you notice how their size was variable? Then, how can use
decode the compressed string? Answer: we *must* save the characters
definitions into the compressed output buffer. This is one of the biggest
problem… how to store the definition table without affecting the output
too much? Anyway, before looking into that, lets see how we can decompress
our string:
00110101110111100011010111111100011010101010111100
We start from left. We say: 0, is it in the definition table? Yes,
it’s A. We output A and delete the 0. Next bit, 0 again… Next bit: 1. Is
it in the definition table? No! Read one more: 11. Is it in the definition
table? No, read one more: 110. Is it in the definition table? Yes! It’s a B,
output the B and go on. Briefly:
00110101110111100011010111111100011010101010111100
AAB C E D AAAB C D E AAB C C C C D AA
Easy decompression!! Now let’s phocus a little on the data structures
we need to use in order to implement such an algorithm.
³ Data structures ³
Ok, before going to the data structures themselfes, let’s try and see
what do we really need to store. First of all we are working on an alphabet
made up from maximum 256 characters. For each character we must store it’s
code (1 byte), it’s percentage of appareance (which we will discuss later- 1
double word). Also, note that after each phase the composite codes grow
until the last step, where they form that total alphabet… This would make
us need a huge matrix of 256*256*9 = 589824 bytes!!!! Of course that is
totaly bad!
Let’s see the structures used:
element struc
value dd ?
start dd ?
element ends
element_size equ size element
list struc
char db ?
next dd ?
list ends
list_size equ size list
The first structure has two definitions. ‘value’ represents the
percentage of appeareance and ‘start’ is a pointer to an element from the
second structure. The second structure holds the character definition and a
pointer to the next character. To define them it goes like this:
chars label
I=0
REPT 256
db I ; character ascii code
dd 0 ; null pointer in the beginning
I=I+1
ENDM
To define the huffman codes we simply write:
huffman element 256 dup(<0>)
The size of the data so far is then 256*13 = 3328 bytes. A little
decrease, right? Ok, the above structures will be used to generate the
huffman tree. We need one more structure to store the actual huffman codes:
huffman_code struc
code dw ?
huff_len dw 0
huffman_code ends
huffman_size equ size huffman_code
codes huffman_code 256 dup(<0>)
As you can see the huffman code is stored on a maximum 16 bits. The
next value represents the length of the code. This adds up 256*4= 1024
bytes, making our data total = 4352.
For the beginning we will have the data set as follows:
huffman table chars table
value:0, start:——> char:0 , next=0
value:0, start:——> char:1 , next=0
…
value:0, start:——> char:255, next=0
This being initialized, we read the input buffer from beginning to
the end and increment the value coresponding to each position. After doing
so, we transform the number of appareances into percent. This is the reason
why we need to store the length ona double word, because when calculating
the percent we must use all decimals so we need to use the FPU:
fild dword ptr [value] ; load the value
fild dword ptr [_100] ; load 100
fmul ; multiply
fild dword ptr [length] ; load length
fdiv ; divide
fstp dword ptr [value] ; and store the floating number
After doing so, we sort the first matrix descending. Let’s consider
our example above:
A – 39.13%
B – 13.04%
C – 26.09%
D – 13.04%
E – 8.70%
huffman table chars table
value:39.13, start:——> char:’A’, next=0
value:26.09, start:——> char:’C’, next=0
value:13.04, start:——> char:’B’, next=0
value:13.04, start:——> char:’D’, next=0
value: 8.70, start:——> char:’E’, next=0
So, we have each pointer pointing the right character in the other
matrix (beware that you don’t need to rearange the second matrix as you work
on pointers to it’s elements; you just arranged the pointers in the first
matrix).
Then, this is how the first phase looks like:
huffman table chars table
value:39.13, start:——> char:’A’, next=0
value:26.09, start:——> char:’C’, next=0
value:21.74, start:——> char:’D’, next:—->char:’E’,next=0
value:13.04, start:——> char:’B’, next=0
So, we added the values of the last two elements, reordered the first
matrix and made the ‘next’ pointer of the first of them pointing to the
second element, basically creating a chain within the second list. We go on
with the next phase:
huffman table chars table
value:39.13, start:——> char:’A’, next=0
value:34.78, start:——> char:’D’, next->char:’E’,next->char:’B’,next=0
value:26.09, start:——> char:’C’, next=0
And the final phase:
huffman table chars table
value:60.87 start:——> char:’D’, next->
char:’E’, next->
char:’B’, next->
char:’C’,next=0
value:39.13, start:——> char:’A’, next=0
As we went so far, I hope you got the image: we have two parallel
arrays, one holding the percentage and a pointer to the first character.
Each character is either alone or end of list and thus it’s pointer is null,
or is followed by other characters and thus it’s pointer points the next
character. At every phase you simply have to locate the last two items in
the first list and then follow the path with characters until you reach a
null pointer. For the higher percentage you must set a 1 and for the lower
one you must set a 0. The setting occurs in the third matrix defined: the
huffman_codes matrix. For each character you set the specific bit and
increase the length. Beware that you must add the bits in the leftward part
of the code.
Ok, this was the hard part: generating the huffman tree and creating
the codes. From here it goes very simple: just read each character from the
input buffer and output the corresponding huffman code to the output buffer.
This is done easily using an algorithm that concatenates variable length
nibbles.
Already you noticed, probably, a weakness in the huffman encoding:
the need to read the input buffer twice, once to get the percentages and
once to actually compress.
Also, please note that in order to be able to decompress you must
store at the beginning of the compressed data all the informations required.
For the beginning you should store:
sign dw 0 ; a file signature
ver dw 0 ; the version of the compressor
orig_size dd 0 ; the original file size
comp_size dd 0 ; the compressed file size
file_crc dd 0 ; the crc of the file
dic_size dw 0 ; the dictionary size
After this the dictionary itself should follow. My way was to store
each character followed by it’s code length and then by the code itself.
This provides an easy way of reading the dictionary at decompression.
These are the basic structure needed to create a huffman compression
without being forced to use big ammounts of memory. This algorithm, however
is far less optimal then the LZ family, where first of all one pass over the
input buffer is enough and there is no need to store information like the
dictionary. However, this algorithm is usefull most of the times when
combined with other LZ methods (like ARJ or PKZip do).
Furthuremore let’s take a look at some of the most important coding
parts of a huffman compression algorithm, as I coded them in my LJHCE32
compression engine.
³ Coding it ³
While coding a compression algorithm there are a few goals that come
without saying:
1.Best compression
2.Good speed in compression
3.Almost unnoticeable at decompression
Also, the decompression routine should be as small as possible (in
the idea of creating a self-extractor).
In order to have a good compression speed one of the most important
part is the sorting algorithm. In my code I used a bubble sorting algorithm,
which works pretty good.
An almost unnoticeable decompressor is the one which manages to
locate the codes as quick as possible. A good routine is needed, though,
when checking the input data and matching it with the dictionary.
Unfortunately in my LJHCE32 I used a not so optimized algorithm and most of
the times the decompression lasts more than the compression (which is not
normal). I will upgrade it in a newer version.
Also, your routines must take care of all errors that might occure
and return the proper error code. The compression routine should notice if
the output is bigger than the input and act accordingly. The decompression
routine should check the integrity of the huffman table and codes and also
check if the CRC of the decompressed data matches the initial one.
For the complete implementation check out the source code for the
LJHCE32 compression engine.
³ Final word ³
I do consider that compression is a great tool that helps in virus
writing. This is because you get more from one shot: you get the encryption,
you get more available space, the decoding is very difficult unless you know
exactly the decompression algorithm. I think everybody should consider
writing compression engines and use them in their codes. As far as I know
there already exist a few available freeware libs with compression engines
which are pretty good, like ApLib, for example.
If somebody manages to write a compression engine using the huffman
algorithm please send it over to me to… I am always eager for new stuff…
All the Best,
³ Lord Julus/29A (Jan.2000)
Recent Comments