Compression (Part I) – The Huffman Algorithm –

27 Feb

by  Lord Julus / 29A

january 2000

³    Foreword  ³

Hello  people, and welcome to the first breath in the so talked about
Y2K…  It looks to me that everything worked out just fine, so I decided to
move my butt and start writing something again, after quite a while.

You  might  consider  this  article  off-topic    a little to the virus
writing field, but actually the things are not at all like that. The raising
Win32    programming  and  the  little ways of stealthing the code withing the
portable  executable  generated  a quest for inserting within the PE without
modifying  the size of the file. This was done by many authors in many ways,
like  overwriting  a  useless    section  or  hiding  in the slack space which
remains  after the rounding to disk alignment. Others choosed however to use
a  compression algorithm to shrink a portion of the file and insert into the
remaining  space. This was done for example by JackyQwerty in the Redemption
virus    with  his  JQCoding algorithm, or by Vecna in Babylonia. The examples
are probably many but this was just to show you the link…

Now,  in  the  world  there  exist  many compression algorithms, ones
better  than  others.    From  them  a  few  raised  as of being recognized by
everybody  as    the  best  ones. Let me tell you a few: Lev-Zimpel algorithms
(also known as LZ), Shano-Fano algorithms, Running Length algorithm, Huffman
algorithm.  It is very hard to say which is the best because it depends very
much  of the raw data it works on. However, as a starter I decided to phocus
myself on the Huffman algorithm. Unfortunately this is not quite the best of
them  as  it  gives lower compression ratios than most of the LZ algorithms.
But, still it is a very interesting algorithm and it deserves all attention.

So,  in  this  article    I  will  present  the basic theory behind the
Huffman  algorithm  with  examples, and then show you how I implemented this
method  in  one of my latest release: Lord Julus’ Huffman Compression Engine
(LJHCE32).

³    Theory    ³

The  Huffman  algorithm  is  based  on the following idea: define the
characters (which are normally coded on 8 bits) on less or more bits so that
on  average  the  total  number of bits will be smaller. So, this algorithm,
unlike  others,  doesn’t  work on strings, but on characters and this is why
it’s results are not very good.

The  first idea anyone would have is to encode each of the characters
that  appear in the input buffer with the least bits possible. Let’s imagine
such an encoding:

A = 1
B = 0
C = 01
D = 10
E = 11, etc…

However,  this    is  a non-decodable codification, because an encoding
like 111101 can be decoded as AAAAC, or EEC, or EAABA, etc… So, this shows
that a special method must be used to obtain a good decodable code. Actually
the  real  problem  does not come from the way the code looks like, but from
the  fact that they are not unique defined from one side or another and this
would require a mark to be set between them so that at decompression you can
know  which  code is which. Of course using a separator between the codes is
out of the question. Another method must be found.

Here is the first example that comes into my mind of unique codes:

A = 1
B = 01
C = 001
D = 0001

If you want to encode the string ABCDAABC you get:

10100100011101001

If  you  read  bit  by    bit  you  can remove the bits as they match a
combination  in  the  definition  table  above…  Of course, this is just a
simple  example  that    shows  you one of the important assets of the huffman
codes    and that is the unique form when read from left to right. The idea is
to compare each bit of the encrypted code and match it with all of the codes
in  the  definition  table.  Once you matched one code you can be absolutely
sure  that  it’s  the    one,  because  they  are  not repeating in the inside
structure.

Of  course,  this is the very first of the problem, how to get unique
codes. The  other part of the problem is each character which code should it
have?

Here  we come to the second of the huffman’s algorithm ideas and that
is the percentage of appareance. Let us consider the following string:

AABCEDAAABCDEAABCCCCDAA

Let’s see how many times each character appears in the string:

A – 9
B – 3
C – 6
D – 3
E – 2
———-
Total – 23

Now,  let’s  transform    each appareance into a percentage relative to
the total, following the formula:

appareance * 100
—————-
total

A –  39.13%
B –  13.04%
C –  26.09%
D –  13.04%
E –   8.70%
—————
Total – 100.00%

The  next  step  is  to  order    our  characters  descending  with the
percentage of appareance. After this we will be moving on with the following
idea: we add the last two smallest values and we obtain a composite position
with  a  percentage  equal  to  the  sum  of  the last two. We rearrange the
positions descending and we keep on going until we reach 100%:

³  Phase 1   ³ Phase 2     ³ Phase 3       ³ Phase 4       ³ Phase 5      ³

³ A – 39.13% ³ A  – 39.13% ³ A   – 39.13% ³ BCDE – 60.87% ³ ABCDE – 100% ³
³ C – 26.09% ³ C  – 26.09% ³ BDE – 34.78% ³ A      – 39.13%
³ B – 13.04% ³ DE – 21.74% ³ C   – 26.09%
³ D – 13.04% ³ B  – 13.04%
³ E –    8.70%

Now,  we  are  very  advanced  inside the algorithm and we are almost
reaching  the end. Now, all we need to do is to associate some bits with the
characters and this is done by selecting from each phase the last two values
and  give a 0 to the smaller one and a 1 to the highest one (not considering
the last phase which gives 100%):

³  D –  1  ³  DE – 1  ³  BDE – 1  ³  BCDE –  1  ³
³  E –  0  ³  B  – 0  ³  C   – 0  ³  A    –  0  ³

And  now,  let’s  read    the  codes.  Start  from  the end towards the
beginning  and  wherever you see the character add the specified code to it.
We get:

A = 0
B = 110
C = 10
D = 1111
E = 1110

Ok, that’s it!! We got our huffman codes!!

After  all  this  is  done  we just need to take the input buffer and
instead  of each character store it’s code. Let’s use the above codification
and compress the string: AABCEDAAABCDEAABCCCCDAA

00110101110111100011010111111100011010101010111100

Dividing it into bytes we get:

00110101 = 35h
11011110 = DEh
00110101 = 35h
11111100 = FCh
01101010 = 6Ah
10101111 = AFh
00______ = 00h

So  we    compressed  a  23  bytes  length string into a 7 bytes length
string. That’s a 58% compression ratio.

This  looks  a    little bit too easy, right? Yep… And I will explain
why  in  a  second… You noticed above how we encoded each character with a
code  and  did  you  notice  how  their size was variable? Then, how can use
decode  the  compressed  string?  Answer:  we    *must*    save  the  characters
definitions  into  the  compressed output buffer. This is one of the biggest
problem…  how  to  store the definition table without affecting the output
too  much?  Anyway, before looking into that, lets see how we can decompress
our string:

00110101110111100011010111111100011010101010111100

We  start  from  left. We say: 0, is it in the definition table? Yes,
it’s  A.  We output A and delete the 0. Next bit, 0 again… Next bit: 1. Is
it  in  the definition table? No! Read one more: 11. Is it in the definition
table? No, read one more: 110. Is it in the definition table? Yes! It’s a B,
output the B and go on. Briefly:

00110101110111100011010111111100011010101010111100
AAB  C E   D   AAAB  C D   E   AAB  C C C C D    AA

Easy decompression!! Now let’s phocus a little on the data structures
we need to use in order to implement such an algorithm.

³    Data structures  ³

Ok, before going to the data structures themselfes, let’s try and see
what  do we really need to store. First of all we are working on an alphabet
made  up  from maximum 256 characters. For each character we must store it’s
code (1 byte), it’s percentage of appareance (which we will discuss later- 1
double  word).  Also,    note  that  after each phase the composite codes grow
until    the last step, where they form that total alphabet… This would make
us  need  a  huge  matrix  of 256*256*9 = 589824 bytes!!!! Of course that is
totaly bad!

Let’s see the structures used:

element struc
value dd ?
start dd ?
element ends
element_size equ size element

list struc
char db ?
next dd ?
list ends
list_size equ size list

The  first  structure  has  two  definitions.  ‘value’ represents the
percentage  of  appeareance  and ‘start’ is a pointer to an element from the
second  structure. The second structure holds the character definition and a
pointer to the next character. To define them it goes like this:

chars label
I=0
REPT 256
db I       ; character ascii code
dd 0       ; null pointer in the beginning
I=I+1
ENDM

To define the huffman codes we simply write:

huffman element 256 dup(<0>)

The  size  of  the  data so far is then 256*13 = 3328 bytes. A little
decrease,  right?  Ok,  the  above  structures  will be used to generate the
huffman tree. We need one more structure to store the actual huffman codes:

huffman_code struc
code     dw ?
huff_len dw 0
huffman_code ends
huffman_size equ size huffman_code

codes huffman_code 256 dup(<0>)

As  you  can see the huffman code is stored on a maximum 16 bits. The
next  value  represents  the  length  of  the code. This adds up 256*4= 1024
bytes, making our data total = 4352.

For the beginning we will have the data set as follows:

huffman table           chars table
value:0, start:——> char:0  , next=0
value:0, start:——> char:1  , next=0

value:0, start:——> char:255, next=0

This  being  initialized,  we read the input buffer from beginning to
the  end  and increment the value coresponding to each position. After doing
so,  we transform the number of appareances into percent. This is the reason
why  we  need    to store the length ona double word, because when calculating
the percent we must use all decimals so we need to use the FPU:

fild dword ptr [value]      ; load the value
fild dword ptr [_100]      ; load 100
fmul              ; multiply
fild dword ptr [length]   ; load length
fdiv              ; divide
fstp dword ptr [value]      ; and store the floating number

After  doing  so, we sort the first matrix descending. Let’s consider
our example above:

A –  39.13%
B –  13.04%
C –  26.09%
D –  13.04%
E –   8.70%

huffman table           chars table
value:39.13, start:——> char:’A’, next=0
value:26.09, start:——> char:’C’, next=0
value:13.04, start:——> char:’B’, next=0
value:13.04, start:——> char:’D’, next=0
value: 8.70, start:——> char:’E’, next=0

So,  we  have  each pointer pointing the right character in the other
matrix (beware that you don’t need to rearange the second matrix as you work
on  pointers  to  it’s elements; you just arranged the pointers in the first
matrix).

Then, this is how the first phase looks like:

huffman table           chars table
value:39.13, start:——> char:’A’, next=0
value:26.09, start:——> char:’C’, next=0
value:21.74, start:——> char:’D’, next:—->char:’E’,next=0
value:13.04, start:——> char:’B’, next=0

So, we added the values of the last two elements, reordered the first
matrix  and  made  the  ‘next’  pointer of the first of them pointing to the
second  element, basically creating a chain within the second list. We go on
with the next phase:

huffman table            chars table
value:39.13, start:——> char:’A’, next=0
value:34.78, start:——> char:’D’, next->char:’E’,next->char:’B’,next=0
value:26.09, start:——> char:’C’, next=0

And the final phase:

huffman table            chars table
value:60.87  start:——> char:’D’, next->
char:’E’, next->
char:’B’, next->
char:’C’,next=0
value:39.13, start:——> char:’A’, next=0

As  we    went  so  far, I hope you got the image: we have two parallel
arrays,  one  holding    the  percentage and a pointer to the first character.
Each character is either alone or end of list and thus it’s pointer is null,
or  is  followed  by  other characters and thus it’s pointer points the next
character.  At  every    phase you simply have to locate the last two items in
the  first  list  and then follow the path with characters until you reach a
null  pointer.  For the higher percentage you must set a 1 and for the lower
one  you  must  set a 0. The setting occurs in the third matrix defined: the
huffman_codes    matrix.  For  each  character  you  set  the specific bit and
increase  the length. Beware that you must add the bits in the leftward part
of the code.

Ok,  this was the hard part: generating the huffman tree and creating
the  codes. From here it goes very simple: just read each character from the
input buffer and output the corresponding huffman code to the output buffer.
This  is  done  easily  using an algorithm that concatenates variable length
nibbles.

Already  you  noticed,    probably, a weakness in the huffman encoding:
the  need  to    read  the input buffer twice, once to get the percentages and
once to actually compress.

Also,  please  note  that  in order to be able to decompress you must
store at the beginning of the compressed data all the informations required.
For the beginning you should store:

sign           dw 0  ; a file signature
ver           dw 0  ; the version of the compressor
orig_size       dd 0  ; the original file size
comp_size       dd 0  ; the compressed file size
file_crc        dd 0  ; the crc of the file
dic_size        dw 0  ; the dictionary size

After  this  the dictionary itself should follow. My way was to store
each  character  followed  by    it’s code length and then by the code itself.
This provides an easy way of reading the dictionary at decompression.

These  are the basic structure needed to create a huffman compression
without  being forced to use big ammounts of memory. This algorithm, however
is far less optimal then the LZ family, where first of all one pass over the
input    buffer    is  enough and there is no need to store information like the
dictionary.  However,    this  algorithm  is  usefull  most  of the times when
combined with other LZ methods (like ARJ or PKZip do).

Furthuremore  let’s  take a look at some of the most important coding
parts    of  a  huffman    compression  algorithm, as I coded them in my LJHCE32
compression engine.

³    Coding it  ³

While  coding a compression algorithm there are a few goals that come
without saying:

1.Best compression
2.Good speed in compression
3.Almost unnoticeable at decompression

Also,  the  decompression  routine should be as small as possible (in
the idea of creating a self-extractor).

In  order  to have a good compression speed one of the most important
part is the sorting algorithm. In my code I used a bubble sorting algorithm,
which works pretty good.

An  almost  unnoticeable  decompressor    is  the  one which manages to
locate  the  codes  as  quick as possible. A good routine is needed, though,
when    checking  the  input  data  and  matching  it  with  the  dictionary.
Unfortunately    in my LJHCE32 I used a not so optimized algorithm and most of
the  times  the  decompression lasts more than the compression (which is not
normal). I will upgrade it in a newer version.

Also,  your  routines  must take care of all errors that might occure
and  return  the proper error code. The compression routine should notice if
the  output  is bigger than the input and act accordingly. The decompression
routine  should  check the integrity of the huffman table and codes and also
check if the CRC of the decompressed data matches the initial one.

For  the  complete  implementation  check out the source code for the
LJHCE32 compression engine.

³    Final word  ³

I  do  consider  that compression is a great tool that helps in virus
writing. This is because you get more from one shot: you get the encryption,
you get more available space, the decoding is very difficult unless you know
exactly  the  decompression  algorithm.  I  think  everybody should consider
writing  compression  engines    and use them in their codes. As far as I know
there    already  exist a few available freeware libs with compression engines
which are pretty good, like ApLib, for example.

If  somebody  manages to write a compression engine using the huffman
algorithm please send it over to me to… I am always eager for new stuff…

All the Best,

³    Lord Julus/29A (Jan.2000)

One Response to “Compression (Part I) – The Huffman Algorithm –”

  1. history of shano fano algoritms March 5, 2009 at 6:42 pm #

    hey..answer me

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: