
GREETINGS!

   This is the README for BZIP, my block-sorting file compressor.
   BZIP is distributed under the GNU General Public License version 2;
   for details, see the file LICENSE.  Pointers to the algorithms used
   are in ALGORITHMS.  Instructions for use are in bzip.1.preformatted.

   Please read this file carefully.



HOW TO BUILD

   Type `make'.     (tough, huh? :-)

   This creates binaries "bzip", and "bunzip",
   which is a symbolic link to "bzip".

   It also runs four compress-decompress tests to make sure
   things are working properly.  If all goes well, you should be up &
   running.  Please be sure to read the output from `make'
   just to be sure that the tests went ok.

   A manual page is supplied, both unformatted (bzip.1)
   and preformatted (bzip.1.preformatted).

   To install bzip properly:

      -- Copy the binary "bzip" to a publically visible place,
         possibly /usr/bin, /usr/common/bin or /usr/local/bin.

      -- In that directory, make "bunzip" be a symbolic link
         to "bzip".

      -- Copy the manual page, bzip.1, to the relevant place.
         Probably the right place is /usr/man/man1/.
   
   

COMPILATION NOTES

   bzip should work on 32-bit Unix boxes.  It is known to work
   [meaning: it has compiled and passed self-tests] on the 
   following platform-os combinations:

      Intel i386/i486        running Linux 1.2.13 and Linux 2.0.0
      Sun Sparcs (various)   running SunOS 4.1.3 and Solaris 2.5
      SGI Indy R3000         running Irix 5.3
      HP 9000/700            running HPUX 9.03
      HP 9000/300            running NetBSD 1.1
      Acorn R260             running RISC iX (a BSD 4.? derivative)

   If you #define the symbol BZUNIX to 0, you should be able to
   compile the code with any ANSI C compiler (that's the theory,
   anyway!)  You'll still need a 32-bit machine to run it on, tho.
  
   I have had reports that an earlier version of bzip worked on an
   Alpha, a 64-bit machine, but I cannot verify this version as such,
   not having an Alpha to hand, alas.  It might work if you modify the
   definitions of the types Int32 and UInt32 so they really are 32
   bits long, even on a 64-bit machine.  If you do succeed in making
   bzip work on a 64-bit machine, I would be pleased to hear from you.

   I recommend GNU C for compilation.  The code is standard ANSI C,
   except for the Unix-specific file handling, so any ANSI C compiler
   should work.  Note however that the many routines marked INLINE
   should be inlined by your compiler, else performance will be very
   poor.

   On a 386/486 machines, I'd recommend giving gcc the
   -fomit-frame-pointer flag; this liberates another register for
   allocation, which measurably improves performance.

   On SPARCs (and, I guess, on many low-range RISC machines) there is no
   hardware implementation of integer multiply and divide.  This can
   mean poor decompression performance.  It also means it is important
   to generate code for the version of the SPARC instruction set you
   intend to use.  gcc -mcypress (for older sparcs) and gcc
   -msupersparc (for newer ones) give binaries which run at strikingly
   different speeds on different flavours of SPARCs.  If you are
   interested in performance figures, try both.



VALIDATION

   Correct operation, in the sense that a compressed file can always be
   decompressed to reproduce the original, is obviously of paramount
   importance.  To validate bzip, I used a modified version of 
   Mark Nelson's churn program.  Churn is an automated test driver
   which recursively traverses a directory structure, using bzip to
   compress and then decompress each file it encounters, and checking
   that the decompressed data is the same as the original.  As test 
   material, I used the entirety of my Linux filesystem, constituting
   390 megabytes in 20,440 files.  The largest file was about seventeen
   megabytes long.  Included in this filesystem was a directory containing
   39 specially constructed test files, designed to break the sorting
   phase of compression, the most elaborate part of the machinery.
   This included files of zero length, various long, highly repetitive 
   files, and some files which generate blocks with all values the same.

   There were actually six test runs on this filesystem, taking about
   50 CPU hours on an Intel 486DX4-100 machine:

      One with the block size set to 900k (ie, with the -9 flag, the default).

      One with the block size set to 500k (ie, with -5).

      One with the block size set to 100k (ie, with -1).

      One where the parameters for the arithmetic coder were
      set to smallB == 14 and smallF == 11, rather than the
      usual values of 26 and 18.  This was intended to expose 
      possible boundary-case problems with the arithmetic coder;
      in particular, setting smallB == 14 keeps the coding values
      all below or equal to 8192.  Doing this, I hoped that the
      values actually would hit their endpoints from time to time,
      so I'd see problems if any lurked.  With smallB = 26, the 
      range of values goes up to 2^26 (64 million), which makes
      potential bugs associated with endpoint effects vastly less
      likely to be detected.

      One where the block size was set to a trivial value, 173,
      so as to invoke the blocking/unblocking machinery tens of
      thousands of times over the run, and expose any potential
      problem there.

      One with normal settings, the block size set 900k, but
      compiled with the symbol DEBUG set to 1, which turns on
      many assertion-checks in the compressor.

   None of these test runs exposed any problems.

   In addition, earlier versions of bzip have been in informal use
   for a while without difficulties.  The largest file I have tried
   so far is a log file from a chip-simulator, 52 megabytes long, 
   and that decompressed correctly.
   
   The distribution does four tests after building bzip.  These tests
   include test decompressions of pre-supplied compressed files, so
   they not only test that bzip works correctly on the machine it was
   built on, but can also decompress files compressed on a different
   machine.  This guards against unforseen interoperability problems.



Please read and be aware of the following:

WARNING:

   This program (attempts to) compress data by performing several
   non-trivial transformations on it.  Unless you are 100% familiar
   with *all* the algorithms contained herein, and with the
   consequences of modifying them, you should NOT meddle with the
   compression or decompression machinery.  Incorrect changes can and
   very likely *will* lead to disastrous loss of data.


DISCLAIMER:

   I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE
   USE OF THIS PROGRAM, HOWSOEVER CAUSED.

   Every compression of a file implies an assumption that the
   compressed file can be decompressed to reproduce the original.
   Great efforts in design, coding and testing have been made to
   ensure that this program works correctly.  However, the complexity
   of the algorithms, and, in particular, the presence of various
   special cases in the code which occur with very low but non-zero
   probability make it impossible to rule out the possibility of bugs
   remaining in the program.  DO NOT COMPRESS ANY DATA WITH THIS
   PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER
   SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.

   That is not to say this program is inherently unreliable.  Indeed,
   I very much hope the opposite is true.  BZIP has been carefully
   constructed and extensively tested.

End of nasty legalities.


I hope you find bzip useful.  Feel free to contact me at
   sewardj@cs.man.ac.uk
if you have any suggestions or queries.

Julian Seward
Manchester, UK
18 July 1996

