NOTE: This text needs thorough rewriting. Haruhiko Okumura Matsusaka University, Matsusaka, Japan. /*********************************************************** Data Compression Algorithms: LZSS, LArc, LZARI, LHarc, ar, and All That (C) Haruhiko Okumura, 1991 ***********************************************************/ What follows is a very brief description of the past and the present activities of the Japanese hobbyists interested in data compression. In 1987, I was asked by a magazine editor to write an article about data compression. I wrote a manuscript and an accompanying program, sent them to the editor, and forgot about them. The next time I heard from him I was told that the magazine was discontinued. So I uploaded my program, LZSS, to a Japanese BBS. That was May 1, 1988. Soon a number of hobby programmers gathered and began improving on that program. The project culminated in Kazuhiko Miki's archiver LArc, which was fairly widey used in Japan. Dr. Miki is a medical specialist working at a governmental office. My program LZSS, and hence LArc, were based on a very simple idea. Suppose I'm going to write "compression" here. But probably I've already used that word before in this file. If I used that word 57 characters before, I might as well write "go 57 characters back, and read 11 characters," or <57,11> for short. In general, when I've already used the string of characters among the recent 4096 characters, say, I encode the string by a pair. In Storer's [8] terminology, this is a sliding dictionary algorithm, analyzed first by Ziv and Lempel [14] and then by Storer and Szymanski [9], among others. Later versions of LZSS and LArc use binary search trees to facilitate string matching; see Bell [1]. Incidentally, there are two distinct Ziv-Lempel (LZ) methods: sliding dictionary [14] and dynamic dictionary [15] in Storer's terminology. The so-called LZW method [12] belongs to the latter. Most pre-LHarc compression tools, such as 'compress', 'ARC', and 'PKARC', were based on the LZW method. During the summer of 1988, I wrote another compression program, LZARI. This program is based on the following observation: Each output of LZSS is either a single character or a pair. A single character can be coded as an integer between 0 and 255. As for the field, if the upper limit of is 256, say, it can be coded as an integer between 256 and 511. Thus, I can say that there are 512 kinds of "characters," and the "characters" 256 through 511 are accompanied by a field. These 512 "characters" can be Huffman- coded, or better still, algebraically coded. The field can be coded in the same manner. In LZARI I used an adaptive algebraic compression [13, 2] to encode the "characters," and static algebraic compression to encode the field. (There are several versions of LZARI; some of them are slightly different from the above description.) The compression of LZARI is very tight, though rather slow. Incidentally, I was told that there is now a shareware compression program whose output is bit-by-bit equivalent to the output of one version of my LZARI. Haruyasu Yoshizaki (Yoshi), a physician and guru hobby programmer, worked very hard to make LZARI faster. Most importantly, he replaced LZARI's algebraic compression by dynamic Huffman coding. His program, LZHUF, was very successful. It was much faster than my LZARI. As for compression ratio, Huffman cannot beat algebraic compression, but the difference was very small. Yoshi rewrote the compression engine of LZHUF in assembler, and added a nifty user interface. His archiver, LHarc, is now the de facto standard in Japanese BBSs. After Prof. Kenjirou Okubo, a mathematician, introduced LHarc to the United States, it became world-famous. Other vendors began using similar techniques: sliding dictionary plus statistical compressions such as Huffman and Shannon- Fano. Although LHarc was much faster than LZARI, we weren't quite satisfied with its speed. Because LHarc was based on dynamic Huffman, it had to update Huffman tree every time it received a character. Yoshi and I tried other dynamic Huffman algorithms [5, 10, 11], but improvements were not as great as we desired. So I took a different step: replacing LHarc's dynamic Huffman by a static Huffman method. Traditional static Huffman coding algorithm first scans the input file to count character distribution, then builds Huffman tree and encodes the file. In my approach, the input file is read only once. It is first compressed by a sliding dictionary method like LZARI and LHarc, and at the same time the distributions of the "characters" (see above) and positions are counted. The output of this process is stored in main memory. When the memory is full (or the input is exhausted), the Huffman trees are constructed, and the half-processed file is actually compressed and output. In static Huffman, the Huffman tree must be stored in the compressed file. In the traditional approach this information consumes hundreds of bytes. My approach is to standardize Huffman trees so that (1) each left subtree is no deeper than its right counterpart, and (2) the leaves at the same level are sorted in ascending order. In this way the Huffman tree can be uniquely specified by the lengths of the codewords. Moreover, the resulting table is again compressed by the same Huffman coding algorithm---a kind of bootstraping. To make the decoding program simpler, the code is adjusted so that the codeword lengths do not exceed 16 bits. Since this adjusting is rarely needed, the algorithm is made very simple. It does not create optimal length-limited Huffman trees; see e.g. [6] for an optimal algorithm. Incidentally, my early program had a bug here, which was now removed thanks to Yoshi. The sliding dictionary algorithm is also improved by Yoshi using a "PATRICIA tree" data structure; see McCreight [7] and Fiala and Greene [4]. After completing my algorithm, I learned that Brent [3] also used a sliding dictionary plus Huffman coding. His method, SLH, is simple and elegant, but since it doesn't find the most recent longest match, the distribution of match position becomes flat. This makes the second-stage Huffman compression less efficient. On the basis of these new algorithms, Yoshi began to rewrite his LHarc, but it took him so long that I decided to write my own archiver. My archiver was quite recklessly named 'ar'. I should have named it 'har' (after my name), say, because 'ar' collides with the name of UNIX's archiver. I didn't want my program to compete with LHarc, but I wanted many people to try the algorithm, so I wrote it in pure ANSI C. This is the reason 'ar' lacks many faculties necessary for a real archiver. Incidentally, I think one recent shareware archiver written by an American uses my algorithm. Yoshi finally showed us his new archiver written in C. It was tentatively named LHx. He then rewrote the main logic in assembler. His new archiver, LH, is in beta test now. The suffix 'arc' was dropped in deference to the archiver ARC. The foregoing is a brief excerpt from what Yoshi and I wrote for the January, 1991 issue of "C Magazine" (in Japanese). Recently we learned that for DOS 5.0, LH means LoadHigh, an internal command. We will have to abandon the name unless Microsoft abandons it! Also, I was told that Fiala and Greene's algorithm was patented ("Textual Substitution Data Compression With Finite Length Search Windows," U.S. Patent 4,906,991, Mar. 6, 1990). The document says that it is one of the three patent applications made by the authors relating to data compression. The other two were "Start, Step, Stop Unary Encoding for Data Compression," Application Ser. No. 07/187,697, and "Search Tree Data Structure Encoding for Textual Substitution Data Compression Systems," Application Ser. No. 07/187,699. Furthermore, I learned that the original Ziv-Lempel compression method (Eastman et al., U.S. Patent 4,464,650, 8/1984) and the LZW method (Welch, 4,558,302, 12/1985) were patented. I also heard that Richard Stallman, of the Free Software Foundation, author of the EMACS editor and leader of the GNU project, ceased to use 'compress' program any more because its LZW algorithm got patented. Are algorithms patentable? (See [16].) If these patents should turn out to be taken seriously, all compression programs now in use may infringe some of these patents. REFERENCES [1] Timothy C. Bell. Better OPM/L text compression. IEEE Transactions on Communications, COM-34(12):1176--1182, 1986. [2] Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text Compression. Prentice Hall, 1990. [3] R. P. Brent. A linear algorithm for data compression. The Australian Computer Journal, 19(2):64--68, 1987. [4] Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Communications of the ACM, 32(4):490--505, 1989. [5] Donald E. Knuth. Dynamic Huffman coding. Journal of Algorithms, 6:163--180, 1985. [6] Lawrence L. Larmore and Daniel S. Hirschberg. A fast algorithm for optimal length-limited Huffman codes. Journal of the Association for Computing Machinery, 37(3):464--473, 1990. [7] Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the Association for Computing Machinery, 23(2):262--272, 1976. [8] James A. Storer. Data Compression: Methods and Theory. Computer Science Press, Rockville, MD., 1988. [9] James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. Journal of the Association for Computing Machinery, 29(4):928--951, 1982. [10] Jeffrey Scott Vitter. Design and analysis of dynamic Huffman codes. Journal of the Association for Computing Machinery, 34(4):825--845, 1987. [11] Jeffrey Scott Vitter. Algorithm 673: Dynamic Huffman coding. ACM Transactions on Mathematical Software, 15(2):158--167, 1989. [12] Terry A. Welch. A technique for high-performance data compression. IEEE Computer}, 17(6):8--19, 1984. [13] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520--540, 1987. [14] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23(3):337--343, 1977. [15] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, IT-24(5):530--536, 1978. [16] Edward N. Zalta. Are algorithms patentable? Notices of the American Mathematical Society, 35(6):796--799, 1988. ------------------------------------------------------------ LZSS, LZARI, LZHUF, LHarc, and AR002 ('ar') can be downloaded from many BBSs, such as CompuServe's IBMPRO Library. Note that the final version of LHarc is 1.13d. "LHICE" is a _fake_ version of LHarc (contrary to what the DDJ magazine says). I welcome your comments. Please write to: CompuServe: 74050,1022 Internet: 74050.1022@compuserve.com Snailmail: 12-2-404 Green Heights, Nagasawa, Yokosuka 239, Japan Haruhiko Okumura