| | @@ -1,972 +0,0 @@ |
| 1 | | -
|
| 2 | | -
|
| 3 | | -
|
| 4 | | -
|
| 5 | | -
|
| 6 | | -
|
| 7 | | -Network Working Group P. Deutsch
|
| 8 | | -Request for Comments: 1951 Aladdin Enterprises
|
| 9 | | -Category: Informational May 1996
|
| 10 | | -
|
| 11 | | -
|
| 12 | | - DEFLATE Compressed Data Format Specification version 1.3
|
| 13 | | -
|
| 14 | | -Status of This Memo
|
| 15 | | -
|
| 16 | | - This memo provides information for the Internet community. This memo
|
| 17 | | - does not specify an Internet standard of any kind. Distribution of
|
| 18 | | - this memo is unlimited.
|
| 19 | | -
|
| 20 | | -IESG Note:
|
| 21 | | -
|
| 22 | | - The IESG takes no position on the validity of any Intellectual
|
| 23 | | - Property Rights statements contained in this document.
|
| 24 | | -
|
| 25 | | -Notices
|
| 26 | | -
|
| 27 | | - Copyright (c) 1996 L. Peter Deutsch
|
| 28 | | -
|
| 29 | | - Permission is granted to copy and distribute this document for any
|
| 30 | | - purpose and without charge, including translations into other
|
| 31 | | - languages and incorporation into compilations, provided that the
|
| 32 | | - copyright notice and this notice are preserved, and that any
|
| 33 | | - substantive changes or deletions from the original are clearly
|
| 34 | | - marked.
|
| 35 | | -
|
| 36 | | - A pointer to the latest version of this and related documentation in
|
| 37 | | - HTML format can be found at the URL
|
| 38 | | - <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
|
| 39 | | -
|
| 40 | | -Abstract
|
| 41 | | -
|
| 42 | | - This specification defines a lossless compressed data format that
|
| 43 | | - compresses data using a combination of the LZ77 algorithm and Huffman
|
| 44 | | - coding, with efficiency comparable to the best currently available
|
| 45 | | - general-purpose compression methods. The data can be produced or
|
| 46 | | - consumed, even for an arbitrarily long sequentially presented input
|
| 47 | | - data stream, using only an a priori bounded amount of intermediate
|
| 48 | | - storage. The format can be implemented readily in a manner not
|
| 49 | | - covered by patents.
|
| 50 | | -
|
| 51 | | -
|
| 52 | | -
|
| 53 | | -
|
| 54 | | -
|
| 55 | | -
|
| 56 | | -
|
| 57 | | -
|
| 58 | | -Deutsch Informational [Page 1]
|
| 59 | | - |
| 60 | | -
|
| 61 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 62 | | -
|
| 63 | | -
|
| 64 | | -Table of Contents
|
| 65 | | -
|
| 66 | | - 1. Introduction ................................................... 2
|
| 67 | | - 1.1. Purpose ................................................... 2
|
| 68 | | - 1.2. Intended audience ......................................... 3
|
| 69 | | - 1.3. Scope ..................................................... 3
|
| 70 | | - 1.4. Compliance ................................................ 3
|
| 71 | | - 1.5. Definitions of terms and conventions used ................ 3
|
| 72 | | - 1.6. Changes from previous versions ............................ 4
|
| 73 | | - 2. Compressed representation overview ............................. 4
|
| 74 | | - 3. Detailed specification ......................................... 5
|
| 75 | | - 3.1. Overall conventions ....................................... 5
|
| 76 | | - 3.1.1. Packing into bytes .................................. 5
|
| 77 | | - 3.2. Compressed block format ................................... 6
|
| 78 | | - 3.2.1. Synopsis of prefix and Huffman coding ............... 6
|
| 79 | | - 3.2.2. Use of Huffman coding in the "deflate" format ....... 7
|
| 80 | | - 3.2.3. Details of block format ............................. 9
|
| 81 | | - 3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
|
| 82 | | - 3.2.5. Compressed blocks (length and distance codes) ...... 11
|
| 83 | | - 3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
|
| 84 | | - 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
|
| 85 | | - 3.3. Compliance ............................................... 14
|
| 86 | | - 4. Compression algorithm details ................................. 14
|
| 87 | | - 5. References .................................................... 16
|
| 88 | | - 6. Security Considerations ....................................... 16
|
| 89 | | - 7. Source code ................................................... 16
|
| 90 | | - 8. Acknowledgements .............................................. 16
|
| 91 | | - 9. Author's Address .............................................. 17
|
| 92 | | -
|
| 93 | | -1. Introduction
|
| 94 | | -
|
| 95 | | - 1.1. Purpose
|
| 96 | | -
|
| 97 | | - The purpose of this specification is to define a lossless
|
| 98 | | - compressed data format that:
|
| 99 | | - * Is independent of CPU type, operating system, file system,
|
| 100 | | - and character set, and hence can be used for interchange;
|
| 101 | | - * Can be produced or consumed, even for an arbitrarily long
|
| 102 | | - sequentially presented input data stream, using only an a
|
| 103 | | - priori bounded amount of intermediate storage, and hence
|
| 104 | | - can be used in data communications or similar structures
|
| 105 | | - such as Unix filters;
|
| 106 | | - * Compresses data with efficiency comparable to the best
|
| 107 | | - currently available general-purpose compression methods,
|
| 108 | | - and in particular considerably better than the "compress"
|
| 109 | | - program;
|
| 110 | | - * Can be implemented readily in a manner not covered by
|
| 111 | | - patents, and hence can be practiced freely;
|
| 112 | | -
|
| 113 | | -
|
| 114 | | -
|
| 115 | | -Deutsch Informational [Page 2]
|
| 116 | | - |
| 117 | | -
|
| 118 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 119 | | -
|
| 120 | | -
|
| 121 | | - * Is compatible with the file format produced by the current
|
| 122 | | - widely used gzip utility, in that conforming decompressors
|
| 123 | | - will be able to read data produced by the existing gzip
|
| 124 | | - compressor.
|
| 125 | | -
|
| 126 | | - The data format defined by this specification does not attempt to:
|
| 127 | | -
|
| 128 | | - * Allow random access to compressed data;
|
| 129 | | - * Compress specialized data (e.g., raster graphics) as well
|
| 130 | | - as the best currently available specialized algorithms.
|
| 131 | | -
|
| 132 | | - A simple counting argument shows that no lossless compression
|
| 133 | | - algorithm can compress every possible input data set. For the
|
| 134 | | - format defined here, the worst case expansion is 5 bytes per 32K-
|
| 135 | | - byte block, i.e., a size increase of 0.015% for large data sets.
|
| 136 | | - English text usually compresses by a factor of 2.5 to 3;
|
| 137 | | - executable files usually compress somewhat less; graphical data
|
| 138 | | - such as raster images may compress much more.
|
| 139 | | -
|
| 140 | | - 1.2. Intended audience
|
| 141 | | -
|
| 142 | | - This specification is intended for use by implementors of software
|
| 143 | | - to compress data into "deflate" format and/or decompress data from
|
| 144 | | - "deflate" format.
|
| 145 | | -
|
| 146 | | - The text of the specification assumes a basic background in
|
| 147 | | - programming at the level of bits and other primitive data
|
| 148 | | - representations. Familiarity with the technique of Huffman coding
|
| 149 | | - is helpful but not required.
|
| 150 | | -
|
| 151 | | - 1.3. Scope
|
| 152 | | -
|
| 153 | | - The specification specifies a method for representing a sequence
|
| 154 | | - of bytes as a (usually shorter) sequence of bits, and a method for
|
| 155 | | - packing the latter bit sequence into bytes.
|
| 156 | | -
|
| 157 | | - 1.4. Compliance
|
| 158 | | -
|
| 159 | | - Unless otherwise indicated below, a compliant decompressor must be
|
| 160 | | - able to accept and decompress any data set that conforms to all
|
| 161 | | - the specifications presented here; a compliant compressor must
|
| 162 | | - produce data sets that conform to all the specifications presented
|
| 163 | | - here.
|
| 164 | | -
|
| 165 | | - 1.5. Definitions of terms and conventions used
|
| 166 | | -
|
| 167 | | - Byte: 8 bits stored or transmitted as a unit (same as an octet).
|
| 168 | | - For this specification, a byte is exactly 8 bits, even on machines
|
| 169 | | -
|
| 170 | | -
|
| 171 | | -
|
| 172 | | -Deutsch Informational [Page 3]
|
| 173 | | - |
| 174 | | -
|
| 175 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 176 | | -
|
| 177 | | -
|
| 178 | | - which store a character on a number of bits different from eight.
|
| 179 | | - See below, for the numbering of bits within a byte.
|
| 180 | | -
|
| 181 | | - String: a sequence of arbitrary bytes.
|
| 182 | | -
|
| 183 | | - 1.6. Changes from previous versions
|
| 184 | | -
|
| 185 | | - There have been no technical changes to the deflate format since
|
| 186 | | - version 1.1 of this specification. In version 1.2, some
|
| 187 | | - terminology was changed. Version 1.3 is a conversion of the
|
| 188 | | - specification to RFC style.
|
| 189 | | -
|
| 190 | | -2. Compressed representation overview
|
| 191 | | -
|
| 192 | | - A compressed data set consists of a series of blocks, corresponding
|
| 193 | | - to successive blocks of input data. The block sizes are arbitrary,
|
| 194 | | - except that non-compressible blocks are limited to 65,535 bytes.
|
| 195 | | -
|
| 196 | | - Each block is compressed using a combination of the LZ77 algorithm
|
| 197 | | - and Huffman coding. The Huffman trees for each block are independent
|
| 198 | | - of those for previous or subsequent blocks; the LZ77 algorithm may
|
| 199 | | - use a reference to a duplicated string occurring in a previous block,
|
| 200 | | - up to 32K input bytes before.
|
| 201 | | -
|
| 202 | | - Each block consists of two parts: a pair of Huffman code trees that
|
| 203 | | - describe the representation of the compressed data part, and a
|
| 204 | | - compressed data part. (The Huffman trees themselves are compressed
|
| 205 | | - using Huffman encoding.) The compressed data consists of a series of
|
| 206 | | - elements of two types: literal bytes (of strings that have not been
|
| 207 | | - detected as duplicated within the previous 32K input bytes), and
|
| 208 | | - pointers to duplicated strings, where a pointer is represented as a
|
| 209 | | - pair <length, backward distance>. The representation used in the
|
| 210 | | - "deflate" format limits distances to 32K bytes and lengths to 258
|
| 211 | | - bytes, but does not limit the size of a block, except for
|
| 212 | | - uncompressible blocks, which are limited as noted above.
|
| 213 | | -
|
| 214 | | - Each type of value (literals, distances, and lengths) in the
|
| 215 | | - compressed data is represented using a Huffman code, using one code
|
| 216 | | - tree for literals and lengths and a separate code tree for distances.
|
| 217 | | - The code trees for each block appear in a compact form just before
|
| 218 | | - the compressed data for that block.
|
| 219 | | -
|
| 220 | | -
|
| 221 | | -
|
| 222 | | -
|
| 223 | | -
|
| 224 | | -
|
| 225 | | -
|
| 226 | | -
|
| 227 | | -
|
| 228 | | -
|
| 229 | | -Deutsch Informational [Page 4]
|
| 230 | | - |
| 231 | | -
|
| 232 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 233 | | -
|
| 234 | | -
|
| 235 | | -3. Detailed specification
|
| 236 | | -
|
| 237 | | - 3.1. Overall conventions In the diagrams below, a box like this:
|
| 238 | | -
|
| 239 | | - +---+
|
| 240 | | - | | <-- the vertical bars might be missing
|
| 241 | | - +---+
|
| 242 | | -
|
| 243 | | - represents one byte; a box like this:
|
| 244 | | -
|
| 245 | | - +==============+
|
| 246 | | - | |
|
| 247 | | - +==============+
|
| 248 | | -
|
| 249 | | - represents a variable number of bytes.
|
| 250 | | -
|
| 251 | | - Bytes stored within a computer do not have a "bit order", since
|
| 252 | | - they are always treated as a unit. However, a byte considered as
|
| 253 | | - an integer between 0 and 255 does have a most- and least-
|
| 254 | | - significant bit, and since we write numbers with the most-
|
| 255 | | - significant digit on the left, we also write bytes with the most-
|
| 256 | | - significant bit on the left. In the diagrams below, we number the
|
| 257 | | - bits of a byte so that bit 0 is the least-significant bit, i.e.,
|
| 258 | | - the bits are numbered:
|
| 259 | | -
|
| 260 | | - +--------+
|
| 261 | | - |76543210|
|
| 262 | | - +--------+
|
| 263 | | -
|
| 264 | | - Within a computer, a number may occupy multiple bytes. All
|
| 265 | | - multi-byte numbers in the format described here are stored with
|
| 266 | | - the least-significant byte first (at the lower memory address).
|
| 267 | | - For example, the decimal number 520 is stored as:
|
| 268 | | -
|
| 269 | | - 0 1
|
| 270 | | - +--------+--------+
|
| 271 | | - |00001000|00000010|
|
| 272 | | - +--------+--------+
|
| 273 | | - ^ ^
|
| 274 | | - | |
|
| 275 | | - | + more significant byte = 2 x 256
|
| 276 | | - + less significant byte = 8
|
| 277 | | -
|
| 278 | | - 3.1.1. Packing into bytes
|
| 279 | | -
|
| 280 | | - This document does not address the issue of the order in which
|
| 281 | | - bits of a byte are transmitted on a bit-sequential medium,
|
| 282 | | - since the final data format described here is byte- rather than
|
| 283 | | -
|
| 284 | | -
|
| 285 | | -
|
| 286 | | -Deutsch Informational [Page 5]
|
| 287 | | - |
| 288 | | -
|
| 289 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 290 | | -
|
| 291 | | -
|
| 292 | | - bit-oriented. However, we describe the compressed block format
|
| 293 | | - in below, as a sequence of data elements of various bit
|
| 294 | | - lengths, not a sequence of bytes. We must therefore specify
|
| 295 | | - how to pack these data elements into bytes to form the final
|
| 296 | | - compressed byte sequence:
|
| 297 | | -
|
| 298 | | - * Data elements are packed into bytes in order of
|
| 299 | | - increasing bit number within the byte, i.e., starting
|
| 300 | | - with the least-significant bit of the byte.
|
| 301 | | - * Data elements other than Huffman codes are packed
|
| 302 | | - starting with the least-significant bit of the data
|
| 303 | | - element.
|
| 304 | | - * Huffman codes are packed starting with the most-
|
| 305 | | - significant bit of the code.
|
| 306 | | -
|
| 307 | | - In other words, if one were to print out the compressed data as
|
| 308 | | - a sequence of bytes, starting with the first byte at the
|
| 309 | | - *right* margin and proceeding to the *left*, with the most-
|
| 310 | | - significant bit of each byte on the left as usual, one would be
|
| 311 | | - able to parse the result from right to left, with fixed-width
|
| 312 | | - elements in the correct MSB-to-LSB order and Huffman codes in
|
| 313 | | - bit-reversed order (i.e., with the first bit of the code in the
|
| 314 | | - relative LSB position).
|
| 315 | | -
|
| 316 | | - 3.2. Compressed block format
|
| 317 | | -
|
| 318 | | - 3.2.1. Synopsis of prefix and Huffman coding
|
| 319 | | -
|
| 320 | | - Prefix coding represents symbols from an a priori known
|
| 321 | | - alphabet by bit sequences (codes), one code for each symbol, in
|
| 322 | | - a manner such that different symbols may be represented by bit
|
| 323 | | - sequences of different lengths, but a parser can always parse
|
| 324 | | - an encoded string unambiguously symbol-by-symbol.
|
| 325 | | -
|
| 326 | | - We define a prefix code in terms of a binary tree in which the
|
| 327 | | - two edges descending from each non-leaf node are labeled 0 and
|
| 328 | | - 1 and in which the leaf nodes correspond one-for-one with (are
|
| 329 | | - labeled with) the symbols of the alphabet; then the code for a
|
| 330 | | - symbol is the sequence of 0's and 1's on the edges leading from
|
| 331 | | - the root to the leaf labeled with that symbol. For example:
|
| 332 | | -
|
| 333 | | -
|
| 334 | | -
|
| 335 | | -
|
| 336 | | -
|
| 337 | | -
|
| 338 | | -
|
| 339 | | -
|
| 340 | | -
|
| 341 | | -
|
| 342 | | -
|
| 343 | | -Deutsch Informational [Page 6]
|
| 344 | | - |
| 345 | | -
|
| 346 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 347 | | -
|
| 348 | | -
|
| 349 | | - /\ Symbol Code
|
| 350 | | - 0 1 ------ ----
|
| 351 | | - / \ A 00
|
| 352 | | - /\ B B 1
|
| 353 | | - 0 1 C 011
|
| 354 | | - / \ D 010
|
| 355 | | - A /\
|
| 356 | | - 0 1
|
| 357 | | - / \
|
| 358 | | - D C
|
| 359 | | -
|
| 360 | | - A parser can decode the next symbol from an encoded input
|
| 361 | | - stream by walking down the tree from the root, at each step
|
| 362 | | - choosing the edge corresponding to the next input bit.
|
| 363 | | -
|
| 364 | | - Given an alphabet with known symbol frequencies, the Huffman
|
| 365 | | - algorithm allows the construction of an optimal prefix code
|
| 366 | | - (one which represents strings with those symbol frequencies
|
| 367 | | - using the fewest bits of any possible prefix codes for that
|
| 368 | | - alphabet). Such a code is called a Huffman code. (See
|
| 369 | | - reference [1] in Chapter 5, references for additional
|
| 370 | | - information on Huffman codes.)
|
| 371 | | -
|
| 372 | | - Note that in the "deflate" format, the Huffman codes for the
|
| 373 | | - various alphabets must not exceed certain maximum code lengths.
|
| 374 | | - This constraint complicates the algorithm for computing code
|
| 375 | | - lengths from symbol frequencies. Again, see Chapter 5,
|
| 376 | | - references for details.
|
| 377 | | -
|
| 378 | | - 3.2.2. Use of Huffman coding in the "deflate" format
|
| 379 | | -
|
| 380 | | - The Huffman codes used for each alphabet in the "deflate"
|
| 381 | | - format have two additional rules:
|
| 382 | | -
|
| 383 | | - * All codes of a given bit length have lexicographically
|
| 384 | | - consecutive values, in the same order as the symbols
|
| 385 | | - they represent;
|
| 386 | | -
|
| 387 | | - * Shorter codes lexicographically precede longer codes.
|
| 388 | | -
|
| 389 | | -
|
| 390 | | -
|
| 391 | | -
|
| 392 | | -
|
| 393 | | -
|
| 394 | | -
|
| 395 | | -
|
| 396 | | -
|
| 397 | | -
|
| 398 | | -
|
| 399 | | -
|
| 400 | | -Deutsch Informational [Page 7]
|
| 401 | | - |
| 402 | | -
|
| 403 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 404 | | -
|
| 405 | | -
|
| 406 | | - We could recode the example above to follow this rule as
|
| 407 | | - follows, assuming that the order of the alphabet is ABCD:
|
| 408 | | -
|
| 409 | | - Symbol Code
|
| 410 | | - ------ ----
|
| 411 | | - A 10
|
| 412 | | - B 0
|
| 413 | | - C 110
|
| 414 | | - D 111
|
| 415 | | -
|
| 416 | | - I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
|
| 417 | | - lexicographically consecutive.
|
| 418 | | -
|
| 419 | | - Given this rule, we can define the Huffman code for an alphabet
|
| 420 | | - just by giving the bit lengths of the codes for each symbol of
|
| 421 | | - the alphabet in order; this is sufficient to determine the
|
| 422 | | - actual codes. In our example, the code is completely defined
|
| 423 | | - by the sequence of bit lengths (2, 1, 3, 3). The following
|
| 424 | | - algorithm generates the codes as integers, intended to be read
|
| 425 | | - from most- to least-significant bit. The code lengths are
|
| 426 | | - initially in tree[I].Len; the codes are produced in
|
| 427 | | - tree[I].Code.
|
| 428 | | -
|
| 429 | | - 1) Count the number of codes for each code length. Let
|
| 430 | | - bl_count[N] be the number of codes of length N, N >= 1.
|
| 431 | | -
|
| 432 | | - 2) Find the numerical value of the smallest code for each
|
| 433 | | - code length:
|
| 434 | | -
|
| 435 | | - code = 0;
|
| 436 | | - bl_count[0] = 0;
|
| 437 | | - for (bits = 1; bits <= MAX_BITS; bits++) {
|
| 438 | | - code = (code + bl_count[bits-1]) << 1;
|
| 439 | | - next_code[bits] = code;
|
| 440 | | - }
|
| 441 | | -
|
| 442 | | - 3) Assign numerical values to all codes, using consecutive
|
| 443 | | - values for all codes of the same length with the base
|
| 444 | | - values determined at step 2. Codes that are never used
|
| 445 | | - (which have a bit length of zero) must not be assigned a
|
| 446 | | - value.
|
| 447 | | -
|
| 448 | | - for (n = 0; n <= max_code; n++) {
|
| 449 | | - len = tree[n].Len;
|
| 450 | | - if (len != 0) {
|
| 451 | | - tree[n].Code = next_code[len];
|
| 452 | | - next_code[len]++;
|
| 453 | | - }
|
| 454 | | -
|
| 455 | | -
|
| 456 | | -
|
| 457 | | -Deutsch Informational [Page 8]
|
| 458 | | - |
| 459 | | -
|
| 460 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 461 | | -
|
| 462 | | -
|
| 463 | | - }
|
| 464 | | -
|
| 465 | | - Example:
|
| 466 | | -
|
| 467 | | - Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
|
| 468 | | - 3, 2, 4, 4). After step 1, we have:
|
| 469 | | -
|
| 470 | | - N bl_count[N]
|
| 471 | | - - -----------
|
| 472 | | - 2 1
|
| 473 | | - 3 5
|
| 474 | | - 4 2
|
| 475 | | -
|
| 476 | | - Step 2 computes the following next_code values:
|
| 477 | | -
|
| 478 | | - N next_code[N]
|
| 479 | | - - ------------
|
| 480 | | - 1 0
|
| 481 | | - 2 0
|
| 482 | | - 3 2
|
| 483 | | - 4 14
|
| 484 | | -
|
| 485 | | - Step 3 produces the following code values:
|
| 486 | | -
|
| 487 | | - Symbol Length Code
|
| 488 | | - ------ ------ ----
|
| 489 | | - A 3 010
|
| 490 | | - B 3 011
|
| 491 | | - C 3 100
|
| 492 | | - D 3 101
|
| 493 | | - E 3 110
|
| 494 | | - F 2 00
|
| 495 | | - G 4 1110
|
| 496 | | - H 4 1111
|
| 497 | | -
|
| 498 | | - 3.2.3. Details of block format
|
| 499 | | -
|
| 500 | | - Each block of compressed data begins with 3 header bits
|
| 501 | | - containing the following data:
|
| 502 | | -
|
| 503 | | - first bit BFINAL
|
| 504 | | - next 2 bits BTYPE
|
| 505 | | -
|
| 506 | | - Note that the header bits do not necessarily begin on a byte
|
| 507 | | - boundary, since a block does not necessarily occupy an integral
|
| 508 | | - number of bytes.
|
| 509 | | -
|
| 510 | | -
|
| 511 | | -
|
| 512 | | -
|
| 513 | | -
|
| 514 | | -Deutsch Informational [Page 9]
|
| 515 | | - |
| 516 | | -
|
| 517 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 518 | | -
|
| 519 | | -
|
| 520 | | - BFINAL is set if and only if this is the last block of the data
|
| 521 | | - set.
|
| 522 | | -
|
| 523 | | - BTYPE specifies how the data are compressed, as follows:
|
| 524 | | -
|
| 525 | | - 00 - no compression
|
| 526 | | - 01 - compressed with fixed Huffman codes
|
| 527 | | - 10 - compressed with dynamic Huffman codes
|
| 528 | | - 11 - reserved (error)
|
| 529 | | -
|
| 530 | | - The only difference between the two compressed cases is how the
|
| 531 | | - Huffman codes for the literal/length and distance alphabets are
|
| 532 | | - defined.
|
| 533 | | -
|
| 534 | | - In all cases, the decoding algorithm for the actual data is as
|
| 535 | | - follows:
|
| 536 | | -
|
| 537 | | - do
|
| 538 | | - read block header from input stream.
|
| 539 | | - if stored with no compression
|
| 540 | | - skip any remaining bits in current partially
|
| 541 | | - processed byte
|
| 542 | | - read LEN and NLEN (see next section)
|
| 543 | | - copy LEN bytes of data to output
|
| 544 | | - otherwise
|
| 545 | | - if compressed with dynamic Huffman codes
|
| 546 | | - read representation of code trees (see
|
| 547 | | - subsection below)
|
| 548 | | - loop (until end of block code recognized)
|
| 549 | | - decode literal/length value from input stream
|
| 550 | | - if value < 256
|
| 551 | | - copy value (literal byte) to output stream
|
| 552 | | - otherwise
|
| 553 | | - if value = end of block (256)
|
| 554 | | - break from loop
|
| 555 | | - otherwise (value = 257..285)
|
| 556 | | - decode distance from input stream
|
| 557 | | -
|
| 558 | | - move backwards distance bytes in the output
|
| 559 | | - stream, and copy length bytes from this
|
| 560 | | - position to the output stream.
|
| 561 | | - end loop
|
| 562 | | - while not last block
|
| 563 | | -
|
| 564 | | - Note that a duplicated string reference may refer to a string
|
| 565 | | - in a previous block; i.e., the backward distance may cross one
|
| 566 | | - or more block boundaries. However a distance cannot refer past
|
| 567 | | - the beginning of the output stream. (An application using a
|
| 568 | | -
|
| 569 | | -
|
| 570 | | -
|
| 571 | | -Deutsch Informational [Page 10]
|
| 572 | | - |
| 573 | | -
|
| 574 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 575 | | -
|
| 576 | | -
|
| 577 | | - preset dictionary might discard part of the output stream; a
|
| 578 | | - distance can refer to that part of the output stream anyway)
|
| 579 | | - Note also that the referenced string may overlap the current
|
| 580 | | - position; for example, if the last 2 bytes decoded have values
|
| 581 | | - X and Y, a string reference with <length = 5, distance = 2>
|
| 582 | | - adds X,Y,X,Y,X to the output stream.
|
| 583 | | -
|
| 584 | | - We now specify each compression method in turn.
|
| 585 | | -
|
| 586 | | - 3.2.4. Non-compressed blocks (BTYPE=00)
|
| 587 | | -
|
| 588 | | - Any bits of input up to the next byte boundary are ignored.
|
| 589 | | - The rest of the block consists of the following information:
|
| 590 | | -
|
| 591 | | - 0 1 2 3 4...
|
| 592 | | - +---+---+---+---+================================+
|
| 593 | | - | LEN | NLEN |... LEN bytes of literal data...|
|
| 594 | | - +---+---+---+---+================================+
|
| 595 | | -
|
| 596 | | - LEN is the number of data bytes in the block. NLEN is the
|
| 597 | | - one's complement of LEN.
|
| 598 | | -
|
| 599 | | - 3.2.5. Compressed blocks (length and distance codes)
|
| 600 | | -
|
| 601 | | - As noted above, encoded data blocks in the "deflate" format
|
| 602 | | - consist of sequences of symbols drawn from three conceptually
|
| 603 | | - distinct alphabets: either literal bytes, from the alphabet of
|
| 604 | | - byte values (0..255), or <length, backward distance> pairs,
|
| 605 | | - where the length is drawn from (3..258) and the distance is
|
| 606 | | - drawn from (1..32,768). In fact, the literal and length
|
| 607 | | - alphabets are merged into a single alphabet (0..285), where
|
| 608 | | - values 0..255 represent literal bytes, the value 256 indicates
|
| 609 | | - end-of-block, and values 257..285 represent length codes
|
| 610 | | - (possibly in conjunction with extra bits following the symbol
|
| 611 | | - code) as follows:
|
| 612 | | -
|
| 613 | | -
|
| 614 | | -
|
| 615 | | -
|
| 616 | | -
|
| 617 | | -
|
| 618 | | -
|
| 619 | | -
|
| 620 | | -
|
| 621 | | -
|
| 622 | | -
|
| 623 | | -
|
| 624 | | -
|
| 625 | | -
|
| 626 | | -
|
| 627 | | -
|
| 628 | | -Deutsch Informational [Page 11]
|
| 629 | | - |
| 630 | | -
|
| 631 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 632 | | -
|
| 633 | | -
|
| 634 | | - Extra Extra Extra
|
| 635 | | - Code Bits Length(s) Code Bits Lengths Code Bits Length(s)
|
| 636 | | - ---- ---- ------ ---- ---- ------- ---- ---- -------
|
| 637 | | - 257 0 3 267 1 15,16 277 4 67-82
|
| 638 | | - 258 0 4 268 1 17,18 278 4 83-98
|
| 639 | | - 259 0 5 269 2 19-22 279 4 99-114
|
| 640 | | - 260 0 6 270 2 23-26 280 4 115-130
|
| 641 | | - 261 0 7 271 2 27-30 281 5 131-162
|
| 642 | | - 262 0 8 272 2 31-34 282 5 163-194
|
| 643 | | - 263 0 9 273 3 35-42 283 5 195-226
|
| 644 | | - 264 0 10 274 3 43-50 284 5 227-257
|
| 645 | | - 265 1 11,12 275 3 51-58 285 0 258
|
| 646 | | - 266 1 13,14 276 3 59-66
|
| 647 | | -
|
| 648 | | - The extra bits should be interpreted as a machine integer
|
| 649 | | - stored with the most-significant bit first, e.g., bits 1110
|
| 650 | | - represent the value 14.
|
| 651 | | -
|
| 652 | | - Extra Extra Extra
|
| 653 | | - Code Bits Dist Code Bits Dist Code Bits Distance
|
| 654 | | - ---- ---- ---- ---- ---- ------ ---- ---- --------
|
| 655 | | - 0 0 1 10 4 33-48 20 9 1025-1536
|
| 656 | | - 1 0 2 11 4 49-64 21 9 1537-2048
|
| 657 | | - 2 0 3 12 5 65-96 22 10 2049-3072
|
| 658 | | - 3 0 4 13 5 97-128 23 10 3073-4096
|
| 659 | | - 4 1 5,6 14 6 129-192 24 11 4097-6144
|
| 660 | | - 5 1 7,8 15 6 193-256 25 11 6145-8192
|
| 661 | | - 6 2 9-12 16 7 257-384 26 12 8193-12288
|
| 662 | | - 7 2 13-16 17 7 385-512 27 12 12289-16384
|
| 663 | | - 8 3 17-24 18 8 513-768 28 13 16385-24576
|
| 664 | | - 9 3 25-32 19 8 769-1024 29 13 24577-32768
|
| 665 | | -
|
| 666 | | - 3.2.6. Compression with fixed Huffman codes (BTYPE=01)
|
| 667 | | -
|
| 668 | | - The Huffman codes for the two alphabets are fixed, and are not
|
| 669 | | - represented explicitly in the data. The Huffman code lengths
|
| 670 | | - for the literal/length alphabet are:
|
| 671 | | -
|
| 672 | | - Lit Value Bits Codes
|
| 673 | | - --------- ---- -----
|
| 674 | | - 0 - 143 8 00110000 through
|
| 675 | | - 10111111
|
| 676 | | - 144 - 255 9 110010000 through
|
| 677 | | - 111111111
|
| 678 | | - 256 - 279 7 0000000 through
|
| 679 | | - 0010111
|
| 680 | | - 280 - 287 8 11000000 through
|
| 681 | | - 11000111
|
| 682 | | -
|
| 683 | | -
|
| 684 | | -
|
| 685 | | -Deutsch Informational [Page 12]
|
| 686 | | - |
| 687 | | -
|
| 688 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 689 | | -
|
| 690 | | -
|
| 691 | | - The code lengths are sufficient to generate the actual codes,
|
| 692 | | - as described above; we show the codes in the table for added
|
| 693 | | - clarity. Literal/length values 286-287 will never actually
|
| 694 | | - occur in the compressed data, but participate in the code
|
| 695 | | - construction.
|
| 696 | | -
|
| 697 | | - Distance codes 0-31 are represented by (fixed-length) 5-bit
|
| 698 | | - codes, with possible additional bits as shown in the table
|
| 699 | | - shown in Paragraph 3.2.5, above. Note that distance codes 30-
|
| 700 | | - 31 will never actually occur in the compressed data.
|
| 701 | | -
|
| 702 | | - 3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
|
| 703 | | -
|
| 704 | | - The Huffman codes for the two alphabets appear in the block
|
| 705 | | - immediately after the header bits and before the actual
|
| 706 | | - compressed data, first the literal/length code and then the
|
| 707 | | - distance code. Each code is defined by a sequence of code
|
| 708 | | - lengths, as discussed in Paragraph 3.2.2, above. For even
|
| 709 | | - greater compactness, the code length sequences themselves are
|
| 710 | | - compressed using a Huffman code. The alphabet for code lengths
|
| 711 | | - is as follows:
|
| 712 | | -
|
| 713 | | - 0 - 15: Represent code lengths of 0 - 15
|
| 714 | | - 16: Copy the previous code length 3 - 6 times.
|
| 715 | | - The next 2 bits indicate repeat length
|
| 716 | | - (0 = 3, ... , 3 = 6)
|
| 717 | | - Example: Codes 8, 16 (+2 bits 11),
|
| 718 | | - 16 (+2 bits 10) will expand to
|
| 719 | | - 12 code lengths of 8 (1 + 6 + 5)
|
| 720 | | - 17: Repeat a code length of 0 for 3 - 10 times.
|
| 721 | | - (3 bits of length)
|
| 722 | | - 18: Repeat a code length of 0 for 11 - 138 times
|
| 723 | | - (7 bits of length)
|
| 724 | | -
|
| 725 | | - A code length of 0 indicates that the corresponding symbol in
|
| 726 | | - the literal/length or distance alphabet will not occur in the
|
| 727 | | - block, and should not participate in the Huffman code
|
| 728 | | - construction algorithm given earlier. If only one distance
|
| 729 | | - code is used, it is encoded using one bit, not zero bits; in
|
| 730 | | - this case there is a single code length of one, with one unused
|
| 731 | | - code. One distance code of zero bits means that there are no
|
| 732 | | - distance codes used at all (the data is all literals).
|
| 733 | | -
|
| 734 | | - We can now define the format of the block:
|
| 735 | | -
|
| 736 | | - 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
|
| 737 | | - 5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
|
| 738 | | - 4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
|
| 739 | | -
|
| 740 | | -
|
| 741 | | -
|
| 742 | | -Deutsch Informational [Page 13]
|
| 743 | | - |
| 744 | | -
|
| 745 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 746 | | -
|
| 747 | | -
|
| 748 | | - (HCLEN + 4) x 3 bits: code lengths for the code length
|
| 749 | | - alphabet given just above, in the order: 16, 17, 18,
|
| 750 | | - 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
|
| 751 | | -
|
| 752 | | - These code lengths are interpreted as 3-bit integers
|
| 753 | | - (0-7); as above, a code length of 0 means the
|
| 754 | | - corresponding symbol (literal/length or distance code
|
| 755 | | - length) is not used.
|
| 756 | | -
|
| 757 | | - HLIT + 257 code lengths for the literal/length alphabet,
|
| 758 | | - encoded using the code length Huffman code
|
| 759 | | -
|
| 760 | | - HDIST + 1 code lengths for the distance alphabet,
|
| 761 | | - encoded using the code length Huffman code
|
| 762 | | -
|
| 763 | | - The actual compressed data of the block,
|
| 764 | | - encoded using the literal/length and distance Huffman
|
| 765 | | - codes
|
| 766 | | -
|
| 767 | | - The literal/length symbol 256 (end of data),
|
| 768 | | - encoded using the literal/length Huffman code
|
| 769 | | -
|
| 770 | | - The code length repeat codes can cross from HLIT + 257 to the
|
| 771 | | - HDIST + 1 code lengths. In other words, all code lengths form
|
| 772 | | - a single sequence of HLIT + HDIST + 258 values.
|
| 773 | | -
|
| 774 | | - 3.3. Compliance
|
| 775 | | -
|
| 776 | | - A compressor may limit further the ranges of values specified in
|
| 777 | | - the previous section and still be compliant; for example, it may
|
| 778 | | - limit the range of backward pointers to some value smaller than
|
| 779 | | - 32K. Similarly, a compressor may limit the size of blocks so that
|
| 780 | | - a compressible block fits in memory.
|
| 781 | | -
|
| 782 | | - A compliant decompressor must accept the full range of possible
|
| 783 | | - values defined in the previous section, and must accept blocks of
|
| 784 | | - arbitrary size.
|
| 785 | | -
|
| 786 | | -4. Compression algorithm details
|
| 787 | | -
|
| 788 | | - While it is the intent of this document to define the "deflate"
|
| 789 | | - compressed data format without reference to any particular
|
| 790 | | - compression algorithm, the format is related to the compressed
|
| 791 | | - formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
|
| 792 | | - since many variations of LZ77 are patented, it is strongly
|
| 793 | | - recommended that the implementor of a compressor follow the general
|
| 794 | | - algorithm presented here, which is known not to be patented per se.
|
| 795 | | - The material in this section is not part of the definition of the
|
| 796 | | -
|
| 797 | | -
|
| 798 | | -
|
| 799 | | -Deutsch Informational [Page 14]
|
| 800 | | - |
| 801 | | -
|
| 802 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 803 | | -
|
| 804 | | -
|
| 805 | | - specification per se, and a compressor need not follow it in order to
|
| 806 | | - be compliant.
|
| 807 | | -
|
| 808 | | - The compressor terminates a block when it determines that starting a
|
| 809 | | - new block with fresh trees would be useful, or when the block size
|
| 810 | | - fills up the compressor's block buffer.
|
| 811 | | -
|
| 812 | | - The compressor uses a chained hash table to find duplicated strings,
|
| 813 | | - using a hash function that operates on 3-byte sequences. At any
|
| 814 | | - given point during compression, let XYZ be the next 3 input bytes to
|
| 815 | | - be examined (not necessarily all different, of course). First, the
|
| 816 | | - compressor examines the hash chain for XYZ. If the chain is empty,
|
| 817 | | - the compressor simply writes out X as a literal byte and advances one
|
| 818 | | - byte in the input. If the hash chain is not empty, indicating that
|
| 819 | | - the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
|
| 820 | | - same hash function value) has occurred recently, the compressor
|
| 821 | | - compares all strings on the XYZ hash chain with the actual input data
|
| 822 | | - sequence starting at the current point, and selects the longest
|
| 823 | | - match.
|
| 824 | | -
|
| 825 | | - The compressor searches the hash chains starting with the most recent
|
| 826 | | - strings, to favor small distances and thus take advantage of the
|
| 827 | | - Huffman encoding. The hash chains are singly linked. There are no
|
| 828 | | - deletions from the hash chains; the algorithm simply discards matches
|
| 829 | | - that are too old. To avoid a worst-case situation, very long hash
|
| 830 | | - chains are arbitrarily truncated at a certain length, determined by a
|
| 831 | | - run-time parameter.
|
| 832 | | -
|
| 833 | | - To improve overall compression, the compressor optionally defers the
|
| 834 | | - selection of matches ("lazy matching"): after a match of length N has
|
| 835 | | - been found, the compressor searches for a longer match starting at
|
| 836 | | - the next input byte. If it finds a longer match, it truncates the
|
| 837 | | - previous match to a length of one (thus producing a single literal
|
| 838 | | - byte) and then emits the longer match. Otherwise, it emits the
|
| 839 | | - original match, and, as described above, advances N bytes before
|
| 840 | | - continuing.
|
| 841 | | -
|
| 842 | | - Run-time parameters also control this "lazy match" procedure. If
|
| 843 | | - compression ratio is most important, the compressor attempts a
|
| 844 | | - complete second search regardless of the length of the first match.
|
| 845 | | - In the normal case, if the current match is "long enough", the
|
| 846 | | - compressor reduces the search for a longer match, thus speeding up
|
| 847 | | - the process. If speed is most important, the compressor inserts new
|
| 848 | | - strings in the hash table only when no match was found, or when the
|
| 849 | | - match is not "too long". This degrades the compression ratio but
|
| 850 | | - saves time since there are both fewer insertions and fewer searches.
|
| 851 | | -
|
| 852 | | -
|
| 853 | | -
|
| 854 | | -
|
| 855 | | -
|
| 856 | | -Deutsch Informational [Page 15]
|
| 857 | | - |
| 858 | | -
|
| 859 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 860 | | -
|
| 861 | | -
|
| 862 | | -5. References
|
| 863 | | -
|
| 864 | | - [1] Huffman, D. A., "A Method for the Construction of Minimum
|
| 865 | | - Redundancy Codes", Proceedings of the Institute of Radio
|
| 866 | | - Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
|
| 867 | | -
|
| 868 | | - [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
|
| 869 | | - Compression", IEEE Transactions on Information Theory, Vol. 23,
|
| 870 | | - No. 3, pp. 337-343.
|
| 871 | | -
|
| 872 | | - [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources,
|
| 873 | | - available in ftp://ftp.uu.net/pub/archiving/zip/doc/
|
| 874 | | -
|
| 875 | | - [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources,
|
| 876 | | - available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/
|
| 877 | | -
|
| 878 | | - [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
|
| 879 | | - encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
|
| 880 | | -
|
| 881 | | - [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
|
| 882 | | - Comm. ACM, 33,4, April 1990, pp. 449-459.
|
| 883 | | -
|
| 884 | | -6. Security Considerations
|
| 885 | | -
|
| 886 | | - Any data compression method involves the reduction of redundancy in
|
| 887 | | - the data. Consequently, any corruption of the data is likely to have
|
| 888 | | - severe effects and be difficult to correct. Uncompressed text, on
|
| 889 | | - the other hand, will probably still be readable despite the presence
|
| 890 | | - of some corrupted bytes.
|
| 891 | | -
|
| 892 | | - It is recommended that systems using this data format provide some
|
| 893 | | - means of validating the integrity of the compressed data. See
|
| 894 | | - reference [3], for example.
|
| 895 | | -
|
| 896 | | -7. Source code
|
| 897 | | -
|
| 898 | | - Source code for a C language implementation of a "deflate" compliant
|
| 899 | | - compressor and decompressor is available within the zlib package at
|
| 900 | | - ftp://ftp.uu.net/pub/archiving/zip/zlib/.
|
| 901 | | -
|
| 902 | | -8. Acknowledgements
|
| 903 | | -
|
| 904 | | - Trademarks cited in this document are the property of their
|
| 905 | | - respective owners.
|
| 906 | | -
|
| 907 | | - Phil Katz designed the deflate format. Jean-Loup Gailly and Mark
|
| 908 | | - Adler wrote the related software described in this specification.
|
| 909 | | - Glenn Randers-Pehrson converted this document to RFC and HTML format.
|
| 910 | | -
|
| 911 | | -
|
| 912 | | -
|
| 913 | | -Deutsch Informational [Page 16]
|
| 914 | | - |
| 915 | | -
|
| 916 | | -RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
| 917 | | -
|
| 918 | | -
|
| 919 | | -9. Author's Address
|
| 920 | | -
|
| 921 | | - L. Peter Deutsch
|
| 922 | | - Aladdin Enterprises
|
| 923 | | - 203 Santa Margarita Ave.
|
| 924 | | - Menlo Park, CA 94025
|
| 925 | | -
|
| 926 | | - Phone: (415) 322-0103 (AM only)
|
| 927 | | - FAX: (415) 322-1734
|
| 928 | | - EMail: <[email protected]>
|
| 929 | | -
|
| 930 | | - Questions about the technical content of this specification can be
|
| 931 | | - sent by email to:
|
| 932 | | -
|
| 933 | | - Jean-Loup Gailly <[email protected]> and
|
| 934 | | - Mark Adler <[email protected]>
|
| 935 | | -
|
| 936 | | - Editorial comments on this specification can be sent by email to:
|
| 937 | | -
|
| 938 | | - L. Peter Deutsch <[email protected]> and
|
| 939 | | - Glenn Randers-Pehrson <[email protected]>
|
| 940 | | -
|
| 941 | | -
|
| 942 | | -
|
| 943 | | -
|
| 944 | | -
|
| 945 | | -
|
| 946 | | -
|
| 947 | | -
|
| 948 | | -
|
| 949 | | -
|
| 950 | | -
|
| 951 | | -
|
| 952 | | -
|
| 953 | | -
|
| 954 | | -
|
| 955 | | -
|
| 956 | | -
|
| 957 | | -
|
| 958 | | -
|
| 959 | | -
|
| 960 | | -
|
| 961 | | -
|
| 962 | | -
|
| 963 | | -
|
| 964 | | -
|
| 965 | | -
|
| 966 | | -
|
| 967 | | -
|
| 968 | | -
|
| 969 | | -
|
| 970 | | -Deutsch Informational [Page 17]
|
| 971 | | - |
| 972 | | -
|