Fossil SCM

fossil-scm / www / image-format-vs-repo-size.md
Source Blame History 302 lines
05f95db… wyoung 1 # Image Format vs Fossil Repo Size
05f95db… wyoung 2
05f95db… wyoung 3 ## The Problem
05f95db… wyoung 4
05f95db… wyoung 5 Fossil has a [delta compression][dc] feature which removes redundant
389e3fb… wyoung 6 information from a file relative to its parent on check-in.[^delta-prgs]
f281b3d… wyoung 7 That delta is then [zlib][zl]-compressed before being stored
41e5237… wyoung 8 in the Fossil repository database file.
41e5237… wyoung 9
41e5237… wyoung 10 Storing pre-compressed data files in a Fossil repository defeats both of
41e5237… wyoung 11 these space-saving measures:
41e5237… wyoung 12
f281b3d… wyoung 13 1. Binary data compression algorithms turn the file data into
389e3fb… wyoung 14 pseudorandom noise.[^prn]
fd1282e… drh 15
41e5237… wyoung 16 Typical data compression algorithms are not [hash functions][hf],
41e5237… wyoung 17 where the goal is that a change to each bit in the input has a
41e5237… wyoung 18 statistically even chance of changing every bit in the output, but
41e5237… wyoung 19 because they do approach that pathological condition, pre-compressed
41e5237… wyoung 20 data tends to defeat Fossil’s delta compression algorithm, there
41e5237… wyoung 21 being so little correlation between two different outputs from the
41e5237… wyoung 22 binary data compression algorithm.
05f95db… wyoung 23
05f95db… wyoung 24 2. An ideal lossless binary data compression algorithm cannot be
05f95db… wyoung 25 applied more than once to make the data even smaller, since random
05f95db… wyoung 26 noise is incompressible. The consequence for our purposes here is
05f95db… wyoung 27 that pre-compressed data doesn’t benefit from Fossil’s zlib
05f95db… wyoung 28 compression.
05f95db… wyoung 29
41e5237… wyoung 30 You might then ask, what does it matter if the space savings comes from
f281b3d… wyoung 31 the application file format (e.g. JPEG, DOCX, Zip, etc.) or from Fossil
41e5237… wyoung 32 itself? It really doesn’t, as far as point 2 above goes, but point 1
41e5237… wyoung 33 causes the Fossil repository to balloon out of proportion to the size of
41e5237… wyoung 34 the input data change on each checkin. This article will illustrate that
41e5237… wyoung 35 problem, quantify it, and give a solution to it.
41e5237… wyoung 36
05f95db… wyoung 37 [dc]: ./delta_format.wiki
05f95db… wyoung 38 [hf]: https://en.wikipedia.org/wiki/Hash_function
05f95db… wyoung 39 [zl]: http://www.zlib.net/
05f95db… wyoung 40
05f95db… wyoung 41
7de2410… wyoung 42 ## <a id="formats"></a>Affected File Formats
41e5237… wyoung 43
41e5237… wyoung 44 In this article’s core experiment, we use 2D image file formats, but
41e5237… wyoung 45 this article’s advice also applies to many other file types. For just a
41e5237… wyoung 46 few examples out of what must be thousands:
41e5237… wyoung 47
41e5237… wyoung 48 * **Microsoft Office**: The [OOXML document format][oox] used from
41e5237… wyoung 49 Office 2003 onward (`.docx`, `.xlsx`, `.pptx`, etc.) are Zip files
41e5237… wyoung 50 containing an XML document file and several collateral files.
41e5237… wyoung 51
f281b3d… wyoung 52 * **Libre Office**: For the purposes of this article, its
f281b3d… wyoung 53 [OpenDocument Format][odf] is designed the same basic way as OOXML.
41e5237… wyoung 54
41e5237… wyoung 55 * **Java**: A Java [`.jar` file][jcl] is a Zip file containing JVM
05f95db… wyoung 56 `.class` files, manifest files, and more.
05f95db… wyoung 57
41e5237… wyoung 58 * **Windows Installer:** An [`*.msi` file][wi] is a proprietary
05f95db… wyoung 59 database format that contains, among other things, [Microsoft
05f95db… wyoung 60 Cabinet][cab]-compressed files, which in turn may hold Windows
05f95db… wyoung 61 executables, which [may themselves be compressed][exc].
05f95db… wyoung 62
41e5237… wyoung 63 * **SVG, PDF, TIFF, etc.**: Many file formats are available in both
41e5237… wyoung 64 compressed and uncompressed forms. You should use the uncompressed
41e5237… wyoung 65 form with Fossil wherever practical, as we will show below.
05f95db… wyoung 66
05f95db… wyoung 67
05f95db… wyoung 68 [cab]: https://en.wikipedia.org/wiki/Cabinet_(file_format)
05f95db… wyoung 69 [exc]: https://en.wikipedia.org/wiki/Executable_compression
05f95db… wyoung 70 [jcl]: https://en.wikipedia.org/wiki/Java_(programming_language)
05f95db… wyoung 71 [odf]: https://en.wikipedia.org/wiki/OpenDocument
05f95db… wyoung 72 [oox]: https://en.wikipedia.org/wiki/Office_Open_XML
05f95db… wyoung 73 [wi]: https://en.wikipedia.org/wiki/Windows_Installer
05f95db… wyoung 74
05f95db… wyoung 75
05f95db… wyoung 76
7de2410… wyoung 77 ## <a id="demo"></a>Demonstration
05f95db… wyoung 78
f52d63e… wyoung 79 The companion `image-format-vs-repo-size.ipynb` file ([download][nbd],
f281b3d… wyoung 80 [preview][nbp]) is a [JupyterLab][jl] notebook implementing the following
f52d63e… wyoung 81 experiment:
05f95db… wyoung 82
f281b3d… wyoung 83 1. Create a new minimum-size Fossil repository. Save this initial size.
05f95db… wyoung 84
05f95db… wyoung 85 2. Use [ImageMagick][im] via [Wand][wp] to generate a JPEG file of a
05f95db… wyoung 86 particular size — currently 256 px² — filled with Gaussian noise to
41e5237… wyoung 87 make data compression more difficult than with a solid-color image.
05f95db… wyoung 88
05f95db… wyoung 89 3. Check that image into the new Fossil repo, and remember that size.
05f95db… wyoung 90
05f95db… wyoung 91 4. Change a random pixel in the image to a random RGB value, save that
05f95db… wyoung 92 image, check it in, and remember the new Fossil repo size.
05f95db… wyoung 93
05f95db… wyoung 94 5. Iterate on step 4 some number of times — currently 10 — and remember
05f95db… wyoung 95 the Fossil repo size at each step.
05f95db… wyoung 96
389e3fb… wyoung 97 6. Repeat the above steps for BMP, PNG, and TIFF.[^tiff-cmp]
05f95db… wyoung 98
05f95db… wyoung 99 7. Create a bar chart showing how the Fossil repository size changes
05f95db… wyoung 100 with each checkin.
05f95db… wyoung 101
f281b3d… wyoung 102 We chose to use JupyterLab for this because it makes it easy for you to
05f95db… wyoung 103 modify the notebook to try different things. Want to see how the
05f95db… wyoung 104 results change with a different image size? Easy, change the `size`
05f95db… wyoung 105 value in the second cell of the notebook. Want to try more image
05f95db… wyoung 106 formats? You can put anything ImageMagick can recognize into the
05f95db… wyoung 107 `formats` list. Want to find the break-even point for images like those
79c2cb0… wyoung 108 in your own repository? Easily done with a small amount of code.
f52d63e… wyoung 109
f52d63e… wyoung 110 [im]: https://www.imagemagick.org/
f281b3d… wyoung 111 [jl]: https://jupyter.org/
f52d63e… wyoung 112 [nbd]: ./image-format-vs-repo-size.ipynb
f52d63e… wyoung 113 [nbp]: https://nbviewer.jupyter.org/urls/fossil-scm.org/fossil/doc/trunk/www/image-format-vs-repo-size.ipynb
f52d63e… wyoung 114 [wp]: http://wand-py.org/
f52d63e… wyoung 115
f52d63e… wyoung 116
7de2410… wyoung 117 ## <a id="results"></a>Results
f52d63e… wyoung 118
389e3fb… wyoung 119 Running the notebook gives a bar chart something like[^variance] this:
05f95db… wyoung 120
05f95db… wyoung 121 ![results bar chart](./image-format-vs-repo-size.svg)
05f95db… wyoung 122
41e5237… wyoung 123 There are a few key things we want to draw your attention to in that
41e5237… wyoung 124 chart:
05f95db… wyoung 125
05f95db… wyoung 126 * BMP and uncompressed TIFF are nearly identical in size for all
f281b3d… wyoung 127 checkins, and the repository growth rate is negligible past the
389e3fb… wyoung 128 first commit.[^size-jump] We owe this economy to Fossil’s delta compression
f281b3d… wyoung 129 feature: it is encoding each of those single-pixel changes in a very
f281b3d… wyoung 130 small amount of repository space.
f281b3d… wyoung 131
309af34… andygoth 132 * The JPEG and PNG bars increase by large amounts on most checkins
f281b3d… wyoung 133 even though each checkin *also* encodes only a *single-pixel change*.
41e5237… wyoung 134
f281b3d… wyoung 135 * The size of the first checkin in the BMP and TIFF cases is roughly
f281b3d… wyoung 136 the same as that for the PNG case, because both PNG and Fossil use
f281b3d… wyoung 137 the zlib binary data compression algorithm. This shows that for
f281b3d… wyoung 138 repos where the image files are committed only once, there is
f281b3d… wyoung 139 virtually no penalty to using BMP or TIFF over PNG. The file sizes
f281b3d… wyoung 140 likely differ only because of differences in zlib settings between
f281b3d… wyoung 141 the cases.
05f95db… wyoung 142
05f95db… wyoung 143 * Because JPEG’s lossy nature allows it to start smaller and have
e755561… danield 144 smaller size increases than PNG, the crossover point with
41e5237… wyoung 145 BMP/TIFF isn’t until 7-9 checkins in typical runs of this [Monte
41e5237… wyoung 146 Carlo experiment][mce]. Given a choice among these four file
41e5237… wyoung 147 formats and a willingness to use lossy image compression, a rational
41e5237… wyoung 148 tradeoff is to choose JPEG for repositories where each image will
41e5237… wyoung 149 change fewer than that number of times.
41e5237… wyoung 150
f281b3d… wyoung 151
05f95db… wyoung 152 [mce]: https://en.wikipedia.org/wiki/Monte_Carlo_method
05f95db… wyoung 153
05f95db… wyoung 154
7de2410… wyoung 155 ## <a id="makefile"></a>Automated Recompression
05f95db… wyoung 156
05f95db… wyoung 157 Since programs that produce and consume binary-compressed data files
05f95db… wyoung 158 often make it either difficult or impossible to work with the
05f95db… wyoung 159 uncompressed form, we want an automated method for producing the
05f95db… wyoung 160 uncompressed form to make Fossil happy while still having the compressed
05f95db… wyoung 161 form to keep our content creation applications happy. This `Makefile`
389e3fb… wyoung 162 should[^makefile] do that for BMP, PNG, SVG, and XLSX files:
389e3fb… wyoung 163
8a1ba49… wyoung 164 .SUFFIXES: .bmp .png .svg .svgz
8a1ba49… wyoung 165
8a1ba49… wyoung 166 .svgz.svg:
8a1ba49… wyoung 167 gzip -dc < $< > $@
8a1ba49… wyoung 168
8a1ba49… wyoung 169 .svg.svgz:
8a1ba49… wyoung 170 gzip -9c < $< > $@
8a1ba49… wyoung 171
8a1ba49… wyoung 172 .bmp.png:
8a1ba49… wyoung 173 convert -quality 95 $< $@
8a1ba49… wyoung 174
8a1ba49… wyoung 175 .png.bmp:
8a1ba49… wyoung 176 convert $< $@
8a1ba49… wyoung 177
8a1ba49… wyoung 178 SS_FILES := $(wildcard spreadsheet/*)
8a1ba49… wyoung 179
8a1ba49… wyoung 180
8a1ba49… wyoung 181 all: $(SS_FILES) illus.svg image.bmp doc-big.pdf
8a1ba49… wyoung 182
8a1ba49… wyoung 183 reconstitute: illus.svgz image.png
8a1ba49… wyoung 184 ( cd spreadsheet ; zip -9 ../spreadsheet.xlsx) * )
8a1ba49… wyoung 185 qpdf doc-big.pdf doc-small.pdf
8a1ba49… wyoung 186
8a1ba49… wyoung 187
8a1ba49… wyoung 188 $(SS_FILES): spreadsheet.xlsx
8a1ba49… wyoung 189 unzip $@ -d $<
8a1ba49… wyoung 190
8a1ba49… wyoung 191 doc-big.pdf: doc-small.pdf
8a1ba49… wyoung 192 qpdf --stream-data=uncompress $@ $<
05f95db… wyoung 193
05f95db… wyoung 194 This `Makefile` allows you to treat the compressed version as the
05f95db… wyoung 195 process input, but to actually check in only the changes against the
41e5237… wyoung 196 uncompressed version by typing “`make`” before “`fossil ci`”. This is
41e5237… wyoung 197 not actually an extra step in practice, since if you’ve got a
389e3fb… wyoung 198 `Makefile`-based project, you should be building — and testing — it
41e5237… wyoung 199 before checking each change in anyway!
05f95db… wyoung 200
41e5237… wyoung 201 Because this technique is based on dependency rules, only the necessary
41e5237… wyoung 202 files are generated on each `make` command.
05f95db… wyoung 203
05f95db… wyoung 204 You only have to run “`make reconstitute`” *once* after opening a fresh
05f95db… wyoung 205 Fossil checkout to produce those compressed sources. After that, you
41e5237… wyoung 206 work with the compressed files in your content creation programs. Your
41e5237… wyoung 207 build system might include some kind of bootstrapping or
41e5237… wyoung 208 auto-configuration step that you could attach this to, so that it
41e5237… wyoung 209 doesn’t need to be run by hand.
05f95db… wyoung 210
41e5237… wyoung 211 This `Makefile` illustrates two primary strategies:
05f95db… wyoung 212
05f95db… wyoung 213
79c2cb0… wyoung 214 ### Input and Output File Formats Differ by Extension
05f95db… wyoung 215
05f95db… wyoung 216 In the case of SVG and the bitmap image formats, the file name extension
05f95db… wyoung 217 differs between the cases, so we can use `make` suffix rules to get the
05f95db… wyoung 218 behavior we want. The top half of the `Makefile` just tells `make` how
05f95db… wyoung 219 to map from `*.svg` to `*.svgz` and vice versa, and the same for `*.bmp`
05f95db… wyoung 220 to/from `*.png`.
05f95db… wyoung 221
05f95db… wyoung 222
41e5237… wyoung 223 ### Input and Output Use the Same Extension
05f95db… wyoung 224
41e5237… wyoung 225 We don’t have that luxury for Excel and PDF files, each for a different
41e5237… wyoung 226 reason:
05f95db… wyoung 227
05f95db… wyoung 228 * **Excel:** Excel has no way to work with the unpacked Zip file
05f95db… wyoung 229 contents at all, so we have to unpack it into a subdirectory, which
05f95db… wyoung 230 is what we check into Fossil. On making a fresh Fossil checkout, we
05f95db… wyoung 231 have to pack that subdirectory’s contents back up into an `*.xlsx`
05f95db… wyoung 232 file with “`make reconstitute`” so we can edit it with Excel again.
05f95db… wyoung 233
05f95db… wyoung 234 * **PDF:** All PDF readers can display an uncompressed PDF file, but
05f95db… wyoung 235 many PDF-*producing* programs have no option for uncompressed
05f95db… wyoung 236 output. Since the file name extension is the same either way, we
05f95db… wyoung 237 treat the compressed PDF as the source to the process, yielding an
05f95db… wyoung 238 automatically-uncompressed PDF for the benefit of Fossil. Unlike
05f95db… wyoung 239 with the Excel case, there is no simple “file base name to directory
05f95db… wyoung 240 name” mapping, so we just created the `-big` to `-small` name scheme
05f95db… wyoung 241 here.
05f95db… wyoung 242
f281b3d… wyoung 243
389e3fb… wyoung 244 [^delta-prgs]:
389e3fb… wyoung 245 This problem is not Fossil-specific. Several other programs also do
f281b3d… wyoung 246 delta compression, so they’ll also be affected by this problem:
f281b3d… wyoung 247 [rsync][rs], [Unison][us], [Git][git], etc. You should take this
f281b3d… wyoung 248 article’s advice when using all such programs, not just Fossil.
f281b3d… wyoung 249 When using file copying and synchronization programs *without* delta
f281b3d… wyoung 250 compression, on the other hand, it’s best to use the most
f281b3d… wyoung 251 highly-compressed file format you can tolerate, since they copy the
f281b3d… wyoung 252 whole file any time any bit of it changes.
f281b3d… wyoung 253
389e3fb… wyoung 254 [^prn]:
389e3fb… wyoung 255 In fact, a good way to gauge the effectiveness of a given
41e5237… wyoung 256 compression scheme is to run its output through the same sort of
41e5237… wyoung 257 tests we use to gauge how “random” a given [PRNG][prng] is. Another
41e5237… wyoung 258 way to look at it is that if there is a discernible pattern in the
f281b3d… wyoung 259 output of a compression scheme, that constitutes *information* (in
f281b3d… wyoung 260 [the technical sense of that word][ith]) that could be further
f281b3d… wyoung 261 compressed.
41e5237… wyoung 262
389e3fb… wyoung 263 [^tiff-cmp]:
389e3fb… wyoung 264 We're using *uncompressed* TIFF here, not [LZW][lzw]- or
05f95db… wyoung 265 Zip-compressed TIFF, either of which would give similar results to
05f95db… wyoung 266 PNG, which is always zlib-compressed.
05f95db… wyoung 267
389e3fb… wyoung 268 [^variance]:
389e3fb… wyoung 269 The raw data changes somewhat from one run to the next due to the
05f95db… wyoung 270 use of random noise in the image to make the zlib/PNG compression
05f95db… wyoung 271 more difficult, and the random pixel changes. Those test design
79c2cb0… wyoung 272 choices make this a [Monte Carlo experiment][mce]. We’ve found that
41e5237… wyoung 273 the overall character of the results doesn’t change from one run to
41e5237… wyoung 274 the next.
41e5237… wyoung 275
389e3fb… wyoung 276 [^size-jump]:
389e3fb… wyoung 277 It’s not clear to me why there is a one-time jump in size for BMP
f281b3d… wyoung 278 and TIFF past the first commit. I suspect it is due to the SQLite
f281b3d… wyoung 279 indices being initialized for the first time.
f281b3d… wyoung 280 Page size inflation might have something to do with it as well,
f281b3d… wyoung 281 though we tried to control that by rebuilding the initial DB with a
f281b3d… wyoung 282 minimal page size. If you re-run the program often enough, you will
f281b3d… wyoung 283 sometimes see the BMP or TIFF bar jump higher than the other, again
f281b3d… wyoung 284 likely due to one of the repos crossing a page boundary.
f281b3d… wyoung 285 Another curious artifact in the data is that the BMP is slightly
f281b3d… wyoung 286 larger than for the TIFF. This goes against expectation because a
f281b3d… wyoung 287 low-tech format like BMP should have a small edge in this test
41e5237… wyoung 288 because TIFF metadata includes the option for multiple timestamps,
41e5237… wyoung 289 UUIDs, etc., which bloat the checkin size by creating many small
f281b3d… wyoung 290 deltas.
82ad591… wyoung 291
389e3fb… wyoung 292 [^makefile]:
389e3fb… wyoung 293 The `Makefile` above is not battle-tested. Please report bugs and
82ad591… wyoung 294 needed extensions [on the forum][for].
82ad591… wyoung 295
41e5237… wyoung 296 [for]: https://fossil-scm.org/forum/forumpost/15e677f2c8
41e5237… wyoung 297 [git]: https://git-scm.com/
f281b3d… wyoung 298 [ith]: https://en.wikipedia.org/wiki/Information_theory
41e5237… wyoung 299 [lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
41e5237… wyoung 300 [prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator
41e5237… wyoung 301 [rs]: https://rsync.samba.org/
41e5237… wyoung 302 [us]: http://www.cis.upenn.edu/~bcpierce/unison/

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button