Fossil SCM

fossil-scm / www / image-format-vs-repo-size.md
1
# Image Format vs Fossil Repo Size
2
3
## The Problem
4
5
Fossil has a [delta compression][dc] feature which removes redundant
6
information from a file relative to its parent on check-in.[^delta-prgs]
7
That delta is then [zlib][zl]-compressed before being stored
8
in the Fossil repository database file.
9
10
Storing pre-compressed data files in a Fossil repository defeats both of
11
these space-saving measures:
12
13
1. Binary data compression algorithms turn the file data into
14
pseudorandom noise.[^prn]
15
16
Typical data compression algorithms are not [hash functions][hf],
17
where the goal is that a change to each bit in the input has a
18
statistically even chance of changing every bit in the output, but
19
because they do approach that pathological condition, pre-compressed
20
data tends to defeat Fossil’s delta compression algorithm, there
21
being so little correlation between two different outputs from the
22
binary data compression algorithm.
23
24
2. An ideal lossless binary data compression algorithm cannot be
25
applied more than once to make the data even smaller, since random
26
noise is incompressible. The consequence for our purposes here is
27
that pre-compressed data doesn’t benefit from Fossil’s zlib
28
compression.
29
30
You might then ask, what does it matter if the space savings comes from
31
the application file format (e.g. JPEG, DOCX, Zip, etc.) or from Fossil
32
itself? It really doesn’t, as far as point 2 above goes, but point 1
33
causes the Fossil repository to balloon out of proportion to the size of
34
the input data change on each checkin. This article will illustrate that
35
problem, quantify it, and give a solution to it.
36
37
[dc]: ./delta_format.wiki
38
[hf]: https://en.wikipedia.org/wiki/Hash_function
39
[zl]: http://www.zlib.net/
40
41
42
## <a id="formats"></a>Affected File Formats
43
44
In this article’s core experiment, we use 2D image file formats, but
45
this article’s advice also applies to many other file types. For just a
46
few examples out of what must be thousands:
47
48
* **Microsoft Office**: The [OOXML document format][oox] used from
49
Office 2003 onward (`.docx`, `.xlsx`, `.pptx`, etc.) are Zip files
50
containing an XML document file and several collateral files.
51
52
* **Libre Office**: For the purposes of this article, its
53
[OpenDocument Format][odf] is designed the same basic way as OOXML.
54
55
* **Java**: A Java [`.jar` file][jcl] is a Zip file containing JVM
56
`.class` files, manifest files, and more.
57
58
* **Windows Installer:** An [`*.msi` file][wi] is a proprietary
59
database format that contains, among other things, [Microsoft
60
Cabinet][cab]-compressed files, which in turn may hold Windows
61
executables, which [may themselves be compressed][exc].
62
63
* **SVG, PDF, TIFF, etc.**: Many file formats are available in both
64
compressed and uncompressed forms. You should use the uncompressed
65
form with Fossil wherever practical, as we will show below.
66
67
68
[cab]: https://en.wikipedia.org/wiki/Cabinet_(file_format)
69
[exc]: https://en.wikipedia.org/wiki/Executable_compression
70
[jcl]: https://en.wikipedia.org/wiki/Java_(programming_language)
71
[odf]: https://en.wikipedia.org/wiki/OpenDocument
72
[oox]: https://en.wikipedia.org/wiki/Office_Open_XML
73
[wi]: https://en.wikipedia.org/wiki/Windows_Installer
74
75
76
77
## <a id="demo"></a>Demonstration
78
79
The companion `image-format-vs-repo-size.ipynb` file ([download][nbd],
80
[preview][nbp]) is a [JupyterLab][jl] notebook implementing the following
81
experiment:
82
83
1. Create a new minimum-size Fossil repository. Save this initial size.
84
85
2. Use [ImageMagick][im] via [Wand][wp] to generate a JPEG file of a
86
particular size — currently 256 px² — filled with Gaussian noise to
87
make data compression more difficult than with a solid-color image.
88
89
3. Check that image into the new Fossil repo, and remember that size.
90
91
4. Change a random pixel in the image to a random RGB value, save that
92
image, check it in, and remember the new Fossil repo size.
93
94
5. Iterate on step 4 some number of times — currently 10 — and remember
95
the Fossil repo size at each step.
96
97
6. Repeat the above steps for BMP, PNG, and TIFF.[^tiff-cmp]
98
99
7. Create a bar chart showing how the Fossil repository size changes
100
with each checkin.
101
102
We chose to use JupyterLab for this because it makes it easy for you to
103
modify the notebook to try different things. Want to see how the
104
results change with a different image size? Easy, change the `size`
105
value in the second cell of the notebook. Want to try more image
106
formats? You can put anything ImageMagick can recognize into the
107
`formats` list. Want to find the break-even point for images like those
108
in your own repository? Easily done with a small amount of code.
109
110
[im]: https://www.imagemagick.org/
111
[jl]: https://jupyter.org/
112
[nbd]: ./image-format-vs-repo-size.ipynb
113
[nbp]: https://nbviewer.jupyter.org/urls/fossil-scm.org/fossil/doc/trunk/www/image-format-vs-repo-size.ipynb
114
[wp]: http://wand-py.org/
115
116
117
## <a id="results"></a>Results
118
119
Running the notebook gives a bar chart something like[^variance] this:
120
121
![results bar chart](./image-format-vs-repo-size.svg)
122
123
There are a few key things we want to draw your attention to in that
124
chart:
125
126
* BMP and uncompressed TIFF are nearly identical in size for all
127
checkins, and the repository growth rate is negligible past the
128
first commit.[^size-jump] We owe this economy to Fossil’s delta compression
129
feature: it is encoding each of those single-pixel changes in a very
130
small amount of repository space.
131
132
* The JPEG and PNG bars increase by large amounts on most checkins
133
even though each checkin *also* encodes only a *single-pixel change*.
134
135
* The size of the first checkin in the BMP and TIFF cases is roughly
136
the same as that for the PNG case, because both PNG and Fossil use
137
the zlib binary data compression algorithm. This shows that for
138
repos where the image files are committed only once, there is
139
virtually no penalty to using BMP or TIFF over PNG. The file sizes
140
likely differ only because of differences in zlib settings between
141
the cases.
142
143
* Because JPEG’s lossy nature allows it to start smaller and have
144
smaller size increases than PNG, the crossover point with
145
BMP/TIFF isn’t until 7-9 checkins in typical runs of this [Monte
146
Carlo experiment][mce]. Given a choice among these four file
147
formats and a willingness to use lossy image compression, a rational
148
tradeoff is to choose JPEG for repositories where each image will
149
change fewer than that number of times.
150
151
152
[mce]: https://en.wikipedia.org/wiki/Monte_Carlo_method
153
154
155
## <a id="makefile"></a>Automated Recompression
156
157
Since programs that produce and consume binary-compressed data files
158
often make it either difficult or impossible to work with the
159
uncompressed form, we want an automated method for producing the
160
uncompressed form to make Fossil happy while still having the compressed
161
form to keep our content creation applications happy. This `Makefile`
162
should[^makefile] do that for BMP, PNG, SVG, and XLSX files:
163
164
.SUFFIXES: .bmp .png .svg .svgz
165
166
.svgz.svg:
167
gzip -dc < $< > $@
168
169
.svg.svgz:
170
gzip -9c < $< > $@
171
172
.bmp.png:
173
convert -quality 95 $< $@
174
175
.png.bmp:
176
convert $< $@
177
178
SS_FILES := $(wildcard spreadsheet/*)
179
180
181
all: $(SS_FILES) illus.svg image.bmp doc-big.pdf
182
183
reconstitute: illus.svgz image.png
184
( cd spreadsheet ; zip -9 ../spreadsheet.xlsx) * )
185
qpdf doc-big.pdf doc-small.pdf
186
187
188
$(SS_FILES): spreadsheet.xlsx
189
unzip $@ -d $<
190
191
doc-big.pdf: doc-small.pdf
192
qpdf --stream-data=uncompress $@ $<
193
194
This `Makefile` allows you to treat the compressed version as the
195
process input, but to actually check in only the changes against the
196
uncompressed version by typing “`make`” before “`fossil ci`”. This is
197
not actually an extra step in practice, since if you’ve got a
198
`Makefile`-based project, you should be building — and testing — it
199
before checking each change in anyway!
200
201
Because this technique is based on dependency rules, only the necessary
202
files are generated on each `make` command.
203
204
You only have to run “`make reconstitute`” *once* after opening a fresh
205
Fossil checkout to produce those compressed sources. After that, you
206
work with the compressed files in your content creation programs. Your
207
build system might include some kind of bootstrapping or
208
auto-configuration step that you could attach this to, so that it
209
doesn’t need to be run by hand.
210
211
This `Makefile` illustrates two primary strategies:
212
213
214
### Input and Output File Formats Differ by Extension
215
216
In the case of SVG and the bitmap image formats, the file name extension
217
differs between the cases, so we can use `make` suffix rules to get the
218
behavior we want. The top half of the `Makefile` just tells `make` how
219
to map from `*.svg` to `*.svgz` and vice versa, and the same for `*.bmp`
220
to/from `*.png`.
221
222
223
### Input and Output Use the Same Extension
224
225
We don’t have that luxury for Excel and PDF files, each for a different
226
reason:
227
228
* **Excel:** Excel has no way to work with the unpacked Zip file
229
contents at all, so we have to unpack it into a subdirectory, which
230
is what we check into Fossil. On making a fresh Fossil checkout, we
231
have to pack that subdirectory’s contents back up into an `*.xlsx`
232
file with “`make reconstitute`” so we can edit it with Excel again.
233
234
* **PDF:** All PDF readers can display an uncompressed PDF file, but
235
many PDF-*producing* programs have no option for uncompressed
236
output. Since the file name extension is the same either way, we
237
treat the compressed PDF as the source to the process, yielding an
238
automatically-uncompressed PDF for the benefit of Fossil. Unlike
239
with the Excel case, there is no simple “file base name to directory
240
name” mapping, so we just created the `-big` to `-small` name scheme
241
here.
242
243
244
[^delta-prgs]:
245
This problem is not Fossil-specific. Several other programs also do
246
delta compression, so they’ll also be affected by this problem:
247
[rsync][rs], [Unison][us], [Git][git], etc. You should take this
248
article’s advice when using all such programs, not just Fossil.
249
When using file copying and synchronization programs *without* delta
250
compression, on the other hand, it’s best to use the most
251
highly-compressed file format you can tolerate, since they copy the
252
whole file any time any bit of it changes.
253
254
[^prn]:
255
In fact, a good way to gauge the effectiveness of a given
256
compression scheme is to run its output through the same sort of
257
tests we use to gauge how “random” a given [PRNG][prng] is. Another
258
way to look at it is that if there is a discernible pattern in the
259
output of a compression scheme, that constitutes *information* (in
260
[the technical sense of that word][ith]) that could be further
261
compressed.
262
263
[^tiff-cmp]:
264
We're using *uncompressed* TIFF here, not [LZW][lzw]- or
265
Zip-compressed TIFF, either of which would give similar results to
266
PNG, which is always zlib-compressed.
267
268
[^variance]:
269
The raw data changes somewhat from one run to the next due to the
270
use of random noise in the image to make the zlib/PNG compression
271
more difficult, and the random pixel changes. Those test design
272
choices make this a [Monte Carlo experiment][mce]. We’ve found that
273
the overall character of the results doesn’t change from one run to
274
the next.
275
276
[^size-jump]:
277
It’s not clear to me why there is a one-time jump in size for BMP
278
and TIFF past the first commit. I suspect it is due to the SQLite
279
indices being initialized for the first time.
280
Page size inflation might have something to do with it as well,
281
though we tried to control that by rebuilding the initial DB with a
282
minimal page size. If you re-run the program often enough, you will
283
sometimes see the BMP or TIFF bar jump higher than the other, again
284
likely due to one of the repos crossing a page boundary.
285
Another curious artifact in the data is that the BMP is slightly
286
larger than for the TIFF. This goes against expectation because a
287
low-tech format like BMP should have a small edge in this test
288
because TIFF metadata includes the option for multiple timestamps,
289
UUIDs, etc., which bloat the checkin size by creating many small
290
deltas.
291
292
[^makefile]:
293
The `Makefile` above is not battle-tested. Please report bugs and
294
needed extensions [on the forum][for].
295
296
[for]: https://fossil-scm.org/forum/forumpost/15e677f2c8
297
[git]: https://git-scm.com/
298
[ith]: https://en.wikipedia.org/wiki/Information_theory
299
[lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
300
[prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator
301
[rs]: https://rsync.samba.org/
302
[us]: http://www.cis.upenn.edu/~bcpierce/unison/
303

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button