|
1
|
# Image Format vs Fossil Repo Size |
|
2
|
|
|
3
|
## The Problem |
|
4
|
|
|
5
|
Fossil has a [delta compression][dc] feature which removes redundant |
|
6
|
information from a file relative to its parent on check-in.[^delta-prgs] |
|
7
|
That delta is then [zlib][zl]-compressed before being stored |
|
8
|
in the Fossil repository database file. |
|
9
|
|
|
10
|
Storing pre-compressed data files in a Fossil repository defeats both of |
|
11
|
these space-saving measures: |
|
12
|
|
|
13
|
1. Binary data compression algorithms turn the file data into |
|
14
|
pseudorandom noise.[^prn] |
|
15
|
|
|
16
|
Typical data compression algorithms are not [hash functions][hf], |
|
17
|
where the goal is that a change to each bit in the input has a |
|
18
|
statistically even chance of changing every bit in the output, but |
|
19
|
because they do approach that pathological condition, pre-compressed |
|
20
|
data tends to defeat Fossil’s delta compression algorithm, there |
|
21
|
being so little correlation between two different outputs from the |
|
22
|
binary data compression algorithm. |
|
23
|
|
|
24
|
2. An ideal lossless binary data compression algorithm cannot be |
|
25
|
applied more than once to make the data even smaller, since random |
|
26
|
noise is incompressible. The consequence for our purposes here is |
|
27
|
that pre-compressed data doesn’t benefit from Fossil’s zlib |
|
28
|
compression. |
|
29
|
|
|
30
|
You might then ask, what does it matter if the space savings comes from |
|
31
|
the application file format (e.g. JPEG, DOCX, Zip, etc.) or from Fossil |
|
32
|
itself? It really doesn’t, as far as point 2 above goes, but point 1 |
|
33
|
causes the Fossil repository to balloon out of proportion to the size of |
|
34
|
the input data change on each checkin. This article will illustrate that |
|
35
|
problem, quantify it, and give a solution to it. |
|
36
|
|
|
37
|
[dc]: ./delta_format.wiki |
|
38
|
[hf]: https://en.wikipedia.org/wiki/Hash_function |
|
39
|
[zl]: http://www.zlib.net/ |
|
40
|
|
|
41
|
|
|
42
|
## <a id="formats"></a>Affected File Formats |
|
43
|
|
|
44
|
In this article’s core experiment, we use 2D image file formats, but |
|
45
|
this article’s advice also applies to many other file types. For just a |
|
46
|
few examples out of what must be thousands: |
|
47
|
|
|
48
|
* **Microsoft Office**: The [OOXML document format][oox] used from |
|
49
|
Office 2003 onward (`.docx`, `.xlsx`, `.pptx`, etc.) are Zip files |
|
50
|
containing an XML document file and several collateral files. |
|
51
|
|
|
52
|
* **Libre Office**: For the purposes of this article, its |
|
53
|
[OpenDocument Format][odf] is designed the same basic way as OOXML. |
|
54
|
|
|
55
|
* **Java**: A Java [`.jar` file][jcl] is a Zip file containing JVM |
|
56
|
`.class` files, manifest files, and more. |
|
57
|
|
|
58
|
* **Windows Installer:** An [`*.msi` file][wi] is a proprietary |
|
59
|
database format that contains, among other things, [Microsoft |
|
60
|
Cabinet][cab]-compressed files, which in turn may hold Windows |
|
61
|
executables, which [may themselves be compressed][exc]. |
|
62
|
|
|
63
|
* **SVG, PDF, TIFF, etc.**: Many file formats are available in both |
|
64
|
compressed and uncompressed forms. You should use the uncompressed |
|
65
|
form with Fossil wherever practical, as we will show below. |
|
66
|
|
|
67
|
|
|
68
|
[cab]: https://en.wikipedia.org/wiki/Cabinet_(file_format) |
|
69
|
[exc]: https://en.wikipedia.org/wiki/Executable_compression |
|
70
|
[jcl]: https://en.wikipedia.org/wiki/Java_(programming_language) |
|
71
|
[odf]: https://en.wikipedia.org/wiki/OpenDocument |
|
72
|
[oox]: https://en.wikipedia.org/wiki/Office_Open_XML |
|
73
|
[wi]: https://en.wikipedia.org/wiki/Windows_Installer |
|
74
|
|
|
75
|
|
|
76
|
|
|
77
|
## <a id="demo"></a>Demonstration |
|
78
|
|
|
79
|
The companion `image-format-vs-repo-size.ipynb` file ([download][nbd], |
|
80
|
[preview][nbp]) is a [JupyterLab][jl] notebook implementing the following |
|
81
|
experiment: |
|
82
|
|
|
83
|
1. Create a new minimum-size Fossil repository. Save this initial size. |
|
84
|
|
|
85
|
2. Use [ImageMagick][im] via [Wand][wp] to generate a JPEG file of a |
|
86
|
particular size — currently 256 px² — filled with Gaussian noise to |
|
87
|
make data compression more difficult than with a solid-color image. |
|
88
|
|
|
89
|
3. Check that image into the new Fossil repo, and remember that size. |
|
90
|
|
|
91
|
4. Change a random pixel in the image to a random RGB value, save that |
|
92
|
image, check it in, and remember the new Fossil repo size. |
|
93
|
|
|
94
|
5. Iterate on step 4 some number of times — currently 10 — and remember |
|
95
|
the Fossil repo size at each step. |
|
96
|
|
|
97
|
6. Repeat the above steps for BMP, PNG, and TIFF.[^tiff-cmp] |
|
98
|
|
|
99
|
7. Create a bar chart showing how the Fossil repository size changes |
|
100
|
with each checkin. |
|
101
|
|
|
102
|
We chose to use JupyterLab for this because it makes it easy for you to |
|
103
|
modify the notebook to try different things. Want to see how the |
|
104
|
results change with a different image size? Easy, change the `size` |
|
105
|
value in the second cell of the notebook. Want to try more image |
|
106
|
formats? You can put anything ImageMagick can recognize into the |
|
107
|
`formats` list. Want to find the break-even point for images like those |
|
108
|
in your own repository? Easily done with a small amount of code. |
|
109
|
|
|
110
|
[im]: https://www.imagemagick.org/ |
|
111
|
[jl]: https://jupyter.org/ |
|
112
|
[nbd]: ./image-format-vs-repo-size.ipynb |
|
113
|
[nbp]: https://nbviewer.jupyter.org/urls/fossil-scm.org/fossil/doc/trunk/www/image-format-vs-repo-size.ipynb |
|
114
|
[wp]: http://wand-py.org/ |
|
115
|
|
|
116
|
|
|
117
|
## <a id="results"></a>Results |
|
118
|
|
|
119
|
Running the notebook gives a bar chart something like[^variance] this: |
|
120
|
|
|
121
|
 |
|
122
|
|
|
123
|
There are a few key things we want to draw your attention to in that |
|
124
|
chart: |
|
125
|
|
|
126
|
* BMP and uncompressed TIFF are nearly identical in size for all |
|
127
|
checkins, and the repository growth rate is negligible past the |
|
128
|
first commit.[^size-jump] We owe this economy to Fossil’s delta compression |
|
129
|
feature: it is encoding each of those single-pixel changes in a very |
|
130
|
small amount of repository space. |
|
131
|
|
|
132
|
* The JPEG and PNG bars increase by large amounts on most checkins |
|
133
|
even though each checkin *also* encodes only a *single-pixel change*. |
|
134
|
|
|
135
|
* The size of the first checkin in the BMP and TIFF cases is roughly |
|
136
|
the same as that for the PNG case, because both PNG and Fossil use |
|
137
|
the zlib binary data compression algorithm. This shows that for |
|
138
|
repos where the image files are committed only once, there is |
|
139
|
virtually no penalty to using BMP or TIFF over PNG. The file sizes |
|
140
|
likely differ only because of differences in zlib settings between |
|
141
|
the cases. |
|
142
|
|
|
143
|
* Because JPEG’s lossy nature allows it to start smaller and have |
|
144
|
smaller size increases than PNG, the crossover point with |
|
145
|
BMP/TIFF isn’t until 7-9 checkins in typical runs of this [Monte |
|
146
|
Carlo experiment][mce]. Given a choice among these four file |
|
147
|
formats and a willingness to use lossy image compression, a rational |
|
148
|
tradeoff is to choose JPEG for repositories where each image will |
|
149
|
change fewer than that number of times. |
|
150
|
|
|
151
|
|
|
152
|
[mce]: https://en.wikipedia.org/wiki/Monte_Carlo_method |
|
153
|
|
|
154
|
|
|
155
|
## <a id="makefile"></a>Automated Recompression |
|
156
|
|
|
157
|
Since programs that produce and consume binary-compressed data files |
|
158
|
often make it either difficult or impossible to work with the |
|
159
|
uncompressed form, we want an automated method for producing the |
|
160
|
uncompressed form to make Fossil happy while still having the compressed |
|
161
|
form to keep our content creation applications happy. This `Makefile` |
|
162
|
should[^makefile] do that for BMP, PNG, SVG, and XLSX files: |
|
163
|
|
|
164
|
.SUFFIXES: .bmp .png .svg .svgz |
|
165
|
|
|
166
|
.svgz.svg: |
|
167
|
gzip -dc < $< > $@ |
|
168
|
|
|
169
|
.svg.svgz: |
|
170
|
gzip -9c < $< > $@ |
|
171
|
|
|
172
|
.bmp.png: |
|
173
|
convert -quality 95 $< $@ |
|
174
|
|
|
175
|
.png.bmp: |
|
176
|
convert $< $@ |
|
177
|
|
|
178
|
SS_FILES := $(wildcard spreadsheet/*) |
|
179
|
|
|
180
|
|
|
181
|
all: $(SS_FILES) illus.svg image.bmp doc-big.pdf |
|
182
|
|
|
183
|
reconstitute: illus.svgz image.png |
|
184
|
( cd spreadsheet ; zip -9 ../spreadsheet.xlsx) * ) |
|
185
|
qpdf doc-big.pdf doc-small.pdf |
|
186
|
|
|
187
|
|
|
188
|
$(SS_FILES): spreadsheet.xlsx |
|
189
|
unzip $@ -d $< |
|
190
|
|
|
191
|
doc-big.pdf: doc-small.pdf |
|
192
|
qpdf --stream-data=uncompress $@ $< |
|
193
|
|
|
194
|
This `Makefile` allows you to treat the compressed version as the |
|
195
|
process input, but to actually check in only the changes against the |
|
196
|
uncompressed version by typing “`make`” before “`fossil ci`”. This is |
|
197
|
not actually an extra step in practice, since if you’ve got a |
|
198
|
`Makefile`-based project, you should be building — and testing — it |
|
199
|
before checking each change in anyway! |
|
200
|
|
|
201
|
Because this technique is based on dependency rules, only the necessary |
|
202
|
files are generated on each `make` command. |
|
203
|
|
|
204
|
You only have to run “`make reconstitute`” *once* after opening a fresh |
|
205
|
Fossil checkout to produce those compressed sources. After that, you |
|
206
|
work with the compressed files in your content creation programs. Your |
|
207
|
build system might include some kind of bootstrapping or |
|
208
|
auto-configuration step that you could attach this to, so that it |
|
209
|
doesn’t need to be run by hand. |
|
210
|
|
|
211
|
This `Makefile` illustrates two primary strategies: |
|
212
|
|
|
213
|
|
|
214
|
### Input and Output File Formats Differ by Extension |
|
215
|
|
|
216
|
In the case of SVG and the bitmap image formats, the file name extension |
|
217
|
differs between the cases, so we can use `make` suffix rules to get the |
|
218
|
behavior we want. The top half of the `Makefile` just tells `make` how |
|
219
|
to map from `*.svg` to `*.svgz` and vice versa, and the same for `*.bmp` |
|
220
|
to/from `*.png`. |
|
221
|
|
|
222
|
|
|
223
|
### Input and Output Use the Same Extension |
|
224
|
|
|
225
|
We don’t have that luxury for Excel and PDF files, each for a different |
|
226
|
reason: |
|
227
|
|
|
228
|
* **Excel:** Excel has no way to work with the unpacked Zip file |
|
229
|
contents at all, so we have to unpack it into a subdirectory, which |
|
230
|
is what we check into Fossil. On making a fresh Fossil checkout, we |
|
231
|
have to pack that subdirectory’s contents back up into an `*.xlsx` |
|
232
|
file with “`make reconstitute`” so we can edit it with Excel again. |
|
233
|
|
|
234
|
* **PDF:** All PDF readers can display an uncompressed PDF file, but |
|
235
|
many PDF-*producing* programs have no option for uncompressed |
|
236
|
output. Since the file name extension is the same either way, we |
|
237
|
treat the compressed PDF as the source to the process, yielding an |
|
238
|
automatically-uncompressed PDF for the benefit of Fossil. Unlike |
|
239
|
with the Excel case, there is no simple “file base name to directory |
|
240
|
name” mapping, so we just created the `-big` to `-small` name scheme |
|
241
|
here. |
|
242
|
|
|
243
|
|
|
244
|
[^delta-prgs]: |
|
245
|
This problem is not Fossil-specific. Several other programs also do |
|
246
|
delta compression, so they’ll also be affected by this problem: |
|
247
|
[rsync][rs], [Unison][us], [Git][git], etc. You should take this |
|
248
|
article’s advice when using all such programs, not just Fossil. |
|
249
|
When using file copying and synchronization programs *without* delta |
|
250
|
compression, on the other hand, it’s best to use the most |
|
251
|
highly-compressed file format you can tolerate, since they copy the |
|
252
|
whole file any time any bit of it changes. |
|
253
|
|
|
254
|
[^prn]: |
|
255
|
In fact, a good way to gauge the effectiveness of a given |
|
256
|
compression scheme is to run its output through the same sort of |
|
257
|
tests we use to gauge how “random” a given [PRNG][prng] is. Another |
|
258
|
way to look at it is that if there is a discernible pattern in the |
|
259
|
output of a compression scheme, that constitutes *information* (in |
|
260
|
[the technical sense of that word][ith]) that could be further |
|
261
|
compressed. |
|
262
|
|
|
263
|
[^tiff-cmp]: |
|
264
|
We're using *uncompressed* TIFF here, not [LZW][lzw]- or |
|
265
|
Zip-compressed TIFF, either of which would give similar results to |
|
266
|
PNG, which is always zlib-compressed. |
|
267
|
|
|
268
|
[^variance]: |
|
269
|
The raw data changes somewhat from one run to the next due to the |
|
270
|
use of random noise in the image to make the zlib/PNG compression |
|
271
|
more difficult, and the random pixel changes. Those test design |
|
272
|
choices make this a [Monte Carlo experiment][mce]. We’ve found that |
|
273
|
the overall character of the results doesn’t change from one run to |
|
274
|
the next. |
|
275
|
|
|
276
|
[^size-jump]: |
|
277
|
It’s not clear to me why there is a one-time jump in size for BMP |
|
278
|
and TIFF past the first commit. I suspect it is due to the SQLite |
|
279
|
indices being initialized for the first time. |
|
280
|
Page size inflation might have something to do with it as well, |
|
281
|
though we tried to control that by rebuilding the initial DB with a |
|
282
|
minimal page size. If you re-run the program often enough, you will |
|
283
|
sometimes see the BMP or TIFF bar jump higher than the other, again |
|
284
|
likely due to one of the repos crossing a page boundary. |
|
285
|
Another curious artifact in the data is that the BMP is slightly |
|
286
|
larger than for the TIFF. This goes against expectation because a |
|
287
|
low-tech format like BMP should have a small edge in this test |
|
288
|
because TIFF metadata includes the option for multiple timestamps, |
|
289
|
UUIDs, etc., which bloat the checkin size by creating many small |
|
290
|
deltas. |
|
291
|
|
|
292
|
[^makefile]: |
|
293
|
The `Makefile` above is not battle-tested. Please report bugs and |
|
294
|
needed extensions [on the forum][for]. |
|
295
|
|
|
296
|
[for]: https://fossil-scm.org/forum/forumpost/15e677f2c8 |
|
297
|
[git]: https://git-scm.com/ |
|
298
|
[ith]: https://en.wikipedia.org/wiki/Information_theory |
|
299
|
[lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch |
|
300
|
[prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator |
|
301
|
[rs]: https://rsync.samba.org/ |
|
302
|
[us]: http://www.cis.upenn.edu/~bcpierce/unison/ |
|
303
|
|