Fossil SCM

fossil-scm / www / image-format-vs-repo-size.md

Source Rendered

Blame History Raw 303 lines

1	`# Image Format vs Fossil Repo Size`
2
3	`## The Problem`
4
5	`Fossil has a [delta compression][dc] feature which removes redundant`
6	`information from a file relative to its parent on check-in.[^delta-prgs]`
7	`That delta is then [zlib][zl]-compressed before being stored`
8	`in the Fossil repository database file.`
9
10	`Storing pre-compressed data files in a Fossil repository defeats both of`
11	`these space-saving measures:`
12
13	`1. Binary data compression algorithms turn the file data into`
14	`pseudorandom noise.[^prn]`
15
16	`Typical data compression algorithms are not [hash functions][hf],`
17	`where the goal is that a change to each bit in the input has a`
18	`statistically even chance of changing every bit in the output, but`
19	`because they do approach that pathological condition, pre-compressed`
20	`data tends to defeat Fossil’s delta compression algorithm, there`
21	`being so little correlation between two different outputs from the`
22	`binary data compression algorithm.`
23
24	`2. An ideal lossless binary data compression algorithm cannot be`
25	`applied more than once to make the data even smaller, since random`
26	`noise is incompressible. The consequence for our purposes here is`
27	`that pre-compressed data doesn’t benefit from Fossil’s zlib`
28	`compression.`
29
30	`You might then ask, what does it matter if the space savings comes from`
31	`the application file format (e.g. JPEG, DOCX, Zip, etc.) or from Fossil`
32	`itself? It really doesn’t, as far as point 2 above goes, but point 1`
33	`causes the Fossil repository to balloon out of proportion to the size of`
34	`the input data change on each checkin. This article will illustrate that`
35	`problem, quantify it, and give a solution to it.`
36
37	`[dc]: ./delta_format.wiki`
38	`[hf]: https://en.wikipedia.org/wiki/Hash_function`
39	`[zl]: http://www.zlib.net/`
40
41
42	`## <a id="formats"></a>Affected File Formats`
43
44	`In this article’s core experiment, we use 2D image file formats, but`
45	`this article’s advice also applies to many other file types. For just a`
46	`few examples out of what must be thousands:`
47
48	`* Microsoft Office: The [OOXML document format][oox] used from`
49	Office 2003 onward (`.docx`, `.xlsx`, `.pptx`, etc.) are Zip files
50	`containing an XML document file and several collateral files.`
51
52	`* Libre Office: For the purposes of this article, its`
53	`[OpenDocument Format][odf] is designed the same basic way as OOXML.`
54
55	* Java: A Java [`.jar` file][jcl] is a Zip file containing JVM
56	`.class` files, manifest files, and more.
57
58	* Windows Installer: An [`*.msi` file][wi] is a proprietary
59	`database format that contains, among other things, [Microsoft`
60	`Cabinet][cab]-compressed files, which in turn may hold Windows`
61	`executables, which [may themselves be compressed][exc].`
62
63	`* SVG, PDF, TIFF, etc.: Many file formats are available in both`
64	`compressed and uncompressed forms. You should use the uncompressed`
65	`form with Fossil wherever practical, as we will show below.`
66
67
68	`[cab]: https://en.wikipedia.org/wiki/Cabinet_(file_format)`
69	`[exc]: https://en.wikipedia.org/wiki/Executable_compression`
70	`[jcl]: https://en.wikipedia.org/wiki/Java_(programming_language)`
71	`[odf]: https://en.wikipedia.org/wiki/OpenDocument`
72	`[oox]: https://en.wikipedia.org/wiki/Office_Open_XML`
73	`[wi]: https://en.wikipedia.org/wiki/Windows_Installer`
74
75
76
77	`## <a id="demo"></a>Demonstration`
78
79	The companion `image-format-vs-repo-size.ipynb` file ([download][nbd],
80	`[preview][nbp]) is a [JupyterLab][jl] notebook implementing the following`
81	`experiment:`
82
83	`1. Create a new minimum-size Fossil repository. Save this initial size.`
84
85	`2. Use [ImageMagick][im] via [Wand][wp] to generate a JPEG file of a`
86	`particular size — currently 256 px² — filled with Gaussian noise to`
87	`make data compression more difficult than with a solid-color image.`
88
89	`3. Check that image into the new Fossil repo, and remember that size.`
90
91	`4. Change a random pixel in the image to a random RGB value, save that`
92	`image, check it in, and remember the new Fossil repo size.`
93
94	`5. Iterate on step 4 some number of times — currently 10 — and remember`
95	`the Fossil repo size at each step.`
96
97	`6. Repeat the above steps for BMP, PNG, and TIFF.[^tiff-cmp]`
98
99	`7. Create a bar chart showing how the Fossil repository size changes`
100	`with each checkin.`
101
102	`We chose to use JupyterLab for this because it makes it easy for you to`
103	`modify the notebook to try different things. Want to see how the`
104	results change with a different image size? Easy, change the `size`
105	`value in the second cell of the notebook. Want to try more image`
106	`formats? You can put anything ImageMagick can recognize into the`
107	`formats` list. Want to find the break-even point for images like those
108	`in your own repository? Easily done with a small amount of code.`
109
110	`[im]: https://www.imagemagick.org/`
111	`[jl]: https://jupyter.org/`
112	`[nbd]: ./image-format-vs-repo-size.ipynb`
113	`[nbp]: https://nbviewer.jupyter.org/urls/fossil-scm.org/fossil/doc/trunk/www/image-format-vs-repo-size.ipynb`
114	`[wp]: http://wand-py.org/`
115
116
117	`## <a id="results"></a>Results`
118
119	`Running the notebook gives a bar chart something like[^variance] this:`
120
121	`![results bar chart](./image-format-vs-repo-size.svg)`
122
123	`There are a few key things we want to draw your attention to in that`
124	`chart:`
125
126	`* BMP and uncompressed TIFF are nearly identical in size for all`
127	`checkins, and the repository growth rate is negligible past the`
128	`first commit.[^size-jump] We owe this economy to Fossil’s delta compression`
129	`feature: it is encoding each of those single-pixel changes in a very`
130	`small amount of repository space.`
131
132	`* The JPEG and PNG bars increase by large amounts on most checkins`
133	`even though each checkin also encodes only a single-pixel change.`
134
135	`* The size of the first checkin in the BMP and TIFF cases is roughly`
136	`the same as that for the PNG case, because both PNG and Fossil use`
137	`the zlib binary data compression algorithm. This shows that for`
138	`repos where the image files are committed only once, there is`
139	`virtually no penalty to using BMP or TIFF over PNG. The file sizes`
140	`likely differ only because of differences in zlib settings between`
141	`the cases.`
142
143	`* Because JPEG’s lossy nature allows it to start smaller and have`
144	`smaller size increases than PNG, the crossover point with`
145	`BMP/TIFF isn’t until 7-9 checkins in typical runs of this [Monte`
146	`Carlo experiment][mce]. Given a choice among these four file`
147	`formats and a willingness to use lossy image compression, a rational`
148	`tradeoff is to choose JPEG for repositories where each image will`
149	`change fewer than that number of times.`
150
151
152	`[mce]: https://en.wikipedia.org/wiki/Monte_Carlo_method`
153
154
155	`## <a id="makefile"></a>Automated Recompression`
156
157	`Since programs that produce and consume binary-compressed data files`
158	`often make it either difficult or impossible to work with the`
159	`uncompressed form, we want an automated method for producing the`
160	`uncompressed form to make Fossil happy while still having the compressed`
161	form to keep our content creation applications happy. This `Makefile`
162	`should[^makefile] do that for BMP, PNG, SVG, and XLSX files:`
163
164	`.SUFFIXES: .bmp .png .svg .svgz`
165
166	`.svgz.svg:`
167	`gzip -dc < $< > $@`
168
169	`.svg.svgz:`
170	`gzip -9c < $< > $@`
171
172	`.bmp.png:`
173	`convert -quality 95 $< $@`
174
175	`.png.bmp:`
176	`convert $< $@`
177
178	`SS_FILES := $(wildcard spreadsheet/*)`
179
180
181	`all: $(SS_FILES) illus.svg image.bmp doc-big.pdf`
182
183	`reconstitute: illus.svgz image.png`
184	`( cd spreadsheet ; zip -9 ../spreadsheet.xlsx) * )`
185	`qpdf doc-big.pdf doc-small.pdf`
186
187
188	`$(SS_FILES): spreadsheet.xlsx`
189	`unzip $@ -d $<`
190
191	`doc-big.pdf: doc-small.pdf`
192	`qpdf --stream-data=uncompress $@ $<`
193
194	This `Makefile` allows you to treat the compressed version as the
195	`process input, but to actually check in only the changes against the`
196	uncompressed version by typing “`make`” before “`fossil ci`”. This is
197	`not actually an extra step in practice, since if you’ve got a`
198	`Makefile`-based project, you should be building — and testing — it
199	`before checking each change in anyway!`
200
201	`Because this technique is based on dependency rules, only the necessary`
202	files are generated on each `make` command.
203
204	You only have to run “`make reconstitute`” once after opening a fresh
205	`Fossil checkout to produce those compressed sources. After that, you`
206	`work with the compressed files in your content creation programs. Your`
207	`build system might include some kind of bootstrapping or`
208	`auto-configuration step that you could attach this to, so that it`
209	`doesn’t need to be run by hand.`
210
211	This `Makefile` illustrates two primary strategies:
212
213
214	`### Input and Output File Formats Differ by Extension`
215
216	`In the case of SVG and the bitmap image formats, the file name extension`
217	differs between the cases, so we can use `make` suffix rules to get the
218	behavior we want. The top half of the `Makefile` just tells `make` how
219	to map from `.svg` to `.svgz` and vice versa, and the same for `*.bmp`
220	to/from `*.png`.
221
222
223	`### Input and Output Use the Same Extension`
224
225	`We don’t have that luxury for Excel and PDF files, each for a different`
226	`reason:`
227
228	`* Excel: Excel has no way to work with the unpacked Zip file`
229	`contents at all, so we have to unpack it into a subdirectory, which`
230	`is what we check into Fossil. On making a fresh Fossil checkout, we`
231	have to pack that subdirectory’s contents back up into an `*.xlsx`
232	file with “`make reconstitute`” so we can edit it with Excel again.
233
234	`* PDF: All PDF readers can display an uncompressed PDF file, but`
235	`many PDF-producing programs have no option for uncompressed`
236	`output. Since the file name extension is the same either way, we`
237	`treat the compressed PDF as the source to the process, yielding an`
238	`automatically-uncompressed PDF for the benefit of Fossil. Unlike`
239	`with the Excel case, there is no simple “file base name to directory`
240	name” mapping, so we just created the `-big` to `-small` name scheme
241	`here.`
242
243
244	`[^delta-prgs]:`
245	`This problem is not Fossil-specific. Several other programs also do`
246	`delta compression, so they’ll also be affected by this problem:`
247	`[rsync][rs], [Unison][us], [Git][git], etc. You should take this`
248	`article’s advice when using all such programs, not just Fossil.`
249	`When using file copying and synchronization programs without delta`
250	`compression, on the other hand, it’s best to use the most`
251	`highly-compressed file format you can tolerate, since they copy the`
252	`whole file any time any bit of it changes.`
253
254	`[^prn]:`
255	`In fact, a good way to gauge the effectiveness of a given`
256	`compression scheme is to run its output through the same sort of`
257	`tests we use to gauge how “random” a given [PRNG][prng] is. Another`
258	`way to look at it is that if there is a discernible pattern in the`
259	`output of a compression scheme, that constitutes information (in`
260	`[the technical sense of that word][ith]) that could be further`
261	`compressed.`
262
263	`[^tiff-cmp]:`
264	`We're using uncompressed TIFF here, not [LZW][lzw]- or`
265	`Zip-compressed TIFF, either of which would give similar results to`
266	`PNG, which is always zlib-compressed.`
267
268	`[^variance]:`
269	`The raw data changes somewhat from one run to the next due to the`
270	`use of random noise in the image to make the zlib/PNG compression`
271	`more difficult, and the random pixel changes. Those test design`
272	`choices make this a [Monte Carlo experiment][mce]. We’ve found that`
273	`the overall character of the results doesn’t change from one run to`
274	`the next.`
275
276	`[^size-jump]:`
277	`It’s not clear to me why there is a one-time jump in size for BMP`
278	`and TIFF past the first commit. I suspect it is due to the SQLite`
279	`indices being initialized for the first time.`
280	`Page size inflation might have something to do with it as well,`
281	`though we tried to control that by rebuilding the initial DB with a`
282	`minimal page size. If you re-run the program often enough, you will`
283	`sometimes see the BMP or TIFF bar jump higher than the other, again`
284	`likely due to one of the repos crossing a page boundary.`
285	`Another curious artifact in the data is that the BMP is slightly`
286	`larger than for the TIFF. This goes against expectation because a`
287	`low-tech format like BMP should have a small edge in this test`
288	`because TIFF metadata includes the option for multiple timestamps,`
289	`UUIDs, etc., which bloat the checkin size by creating many small`
290	`deltas.`
291
292	`[^makefile]:`
293	The `Makefile` above is not battle-tested. Please report bugs and
294	`needed extensions [on the forum][for].`
295
296	`[for]: https://fossil-scm.org/forum/forumpost/15e677f2c8`
297	`[git]: https://git-scm.com/`
298	`[ith]: https://en.wikipedia.org/wiki/Information_theory`
299	`[lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch`
300	`[prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator`
301	`[rs]: https://rsync.samba.org/`
302	`[us]: http://www.cis.upenn.edu/~bcpierce/unison/`
303

Fossil SCM

Keyboard Shortcuts