Fossil SCM

Converted the hand-crafted footnotes in the "Image Format vs Fossil Repo Size" doc to use the new Markdown affordance.

wyoung 2023-04-25 22:09 trunk
Commit 389e3fb976ea4a23d04b847c9c2bea77e73312be27f17ab564ac961783d9593e
--- www/image-format-vs-repo-size.md
+++ www/image-format-vs-repo-size.md
@@ -1,19 +1,19 @@
11
# Image Format vs Fossil Repo Size
22
33
## The Problem
44
55
Fossil has a [delta compression][dc] feature which removes redundant
6
-information from a file relative to its parent on check-in.¹
6
+information from a file relative to its parent on check-in.[^delta-prgs]
77
That delta is then [zlib][zl]-compressed before being stored
88
in the Fossil repository database file.
99
1010
Storing pre-compressed data files in a Fossil repository defeats both of
1111
these space-saving measures:
1212
1313
1. Binary data compression algorithms turn the file data into
14
- [pseudorandom noise][prn].²
14
+ pseudorandom noise.[^prn]
1515
1616
Typical data compression algorithms are not [hash functions][hf],
1717
where the goal is that a change to each bit in the input has a
1818
statistically even chance of changing every bit in the output, but
1919
because they do approach that pathological condition, pre-compressed
@@ -34,11 +34,10 @@
3434
the input data change on each checkin. This article will illustrate that
3535
problem, quantify it, and give a solution to it.
3636
3737
[dc]: ./delta_format.wiki
3838
[hf]: https://en.wikipedia.org/wiki/Hash_function
39
-[prn]: https://en.wikipedia.org/wiki/Pseudorandomness
4039
[zl]: http://www.zlib.net/
4140
4241
4342
## <a id="formats"></a>Affected File Formats
4443
@@ -93,11 +92,11 @@
9392
image, check it in, and remember the new Fossil repo size.
9493
9594
5. Iterate on step 4 some number of times — currently 10 — and remember
9695
the Fossil repo size at each step.
9796
98
-6. Repeat the above steps for BMP, TIFF,³ and PNG.
97
+6. Repeat the above steps for BMP, PNG, and TIFF.[^tiff-cmp]
9998
10099
7. Create a bar chart showing how the Fossil repository size changes
101100
with each checkin.
102101
103102
We chose to use JupyterLab for this because it makes it easy for you to
@@ -115,20 +114,20 @@
115114
[wp]: http://wand-py.org/
116115
117116
118117
## <a id="results"></a>Results
119118
120
-Running the notebook gives a bar chart something like⁴ this:
119
+Running the notebook gives a bar chart something like[^variance] this:
121120
122121
![results bar chart](./image-format-vs-repo-size.svg)
123122
124123
There are a few key things we want to draw your attention to in that
125124
chart:
126125
127126
* BMP and uncompressed TIFF are nearly identical in size for all
128127
checkins, and the repository growth rate is negligible past the
129
- first commit.⁵ We owe this economy to Fossil’s delta compression
128
+ first commit.[^size-jump] We owe this economy to Fossil’s delta compression
130129
feature: it is encoding each of those single-pixel changes in a very
131130
small amount of repository space.
132131
133132
* The JPEG and PNG bars increase by large amounts on most checkins
134133
even though each checkin *also* encodes only a *single-pixel change*.
@@ -158,11 +157,11 @@
158157
Since programs that produce and consume binary-compressed data files
159158
often make it either difficult or impossible to work with the
160159
uncompressed form, we want an automated method for producing the
161160
uncompressed form to make Fossil happy while still having the compressed
162161
form to keep our content creation applications happy. This `Makefile`
163
-should⁶ do that for BMP, PNG, SVG, and XLSX files:
162
+should[^makefile] do that for BMP, PNG, SVG, and XLSX files:
164163
165164
.SUFFIXES: .bmp .png .svg .svgz
166165
167166
.svgz.svg:
168167
gzip -dc < $< > $@
@@ -194,11 +193,11 @@
194193
195194
This `Makefile` allows you to treat the compressed version as the
196195
process input, but to actually check in only the changes against the
197196
uncompressed version by typing “`make`” before “`fossil ci`”. This is
198197
not actually an extra step in practice, since if you’ve got a
199
-`Makefile`-based project, you should be building (and testing!) it
198
+`Makefile`-based project, you should be building — and testing — it
200199
before checking each change in anyway!
201200
202201
Because this technique is based on dependency rules, only the necessary
203202
files are generated on each `make` command.
204203
@@ -239,66 +238,65 @@
239238
automatically-uncompressed PDF for the benefit of Fossil. Unlike
240239
with the Excel case, there is no simple “file base name to directory
241240
name” mapping, so we just created the `-big` to `-small` name scheme
242241
here.
243242
-----
244243
245
-
246
-## <a id="notes"></a>Footnotes and Digressions
247
-
248
-1. This problem is not Fossil-specific. Several other programs also do
244
+[^delta-prgs]:
245
+ This problem is not Fossil-specific. Several other programs also do
249246
delta compression, so they’ll also be affected by this problem:
250247
[rsync][rs], [Unison][us], [Git][git], etc. You should take this
251248
article’s advice when using all such programs, not just Fossil.
252
-
253249
When using file copying and synchronization programs *without* delta
254250
compression, on the other hand, it’s best to use the most
255251
highly-compressed file format you can tolerate, since they copy the
256252
whole file any time any bit of it changes.
257253
258
-2. In fact, a good way to gauge the effectiveness of a given
254
+[^prn]:
255
+ In fact, a good way to gauge the effectiveness of a given
259256
compression scheme is to run its output through the same sort of
260257
tests we use to gauge how “random” a given [PRNG][prng] is. Another
261258
way to look at it is that if there is a discernible pattern in the
262259
output of a compression scheme, that constitutes *information* (in
263260
[the technical sense of that word][ith]) that could be further
264261
compressed.
265262
266
-3. We're using *uncompressed* TIFF here, not [LZW][lzw]- or
263
+[^tiff-cmp]:
264
+ We're using *uncompressed* TIFF here, not [LZW][lzw]- or
267265
Zip-compressed TIFF, either of which would give similar results to
268266
PNG, which is always zlib-compressed.
269267
270
-4. The raw data changes somewhat from one run to the next due to the
268
+[^variance]:
269
+ The raw data changes somewhat from one run to the next due to the
271270
use of random noise in the image to make the zlib/PNG compression
272271
more difficult, and the random pixel changes. Those test design
273272
choices make this a [Monte Carlo experiment][mce]. We’ve found that
274273
the overall character of the results doesn’t change from one run to
275274
the next.
276275
277
-5. It’s not clear to me why there is a one-time jump in size for BMP
276
+[^size-jump]:
277
+ It’s not clear to me why there is a one-time jump in size for BMP
278278
and TIFF past the first commit. I suspect it is due to the SQLite
279279
indices being initialized for the first time.
280
-
281280
Page size inflation might have something to do with it as well,
282281
though we tried to control that by rebuilding the initial DB with a
283282
minimal page size. If you re-run the program often enough, you will
284283
sometimes see the BMP or TIFF bar jump higher than the other, again
285284
likely due to one of the repos crossing a page boundary.
286
-
287285
Another curious artifact in the data is that the BMP is slightly
288286
larger than for the TIFF. This goes against expectation because a
289287
low-tech format like BMP should have a small edge in this test
290288
because TIFF metadata includes the option for multiple timestamps,
291289
UUIDs, etc., which bloat the checkin size by creating many small
292290
deltas.
293291
294
-6. The `Makefile` above is not battle-tested. Please report bugs and
292
+[^makefile]:
293
+ The `Makefile` above is not battle-tested. Please report bugs and
295294
needed extensions [on the forum][for].
296295
297296
[for]: https://fossil-scm.org/forum/forumpost/15e677f2c8
298297
[git]: https://git-scm.com/
299298
[ith]: https://en.wikipedia.org/wiki/Information_theory
300299
[lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
301300
[prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator
302301
[rs]: https://rsync.samba.org/
303302
[us]: http://www.cis.upenn.edu/~bcpierce/unison/
304303
--- www/image-format-vs-repo-size.md
+++ www/image-format-vs-repo-size.md
@@ -1,19 +1,19 @@
1 # Image Format vs Fossil Repo Size
2
3 ## The Problem
4
5 Fossil has a [delta compression][dc] feature which removes redundant
6 information from a file relative to its parent on check-in.¹
7 That delta is then [zlib][zl]-compressed before being stored
8 in the Fossil repository database file.
9
10 Storing pre-compressed data files in a Fossil repository defeats both of
11 these space-saving measures:
12
13 1. Binary data compression algorithms turn the file data into
14 [pseudorandom noise][prn].²
15
16 Typical data compression algorithms are not [hash functions][hf],
17 where the goal is that a change to each bit in the input has a
18 statistically even chance of changing every bit in the output, but
19 because they do approach that pathological condition, pre-compressed
@@ -34,11 +34,10 @@
34 the input data change on each checkin. This article will illustrate that
35 problem, quantify it, and give a solution to it.
36
37 [dc]: ./delta_format.wiki
38 [hf]: https://en.wikipedia.org/wiki/Hash_function
39 [prn]: https://en.wikipedia.org/wiki/Pseudorandomness
40 [zl]: http://www.zlib.net/
41
42
43 ## <a id="formats"></a>Affected File Formats
44
@@ -93,11 +92,11 @@
93 image, check it in, and remember the new Fossil repo size.
94
95 5. Iterate on step 4 some number of times — currently 10 — and remember
96 the Fossil repo size at each step.
97
98 6. Repeat the above steps for BMP, TIFF,³ and PNG.
99
100 7. Create a bar chart showing how the Fossil repository size changes
101 with each checkin.
102
103 We chose to use JupyterLab for this because it makes it easy for you to
@@ -115,20 +114,20 @@
115 [wp]: http://wand-py.org/
116
117
118 ## <a id="results"></a>Results
119
120 Running the notebook gives a bar chart something like⁴ this:
121
122 ![results bar chart](./image-format-vs-repo-size.svg)
123
124 There are a few key things we want to draw your attention to in that
125 chart:
126
127 * BMP and uncompressed TIFF are nearly identical in size for all
128 checkins, and the repository growth rate is negligible past the
129 first commit.⁵ We owe this economy to Fossil’s delta compression
130 feature: it is encoding each of those single-pixel changes in a very
131 small amount of repository space.
132
133 * The JPEG and PNG bars increase by large amounts on most checkins
134 even though each checkin *also* encodes only a *single-pixel change*.
@@ -158,11 +157,11 @@
158 Since programs that produce and consume binary-compressed data files
159 often make it either difficult or impossible to work with the
160 uncompressed form, we want an automated method for producing the
161 uncompressed form to make Fossil happy while still having the compressed
162 form to keep our content creation applications happy. This `Makefile`
163 should⁶ do that for BMP, PNG, SVG, and XLSX files:
164
165 .SUFFIXES: .bmp .png .svg .svgz
166
167 .svgz.svg:
168 gzip -dc < $< > $@
@@ -194,11 +193,11 @@
194
195 This `Makefile` allows you to treat the compressed version as the
196 process input, but to actually check in only the changes against the
197 uncompressed version by typing “`make`” before “`fossil ci`”. This is
198 not actually an extra step in practice, since if you’ve got a
199 `Makefile`-based project, you should be building (and testing!) it
200 before checking each change in anyway!
201
202 Because this technique is based on dependency rules, only the necessary
203 files are generated on each `make` command.
204
@@ -239,66 +238,65 @@
239 automatically-uncompressed PDF for the benefit of Fossil. Unlike
240 with the Excel case, there is no simple “file base name to directory
241 name” mapping, so we just created the `-big` to `-small` name scheme
242 here.
243
-----
244
245
246 ## <a id="notes"></a>Footnotes and Digressions
247
248 1. This problem is not Fossil-specific. Several other programs also do
249 delta compression, so they’ll also be affected by this problem:
250 [rsync][rs], [Unison][us], [Git][git], etc. You should take this
251 article’s advice when using all such programs, not just Fossil.
252
253 When using file copying and synchronization programs *without* delta
254 compression, on the other hand, it’s best to use the most
255 highly-compressed file format you can tolerate, since they copy the
256 whole file any time any bit of it changes.
257
258 2. In fact, a good way to gauge the effectiveness of a given
 
259 compression scheme is to run its output through the same sort of
260 tests we use to gauge how “random” a given [PRNG][prng] is. Another
261 way to look at it is that if there is a discernible pattern in the
262 output of a compression scheme, that constitutes *information* (in
263 [the technical sense of that word][ith]) that could be further
264 compressed.
265
266 3. We're using *uncompressed* TIFF here, not [LZW][lzw]- or
 
267 Zip-compressed TIFF, either of which would give similar results to
268 PNG, which is always zlib-compressed.
269
270 4. The raw data changes somewhat from one run to the next due to the
 
271 use of random noise in the image to make the zlib/PNG compression
272 more difficult, and the random pixel changes. Those test design
273 choices make this a [Monte Carlo experiment][mce]. We’ve found that
274 the overall character of the results doesn’t change from one run to
275 the next.
276
277 5. It’s not clear to me why there is a one-time jump in size for BMP
 
278 and TIFF past the first commit. I suspect it is due to the SQLite
279 indices being initialized for the first time.
280
281 Page size inflation might have something to do with it as well,
282 though we tried to control that by rebuilding the initial DB with a
283 minimal page size. If you re-run the program often enough, you will
284 sometimes see the BMP or TIFF bar jump higher than the other, again
285 likely due to one of the repos crossing a page boundary.
286
287 Another curious artifact in the data is that the BMP is slightly
288 larger than for the TIFF. This goes against expectation because a
289 low-tech format like BMP should have a small edge in this test
290 because TIFF metadata includes the option for multiple timestamps,
291 UUIDs, etc., which bloat the checkin size by creating many small
292 deltas.
293
294 6. The `Makefile` above is not battle-tested. Please report bugs and
 
295 needed extensions [on the forum][for].
296
297 [for]: https://fossil-scm.org/forum/forumpost/15e677f2c8
298 [git]: https://git-scm.com/
299 [ith]: https://en.wikipedia.org/wiki/Information_theory
300 [lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
301 [prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator
302 [rs]: https://rsync.samba.org/
303 [us]: http://www.cis.upenn.edu/~bcpierce/unison/
304
--- www/image-format-vs-repo-size.md
+++ www/image-format-vs-repo-size.md
@@ -1,19 +1,19 @@
1 # Image Format vs Fossil Repo Size
2
3 ## The Problem
4
5 Fossil has a [delta compression][dc] feature which removes redundant
6 information from a file relative to its parent on check-in.[^delta-prgs]
7 That delta is then [zlib][zl]-compressed before being stored
8 in the Fossil repository database file.
9
10 Storing pre-compressed data files in a Fossil repository defeats both of
11 these space-saving measures:
12
13 1. Binary data compression algorithms turn the file data into
14 pseudorandom noise.[^prn]
15
16 Typical data compression algorithms are not [hash functions][hf],
17 where the goal is that a change to each bit in the input has a
18 statistically even chance of changing every bit in the output, but
19 because they do approach that pathological condition, pre-compressed
@@ -34,11 +34,10 @@
34 the input data change on each checkin. This article will illustrate that
35 problem, quantify it, and give a solution to it.
36
37 [dc]: ./delta_format.wiki
38 [hf]: https://en.wikipedia.org/wiki/Hash_function
 
39 [zl]: http://www.zlib.net/
40
41
42 ## <a id="formats"></a>Affected File Formats
43
@@ -93,11 +92,11 @@
92 image, check it in, and remember the new Fossil repo size.
93
94 5. Iterate on step 4 some number of times — currently 10 — and remember
95 the Fossil repo size at each step.
96
97 6. Repeat the above steps for BMP, PNG, and TIFF.[^tiff-cmp]
98
99 7. Create a bar chart showing how the Fossil repository size changes
100 with each checkin.
101
102 We chose to use JupyterLab for this because it makes it easy for you to
@@ -115,20 +114,20 @@
114 [wp]: http://wand-py.org/
115
116
117 ## <a id="results"></a>Results
118
119 Running the notebook gives a bar chart something like[^variance] this:
120
121 ![results bar chart](./image-format-vs-repo-size.svg)
122
123 There are a few key things we want to draw your attention to in that
124 chart:
125
126 * BMP and uncompressed TIFF are nearly identical in size for all
127 checkins, and the repository growth rate is negligible past the
128 first commit.[^size-jump] We owe this economy to Fossil’s delta compression
129 feature: it is encoding each of those single-pixel changes in a very
130 small amount of repository space.
131
132 * The JPEG and PNG bars increase by large amounts on most checkins
133 even though each checkin *also* encodes only a *single-pixel change*.
@@ -158,11 +157,11 @@
157 Since programs that produce and consume binary-compressed data files
158 often make it either difficult or impossible to work with the
159 uncompressed form, we want an automated method for producing the
160 uncompressed form to make Fossil happy while still having the compressed
161 form to keep our content creation applications happy. This `Makefile`
162 should[^makefile] do that for BMP, PNG, SVG, and XLSX files:
163
164 .SUFFIXES: .bmp .png .svg .svgz
165
166 .svgz.svg:
167 gzip -dc < $< > $@
@@ -194,11 +193,11 @@
193
194 This `Makefile` allows you to treat the compressed version as the
195 process input, but to actually check in only the changes against the
196 uncompressed version by typing “`make`” before “`fossil ci`”. This is
197 not actually an extra step in practice, since if you’ve got a
198 `Makefile`-based project, you should be building — and testing — it
199 before checking each change in anyway!
200
201 Because this technique is based on dependency rules, only the necessary
202 files are generated on each `make` command.
203
@@ -239,66 +238,65 @@
238 automatically-uncompressed PDF for the benefit of Fossil. Unlike
239 with the Excel case, there is no simple “file base name to directory
240 name” mapping, so we just created the `-big` to `-small` name scheme
241 here.
242
-----
243
244 [^delta-prgs]:
245 This problem is not Fossil-specific. Several other programs also do
 
 
246 delta compression, so they’ll also be affected by this problem:
247 [rsync][rs], [Unison][us], [Git][git], etc. You should take this
248 article’s advice when using all such programs, not just Fossil.
 
249 When using file copying and synchronization programs *without* delta
250 compression, on the other hand, it’s best to use the most
251 highly-compressed file format you can tolerate, since they copy the
252 whole file any time any bit of it changes.
253
254 [^prn]:
255 In fact, a good way to gauge the effectiveness of a given
256 compression scheme is to run its output through the same sort of
257 tests we use to gauge how “random” a given [PRNG][prng] is. Another
258 way to look at it is that if there is a discernible pattern in the
259 output of a compression scheme, that constitutes *information* (in
260 [the technical sense of that word][ith]) that could be further
261 compressed.
262
263 [^tiff-cmp]:
264 We're using *uncompressed* TIFF here, not [LZW][lzw]- or
265 Zip-compressed TIFF, either of which would give similar results to
266 PNG, which is always zlib-compressed.
267
268 [^variance]:
269 The raw data changes somewhat from one run to the next due to the
270 use of random noise in the image to make the zlib/PNG compression
271 more difficult, and the random pixel changes. Those test design
272 choices make this a [Monte Carlo experiment][mce]. We’ve found that
273 the overall character of the results doesn’t change from one run to
274 the next.
275
276 [^size-jump]:
277 It’s not clear to me why there is a one-time jump in size for BMP
278 and TIFF past the first commit. I suspect it is due to the SQLite
279 indices being initialized for the first time.
 
280 Page size inflation might have something to do with it as well,
281 though we tried to control that by rebuilding the initial DB with a
282 minimal page size. If you re-run the program often enough, you will
283 sometimes see the BMP or TIFF bar jump higher than the other, again
284 likely due to one of the repos crossing a page boundary.
 
285 Another curious artifact in the data is that the BMP is slightly
286 larger than for the TIFF. This goes against expectation because a
287 low-tech format like BMP should have a small edge in this test
288 because TIFF metadata includes the option for multiple timestamps,
289 UUIDs, etc., which bloat the checkin size by creating many small
290 deltas.
291
292 [^makefile]:
293 The `Makefile` above is not battle-tested. Please report bugs and
294 needed extensions [on the forum][for].
295
296 [for]: https://fossil-scm.org/forum/forumpost/15e677f2c8
297 [git]: https://git-scm.com/
298 [ith]: https://en.wikipedia.org/wiki/Information_theory
299 [lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
300 [prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator
301 [rs]: https://rsync.samba.org/
302 [us]: http://www.cis.upenn.edu/~bcpierce/unison/
303

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button