Fossil SCM

Converted the hand-crafted footnotes in the "Image Format vs Fossil Repo Size" doc to use the new Markdown affordance.

wyoung 2023-04-25 22:09 trunk

Commit 389e3fb976ea4a23d04b847c9c2bea77e73312be27f17ab564ac961783d9593e

Parent 5c8f5575657c0e3…

1 file changed +19 -20

M www/image-format-vs-repo-size.md

+19 -20

		--- www/image-format-vs-repo-size.md
		+++ www/image-format-vs-repo-size.md
		@@ -1,19 +1,19 @@
1	1	# Image Format vs Fossil Repo Size
2	2
3	3	## The Problem
4	4
5	5	Fossil has a [delta compression][dc] feature which removes redundant
6		-information from a file relative to its parent on check-in.¹
	6	+information from a file relative to its parent on check-in.[^delta-prgs]
7	7	That delta is then [zlib][zl]-compressed before being stored
8	8	in the Fossil repository database file.
9	9
10	10	Storing pre-compressed data files in a Fossil repository defeats both of
11	11	these space-saving measures:
12	12
13	13	1. Binary data compression algorithms turn the file data into
14		- [pseudorandom noise][prn].²
	14	+ pseudorandom noise.[^prn]
15	15
16	16	Typical data compression algorithms are not [hash functions][hf],
17	17	where the goal is that a change to each bit in the input has a
18	18	statistically even chance of changing every bit in the output, but
19	19	because they do approach that pathological condition, pre-compressed
		@@ -34,11 +34,10 @@
34	34	the input data change on each checkin. This article will illustrate that
35	35	problem, quantify it, and give a solution to it.
36	36
37	37	[dc]: ./delta_format.wiki
38	38	[hf]: https://en.wikipedia.org/wiki/Hash_function
39		-[prn]: https://en.wikipedia.org/wiki/Pseudorandomness
40	39	[zl]: http://www.zlib.net/
41	40
42	41
43	42	## <a id="formats"></a>Affected File Formats
44	43
		@@ -93,11 +92,11 @@
93	92	image, check it in, and remember the new Fossil repo size.
94	93
95	94	5. Iterate on step 4 some number of times — currently 10 — and remember
96	95	the Fossil repo size at each step.
97	96
98		-6. Repeat the above steps for BMP, TIFF,³ and PNG.
	97	+6. Repeat the above steps for BMP, PNG, and TIFF.[^tiff-cmp]
99	98
100	99	7. Create a bar chart showing how the Fossil repository size changes
101	100	with each checkin.
102	101
103	102	We chose to use JupyterLab for this because it makes it easy for you to
		@@ -115,20 +114,20 @@
115	114	[wp]: http://wand-py.org/
116	115
117	116
118	117	## <a id="results"></a>Results
119	118
120		-Running the notebook gives a bar chart something like⁴ this:
	119	+Running the notebook gives a bar chart something like[^variance] this:
121	120
122	121	![results bar chart](./image-format-vs-repo-size.svg)
123	122
124	123	There are a few key things we want to draw your attention to in that
125	124	chart:
126	125
127	126	* BMP and uncompressed TIFF are nearly identical in size for all
128	127	checkins, and the repository growth rate is negligible past the
129		- first commit.⁵ We owe this economy to Fossil’s delta compression
	128	+ first commit.[^size-jump] We owe this economy to Fossil’s delta compression
130	129	feature: it is encoding each of those single-pixel changes in a very
131	130	small amount of repository space.
132	131
133	132	* The JPEG and PNG bars increase by large amounts on most checkins
134	133	even though each checkin also encodes only a single-pixel change.
		@@ -158,11 +157,11 @@
158	157	Since programs that produce and consume binary-compressed data files
159	158	often make it either difficult or impossible to work with the
160	159	uncompressed form, we want an automated method for producing the
161	160	uncompressed form to make Fossil happy while still having the compressed
162	161	form to keep our content creation applications happy. This `Makefile`
163		-should⁶ do that for BMP, PNG, SVG, and XLSX files:
	162	+should[^makefile] do that for BMP, PNG, SVG, and XLSX files:
164	163
165	164	.SUFFIXES: .bmp .png .svg .svgz
166	165
167	166	.svgz.svg:
168	167	gzip -dc < $< > $@
		@@ -194,11 +193,11 @@
194	193
195	194	This `Makefile` allows you to treat the compressed version as the
196	195	process input, but to actually check in only the changes against the
197	196	uncompressed version by typing “`make`” before “`fossil ci`”. This is
198	197	not actually an extra step in practice, since if you’ve got a
199		-`Makefile`-based project, you should be building (and testing!) it
	198	+`Makefile`-based project, you should be building — and testing — it
200	199	before checking each change in anyway!
201	200
202	201	Because this technique is based on dependency rules, only the necessary
203	202	files are generated on each `make` command.
204	203
		@@ -239,66 +238,65 @@
239	238	automatically-uncompressed PDF for the benefit of Fossil. Unlike
240	239	with the Excel case, there is no simple “file base name to directory
241	240	name” mapping, so we just created the `-big` to `-small` name scheme
242	241	here.
243	242
		-----
244	243
245		-
246		-## <a id="notes"></a>Footnotes and Digressions
247		-
248		-1. This problem is not Fossil-specific. Several other programs also do
	244	+[^delta-prgs]:
	245	+ This problem is not Fossil-specific. Several other programs also do
249	246	delta compression, so they’ll also be affected by this problem:
250	247	[rsync][rs], [Unison][us], [Git][git], etc. You should take this
251	248	article’s advice when using all such programs, not just Fossil.
252		-
253	249	When using file copying and synchronization programs without delta
254	250	compression, on the other hand, it’s best to use the most
255	251	highly-compressed file format you can tolerate, since they copy the
256	252	whole file any time any bit of it changes.
257	253
258		-2. In fact, a good way to gauge the effectiveness of a given
	254	+[^prn]:
	255	+ In fact, a good way to gauge the effectiveness of a given
259	256	compression scheme is to run its output through the same sort of
260	257	tests we use to gauge how “random” a given [PRNG][prng] is. Another
261	258	way to look at it is that if there is a discernible pattern in the
262	259	output of a compression scheme, that constitutes information (in
263	260	[the technical sense of that word][ith]) that could be further
264	261	compressed.
265	262
266		-3. We're using uncompressed TIFF here, not [LZW][lzw]- or
	263	+[^tiff-cmp]:
	264	+ We're using uncompressed TIFF here, not [LZW][lzw]- or
267	265	Zip-compressed TIFF, either of which would give similar results to
268	266	PNG, which is always zlib-compressed.
269	267
270		-4. The raw data changes somewhat from one run to the next due to the
	268	+[^variance]:
	269	+ The raw data changes somewhat from one run to the next due to the
271	270	use of random noise in the image to make the zlib/PNG compression
272	271	more difficult, and the random pixel changes. Those test design
273	272	choices make this a [Monte Carlo experiment][mce]. We’ve found that
274	273	the overall character of the results doesn’t change from one run to
275	274	the next.
276	275
277		-5. It’s not clear to me why there is a one-time jump in size for BMP
	276	+[^size-jump]:
	277	+ It’s not clear to me why there is a one-time jump in size for BMP
278	278	and TIFF past the first commit. I suspect it is due to the SQLite
279	279	indices being initialized for the first time.
280		-
281	280	Page size inflation might have something to do with it as well,
282	281	though we tried to control that by rebuilding the initial DB with a
283	282	minimal page size. If you re-run the program often enough, you will
284	283	sometimes see the BMP or TIFF bar jump higher than the other, again
285	284	likely due to one of the repos crossing a page boundary.
286		-
287	285	Another curious artifact in the data is that the BMP is slightly
288	286	larger than for the TIFF. This goes against expectation because a
289	287	low-tech format like BMP should have a small edge in this test
290	288	because TIFF metadata includes the option for multiple timestamps,
291	289	UUIDs, etc., which bloat the checkin size by creating many small
292	290	deltas.
293	291
294		-6. The `Makefile` above is not battle-tested. Please report bugs and
	292	+[^makefile]:
	293	+ The `Makefile` above is not battle-tested. Please report bugs and
295	294	needed extensions [on the forum][for].
296	295
297	296	[for]: https://fossil-scm.org/forum/forumpost/15e677f2c8
298	297	[git]: https://git-scm.com/
299	298	[ith]: https://en.wikipedia.org/wiki/Information_theory
300	299	[lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
301	300	[prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator
302	301	[rs]: https://rsync.samba.org/
303	302	[us]: http://www.cis.upenn.edu/~bcpierce/unison/
304	303

	--- www/image-format-vs-repo-size.md
	+++ www/image-format-vs-repo-size.md
	@@ -1,19 +1,19 @@
1	# Image Format vs Fossil Repo Size
2
3	## The Problem
4
5	Fossil has a [delta compression][dc] feature which removes redundant
6	information from a file relative to its parent on check-in.¹
7	That delta is then [zlib][zl]-compressed before being stored
8	in the Fossil repository database file.
9
10	Storing pre-compressed data files in a Fossil repository defeats both of
11	these space-saving measures:
12
13	1. Binary data compression algorithms turn the file data into
14	[pseudorandom noise][prn].²
15
16	Typical data compression algorithms are not [hash functions][hf],
17	where the goal is that a change to each bit in the input has a
18	statistically even chance of changing every bit in the output, but
19	because they do approach that pathological condition, pre-compressed
	@@ -34,11 +34,10 @@
34	the input data change on each checkin. This article will illustrate that
35	problem, quantify it, and give a solution to it.
36
37	[dc]: ./delta_format.wiki
38	[hf]: https://en.wikipedia.org/wiki/Hash_function
39	[prn]: https://en.wikipedia.org/wiki/Pseudorandomness
40	[zl]: http://www.zlib.net/
41
42
43	## <a id="formats"></a>Affected File Formats
44
	@@ -93,11 +92,11 @@
93	image, check it in, and remember the new Fossil repo size.
94
95	5. Iterate on step 4 some number of times — currently 10 — and remember
96	the Fossil repo size at each step.
97
98	6. Repeat the above steps for BMP, TIFF,³ and PNG.
99
100	7. Create a bar chart showing how the Fossil repository size changes
101	with each checkin.
102
103	We chose to use JupyterLab for this because it makes it easy for you to
	@@ -115,20 +114,20 @@
115	[wp]: http://wand-py.org/
116
117
118	## <a id="results"></a>Results
119
120	Running the notebook gives a bar chart something like⁴ this:
121
122	![results bar chart](./image-format-vs-repo-size.svg)
123
124	There are a few key things we want to draw your attention to in that
125	chart:
126
127	* BMP and uncompressed TIFF are nearly identical in size for all
128	checkins, and the repository growth rate is negligible past the
129	first commit.⁵ We owe this economy to Fossil’s delta compression
130	feature: it is encoding each of those single-pixel changes in a very
131	small amount of repository space.
132
133	* The JPEG and PNG bars increase by large amounts on most checkins
134	even though each checkin also encodes only a single-pixel change.
	@@ -158,11 +157,11 @@
158	Since programs that produce and consume binary-compressed data files
159	often make it either difficult or impossible to work with the
160	uncompressed form, we want an automated method for producing the
161	uncompressed form to make Fossil happy while still having the compressed
162	form to keep our content creation applications happy. This `Makefile`
163	should⁶ do that for BMP, PNG, SVG, and XLSX files:
164
165	.SUFFIXES: .bmp .png .svg .svgz
166
167	.svgz.svg:
168	gzip -dc < $< > $@
	@@ -194,11 +193,11 @@
194
195	This `Makefile` allows you to treat the compressed version as the
196	process input, but to actually check in only the changes against the
197	uncompressed version by typing “`make`” before “`fossil ci`”. This is
198	not actually an extra step in practice, since if you’ve got a
199	`Makefile`-based project, you should be building (and testing!) it
200	before checking each change in anyway!
201
202	Because this technique is based on dependency rules, only the necessary
203	files are generated on each `make` command.
204
	@@ -239,66 +238,65 @@
239	automatically-uncompressed PDF for the benefit of Fossil. Unlike
240	with the Excel case, there is no simple “file base name to directory
241	name” mapping, so we just created the `-big` to `-small` name scheme
242	here.
243
	-----
244
245
246	## <a id="notes"></a>Footnotes and Digressions
247
248	1. This problem is not Fossil-specific. Several other programs also do
249	delta compression, so they’ll also be affected by this problem:
250	[rsync][rs], [Unison][us], [Git][git], etc. You should take this
251	article’s advice when using all such programs, not just Fossil.
252
253	When using file copying and synchronization programs without delta
254	compression, on the other hand, it’s best to use the most
255	highly-compressed file format you can tolerate, since they copy the
256	whole file any time any bit of it changes.
257
258	2. In fact, a good way to gauge the effectiveness of a given

259	compression scheme is to run its output through the same sort of
260	tests we use to gauge how “random” a given [PRNG][prng] is. Another
261	way to look at it is that if there is a discernible pattern in the
262	output of a compression scheme, that constitutes information (in
263	[the technical sense of that word][ith]) that could be further
264	compressed.
265
266	3. We're using uncompressed TIFF here, not [LZW][lzw]- or

267	Zip-compressed TIFF, either of which would give similar results to
268	PNG, which is always zlib-compressed.
269
270	4. The raw data changes somewhat from one run to the next due to the

271	use of random noise in the image to make the zlib/PNG compression
272	more difficult, and the random pixel changes. Those test design
273	choices make this a [Monte Carlo experiment][mce]. We’ve found that
274	the overall character of the results doesn’t change from one run to
275	the next.
276
277	5. It’s not clear to me why there is a one-time jump in size for BMP

278	and TIFF past the first commit. I suspect it is due to the SQLite
279	indices being initialized for the first time.
280
281	Page size inflation might have something to do with it as well,
282	though we tried to control that by rebuilding the initial DB with a
283	minimal page size. If you re-run the program often enough, you will
284	sometimes see the BMP or TIFF bar jump higher than the other, again
285	likely due to one of the repos crossing a page boundary.
286
287	Another curious artifact in the data is that the BMP is slightly
288	larger than for the TIFF. This goes against expectation because a
289	low-tech format like BMP should have a small edge in this test
290	because TIFF metadata includes the option for multiple timestamps,
291	UUIDs, etc., which bloat the checkin size by creating many small
292	deltas.
293
294	6. The `Makefile` above is not battle-tested. Please report bugs and

295	needed extensions [on the forum][for].
296
297	[for]: https://fossil-scm.org/forum/forumpost/15e677f2c8
298	[git]: https://git-scm.com/
299	[ith]: https://en.wikipedia.org/wiki/Information_theory
300	[lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
301	[prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator
302	[rs]: https://rsync.samba.org/
303	[us]: http://www.cis.upenn.edu/~bcpierce/unison/
304

	--- www/image-format-vs-repo-size.md
	+++ www/image-format-vs-repo-size.md
	@@ -1,19 +1,19 @@
1	# Image Format vs Fossil Repo Size
2
3	## The Problem
4
5	Fossil has a [delta compression][dc] feature which removes redundant
6	information from a file relative to its parent on check-in.[^delta-prgs]
7	That delta is then [zlib][zl]-compressed before being stored
8	in the Fossil repository database file.
9
10	Storing pre-compressed data files in a Fossil repository defeats both of
11	these space-saving measures:
12
13	1. Binary data compression algorithms turn the file data into
14	pseudorandom noise.[^prn]
15
16	Typical data compression algorithms are not [hash functions][hf],
17	where the goal is that a change to each bit in the input has a
18	statistically even chance of changing every bit in the output, but
19	because they do approach that pathological condition, pre-compressed
	@@ -34,11 +34,10 @@
34	the input data change on each checkin. This article will illustrate that
35	problem, quantify it, and give a solution to it.
36
37	[dc]: ./delta_format.wiki
38	[hf]: https://en.wikipedia.org/wiki/Hash_function

39	[zl]: http://www.zlib.net/
40
41
42	## <a id="formats"></a>Affected File Formats
43
	@@ -93,11 +92,11 @@
92	image, check it in, and remember the new Fossil repo size.
93
94	5. Iterate on step 4 some number of times — currently 10 — and remember
95	the Fossil repo size at each step.
96
97	6. Repeat the above steps for BMP, PNG, and TIFF.[^tiff-cmp]
98
99	7. Create a bar chart showing how the Fossil repository size changes
100	with each checkin.
101
102	We chose to use JupyterLab for this because it makes it easy for you to
	@@ -115,20 +114,20 @@
114	[wp]: http://wand-py.org/
115
116
117	## <a id="results"></a>Results
118
119	Running the notebook gives a bar chart something like[^variance] this:
120
121	![results bar chart](./image-format-vs-repo-size.svg)
122
123	There are a few key things we want to draw your attention to in that
124	chart:
125
126	* BMP and uncompressed TIFF are nearly identical in size for all
127	checkins, and the repository growth rate is negligible past the
128	first commit.[^size-jump] We owe this economy to Fossil’s delta compression
129	feature: it is encoding each of those single-pixel changes in a very
130	small amount of repository space.
131
132	* The JPEG and PNG bars increase by large amounts on most checkins
133	even though each checkin also encodes only a single-pixel change.
	@@ -158,11 +157,11 @@
157	Since programs that produce and consume binary-compressed data files
158	often make it either difficult or impossible to work with the
159	uncompressed form, we want an automated method for producing the
160	uncompressed form to make Fossil happy while still having the compressed
161	form to keep our content creation applications happy. This `Makefile`
162	should[^makefile] do that for BMP, PNG, SVG, and XLSX files:
163
164	.SUFFIXES: .bmp .png .svg .svgz
165
166	.svgz.svg:
167	gzip -dc < $< > $@
	@@ -194,11 +193,11 @@
193
194	This `Makefile` allows you to treat the compressed version as the
195	process input, but to actually check in only the changes against the
196	uncompressed version by typing “`make`” before “`fossil ci`”. This is
197	not actually an extra step in practice, since if you’ve got a
198	`Makefile`-based project, you should be building — and testing — it
199	before checking each change in anyway!
200
201	Because this technique is based on dependency rules, only the necessary
202	files are generated on each `make` command.
203
	@@ -239,66 +238,65 @@
238	automatically-uncompressed PDF for the benefit of Fossil. Unlike
239	with the Excel case, there is no simple “file base name to directory
240	name” mapping, so we just created the `-big` to `-small` name scheme
241	here.
242
	-----
243
244	[^delta-prgs]:
245	This problem is not Fossil-specific. Several other programs also do


246	delta compression, so they’ll also be affected by this problem:
247	[rsync][rs], [Unison][us], [Git][git], etc. You should take this
248	article’s advice when using all such programs, not just Fossil.

249	When using file copying and synchronization programs without delta
250	compression, on the other hand, it’s best to use the most
251	highly-compressed file format you can tolerate, since they copy the
252	whole file any time any bit of it changes.
253
254	[^prn]:
255	In fact, a good way to gauge the effectiveness of a given
256	compression scheme is to run its output through the same sort of
257	tests we use to gauge how “random” a given [PRNG][prng] is. Another
258	way to look at it is that if there is a discernible pattern in the
259	output of a compression scheme, that constitutes information (in
260	[the technical sense of that word][ith]) that could be further
261	compressed.
262
263	[^tiff-cmp]:
264	We're using uncompressed TIFF here, not [LZW][lzw]- or
265	Zip-compressed TIFF, either of which would give similar results to
266	PNG, which is always zlib-compressed.
267
268	[^variance]:
269	The raw data changes somewhat from one run to the next due to the
270	use of random noise in the image to make the zlib/PNG compression
271	more difficult, and the random pixel changes. Those test design
272	choices make this a [Monte Carlo experiment][mce]. We’ve found that
273	the overall character of the results doesn’t change from one run to
274	the next.
275
276	[^size-jump]:
277	It’s not clear to me why there is a one-time jump in size for BMP
278	and TIFF past the first commit. I suspect it is due to the SQLite
279	indices being initialized for the first time.

280	Page size inflation might have something to do with it as well,
281	though we tried to control that by rebuilding the initial DB with a
282	minimal page size. If you re-run the program often enough, you will
283	sometimes see the BMP or TIFF bar jump higher than the other, again
284	likely due to one of the repos crossing a page boundary.

285	Another curious artifact in the data is that the BMP is slightly
286	larger than for the TIFF. This goes against expectation because a
287	low-tech format like BMP should have a small edge in this test
288	because TIFF metadata includes the option for multiple timestamps,
289	UUIDs, etc., which bloat the checkin size by creating many small
290	deltas.
291
292	[^makefile]:
293	The `Makefile` above is not battle-tested. Please report bugs and
294	needed extensions [on the forum][for].
295
296	[for]: https://fossil-scm.org/forum/forumpost/15e677f2c8
297	[git]: https://git-scm.com/
298	[ith]: https://en.wikipedia.org/wiki/Information_theory
299	[lzw]: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
300	[prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator
301	[rs]: https://rsync.samba.org/
302	[us]: http://www.cis.upenn.edu/~bcpierce/unison/
303

Fossil SCM

Keyboard Shortcuts