Fossil SCM

Reworked my notes regarding 'reconstruct' based on my reading of content.c, checkin.c, and manifest.c

aku 2007-08-28 05:01 trunk

Commit 10062df2fa9b02069f0d3c90dd1d990717a921a9

Parent 8857e1eabbea146…

1 file changed +65 -38

M ci_fossil.txt

+65 -38

		--- ci_fossil.txt
		+++ ci_fossil.txt
		@@ -58,43 +58,70 @@
58	58
59	59	----------------------------------
60	60
61	61	reconstruct
62	62
63		-It is currently not clear to me when and how fossil does
64		-delta-compression. Does it use deltas, or reverse deltas, or a
65		-combination of both ? And when does it generate the deltas ?
66		-
67		-The trivial solution is that it uses deltas, i.e. the first revision
68		-of a file is stored without delta compression and all future versions
69		-are deltas from that, and the delta is generated when the new revision
70		-is committed. With the obvious disadvantage that newer revisions take
71		-more and more time to be decompressed as the set of deltas to apply
72		-grows.
73		-
74		-And during xfer it simply sends the deltas as is, making for easy
75		-integration on the remote side.
76		-
77		-However reconstruct, that method sees initially only an unordered
78		-collection of files, some of which may be manifests, others are data
79		-files, and if it imports them in a random order it might find that
80		-file X, which was imported first and therefore has no delta
81		-compression, is actually somewhere in the middle of a line of
82		-revisions, and should be delta-compressed, and then it has to find out
83		-the predecessor and do the compression, etc.
84		-
85		-So depending on how the internal logic of delta-compression is done
86		-reconstruct might need more logic to help the lower level achieve good
87		-compression.
88		-
89		-Like, in a first pass determine which files are manifests, and read
90		-enough of them to determine their parent/child structure, and in a
91		-second pass actually imports them, in topological order, with all
92		-relevant non-manifest files for a manifest imported as that time
93		-too. With that the underlying engine would see the files basically in
94		-the same order as generated by a regular series of commits.
95		-
96		-Problems for reconstruct: Files referenced, but not present, and,
97		-conversely, files present, but not referenced. This can done as part
98		-of the second pass, aborting when a missing file is encountered, with
99		-(un)marking of used files, and at the end we know the unused
100		-files. Could also be a separate pass between first and second.
	63	+The core logic for handling content is in the file "content.c", in
	64	+particular the functions 'content_put' and 'content_deltify'. One of
	65	+the main users of these functions is in the file "checkin.c", see the
	66	+function 'commit_cmd'.
	67	+
	68	+The logic is clear. The new modified files are simply stored without
	69	+delta-compression, using 'content_put'. And should fosssil have an id
	70	+for the _previous_ revision of the committed file it uses
	71	+'content_deltify' to convert the already stored data for that revision
	72	+into a delta with the just stored new revision as origin.
	73	+
	74	+In other words, fossil produces reverse deltas, with leaf revisions
	75	+stored just zip-compressed (plain) and older revisions using both zip-
	76	+and delta-compression.
	77	+
	78	+Of note is that the underlying logic in 'content_deltify' gives up on
	79	+delta compression if the involved files are either not large enough,
	80	+or if the achieved compression factor was not high enough. In that
	81	+case the old revision of the file is left plain.
	82	+
	83	+The scheme can thus be called a 'truncated reverse delta'.
	84	+
	85	+The manifest is created and committed after the modified files. It
	86	+uses the same logic as for the regular files. The new leaf is stored
	87	+plain, and storage of the parent manifest is modified to be a delta
	88	+with the current as origin.
	89	+
	90	+Further note that for a checkin of a merge result oonly the primary
	91	+parent is modified in that way. The secondary parent, the one merged
	92	+into the current revision is not touched. I.e. from the storage layer
	93	+point of view this revision is still a leaf and the data is kept
	94	+stored plain, not delta-compressed.
	95	+
	96	+
	97	+
	98	+Now the "reconstruct" can be done like so:
	99	+
	100	+- Scan the files in the indicated directory, and look for a manifest.
	101	+
	102	+- When the manifest has been found parse its contents and follow the
	103	+ chain of parent links to locate the root manifest (no parent).
	104	+
	105	+- Import the files referenced by the root manifest, then the manifest
	106	+ itself. This can be done using a modified form of the 'commit_cmd'
	107	+ which does not have to construct a manifest on its own from vfile,
	108	+ vmerge, etc.
	109	+
	110	+- After that recursively apply the import of the previous step to the
	111	+ children of the root, and so on.
	112	+
	113	+For an incremental "reconstruct" the collection of files would not be
	114	+a single tree with a root, but a forest, and the roots to look for are
	115	+not manifests without parent, but with a parent which is already
	116	+present in the repository. After one such root has been found and
	117	+processed the unprocessed files have to be searched further for more
	118	+roots, and only if no such are found anymore will the remaining files
	119	+be considered as superfluous.
	120	+
	121	+We can use the functions in "manifest.c" for the parsing and following
	122	+the parental chain.
	123	+
	124	+Hm. But we have no direct child information. So the above algorithm
	125	+has to be modified, we have to scan all manifests before we start
	126	+importing, and we have to create a reverse index, from manifest to
	127	+children so that we can perform the import from root to leaves.
101	128

	--- ci_fossil.txt
	+++ ci_fossil.txt
	@@ -58,43 +58,70 @@
58
59	----------------------------------
60
61	reconstruct
62
63	It is currently not clear to me when and how fossil does
64	delta-compression. Does it use deltas, or reverse deltas, or a
65	combination of both ? And when does it generate the deltas ?
66
67	The trivial solution is that it uses deltas, i.e. the first revision
68	of a file is stored without delta compression and all future versions
69	are deltas from that, and the delta is generated when the new revision
70	is committed. With the obvious disadvantage that newer revisions take
71	more and more time to be decompressed as the set of deltas to apply
72	grows.
73
74	And during xfer it simply sends the deltas as is, making for easy
75	integration on the remote side.
76
77	However reconstruct, that method sees initially only an unordered
78	collection of files, some of which may be manifests, others are data
79	files, and if it imports them in a random order it might find that
80	file X, which was imported first and therefore has no delta
81	compression, is actually somewhere in the middle of a line of
82	revisions, and should be delta-compressed, and then it has to find out
83	the predecessor and do the compression, etc.
84
85	So depending on how the internal logic of delta-compression is done
86	reconstruct might need more logic to help the lower level achieve good
87	compression.
88
89	Like, in a first pass determine which files are manifests, and read
90	enough of them to determine their parent/child structure, and in a
91	second pass actually imports them, in topological order, with all
92	relevant non-manifest files for a manifest imported as that time
93	too. With that the underlying engine would see the files basically in
94	the same order as generated by a regular series of commits.
95
96	Problems for reconstruct: Files referenced, but not present, and,
97	conversely, files present, but not referenced. This can done as part
98	of the second pass, aborting when a missing file is encountered, with
99	(un)marking of used files, and at the end we know the unused
100	files. Could also be a separate pass between first and second.



























101

	--- ci_fossil.txt
	+++ ci_fossil.txt
	@@ -58,43 +58,70 @@
58
59	----------------------------------
60
61	reconstruct
62
63	The core logic for handling content is in the file "content.c", in
64	particular the functions 'content_put' and 'content_deltify'. One of
65	the main users of these functions is in the file "checkin.c", see the
66	function 'commit_cmd'.
67
68	The logic is clear. The new modified files are simply stored without
69	delta-compression, using 'content_put'. And should fosssil have an id
70	for the _previous_ revision of the committed file it uses
71	'content_deltify' to convert the already stored data for that revision
72	into a delta with the just stored new revision as origin.
73
74	In other words, fossil produces reverse deltas, with leaf revisions
75	stored just zip-compressed (plain) and older revisions using both zip-
76	and delta-compression.
77
78	Of note is that the underlying logic in 'content_deltify' gives up on
79	delta compression if the involved files are either not large enough,
80	or if the achieved compression factor was not high enough. In that
81	case the old revision of the file is left plain.
82
83	The scheme can thus be called a 'truncated reverse delta'.
84
85	The manifest is created and committed after the modified files. It
86	uses the same logic as for the regular files. The new leaf is stored
87	plain, and storage of the parent manifest is modified to be a delta
88	with the current as origin.
89
90	Further note that for a checkin of a merge result oonly the primary
91	parent is modified in that way. The secondary parent, the one merged
92	into the current revision is not touched. I.e. from the storage layer
93	point of view this revision is still a leaf and the data is kept
94	stored plain, not delta-compressed.
95
96
97
98	Now the "reconstruct" can be done like so:
99
100	- Scan the files in the indicated directory, and look for a manifest.
101
102	- When the manifest has been found parse its contents and follow the
103	chain of parent links to locate the root manifest (no parent).
104
105	- Import the files referenced by the root manifest, then the manifest
106	itself. This can be done using a modified form of the 'commit_cmd'
107	which does not have to construct a manifest on its own from vfile,
108	vmerge, etc.
109
110	- After that recursively apply the import of the previous step to the
111	children of the root, and so on.
112
113	For an incremental "reconstruct" the collection of files would not be
114	a single tree with a root, but a forest, and the roots to look for are
115	not manifests without parent, but with a parent which is already
116	present in the repository. After one such root has been found and
117	processed the unprocessed files have to be searched further for more
118	roots, and only if no such are found anymore will the remaining files
119	be considered as superfluous.
120
121	We can use the functions in "manifest.c" for the parsing and following
122	the parental chain.
123
124	Hm. But we have no direct child information. So the above algorithm
125	has to be modified, we have to scan all manifests before we start
126	importing, and we have to create a reverse index, from manifest to
127	children so that we can perform the import from root to leaves.
128

Fossil SCM

Keyboard Shortcuts