Fossil SCM
Reworked my notes regarding 'reconstruct' based on my reading of content.c, checkin.c, and manifest.c
Commit
10062df2fa9b02069f0d3c90dd1d990717a921a9
Parent
8857e1eabbea146…
1 file changed
+65
-38
+65
-38
| --- ci_fossil.txt | ||
| +++ ci_fossil.txt | ||
| @@ -58,43 +58,70 @@ | ||
| 58 | 58 | |
| 59 | 59 | ---------------------------------- |
| 60 | 60 | |
| 61 | 61 | reconstruct |
| 62 | 62 | |
| 63 | -It is currently not clear to me when and how fossil does | |
| 64 | -delta-compression. Does it use deltas, or reverse deltas, or a | |
| 65 | -combination of both ? And when does it generate the deltas ? | |
| 66 | - | |
| 67 | -The trivial solution is that it uses deltas, i.e. the first revision | |
| 68 | -of a file is stored without delta compression and all future versions | |
| 69 | -are deltas from that, and the delta is generated when the new revision | |
| 70 | -is committed. With the obvious disadvantage that newer revisions take | |
| 71 | -more and more time to be decompressed as the set of deltas to apply | |
| 72 | -grows. | |
| 73 | - | |
| 74 | -And during xfer it simply sends the deltas as is, making for easy | |
| 75 | -integration on the remote side. | |
| 76 | - | |
| 77 | -However reconstruct, that method sees initially only an unordered | |
| 78 | -collection of files, some of which may be manifests, others are data | |
| 79 | -files, and if it imports them in a random order it might find that | |
| 80 | -file X, which was imported first and therefore has no delta | |
| 81 | -compression, is actually somewhere in the middle of a line of | |
| 82 | -revisions, and should be delta-compressed, and then it has to find out | |
| 83 | -the predecessor and do the compression, etc. | |
| 84 | - | |
| 85 | -So depending on how the internal logic of delta-compression is done | |
| 86 | -reconstruct might need more logic to help the lower level achieve good | |
| 87 | -compression. | |
| 88 | - | |
| 89 | -Like, in a first pass determine which files are manifests, and read | |
| 90 | -enough of them to determine their parent/child structure, and in a | |
| 91 | -second pass actually imports them, in topological order, with all | |
| 92 | -relevant non-manifest files for a manifest imported as that time | |
| 93 | -too. With that the underlying engine would see the files basically in | |
| 94 | -the same order as generated by a regular series of commits. | |
| 95 | - | |
| 96 | -Problems for reconstruct: Files referenced, but not present, and, | |
| 97 | -conversely, files present, but not referenced. This can done as part | |
| 98 | -of the second pass, aborting when a missing file is encountered, with | |
| 99 | -(un)marking of used files, and at the end we know the unused | |
| 100 | -files. Could also be a separate pass between first and second. | |
| 63 | +The core logic for handling content is in the file "content.c", in | |
| 64 | +particular the functions 'content_put' and 'content_deltify'. One of | |
| 65 | +the main users of these functions is in the file "checkin.c", see the | |
| 66 | +function 'commit_cmd'. | |
| 67 | + | |
| 68 | +The logic is clear. The new modified files are simply stored without | |
| 69 | +delta-compression, using 'content_put'. And should fosssil have an id | |
| 70 | +for the _previous_ revision of the committed file it uses | |
| 71 | +'content_deltify' to convert the already stored data for that revision | |
| 72 | +into a delta with the just stored new revision as origin. | |
| 73 | + | |
| 74 | +In other words, fossil produces reverse deltas, with leaf revisions | |
| 75 | +stored just zip-compressed (plain) and older revisions using both zip- | |
| 76 | +and delta-compression. | |
| 77 | + | |
| 78 | +Of note is that the underlying logic in 'content_deltify' gives up on | |
| 79 | +delta compression if the involved files are either not large enough, | |
| 80 | +or if the achieved compression factor was not high enough. In that | |
| 81 | +case the old revision of the file is left plain. | |
| 82 | + | |
| 83 | +The scheme can thus be called a 'truncated reverse delta'. | |
| 84 | + | |
| 85 | +The manifest is created and committed after the modified files. It | |
| 86 | +uses the same logic as for the regular files. The new leaf is stored | |
| 87 | +plain, and storage of the parent manifest is modified to be a delta | |
| 88 | +with the current as origin. | |
| 89 | + | |
| 90 | +Further note that for a checkin of a merge result oonly the primary | |
| 91 | +parent is modified in that way. The secondary parent, the one merged | |
| 92 | +into the current revision is not touched. I.e. from the storage layer | |
| 93 | +point of view this revision is still a leaf and the data is kept | |
| 94 | +stored plain, not delta-compressed. | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | +Now the "reconstruct" can be done like so: | |
| 99 | + | |
| 100 | +- Scan the files in the indicated directory, and look for a manifest. | |
| 101 | + | |
| 102 | +- When the manifest has been found parse its contents and follow the | |
| 103 | + chain of parent links to locate the root manifest (no parent). | |
| 104 | + | |
| 105 | +- Import the files referenced by the root manifest, then the manifest | |
| 106 | + itself. This can be done using a modified form of the 'commit_cmd' | |
| 107 | + which does not have to construct a manifest on its own from vfile, | |
| 108 | + vmerge, etc. | |
| 109 | + | |
| 110 | +- After that recursively apply the import of the previous step to the | |
| 111 | + children of the root, and so on. | |
| 112 | + | |
| 113 | +For an incremental "reconstruct" the collection of files would not be | |
| 114 | +a single tree with a root, but a forest, and the roots to look for are | |
| 115 | +not manifests without parent, but with a parent which is already | |
| 116 | +present in the repository. After one such root has been found and | |
| 117 | +processed the unprocessed files have to be searched further for more | |
| 118 | +roots, and only if no such are found anymore will the remaining files | |
| 119 | +be considered as superfluous. | |
| 120 | + | |
| 121 | +We can use the functions in "manifest.c" for the parsing and following | |
| 122 | +the parental chain. | |
| 123 | + | |
| 124 | +Hm. But we have no direct child information. So the above algorithm | |
| 125 | +has to be modified, we have to scan all manifests before we start | |
| 126 | +importing, and we have to create a reverse index, from manifest to | |
| 127 | +children so that we can perform the import from root to leaves. | |
| 101 | 128 |
| --- ci_fossil.txt | |
| +++ ci_fossil.txt | |
| @@ -58,43 +58,70 @@ | |
| 58 | |
| 59 | ---------------------------------- |
| 60 | |
| 61 | reconstruct |
| 62 | |
| 63 | It is currently not clear to me when and how fossil does |
| 64 | delta-compression. Does it use deltas, or reverse deltas, or a |
| 65 | combination of both ? And when does it generate the deltas ? |
| 66 | |
| 67 | The trivial solution is that it uses deltas, i.e. the first revision |
| 68 | of a file is stored without delta compression and all future versions |
| 69 | are deltas from that, and the delta is generated when the new revision |
| 70 | is committed. With the obvious disadvantage that newer revisions take |
| 71 | more and more time to be decompressed as the set of deltas to apply |
| 72 | grows. |
| 73 | |
| 74 | And during xfer it simply sends the deltas as is, making for easy |
| 75 | integration on the remote side. |
| 76 | |
| 77 | However reconstruct, that method sees initially only an unordered |
| 78 | collection of files, some of which may be manifests, others are data |
| 79 | files, and if it imports them in a random order it might find that |
| 80 | file X, which was imported first and therefore has no delta |
| 81 | compression, is actually somewhere in the middle of a line of |
| 82 | revisions, and should be delta-compressed, and then it has to find out |
| 83 | the predecessor and do the compression, etc. |
| 84 | |
| 85 | So depending on how the internal logic of delta-compression is done |
| 86 | reconstruct might need more logic to help the lower level achieve good |
| 87 | compression. |
| 88 | |
| 89 | Like, in a first pass determine which files are manifests, and read |
| 90 | enough of them to determine their parent/child structure, and in a |
| 91 | second pass actually imports them, in topological order, with all |
| 92 | relevant non-manifest files for a manifest imported as that time |
| 93 | too. With that the underlying engine would see the files basically in |
| 94 | the same order as generated by a regular series of commits. |
| 95 | |
| 96 | Problems for reconstruct: Files referenced, but not present, and, |
| 97 | conversely, files present, but not referenced. This can done as part |
| 98 | of the second pass, aborting when a missing file is encountered, with |
| 99 | (un)marking of used files, and at the end we know the unused |
| 100 | files. Could also be a separate pass between first and second. |
| 101 |
| --- ci_fossil.txt | |
| +++ ci_fossil.txt | |
| @@ -58,43 +58,70 @@ | |
| 58 | |
| 59 | ---------------------------------- |
| 60 | |
| 61 | reconstruct |
| 62 | |
| 63 | The core logic for handling content is in the file "content.c", in |
| 64 | particular the functions 'content_put' and 'content_deltify'. One of |
| 65 | the main users of these functions is in the file "checkin.c", see the |
| 66 | function 'commit_cmd'. |
| 67 | |
| 68 | The logic is clear. The new modified files are simply stored without |
| 69 | delta-compression, using 'content_put'. And should fosssil have an id |
| 70 | for the _previous_ revision of the committed file it uses |
| 71 | 'content_deltify' to convert the already stored data for that revision |
| 72 | into a delta with the just stored new revision as origin. |
| 73 | |
| 74 | In other words, fossil produces reverse deltas, with leaf revisions |
| 75 | stored just zip-compressed (plain) and older revisions using both zip- |
| 76 | and delta-compression. |
| 77 | |
| 78 | Of note is that the underlying logic in 'content_deltify' gives up on |
| 79 | delta compression if the involved files are either not large enough, |
| 80 | or if the achieved compression factor was not high enough. In that |
| 81 | case the old revision of the file is left plain. |
| 82 | |
| 83 | The scheme can thus be called a 'truncated reverse delta'. |
| 84 | |
| 85 | The manifest is created and committed after the modified files. It |
| 86 | uses the same logic as for the regular files. The new leaf is stored |
| 87 | plain, and storage of the parent manifest is modified to be a delta |
| 88 | with the current as origin. |
| 89 | |
| 90 | Further note that for a checkin of a merge result oonly the primary |
| 91 | parent is modified in that way. The secondary parent, the one merged |
| 92 | into the current revision is not touched. I.e. from the storage layer |
| 93 | point of view this revision is still a leaf and the data is kept |
| 94 | stored plain, not delta-compressed. |
| 95 | |
| 96 | |
| 97 | |
| 98 | Now the "reconstruct" can be done like so: |
| 99 | |
| 100 | - Scan the files in the indicated directory, and look for a manifest. |
| 101 | |
| 102 | - When the manifest has been found parse its contents and follow the |
| 103 | chain of parent links to locate the root manifest (no parent). |
| 104 | |
| 105 | - Import the files referenced by the root manifest, then the manifest |
| 106 | itself. This can be done using a modified form of the 'commit_cmd' |
| 107 | which does not have to construct a manifest on its own from vfile, |
| 108 | vmerge, etc. |
| 109 | |
| 110 | - After that recursively apply the import of the previous step to the |
| 111 | children of the root, and so on. |
| 112 | |
| 113 | For an incremental "reconstruct" the collection of files would not be |
| 114 | a single tree with a root, but a forest, and the roots to look for are |
| 115 | not manifests without parent, but with a parent which is already |
| 116 | present in the repository. After one such root has been found and |
| 117 | processed the unprocessed files have to be searched further for more |
| 118 | roots, and only if no such are found anymore will the remaining files |
| 119 | be considered as superfluous. |
| 120 | |
| 121 | We can use the functions in "manifest.c" for the parsing and following |
| 122 | the parental chain. |
| 123 | |
| 124 | Hm. But we have no direct child information. So the above algorithm |
| 125 | has to be modified, we have to scan all manifests before we start |
| 126 | importing, and we have to create a reverse index, from manifest to |
| 127 | children so that we can perform the import from root to leaves. |
| 128 |