Fossil SCM

fossil-scm / www / delta_encoder_algorithm.wiki

Source Rendered

Blame History Raw 249 lines

1	`<title>Fossil Delta Encoding Algorithm</title>`
2
3	`<h2>Abstract</h2>`
4
5	`<p>A key component for the efficient storage of multiple revisions of`
6	`a file in fossil repositories is the use of delta-compression, i.e. to`
7	`store only the changes between revisions instead of the whole`
8	`file.</p>`
9
10	`<p>This document describes the encoding algorithm used by Fossil to`
11	`generate deltas. It is targeted at developers working on either`
12	`<a href="index.wiki">fossil</a> itself, or on tools compatible with`
13	`it. The exact format of the generated byte-sequences, while in general`
14	`not necessary to understand encoder operation, can be found in the`
15	`companion specification titled "<a href="delta_format.wiki">Fossil`
16	`Delta Format</a>".`
17	`</p>`
18
19	`<p>The algorithm is inspired`
20	`by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p>`
21
22	`<h2 id="argresparam">1.0 Arguments, Results, and Parameters</h2>`
23
24	`<p>The encoder takes two byte-sequences as input, the "original", and`
25	`the "target", and returns a single byte-sequence containing the`
26	`"delta" which transforms the original into the target upon its`
27	`application.</p>`
28
29	`<p>Note that the data of a "byte-sequence" includes its length,`
30	`i.e. the number of bytes contained in the sequence.</p>`
31
32	`<p>The algorithm has one parameter named "NHASH", the size of the`
33	`"sliding window" for the "rolling hash", in bytes. These two terms are`
34	`explained in the next section. The value of this parameter has to be a`
35	`power of two for the algorithm to work. For Fossil the value of this`
36	`parameter is set to "16".</p>`
37
38	`<h2 id="operation">2.0 Operation</h2>`
39
40	`<p>The algorithm is split into three phases which generate`
41	`the <a href="delta_format.wiki#header">header</a>,`
42	`<a href="delta_format.wiki#slist">segment list</a>,`
43	`and <a href="delta_format.wiki#trailer">trailer</a> of the delta, per`
44	`its general <a href="delta_format.wiki#structure">structure</a>.</p>`
45
46	`<p>The two phases generating header and trailer are not covered here`
47	`as their implementation trivially follows directly from the`
48	`specification of the <a href="delta_format.wiki">delta format</a>.</p>`
49
50	`<p>This leaves the segment-list. Its generation is done in two phases,`
51	`a pre-processing step operating on the "original" byte-sequence,`
52	`followed by the processing of the "target" byte-sequence using the`
53	`information gathered by the first step.</p>`
54
55	`<h3 id="preprocessing">2.1 Preprocessing the original</h3>`
56
57	`<p>A major part of the processing of the "target" is to find a range`
58	`in the "original" which contains the same content as found at the`
59	`current location in the "target".</p>`
60
61	`<p>A naive approach to this would be to search the whole "original"`
62	`for such content. This however is very inefficient as it would search`
63	`the same parts of the "original" over and over. What is done instead`
64	`is to sample the "original" at regular intervals, compute signatures`
65	`for the sampled locations and store them in a hash table keyed by`
66	`these signatures.</p>`
67
68	`<p>That is what happens in this step. The following processing step`
69	`can then the compute signature for its current location and then has`
70	`to search only a narrow set of locations in the "original" for`
71	`possible matches, namely those which have the same signature.</p>`
72
73	`<p>In detail:</p>`
74
75	`<ol>`
76	`<li>The "original" is split into chunks of NHASH bytes. Note that a`
77	`partial chunk of less than NHASH bytes at the end of "original" is`
78	`ignored.`
79	`</li>`
80	`<li>The <a href="#rollhash">rolling hash</a> of each chunk is`
81	`computed.`
82	`</li>`
83	`<li>A hash table is filled, mapping from the hashes of the chunks to`
84	`the list of chunk locations having this hash.`
85	`</li>`
86	`</ol>`
87
88	`<h3 id="processing">2.2 Processing the target</h3>`
89
90	`<p>This, the main phase of the encoder, processes the target in a loop`
91	`from beginning to end. The state of the encoder is captured by two`
92	`locations, the "base" and the "slide". "base" points to the first byte`
93	`of the target for which no delta output has been generated yet, and`
94	`"slide" is the location of the window used to look in the "origin" for`
95	`commonalities. This window is NHASH bytes long.</p>`
96
97	`<p>Initially both "base" and "slide" point to the beginning of the`
98	`"target". In each iteration of the loop the encoder decides whether to`
99	`<ul>`
100	`<li>emit a single instruction`
101	`to <a href="delta_format.wiki#copyrange">copy a range</a>, or`
102	`</li>`
103	`<li>emit two instructions, first`
104	`to <a href="delta_format.wiki#insertlit">insert a literal</a>, then`
105	`to <a href="delta_format.wiki#copyrange">copy a range</a>, or`
106	`</li>`
107	`<li>move the window forward one byte.`
108	`</li>`
109	`</ul>`
110	`</p>`
111
112	`<verbatim type="pikchr float-right">`
113	`TARGET: [`
114	`scale = 0.8`
115	`down`
116	`"Target" bold`
117	`box fill palegreen width 150% height 200% "Processed"`
118	`GI: box same as first box fill yellow height 25% "Gap → Insert"`
119	`CC: box same fill orange height 200% "Common → Copy"`
120	`W: box same as GI fill lightgray width 125% height 200% "Window" bold`
121	`box same as CC height 125% ""`
122	`box same fill white ""`
123	`]`
124
125	`[ "Base" bold ; right ; arrow 33% ] with .e at TARGET.GI.nw`
126	`[ "Slide" bold ; right ; arrow 33% ] with .e at TARGET.W.nw`
127
128	`ORIGIN: [`
129	`down`
130	`"Origin" bold`
131	`B1: box fill white`
132	`B2: box fill orange height 200%`
133	`B3: box fill white height 200%`
134	`] with .nw at 0.75 right of TARGET.ne`
135
136	`arrow from TARGET.W.e to ORIGIN.B2.w "Signature" aligned above`
137	`</verbatim>`
138
139	`<p>To make this decision the encoder first computes the hash value for`
140	`the NHASH bytes in the window and then looks at all the locations in`
141	`the "origin" which have the same signature. This part uses the hash`
142	`table created by the pre-processing step to efficiently find these`
143	`locations.</p>`
144
145	`<p>For each of the possible candidates the encoder finds the maximal`
146	`range of bytes common to both "origin" and "target", going forward and`
147	`backward from "slide" in the "target", and the candidate location in`
148	`the "origin". This search is constrained on the side of the "target"`
149	`by the "base" (backward search), and the end of the "target" (forward`
150	`search), and on the side of the "origin" by the beginning and end of`
151	`the "origin", respectively.</p>`
152
153	`<p>There are input files for which the hash chains generated by the`
154	`pre-processing step can become very long, leading to long search times`
155	`and affecting the performance of the delta generator. To limit the`
156	`effect such long chains can have the actual search for candidates is`
157	`bounded, looking at most N candidates. Currently N is set to 250.</p>`
158
159	`<p>From the ranges for all the candidates the best (= largest) common`
160	`range is taken and it is determined how many bytes are needed to`
161	`encode the bytes between the "base" and the end of that range. If the`
162	`range extended back to the "base" then this can be done in a single`
163	`copy instruction. Otherwise, i.e if there is a gap between the "base"`
164	`and the beginning of the range then two instructions are needed, one`
165	`to insert the bytes in the gap as a literal, and a copy instruction`
166	`for the range itself. The general situation at this point can be seen`
167	`in the picture to the right.</p>`
168
169	`<p>If the number of bytes needed to encode both gap (if present), and`
170	`range is less than the number of bytes we are encoding the encoder`
171	`will emit the necessary instructions as described above, set "base"`
172	`and "slide" to the end of the encoded range and start the next`
173	`iteration at that point.</p>`
174
175	`<p>If, on the other hand, the encoder either did not find candidate`
176	`locations in the origin, or the best range coming out of the search`
177	`needed more bytes to encode the range than there were bytes in the`
178	`range, then no instructions are emitted and the window is moved one`
179	`byte forward. The "base" is left unchanged in that case.</p>`
180
181	`<p>The processing loop stops at one of two conditions:`
182	`<ol>`
183	`<li>The encoder decided to move the window forward, but the end of the`
184	`window reached the end of the "target".`
185	`</li>`
186	`<li>After the emission of instructions the new "base" location is`
187	`within NHASH bytes of end of the "target", i.e. there are no more than`
188	`at most NHASH bytes left.`
189	`</li>`
190	`</ol>`
191	`</p>`
192
193	`<p>If the processing loop left bytes unencoded, i.e. "base" not`
194	`exactly at the end of the "target", as is possible for both end`
195	`conditions, then one last insert instruction is emitted to put these`
196	`bytes into the delta.<p>`
197
198	`<h2 id="exceptions">3.0 Exceptions</h2>`
199
200	`<p>If the "original" is at most NHASH bytes long no compression of`
201	`changes is possible, and the segment-list of the delta consists of a`
202	`single literal which contains the entire "target".</p>`
203
204	`<p>This is actually equivalent to the second end condition of the`
205	`processing loop described in the previous section, just checked before`
206	`actually entering the loop.</p>`
207
208	`<h2 id="rollhash">4.0 The rolling hash</h2>`
209
210	`<p>The rolling hash described below and used to compute content`
211	`signatures was chosen not only for good hashing properties, but also`
212	`to enable the easy (incremental) recalculation of its value for a`
213	`sliding window, i.e. where the oldest byte is removed from the window`
214	`and a new byte is shifted in.<p>`
215
216	`<h3 id="rhdef">4.1 Definition</h3>`
217
218	`<p>Assuming an array Z of NHASH bytes (indexing starting at 0) the`
219	`hash V is computed via</p>`
220
221	`<div align="center"><img src="encode1.gif"></div>`
222	`<div align="center"><img src="encode2.gif"></div>`
223	`<div align="center"><img src="encode3.gif"></div>`
224
225	`where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and`
226	`V is a 32-bit unsigned integer with B as MSB, A as LSB.`
227
228	`<h3 id="rhincr">4.2 Incremental recalculation</h3>`
229
230	`<p>Assuming an array Z of NHASH bytes (indexing starting at 0) with`
231	`hash V (and components A and B), the dropped`
232	`byte <img src="encode4.gif" align="center">, and the new byte`
233	`<img src="encode5.gif" align="center"> , the new hash can`
234	`be computed incrementally via: </p>`
235
236	`<div align="center"><img src="encode6.gif"></div>`
237	`<div align="center"><img src="encode7.gif"></div>`
238	`<div align="center"><img src="encode8.gif"></div>`
239
240	`<p>For A, the regular sum, it can be seen easily that this the correct`
241	`way recomputing that component.</p>`
242
243	`<p>For B, the weighted sum, note first that <img src="encode4.gif"`
244	`align="center"> has the weight NHASH in the sum, so that is what has`
245	`to be removed. Then adding in <img src="encode9.gif" align="center">`
246	`adds one weight factor to all the other values of Z, and at last adds`
247	`in <img src="encode5.gif" align="center"> with weight 1, also`
248	`generating the correct new sum</p>`
249

Fossil SCM

Keyboard Shortcuts