|
1
|
<title>Fossil Delta Encoding Algorithm</title> |
|
2
|
|
|
3
|
<h2>Abstract</h2> |
|
4
|
|
|
5
|
<p>A key component for the efficient storage of multiple revisions of |
|
6
|
a file in fossil repositories is the use of delta-compression, i.e. to |
|
7
|
store only the changes between revisions instead of the whole |
|
8
|
file.</p> |
|
9
|
|
|
10
|
<p>This document describes the encoding algorithm used by Fossil to |
|
11
|
generate deltas. It is targeted at developers working on either |
|
12
|
<a href="index.wiki">fossil</a> itself, or on tools compatible with |
|
13
|
it. The exact format of the generated byte-sequences, while in general |
|
14
|
not necessary to understand encoder operation, can be found in the |
|
15
|
companion specification titled "<a href="delta_format.wiki">Fossil |
|
16
|
Delta Format</a>". |
|
17
|
</p> |
|
18
|
|
|
19
|
<p>The algorithm is inspired |
|
20
|
by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p> |
|
21
|
|
|
22
|
<h2 id="argresparam">1.0 Arguments, Results, and Parameters</h2> |
|
23
|
|
|
24
|
<p>The encoder takes two byte-sequences as input, the "original", and |
|
25
|
the "target", and returns a single byte-sequence containing the |
|
26
|
"delta" which transforms the original into the target upon its |
|
27
|
application.</p> |
|
28
|
|
|
29
|
<p>Note that the data of a "byte-sequence" includes its length, |
|
30
|
i.e. the number of bytes contained in the sequence.</p> |
|
31
|
|
|
32
|
<p>The algorithm has one parameter named "NHASH", the size of the |
|
33
|
"sliding window" for the "rolling hash", in bytes. These two terms are |
|
34
|
explained in the next section. The value of this parameter has to be a |
|
35
|
power of two for the algorithm to work. For Fossil the value of this |
|
36
|
parameter is set to "16".</p> |
|
37
|
|
|
38
|
<h2 id="operation">2.0 Operation</h2> |
|
39
|
|
|
40
|
<p>The algorithm is split into three phases which generate |
|
41
|
the <a href="delta_format.wiki#header">header</a>, |
|
42
|
<a href="delta_format.wiki#slist">segment list</a>, |
|
43
|
and <a href="delta_format.wiki#trailer">trailer</a> of the delta, per |
|
44
|
its general <a href="delta_format.wiki#structure">structure</a>.</p> |
|
45
|
|
|
46
|
<p>The two phases generating header and trailer are not covered here |
|
47
|
as their implementation trivially follows directly from the |
|
48
|
specification of the <a href="delta_format.wiki">delta format</a>.</p> |
|
49
|
|
|
50
|
<p>This leaves the segment-list. Its generation is done in two phases, |
|
51
|
a pre-processing step operating on the "original" byte-sequence, |
|
52
|
followed by the processing of the "target" byte-sequence using the |
|
53
|
information gathered by the first step.</p> |
|
54
|
|
|
55
|
<h3 id="preprocessing">2.1 Preprocessing the original</h3> |
|
56
|
|
|
57
|
<p>A major part of the processing of the "target" is to find a range |
|
58
|
in the "original" which contains the same content as found at the |
|
59
|
current location in the "target".</p> |
|
60
|
|
|
61
|
<p>A naive approach to this would be to search the whole "original" |
|
62
|
for such content. This however is very inefficient as it would search |
|
63
|
the same parts of the "original" over and over. What is done instead |
|
64
|
is to sample the "original" at regular intervals, compute signatures |
|
65
|
for the sampled locations and store them in a hash table keyed by |
|
66
|
these signatures.</p> |
|
67
|
|
|
68
|
<p>That is what happens in this step. The following processing step |
|
69
|
can then the compute signature for its current location and then has |
|
70
|
to search only a narrow set of locations in the "original" for |
|
71
|
possible matches, namely those which have the same signature.</p> |
|
72
|
|
|
73
|
<p>In detail:</p> |
|
74
|
|
|
75
|
<ol> |
|
76
|
<li>The "original" is split into chunks of NHASH bytes. Note that a |
|
77
|
partial chunk of less than NHASH bytes at the end of "original" is |
|
78
|
ignored. |
|
79
|
</li> |
|
80
|
<li>The <a href="#rollhash">rolling hash</a> of each chunk is |
|
81
|
computed. |
|
82
|
</li> |
|
83
|
<li>A hash table is filled, mapping from the hashes of the chunks to |
|
84
|
the list of chunk locations having this hash. |
|
85
|
</li> |
|
86
|
</ol> |
|
87
|
|
|
88
|
<h3 id="processing">2.2 Processing the target</h3> |
|
89
|
|
|
90
|
<p>This, the main phase of the encoder, processes the target in a loop |
|
91
|
from beginning to end. The state of the encoder is captured by two |
|
92
|
locations, the "base" and the "slide". "base" points to the first byte |
|
93
|
of the target for which no delta output has been generated yet, and |
|
94
|
"slide" is the location of the window used to look in the "origin" for |
|
95
|
commonalities. This window is NHASH bytes long.</p> |
|
96
|
|
|
97
|
<p>Initially both "base" and "slide" point to the beginning of the |
|
98
|
"target". In each iteration of the loop the encoder decides whether to |
|
99
|
<ul> |
|
100
|
<li>emit a single instruction |
|
101
|
to <a href="delta_format.wiki#copyrange">copy a range</a>, or |
|
102
|
</li> |
|
103
|
<li>emit two instructions, first |
|
104
|
to <a href="delta_format.wiki#insertlit">insert a literal</a>, then |
|
105
|
to <a href="delta_format.wiki#copyrange">copy a range</a>, or |
|
106
|
</li> |
|
107
|
<li>move the window forward one byte. |
|
108
|
</li> |
|
109
|
</ul> |
|
110
|
</p> |
|
111
|
|
|
112
|
<verbatim type="pikchr float-right"> |
|
113
|
TARGET: [ |
|
114
|
scale = 0.8 |
|
115
|
down |
|
116
|
"Target" bold |
|
117
|
box fill palegreen width 150% height 200% "Processed" |
|
118
|
GI: box same as first box fill yellow height 25% "Gap → Insert" |
|
119
|
CC: box same fill orange height 200% "Common → Copy" |
|
120
|
W: box same as GI fill lightgray width 125% height 200% "Window" bold |
|
121
|
box same as CC height 125% "" |
|
122
|
box same fill white "" |
|
123
|
] |
|
124
|
|
|
125
|
[ "Base" bold ; right ; arrow 33% ] with .e at TARGET.GI.nw |
|
126
|
[ "Slide" bold ; right ; arrow 33% ] with .e at TARGET.W.nw |
|
127
|
|
|
128
|
ORIGIN: [ |
|
129
|
down |
|
130
|
"Origin" bold |
|
131
|
B1: box fill white |
|
132
|
B2: box fill orange height 200% |
|
133
|
B3: box fill white height 200% |
|
134
|
] with .nw at 0.75 right of TARGET.ne |
|
135
|
|
|
136
|
arrow from TARGET.W.e to ORIGIN.B2.w "Signature" aligned above |
|
137
|
</verbatim> |
|
138
|
|
|
139
|
<p>To make this decision the encoder first computes the hash value for |
|
140
|
the NHASH bytes in the window and then looks at all the locations in |
|
141
|
the "origin" which have the same signature. This part uses the hash |
|
142
|
table created by the pre-processing step to efficiently find these |
|
143
|
locations.</p> |
|
144
|
|
|
145
|
<p>For each of the possible candidates the encoder finds the maximal |
|
146
|
range of bytes common to both "origin" and "target", going forward and |
|
147
|
backward from "slide" in the "target", and the candidate location in |
|
148
|
the "origin". This search is constrained on the side of the "target" |
|
149
|
by the "base" (backward search), and the end of the "target" (forward |
|
150
|
search), and on the side of the "origin" by the beginning and end of |
|
151
|
the "origin", respectively.</p> |
|
152
|
|
|
153
|
<p>There are input files for which the hash chains generated by the |
|
154
|
pre-processing step can become very long, leading to long search times |
|
155
|
and affecting the performance of the delta generator. To limit the |
|
156
|
effect such long chains can have the actual search for candidates is |
|
157
|
bounded, looking at most N candidates. Currently N is set to 250.</p> |
|
158
|
|
|
159
|
<p>From the ranges for all the candidates the best (= largest) common |
|
160
|
range is taken and it is determined how many bytes are needed to |
|
161
|
encode the bytes between the "base" and the end of that range. If the |
|
162
|
range extended back to the "base" then this can be done in a single |
|
163
|
copy instruction. Otherwise, i.e if there is a gap between the "base" |
|
164
|
and the beginning of the range then two instructions are needed, one |
|
165
|
to insert the bytes in the gap as a literal, and a copy instruction |
|
166
|
for the range itself. The general situation at this point can be seen |
|
167
|
in the picture to the right.</p> |
|
168
|
|
|
169
|
<p>If the number of bytes needed to encode both gap (if present), and |
|
170
|
range is less than the number of bytes we are encoding the encoder |
|
171
|
will emit the necessary instructions as described above, set "base" |
|
172
|
and "slide" to the end of the encoded range and start the next |
|
173
|
iteration at that point.</p> |
|
174
|
|
|
175
|
<p>If, on the other hand, the encoder either did not find candidate |
|
176
|
locations in the origin, or the best range coming out of the search |
|
177
|
needed more bytes to encode the range than there were bytes in the |
|
178
|
range, then no instructions are emitted and the window is moved one |
|
179
|
byte forward. The "base" is left unchanged in that case.</p> |
|
180
|
|
|
181
|
<p>The processing loop stops at one of two conditions: |
|
182
|
<ol> |
|
183
|
<li>The encoder decided to move the window forward, but the end of the |
|
184
|
window reached the end of the "target". |
|
185
|
</li> |
|
186
|
<li>After the emission of instructions the new "base" location is |
|
187
|
within NHASH bytes of end of the "target", i.e. there are no more than |
|
188
|
at most NHASH bytes left. |
|
189
|
</li> |
|
190
|
</ol> |
|
191
|
</p> |
|
192
|
|
|
193
|
<p>If the processing loop left bytes unencoded, i.e. "base" not |
|
194
|
exactly at the end of the "target", as is possible for both end |
|
195
|
conditions, then one last insert instruction is emitted to put these |
|
196
|
bytes into the delta.<p> |
|
197
|
|
|
198
|
<h2 id="exceptions">3.0 Exceptions</h2> |
|
199
|
|
|
200
|
<p>If the "original" is at most NHASH bytes long no compression of |
|
201
|
changes is possible, and the segment-list of the delta consists of a |
|
202
|
single literal which contains the entire "target".</p> |
|
203
|
|
|
204
|
<p>This is actually equivalent to the second end condition of the |
|
205
|
processing loop described in the previous section, just checked before |
|
206
|
actually entering the loop.</p> |
|
207
|
|
|
208
|
<h2 id="rollhash">4.0 The rolling hash</h2> |
|
209
|
|
|
210
|
<p>The rolling hash described below and used to compute content |
|
211
|
signatures was chosen not only for good hashing properties, but also |
|
212
|
to enable the easy (incremental) recalculation of its value for a |
|
213
|
sliding window, i.e. where the oldest byte is removed from the window |
|
214
|
and a new byte is shifted in.<p> |
|
215
|
|
|
216
|
<h3 id="rhdef">4.1 Definition</h3> |
|
217
|
|
|
218
|
<p>Assuming an array Z of NHASH bytes (indexing starting at 0) the |
|
219
|
hash V is computed via</p> |
|
220
|
|
|
221
|
<div align="center"><img src="encode1.gif"></div> |
|
222
|
<div align="center"><img src="encode2.gif"></div> |
|
223
|
<div align="center"><img src="encode3.gif"></div> |
|
224
|
|
|
225
|
where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and |
|
226
|
V is a 32-bit unsigned integer with B as MSB, A as LSB. |
|
227
|
|
|
228
|
<h3 id="rhincr">4.2 Incremental recalculation</h3> |
|
229
|
|
|
230
|
<p>Assuming an array Z of NHASH bytes (indexing starting at 0) with |
|
231
|
hash V (and components A and B), the dropped |
|
232
|
byte <img src="encode4.gif" align="center">, and the new byte |
|
233
|
<img src="encode5.gif" align="center"> , the new hash can |
|
234
|
be computed incrementally via: </p> |
|
235
|
|
|
236
|
<div align="center"><img src="encode6.gif"></div> |
|
237
|
<div align="center"><img src="encode7.gif"></div> |
|
238
|
<div align="center"><img src="encode8.gif"></div> |
|
239
|
|
|
240
|
<p>For A, the regular sum, it can be seen easily that this the correct |
|
241
|
way recomputing that component.</p> |
|
242
|
|
|
243
|
<p>For B, the weighted sum, note first that <img src="encode4.gif" |
|
244
|
align="center"> has the weight NHASH in the sum, so that is what has |
|
245
|
to be removed. Then adding in <img src="encode9.gif" align="center"> |
|
246
|
adds one weight factor to all the other values of Z, and at last adds |
|
247
|
in <img src="encode5.gif" align="center"> with weight 1, also |
|
248
|
generating the correct new sum</p> |
|
249
|
|