Fossil SCM

fossil-scm / www / delta_encoder_algorithm.wiki
1
<title>Fossil Delta Encoding Algorithm</title>
2
3
<h2>Abstract</h2>
4
5
<p>A key component for the efficient storage of multiple revisions of
6
a file in fossil repositories is the use of delta-compression, i.e. to
7
store only the changes between revisions instead of the whole
8
file.</p>
9
10
<p>This document describes the encoding algorithm used by Fossil to
11
generate deltas. It is targeted at developers working on either
12
<a href="index.wiki">fossil</a> itself, or on tools compatible with
13
it. The exact format of the generated byte-sequences, while in general
14
not necessary to understand encoder operation, can be found in the
15
companion specification titled "<a href="delta_format.wiki">Fossil
16
Delta Format</a>".
17
</p>
18
19
<p>The algorithm is inspired
20
by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p>
21
22
<h2 id="argresparam">1.0 Arguments, Results, and Parameters</h2>
23
24
<p>The encoder takes two byte-sequences as input, the "original", and
25
the "target", and returns a single byte-sequence containing the
26
"delta" which transforms the original into the target upon its
27
application.</p>
28
29
<p>Note that the data of a "byte-sequence" includes its length,
30
i.e. the number of bytes contained in the sequence.</p>
31
32
<p>The algorithm has one parameter named "NHASH", the size of the
33
"sliding window" for the "rolling hash", in bytes. These two terms are
34
explained in the next section. The value of this parameter has to be a
35
power of two for the algorithm to work. For Fossil the value of this
36
parameter is set to "16".</p>
37
38
<h2 id="operation">2.0 Operation</h2>
39
40
<p>The algorithm is split into three phases which generate
41
the <a href="delta_format.wiki#header">header</a>,
42
<a href="delta_format.wiki#slist">segment list</a>,
43
and <a href="delta_format.wiki#trailer">trailer</a> of the delta, per
44
its general <a href="delta_format.wiki#structure">structure</a>.</p>
45
46
<p>The two phases generating header and trailer are not covered here
47
as their implementation trivially follows directly from the
48
specification of the <a href="delta_format.wiki">delta format</a>.</p>
49
50
<p>This leaves the segment-list. Its generation is done in two phases,
51
a pre-processing step operating on the "original" byte-sequence,
52
followed by the processing of the "target" byte-sequence using the
53
information gathered by the first step.</p>
54
55
<h3 id="preprocessing">2.1 Preprocessing the original</h3>
56
57
<p>A major part of the processing of the "target" is to find a range
58
in the "original" which contains the same content as found at the
59
current location in the "target".</p>
60
61
<p>A naive approach to this would be to search the whole "original"
62
for such content. This however is very inefficient as it would search
63
the same parts of the "original" over and over. What is done instead
64
is to sample the "original" at regular intervals, compute signatures
65
for the sampled locations and store them in a hash table keyed by
66
these signatures.</p>
67
68
<p>That is what happens in this step. The following processing step
69
can then the compute signature for its current location and then has
70
to search only a narrow set of locations in the "original" for
71
possible matches, namely those which have the same signature.</p>
72
73
<p>In detail:</p>
74
75
<ol>
76
<li>The "original" is split into chunks of NHASH bytes. Note that a
77
partial chunk of less than NHASH bytes at the end of "original" is
78
ignored.
79
</li>
80
<li>The <a href="#rollhash">rolling hash</a> of each chunk is
81
computed.
82
</li>
83
<li>A hash table is filled, mapping from the hashes of the chunks to
84
the list of chunk locations having this hash.
85
</li>
86
</ol>
87
88
<h3 id="processing">2.2 Processing the target</h3>
89
90
<p>This, the main phase of the encoder, processes the target in a loop
91
from beginning to end. The state of the encoder is captured by two
92
locations, the "base" and the "slide". "base" points to the first byte
93
of the target for which no delta output has been generated yet, and
94
"slide" is the location of the window used to look in the "origin" for
95
commonalities. This window is NHASH bytes long.</p>
96
97
<p>Initially both "base" and "slide" point to the beginning of the
98
"target". In each iteration of the loop the encoder decides whether to
99
<ul>
100
<li>emit a single instruction
101
to <a href="delta_format.wiki#copyrange">copy a range</a>, or
102
</li>
103
<li>emit two instructions, first
104
to <a href="delta_format.wiki#insertlit">insert a literal</a>, then
105
to <a href="delta_format.wiki#copyrange">copy a range</a>, or
106
</li>
107
<li>move the window forward one byte.
108
</li>
109
</ul>
110
</p>
111
112
<verbatim type="pikchr float-right">
113
TARGET: [
114
scale = 0.8
115
down
116
"Target" bold
117
box fill palegreen width 150% height 200% "Processed"
118
GI: box same as first box fill yellow height 25% "Gap → Insert"
119
CC: box same fill orange height 200% "Common → Copy"
120
W: box same as GI fill lightgray width 125% height 200% "Window" bold
121
box same as CC height 125% ""
122
box same fill white ""
123
]
124
125
[ "Base" bold ; right ; arrow 33% ] with .e at TARGET.GI.nw
126
[ "Slide" bold ; right ; arrow 33% ] with .e at TARGET.W.nw
127
128
ORIGIN: [
129
down
130
"Origin" bold
131
B1: box fill white
132
B2: box fill orange height 200%
133
B3: box fill white height 200%
134
] with .nw at 0.75 right of TARGET.ne
135
136
arrow from TARGET.W.e to ORIGIN.B2.w "Signature" aligned above
137
</verbatim>
138
139
<p>To make this decision the encoder first computes the hash value for
140
the NHASH bytes in the window and then looks at all the locations in
141
the "origin" which have the same signature. This part uses the hash
142
table created by the pre-processing step to efficiently find these
143
locations.</p>
144
145
<p>For each of the possible candidates the encoder finds the maximal
146
range of bytes common to both "origin" and "target", going forward and
147
backward from "slide" in the "target", and the candidate location in
148
the "origin". This search is constrained on the side of the "target"
149
by the "base" (backward search), and the end of the "target" (forward
150
search), and on the side of the "origin" by the beginning and end of
151
the "origin", respectively.</p>
152
153
<p>There are input files for which the hash chains generated by the
154
pre-processing step can become very long, leading to long search times
155
and affecting the performance of the delta generator. To limit the
156
effect such long chains can have the actual search for candidates is
157
bounded, looking at most N candidates. Currently N is set to 250.</p>
158
159
<p>From the ranges for all the candidates the best (= largest) common
160
range is taken and it is determined how many bytes are needed to
161
encode the bytes between the "base" and the end of that range. If the
162
range extended back to the "base" then this can be done in a single
163
copy instruction. Otherwise, i.e if there is a gap between the "base"
164
and the beginning of the range then two instructions are needed, one
165
to insert the bytes in the gap as a literal, and a copy instruction
166
for the range itself. The general situation at this point can be seen
167
in the picture to the right.</p>
168
169
<p>If the number of bytes needed to encode both gap (if present), and
170
range is less than the number of bytes we are encoding the encoder
171
will emit the necessary instructions as described above, set "base"
172
and "slide" to the end of the encoded range and start the next
173
iteration at that point.</p>
174
175
<p>If, on the other hand, the encoder either did not find candidate
176
locations in the origin, or the best range coming out of the search
177
needed more bytes to encode the range than there were bytes in the
178
range, then no instructions are emitted and the window is moved one
179
byte forward. The "base" is left unchanged in that case.</p>
180
181
<p>The processing loop stops at one of two conditions:
182
<ol>
183
<li>The encoder decided to move the window forward, but the end of the
184
window reached the end of the "target".
185
</li>
186
<li>After the emission of instructions the new "base" location is
187
within NHASH bytes of end of the "target", i.e. there are no more than
188
at most NHASH bytes left.
189
</li>
190
</ol>
191
</p>
192
193
<p>If the processing loop left bytes unencoded, i.e. "base" not
194
exactly at the end of the "target", as is possible for both end
195
conditions, then one last insert instruction is emitted to put these
196
bytes into the delta.<p>
197
198
<h2 id="exceptions">3.0 Exceptions</h2>
199
200
<p>If the "original" is at most NHASH bytes long no compression of
201
changes is possible, and the segment-list of the delta consists of a
202
single literal which contains the entire "target".</p>
203
204
<p>This is actually equivalent to the second end condition of the
205
processing loop described in the previous section, just checked before
206
actually entering the loop.</p>
207
208
<h2 id="rollhash">4.0 The rolling hash</h2>
209
210
<p>The rolling hash described below and used to compute content
211
signatures was chosen not only for good hashing properties, but also
212
to enable the easy (incremental) recalculation of its value for a
213
sliding window, i.e. where the oldest byte is removed from the window
214
and a new byte is shifted in.<p>
215
216
<h3 id="rhdef">4.1 Definition</h3>
217
218
<p>Assuming an array Z of NHASH bytes (indexing starting at 0) the
219
hash V is computed via</p>
220
221
<div align="center"><img src="encode1.gif"></div>
222
<div align="center"><img src="encode2.gif"></div>
223
<div align="center"><img src="encode3.gif"></div>
224
225
where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and
226
V is a 32-bit unsigned integer with B as MSB, A as LSB.
227
228
<h3 id="rhincr">4.2 Incremental recalculation</h3>
229
230
<p>Assuming an array Z of NHASH bytes (indexing starting at 0) with
231
hash V (and components A and B), the dropped
232
byte <img src="encode4.gif" align="center">, and the new byte
233
<img src="encode5.gif" align="center"> , the new hash can
234
be computed incrementally via: </p>
235
236
<div align="center"><img src="encode6.gif"></div>
237
<div align="center"><img src="encode7.gif"></div>
238
<div align="center"><img src="encode8.gif"></div>
239
240
<p>For A, the regular sum, it can be seen easily that this the correct
241
way recomputing that component.</p>
242
243
<p>For B, the weighted sum, note first that <img src="encode4.gif"
244
align="center"> has the weight NHASH in the sum, so that is what has
245
to be removed. Then adding in <img src="encode9.gif" align="center">
246
adds one weight factor to all the other values of Z, and at last adds
247
in <img src="encode5.gif" align="center"> with weight 1, also
248
generating the correct new sum</p>
249

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button