Fossil SCM

fossil-scm / www / fossil-is-not-relational.md

Source Rendered

Blame History Raw 407 lines

1	`# Fossil is not Relational`
2
3	`*An Introduction to the Fossil Data Model*`
4
5	`Upon hearing that Fossil is based on sqlite, it's natural for people`
6	`unfamiliar with its internals to assume that Fossil stores its`
7	`SCM-relevant data in a database-friendly way and that the SCM history`
8	`can be modified via SQL. The truth, however, is *far stranger than`
9	`that.*`
10
11	`This document introduces, at a relatively high level:`
12
13	`1) The underlying enduring and immutable data format, which is`
14	`independent of any specific storage engine.`
15
16	2) The `blob` table: Fossil's single point of SCM-relevant data
17	`storage.`
18
19	`3) The transformation of (1) from its immutable raw form to a`
20	`transient database-friendly form.`
21
22	`4) Some of the consequences of this model.`
23
24
25	`# Part 1: Artifacts`
26
27	```pikchr center
28	`AllObjects: [`
29	`A: file "Artifacts" fill lightskyblue;`
30	`down; move to A.s; move 50%;`
31	`F: file "Client" "files";`
32	`right; move 1; up; move 50%;`
33	`B: cylinder "blob table"`
34	`right;`
35	`arrow from A.e to B.w;`
36	`arrow from F.e to B.w;`
37	`arrow dashed from B.e;`
38	`C: box rad 0.1 "Crosslink" "process";`
39	`arrow`
40	`AUX: cylinder "Auxiliary" "tables"`
41	`arc -> cw dotted from AUX.s to B.s;`
42	`] # end of AllObjects`
43	```
44
45
46	`The centerpiece of Fossil's architecture is a data format which`
47	`describes what we call "artifacts." Each artifact represents the state`
48	`of one atomic unit of SCM-relevant data, such as a single checkin, a`
49	`single wiki page edit, a single modification to a ticket, creation or`
50	`cancellation of tags, and similar SCM constructs. In the cases of`
51	`checkins and ticket updates, an artifact may record changes to`
52	`multiple files resp. ticket fields, but the change as a whole`
53	`is atomic. Though we often refer to both fossil-specific SCM data`
54	`and client-side content as artifacts, this document uses the term`
55	`artifact solely for the former purpose.`
56
57	`From [the data format's main documentation][dataformat]:`
58
59	`> The global state of a fossil repository is kept simple so that it`
60	`> can endure in useful form for decades or centuries. A fossil`
61	`> repository is intended to be readable, searchable, and extensible by`
62	`> people not yet born.`
63
64	`[dataformat]: ./fileformat.wiki`
65
66	`This format has the following major properties:`
67
68	`- It is <u>syntactically simple</u>, easily and efficiently`
69	`parsable in any programming language. It is also entirely`
70	`human-readable.`
71
72	`- It is <u>immutable</u>. An artifact is identified by its unique`
73	`hash value. Any modification to an artifact changes that hash,`
74	`thereby changing its identity.`
75
76	`- It is <u>not generic</u>. It is custom-made for its purpose and`
77	`makes no attempt at providing a generic format. It contains only`
78	`what it needs to function, with zero bloat.`
79
80	`- It <u>**holds all SCM-relevant data except for client-level file`
81	`content**</u>, the latter instead being referenced by their unique`
82	`hash values. Storage of the client-side content is an implementation`
83	`detail delegated to higher-level applications.`
84
85	`- <u>Auditability</u>. By following the hash references in`
86	`artifacts it is possible to unambiguously trace the origin of any`
87	`modification to the SCM state. Combined with higher-level tools`
88	`(specifically, Fossil's database), this audit trail can easily be`
89	`traced both backwards and forwards in time, using any given version`
90	`in the SCM history as a starting point.`
91
92	`Notably, the artifact file format <u>does not</u>...`
93
94	`- Specify any specific storage mechanism for the SCM's raw bytes,`
95	`which includes both artifacts themselves and client-side file`
96	`content. The file format refers to all such content solely by its`
97	`unique hash value.`
98
99	`- Specify any optimizations such as storing file-level changes as`
100	`deltas between two versions of that content.`
101
102	`Such aspects are all considered to be implementation details of`
103	`higher-level applications (be they in the main fossil binary or a`
104	`hypothetical 3rd-party application), and have no effect on the`
105	`underlying artifact data model. That said, in Fossil:`
106
107	`- All raw byte content (artifacts and client files) is stored in`
108	the `blob` database table.
109
110	`- Fossil uses delta and zlib compression to keep the storage size of`
111	`changes from one version of a piece of content to the next to a`
112	`minimum.`
113
114
115	`## Sidebar: SCM-relevant vs Non-SCM-relevant State`
116
117	`Certain data in Fossil are "SCM-relevant" and certain data are not. In`
118	`short, SCM-relevant data are managed in a way consistent with`
119	`controlled versioning of that data. Conversely, non-SCM-relevant data`
120	`are essentially any state neither specified by nor unambiguously`
121	`refererenced by the artifact file format and are therefore not`
122	`versioned.`
123
124	`SCM-relevant state includes:`
125
126	`- Any and all data stored in the bodies of artifacts. This includes,`
127	`but is not limited to: wiki/ticket/forum content, tags, file names`
128	`and Fossil-side permissions, the name of each user who introduces`
129	`any given artifact into the data store, the timestamp of each such`
130	`change, the inheritance tree of checkins, and many other pieces of`
131	`metadata.`
132
133	`- Raw file content of versioned files. These data are external to`
134	`artifacts, which refer to them by their hashes. How they are stored`
135	`is not the concern of the data model, but (spoiler alert!) Fossil`
136	`stores them in an SQLite database, one record per distinct hash, in`
137	its `blob` table (which we will cover more very soon).
138
139	`Non-SCM-relevant state includes:`
140
141	`- Fossil's list of users and their metadata (permissions, email`
142	`address, etc.). Artifacts themselves reference users only by their`
143	`user names. Artifacts neither care whether, nor guarantee that, user`
144	`"drh" in one artifact is in fact the same "drh" referenced in`
145	`another artifact.`
146
147	`- All Fossil UI configuration, e.g. the site's skin, config settings,`
148	`and project name.`
149
150	`- In short, any tables in a Fossil repository file except for the`
151	`blob` table. Most, but not all, of these tables are transient
152	`caches for the data specified by the artifact files (which are`
153	stored in the `blob` table), and can safely be destroyed and rebuilt
154	`from the collection of artifacts with no loss of state to the`
155	repository. All of them, except for `blob` and `delta`, can be
156	`destroyed with no loss of SCM-relevant data.`
157
158	`## Terminology Hair-splitting: Manifest vs. Artifact`
159
160	`We sometimes refer to artifacts as "manifests," which is technically a`
161	`term for artifacts which record checkins. The various other artifact`
162	`types are arguably not "manifests," but are sometimes referred to as`
163	`such because the internal APIs use that term.`
164
165
166	`## A Very Basic Example`
167
168	`The following artifact, truncated for brevity, represents a typical`
169	`checkin artifact (a.k.a. a manifest):`
170
171	```
172	`C Bug\sfix\sin\sthe\slocal\sdatabase\sfinder.`
173	`D 2007-07-30T13:01:08`
174	`F src/VERSION 24bbb3aad63325ff33c56d777007d7cd63dc19ea`
175	`F src/add.c 1a5dfcdbfd24c65fa04da865b2e21486d075e154`
176	`F src/blob.c 8ec1e279a6cd0cfd5f1e3f8a39f2e9a1682e0113`
177	`<SNIP>`
178	`F www/selfcheck.html 849df9860df602dc2c55163d658c6b138213122f`
179	`P 01e7596a984e2cd2bc12abc0a741415b902cbeea`
180	`R 74a0432d81b956bfc3ff5a1a2bb46eb5`
181	`U drh`
182	`Z c9dcc06ecead312b1c310711cb360bc3`
183	```
184
185	`Each line is a single data record called a "card." The first letter of`
186	`each line tells us the type of data stored on that line and the`
187	`following space-separated tokens contain the data for that`
188	`line. Tokens which themselves contain spaces (notably the checkin`
189	comment) have those escaped as `\s`. The raw text of wiki
190	`pages/comments, forum posts, and ticket bodies/comments is stored`
191	`directly in the corresponding artifact, but is stored in a way which`
192	`makes such escaping unnecessary.`
193
194	`The hashes seen above are a critical component of the architecture:`
195
196	- The `F` (file) records refer to the content of those files by the
197	`hash of that content. Where that content is stored is not specified`
198	`by the data model.`
199
200	- The `P` (parent) line is the hash code of the parent version (itself
201	`an artifact).`
202
203	- The `Z` line is a hash of all of the content of this artifact
204	which precedes the `Z` line. Thus any change to the content of an
205	artifact changes both the artifact's identity (its hash) and its `Z`
206	`value, making it impossible to inject modified artifacts into an`
207	`existing artifact tree.`
208
209	- The `R` line is yet another consistency-checking hash which we won't
210	`go into here except to say that it's an internal consistency`
211	`check/line of defense against modification of file content`
212	`referenced by the artifact.`
213
214	# Part 2: The `blob` Table
215
216	```pikchr center
217	`AllObjects: [`
218	`A: file "Artifacts";`
219	`down; move to A.s; move 50%;`
220	`F: file "Client" "files" fill lightskyblue;`
221	`right; move 1; up; move 50%;`
222	`B: cylinder "blob table" fill lightskyblue;`
223	`right;`
224	`arrow from A.e to B.w;`
225	`arrow from F.e to B.w;`
226	`arrow dashed from B.e;`
227	`C: box rad 0.1 "Crosslink" "process";`
228	`arrow`
229	`AUX: cylinder "Auxiliary" "tables"`
230	`arc -> cw dotted from AUX.s to B.s;`
231	`] # end of AllObjects`
232	```
233
234
235	The `blob` table is the core-most storage of a Fossil repository
236	`database, storing all SCM-relevant data (and only SCM-relevant`
237	`data). Each row of this table holds a single artifact or the content`
238	`for a single version of a single client-side file. Slightly truncated`
239	`for clarity, its schema contains the following fields:`
240
241	- `uuid`: the hash code of the blob's contents.
242	- `rid`: a unique integer key for this record. This is how the
243	`blob table is mapped to other (transient) tables, but the RIDs are`
244	`specific to one given copy of a repository and must not be used for`
245	`cross-repository referencing. The RID is a private/internal value of`
246	`no use to a user unless they're building SQL queries for use with`
247	`the Fossil db schema.`
248	- `size`: the size, in bytes, of the blob's contents, or -1 for
249	`"phantom" blobs (those which Fossil knows should exist because it's`
250	`seen them referenced somewhere, but for which it has not been given`
251	`any content).`
252	- `content`: the blob's raw content bytes, with the caveat that
253	`Fossil is free to store it in an "alternate representation."`
254	Specifically, the `content` field often holds a zlib-compressed
255	`delta from a previous version of the blob's content (a separate`
256	entry in the `blob` table), and an auxiliary table named `delta`
257	`maps such blobs to their previous versions, such that Fossil can`
258	`reconstruct the real content from them by applying the delta to its`
259	`previous version (and such deltas may be chained). Thus extraction`
260	`of the content from this field cannot be performed via vanilla SQL,`
261	`and requires a Fossil-specific function which knows how to convert`
262	`any internal representations of the content to its original form.`
263
264
265	## Sidebar: How does `blob` Distinguish Between Artifacts and Client Content?
266
267	Notice that the `blob` table has no flag saying "this record is an
268	`artifact" or "this record is client data." Similarly, there is no`
269	place in the database dedicated to keeping track of which `blob`
270	`records are artifacts and which are file content.`
271
272	`That said, (A) the type of a blob can be implied via certain table`
273	relationships and (B) the `event` table (the `/timeline`'s main data
274	`source) incidentally has a list of artifacts and their sub-types`
275	`(checkin, wiki, tag, etc.). However, given that all of those`
276	`relationships, including the timeline, are transient, how can Fossil`
277	`distinguish between the two types of data?`
278
279	`Fossil's artifact format is extremely rigid and is strictly enforced`
280	`internally, with zero room provided for leniency. Every artifact which`
281	`is internally created is re-parsed for validity before it is committed`
282	`to the database, making it impossible that Fossil can inject an`
283	`invalid artifact into the repository. Because of the strictness of the`
284	`artifact parser, the chances that any given piece of arbitrary client`
285	`data could be successfully parsed as an artifact, even if it is`
286	`syntactically 99% similar to an artifact, are effectively zero.`
287
288	`Thus Fossil's rule of interpreting the contents of the blob table is:`
289	`if it can be parsed as an artifact, it is an artifact, else it is`
290	`opaque client-side data.`
291
292	That rule is most often relevant in operations like `rebuild` and
293	`reconstruct`, both of which necessarily have to sort out artifacts
294	`and non-artifact blobs from arbitrary collections of blobs.`
295
296	`It is, in fact, possible to store an artifact unrelated to the current`
297	`repository in that repository, and it *will be parsed and processed as`
298	`an artifact* (see below), but it likely refers to other artifacts or`
299	`blobs which are not part of the current repository, thereby possibly`
300	`introducing "strange" data into the UI. If this happens, it's`
301	`potentially slightly confusing but is functionally harmless.`
302
303
304	`# Part 3: Crosslinking`
305
306	```pikchr center
307	`AllObjects: [`
308	`A: file "Artifacts";`
309	`down; move to A.s; move 50%;`
310	`F: file "Client" "files";`
311	`right; move 1; up; move 50%;`
312	`B: cylinder "blob table"`
313	`right;`
314	`arrow from A.e to B.w;`
315	`arrow from F.e to B.w;`
316	`arrow dashed from B.e;`
317	`C: box rad 0.1 "Crosslink" "process" fill lightskyblue;`
318	`arrow`
319	`AUX: cylinder "Auxiliary" "tables" fill lightskyblue;`
320	`arc -> cw dotted from AUX.s to B.s;`
321	`] # end of AllObjects`
322	```
323
324	Once an artifact is stored in the `blob` table, how does one perform
325	`SQL queries against its plain-text format? In short: *One Does Not`
326	`Simply Query the Artifacts*.`
327
328	`Crosslinking, as its colloquially known, is a one-way processing step`
329	`which transforms an immutable artifact's state into something`
330	`database-friendly. Crosslinking happens automatically every time`
331	`Fossil generates, or is given, a new artifact. Crosslinking of any`
332	`given artifact may update many different auxiliary tables, all of`
333	`which are transient in the sense that they may be destroyed and then`
334	recreated by crosslinking all artifacts from the `blob` table (which
335	is exactly what the `rebuild` command does). The overwhelming majority
336	`of individual database records in any Fossil repository are found in`
337	these transient auxiliary tables, though the `blob` table tends to
338	`account for the overwhelming majority of a repository's disk space.`
339
340	`This approach to mapping data from artifacts to the db gives Fossil`
341	`the freedom to change its database model, effectively at will, with`
342	minimal client-side disruption (at most, a call to `rebuild`). This
343	`allows, for example, Fossil to take advantage of new improvements in`
344	`sqlite without affecting compatibility with older repositories.`
345
346	`Auxiliary tables hold data mappings such as:`
347
348	- Child/parent relationships of checkins. (The `plink` table.)
349	- Records of file names and changes to files. (The `mlink` and `filename` tables.)
350	- Timeline entries. (The `event` table.)
351
352	`And numerous other bits and pieces.`
353
354	`The many auxiliary tables maintained by the app-level code reference`
355	the `blob` table via its RID field, as that's far more efficient than
356	using hashes (`blob.uuid`) as foreign keys. The contexts of those
357	`auxiliary data unambiguously tell us whether the referenced blobs are`
358	`artifacts or file content, so there is no efficiency penalty there for`
359	hosting both opaque blobs and artifacts in the `blob` table.
360
361	`The complete SQL schemas for the core-most auxiliary tables can be found`
362	`at:`
363
364	`[](/finfo/src/schema.c?ci=trunk)`
365
366	`Noting, however, that all database tables are effectively internal`
367	`APIs, with no API stability guarantees and subject to change at any`
368	`time. Thus their structures generally should not be relied upon in`
369	`client-side scripts.`
370
371
372	`# Part 4: Implications and Consequences of the Model`
373
374	`Some of the implications and consequences of Fossil's data model`
375	`combined with the higher-level access via SQL include:`
376
377	`- Provable immutability of history. Fossil offers only one option`
378	`for modifying history: "shunning" is the forceful removal of an`
379	artifact from the `blob` table and the creation of a db record
380	`stating that the shunned hash may no longer be synced into this`
381	`repository. Shunning effectively leaves a hole in the SCM history,`
382	`and is only intended to be used for removal of illegal, dangerous,`
383	`or private information which should never have been added to the`
384	`repository.`
385
386	`- **Complete separation of SCM-relevant data and app-level data`
387	`structures**. This allows the application to update its structures`
388	`at will without significant backwards-compatibility concerns. In`
389	`Fossil's case, "data structures" primarily refers to the SQL`
390	`schema. Bringing a given repository schema up to date vis a vis a`
391	`given fossil binary version simply means rebuilding the repository`
392	`with that fossil binary. There are exceptionally rare cases, namely`
393	`the switch from SHA1 to SHA3-256 ushered in with Fossil 2.0, which`
394	`can lead to true incompatibility. e.g. a Fossil 1.x client cannot`
395	`use a repository database which contains SHA3 hashes, regardless of`
396	`a rebuild.`
397
398	`- Two-way compatibility with other hypothetical clients which also`
399	`implement the same underlying data model. So far there are none, but`
400	`it's conceivably possible.`
401
402	`- Provides a solid basis for reporting. Fossil's real-time metrics`
403	`and reporting options are arguably the most powerful and flexible`
404	`yet seen in an SCM.`
405
406	`- Very probably several more things.`
407

Fossil SCM

Keyboard Shortcuts