|
1
|
<title>Thoughts On The Design Of The Fossil DVCS</title> |
|
2
|
|
|
3
|
Two questions (or criticisms) that arise frequently regarding Fossil |
|
4
|
can be summarized as follows: |
|
5
|
|
|
6
|
1. Why is Fossil based on SQLite instead of a distributed NoSQL database? |
|
7
|
|
|
8
|
2. Why is Fossil written in C instead of a modern high-level language? |
|
9
|
|
|
10
|
Neither question can be answered directly because they are both |
|
11
|
based on false assumptions. We claim that Fossil is not based on SQLite |
|
12
|
at all and that Fossil is not based on a distributed NoSQL database |
|
13
|
because Fossil is a distributed NoSQL database. And, Fossil does use |
|
14
|
a modern high-level language for its implementation, namely SQL. |
|
15
|
|
|
16
|
<h2>Fossil Is A NoSQL Database</h2> |
|
17
|
|
|
18
|
We begin with the first question: Fossil is not based on a distributed |
|
19
|
NoSQL database because Fossil <u><i>is</i></u> a distributed NoSQL database. |
|
20
|
Fossil is <u>not</u> based on SQLite. |
|
21
|
The current implementation of Fossnil uses |
|
22
|
SQLite as a local store for the content of the distributed database and as |
|
23
|
a cache for meta-information about the distributed database that is precomputed |
|
24
|
for quick and easy presentation. But the use of SQLite in this role is an |
|
25
|
implementation detail and is not fundamental to the design. Some future |
|
26
|
version of Fossil might do away with SQLite and substitute a pile-of-files or |
|
27
|
a key/value database in place of SQLite. |
|
28
|
(Actually, that is very unlikely |
|
29
|
to happen since SQLite works amazingly well in its current role, but the point |
|
30
|
is that omitting SQLite from Fossil is a theoretical possibility.) |
|
31
|
|
|
32
|
The underlying database that Fossil implements has nothing to do with |
|
33
|
SQLite, or SQL, or even relational database theory. The underlying |
|
34
|
database is very simple: it is an unordered collection of "artifacts". |
|
35
|
An artifact is a list of bytes - a "file" in the usual manner of thinking. |
|
36
|
Many artifacts are simply the content of source files that have |
|
37
|
been checked into the Fossil repository. Call these "content artifacts". |
|
38
|
Other artifacts, known as |
|
39
|
"control artifacts", contain ASCII text in a particular format that |
|
40
|
defines relationships between other artifacts, such as which |
|
41
|
content artifacts that go together to form a particular version of the |
|
42
|
project. Each artifact is named by its SHA1 or SHA3-256 hash and is |
|
43
|
thus immutable. |
|
44
|
Artifacts can be added to the database but not removed (if we ignore |
|
45
|
the exceptional case of [./shunning.wiki | shunning].) Repositories |
|
46
|
synchronize by computing the union of their artifact sets. SQL and |
|
47
|
relation theory play no role in any of this. |
|
48
|
|
|
49
|
SQL enters the picture only in the implementation details. The current |
|
50
|
implementation of Fossil stores each artifact as a BLOB in a SQLite |
|
51
|
database. |
|
52
|
The current implementation also parses up each control artifact as it |
|
53
|
arrives and stores the information discovered from that parse in various |
|
54
|
other SQLite tables to facilitate rapid generation of reports such as |
|
55
|
timelines, file histories, file lists, branch lists, and so forth. Note |
|
56
|
that all of this additional information is derived from the artifacts. |
|
57
|
The artifacts are canonical. The relational tables serve only as a cache. |
|
58
|
Everything in the relational tables can be recomputed |
|
59
|
from the artifacts, and in fact that is exactly what happens when one runs |
|
60
|
the "fossil rebuild" command on a repository. |
|
61
|
|
|
62
|
So really, Fossil works with two separate databases. There is the |
|
63
|
bag-of-artifacts database which is non-relational and distributed (like |
|
64
|
a NoSQL database) and there is the local relational database. The |
|
65
|
bag-of-artifacts database has a fixed format and is what defines a Fossil |
|
66
|
repository. Fossil will never modify the file format of the bag-of-artifacts |
|
67
|
database in an incompatible way because to do so would be to make something |
|
68
|
that is no longer "Fossil". The local relational database, on the other hand, |
|
69
|
is a cache that contains information derived from the bag-of-artifacts. |
|
70
|
The schema of the local relational database changes from time to time as |
|
71
|
the Fossil implementation is enhanced, and the content is recomputed from |
|
72
|
the unchanging bag of artifacts. The local relational database is an |
|
73
|
implementation detail which currently happens to use SQLite. |
|
74
|
|
|
75
|
Another way to think of the relational tables in a Fossil repository is |
|
76
|
as an index for the artifacts. Without the relational tables, |
|
77
|
to generate a report like a timeline would require scanning every artifact - |
|
78
|
the equivalent of a full table scan. The relational tables hold pointers to |
|
79
|
the relevant artifacts in presorted order so that generating a timeline |
|
80
|
is much more efficient. So like an index in a relational database, the |
|
81
|
relational tables in a Fossil repository do not add any new information, |
|
82
|
they merely make the information in the artifacts faster and easier to |
|
83
|
look up. |
|
84
|
|
|
85
|
Fossil is not "based" on SQLite. Fossil simply exploits SQLite as |
|
86
|
a powerful tool to make the implementation easier. |
|
87
|
And Fossil doesn't use a distributed |
|
88
|
NoSQL database because Fossil is a distributed NoSQL database. That answers |
|
89
|
the first question. |
|
90
|
|
|
91
|
<h2>SQL Is A High-Level Scripting Language</h2> |
|
92
|
|
|
93
|
The second concern states that Fossil does not use a high-level scripting |
|
94
|
language. But that is not true. Fossil uses SQL (as implemented by SQLite) |
|
95
|
as its scripting language. |
|
96
|
|
|
97
|
This misunderstanding likely arises because people fail |
|
98
|
to appreciate that SQL is a programming language. People are taught that SQL |
|
99
|
is a "query language" as if that were somehow different from a |
|
100
|
"programming language". But they really are two different flavors of the |
|
101
|
same thing. I find that people do better with SQL if they think of |
|
102
|
SQL as a programming language and each statement |
|
103
|
of SQL is a separate program. SQL is a peculiar programming language |
|
104
|
in that one uses SQL to specify <i>what</i> to compute whereas in |
|
105
|
most other programming languages one specifies <i>how</i> |
|
106
|
to carry out the computation. |
|
107
|
This difference means that SQL |
|
108
|
is an extraordinary high-level programming language, but it is still |
|
109
|
just a programming language. |
|
110
|
|
|
111
|
For certain types of problems, SQL has a huge advantage over other |
|
112
|
programming languages because it |