|
a630398…
|
drh
|
1 |
<title>Thoughts On The Design Of The Fossil DVCS</title> |
|
a630398…
|
drh
|
2 |
|
|
a630398…
|
drh
|
3 |
Two questions (or criticisms) that arise frequently regarding Fossil |
|
a630398…
|
drh
|
4 |
can be summarized as follows: |
|
a630398…
|
drh
|
5 |
|
|
a630398…
|
drh
|
6 |
1. Why is Fossil based on SQLite instead of a distributed NoSQL database? |
|
a630398…
|
drh
|
7 |
|
|
a630398…
|
drh
|
8 |
2. Why is Fossil written in C instead of a modern high-level language? |
|
a630398…
|
drh
|
9 |
|
|
a630398…
|
drh
|
10 |
Neither question can be answered directly because they are both |
|
a630398…
|
drh
|
11 |
based on false assumptions. We claim that Fossil is not based on SQLite |
|
a630398…
|
drh
|
12 |
at all and that Fossil is not based on a distributed NoSQL database |
|
a630398…
|
drh
|
13 |
because Fossil is a distributed NoSQL database. And, Fossil does use |
|
a630398…
|
drh
|
14 |
a modern high-level language for its implementation, namely SQL. |
|
a630398…
|
drh
|
15 |
|
|
a630398…
|
drh
|
16 |
<h2>Fossil Is A NoSQL Database</h2> |
|
a630398…
|
drh
|
17 |
|
|
a630398…
|
drh
|
18 |
We begin with the first question: Fossil is not based on a distributed |
|
a630398…
|
drh
|
19 |
NoSQL database because Fossil <u><i>is</i></u> a distributed NoSQL database. |
|
f47b705…
|
jan.nijtmans
|
20 |
Fossil is <u>not</u> based on SQLite. |
|
a630398…
|
drh
|
21 |
The current implementation of Fossil uses |
|
a630398…
|
drh
|
22 |
SQLite as a local store for the content of the distributed database and as |
|
a630398…
|
drh
|
23 |
a cache for meta-information about the distributed database that is precomputed |
|
a630398…
|
drh
|
24 |
for quick and easy presentation. But the use of SQLite in this role is an |
|
a630398…
|
drh
|
25 |
implementation detail and is not fundamental to the design. Some future |
|
a630398…
|
drh
|
26 |
version of Fossil might do away with SQLite and substitute a pile-of-files or |
|
f47b705…
|
jan.nijtmans
|
27 |
a key/value database in place of SQLite. |
|
a630398…
|
drh
|
28 |
(Actually, that is very unlikely |
|
a630398…
|
drh
|
29 |
to happen since SQLite works amazingly well in its current role, but the point |
|
a630398…
|
drh
|
30 |
is that omitting SQLite from Fossil is a theoretical possibility.) |
|
a630398…
|
drh
|
31 |
|
|
a630398…
|
drh
|
32 |
The underlying database that Fossil implements has nothing to do with |
|
a630398…
|
drh
|
33 |
SQLite, or SQL, or even relational database theory. The underlying |
|
a630398…
|
drh
|
34 |
database is very simple: it is an unordered collection of "artifacts". |
|
a630398…
|
drh
|
35 |
An artifact is a list of bytes - a "file" in the usual manner of thinking. |
|
a630398…
|
drh
|
36 |
Many artifacts are simply the content of source files that have |
|
a630398…
|
drh
|
37 |
been checked into the Fossil repository. Call these "content artifacts". |
|
a630398…
|
drh
|
38 |
Other artifacts, known as |
|
a630398…
|
drh
|
39 |
"control artifacts", contain ASCII text in a particular format that |
|
a630398…
|
drh
|
40 |
defines relationships between other artifacts, such as which |
|
a630398…
|
drh
|
41 |
content artifacts that go together to form a particular version of the |
|
796db89…
|
drh
|
42 |
project. Each artifact is named by its SHA1 or SHA3-256 hash and is |
|
796db89…
|
drh
|
43 |
thus immutable. |
|
a630398…
|
drh
|
44 |
Artifacts can be added to the database but not removed (if we ignore |
|
a630398…
|
drh
|
45 |
the exceptional case of [./shunning.wiki | shunning].) Repositories |
|
a630398…
|
drh
|
46 |
synchronize by computing the union of their artifact sets. SQL and |
|
a630398…
|
drh
|
47 |
relation theory play no role in any of this. |
|
a630398…
|
drh
|
48 |
|
|
a630398…
|
drh
|
49 |
SQL enters the picture only in the implementation details. The current |
|
a630398…
|
drh
|
50 |
implementation of Fossil stores each artifact as a BLOB in an SQLite |
|
a630398…
|
drh
|
51 |
database. |
|
a630398…
|
drh
|
52 |
The current implementation also parses up each control artifact as it |
|
a630398…
|
drh
|
53 |
arrives and stores the information discovered from that parse in various |
|
a630398…
|
drh
|
54 |
other SQLite tables to facilitate rapid generation of reports such as |
|
a630398…
|
drh
|
55 |
timelines, file histories, file lists, branch lists, and so forth. Note |
|
a630398…
|
drh
|
56 |
that all of this additional information is derived from the artifacts. |
|
a630398…
|
drh
|
57 |
The artifacts are canonical. The relational tables serve only as a cache. |
|
a630398…
|
drh
|
58 |
Everything in the relational tables can be recomputed |
|
a630398…
|
drh
|
59 |
from the artifacts, and in fact that is exactly what happens when one runs |
|
a630398…
|
drh
|
60 |
the "fossil rebuild" command on a repository. |
|
a630398…
|
drh
|
61 |
|
|
a630398…
|
drh
|
62 |
So really, Fossil works with two separate databases. There is the |
|
a630398…
|
drh
|
63 |
bag-of-artifacts database which is non-relational and distributed (like |
|
a630398…
|
drh
|
64 |
a NoSQL database) and there is the local relational database. The |
|
a630398…
|
drh
|
65 |
bag-of-artifacts database has a fixed format and is what defines a Fossil |
|
a630398…
|
drh
|
66 |
repository. Fossil will never modify the file format of the bag-of-artifacts |
|
a630398…
|
drh
|
67 |
database in an incompatible way because to do so would be to make something |
|
a630398…
|
drh
|
68 |
that is no longer "Fossil". The local relational database, on the other hand, |
|
f47b705…
|
jan.nijtmans
|
69 |
is a cache that contains information derived from the bag-of-artifacts. |
|
a630398…
|
drh
|
70 |
The schema of the local relational database changes from time to time as |
|
a630398…
|
drh
|
71 |
the Fossil implementation is enhanced, and the content is recomputed from |
|
a630398…
|
drh
|
72 |
the unchanging bag of artifacts. The local relational database is an |
|
a630398…
|
drh
|
73 |
implementation detail which currently happens to use SQLite. |
|
a630398…
|
drh
|
74 |
|
|
a630398…
|
drh
|
75 |
Another way to think of the relational tables in a Fossil repository is |
|
a630398…
|
drh
|
76 |
as an index for the artifacts. Without the relational tables, |
|
a630398…
|
drh
|
77 |
to generate a report like a timeline would require scanning every artifact - |
|
a3be0b8…
|
drh
|
78 |
the equivalent of a full table scan. The relational tables hold pointers to |
|
44b02c3…
|
BMorgat
|
79 |
the relevant artifacts in presorted order so that generating a timeline |
|
a630398…
|
drh
|
80 |
is much more efficient. So like an index in a relational database, the |
|
a3be0b8…
|
drh
|
81 |
relational tables in a Fossil repository do not add any new information, |
|
a630398…
|
drh
|
82 |
they merely make the information in the artifacts faster and easier to |
|
a630398…
|
drh
|
83 |
look up. |
|
a630398…
|
drh
|
84 |
|
|
a630398…
|
drh
|
85 |
Fossil is not "based" on SQLite. Fossil simply exploits SQLite as |
|
a630398…
|
drh
|
86 |
a powerful tool to make the implementation easier. |
|
a630398…
|
drh
|
87 |
And Fossil doesn't use a distributed |
|
a630398…
|
drh
|
88 |
NoSQL database because Fossil is a distributed NoSQL database. That answers |
|
a630398…
|
drh
|
89 |
the first question. |
|
a630398…
|
drh
|
90 |
|
|
a630398…
|
drh
|
91 |
<h2>SQL Is A High-Level Scripting Language</h2> |
|
a630398…
|
drh
|
92 |
|
|
a630398…
|
drh
|
93 |
The second concern states that Fossil does not use a high-level scripting |
|
f47b705…
|
jan.nijtmans
|
94 |
language. But that is not true. Fossil uses SQL (as implemented by SQLite) |
|
ce6b68c…
|
drh
|
95 |
as its scripting language. |
|
a630398…
|
drh
|
96 |
|
|
a630398…
|
drh
|
97 |
This misunderstanding likely arises because people fail |
|
a630398…
|
drh
|
98 |
to appreciate that SQL is a programming language. People are taught that SQL |
|
ce6b68c…
|
drh
|
99 |
is a "query language" as if that were somehow different from a |
|
7b011ab…
|
stephan
|
100 |
"programming language". But they really are two different flavors of the |
|
a630398…
|
drh
|
101 |
same thing. I find that people do better with SQL if they think of |
|
a630398…
|
drh
|
102 |
SQL as a programming language and each statement |
|
81b0597…
|
drh
|
103 |
of SQL is a separate program. SQL is a peculiar programming language |
|
8748d75…
|
eric
|
104 |
in that one uses SQL to specify <i>what</i> to compute whereas in |
|
a630398…
|
drh
|
105 |
most other programming languages one specifies <i>how</i> |
|
a630398…
|
drh
|
106 |
to carry out the computation. |
|
a630398…
|
drh
|
107 |
This difference means that SQL |
|
a630398…
|
drh
|
108 |
is an extraordinary high-level programming language, but it is still |
|
a630398…
|
drh
|
109 |
just a programming language. |
|
a630398…
|
drh
|
110 |
|
|
a630398…
|
drh
|
111 |
For certain types of problems, SQL has a huge advantage over other |
|
a630398…
|
drh
|
112 |
programming languages because it is so high level and because it allows |
|
4d1ac68…
|
drh
|
113 |
programmers to focus more on the <i>what</i> and less on the <i>how</i> |
|
a630398…
|
drh
|
114 |
of a computation. In other words, |
|
a630398…
|
drh
|
115 |
programmers tend to think about problems at a much higher level when |
|
16e703b…
|
drh
|
116 |
using SQL; this can result in better applications. |
|
a630398…
|
drh
|
117 |
SQL is also very dense. |
|
a630398…
|
drh
|
118 |
In practice, this often means that a few |
|
a630398…
|
drh
|
119 |
lines of SQL can often replace hundreds or thousands of lines of |
|
a630398…
|
drh
|
120 |
procedural code, with a corresponding decrease in programming effort |
|
a630398…
|
drh
|
121 |
and opportunities to introduce bugs. |
|
a630398…
|
drh
|
122 |
Fossil happens to be one of those problems for which SQL is well suited. |
|
a630398…
|
drh
|
123 |
|
|
a630398…
|
drh
|
124 |
Much of the "heavy lifting" within the Fossil implementation is carried |
|
a630398…
|
drh
|
125 |
out using SQL statements. It is true that these SQL statements are glued |
|
a630398…
|
drh
|
126 |
together with C code, but it turns out that C works surprisingly well in |
|
a630398…
|
drh
|
127 |
that role. Several early prototypes of Fossil were written in a scripting |
|
a630398…
|
drh
|
128 |
language (TCL). We normally find that TCL programs are shorter than the |
|
a630398…
|
drh
|
129 |
equivalent C code by a factor of 10 or more. But in the case of Fossil, |
|
f47b705…
|
jan.nijtmans
|
130 |
the use of TCL was actually making the code longer and more difficult to |
|
a630398…
|
drh
|
131 |
understand. |
|
a630398…
|
drh
|
132 |
And so in the final design, we switched from TCL to C in order to make |
|
a630398…
|
drh
|
133 |
the code easier to implement and debug. |
|
a630398…
|
drh
|
134 |
|
|
a630398…
|
drh
|
135 |
Without the advantages of having SQLite built in, the design might well |
|
a630398…
|
drh
|
136 |
have followed a different path. Most reports generated by Fossil involve |
|
a630398…
|
drh
|
137 |
a complex set of queries against the relational tables of the repository |
|
a630398…
|
drh
|
138 |
database. These queries are normally implemented in only a few dozen |
|
a630398…
|
drh
|
139 |
lines of SQL code. But if those queries had been implemented procedurally |
|
a630398…
|
drh
|
140 |
using a key/value or pile-of-files database, it |
|
a630398…
|
drh
|
141 |
may have well been the case that a high-level scripting language such as |
|
a630398…
|
drh
|
142 |
Tcl, Python, or Ruby may have worked out better than C. |