Fossil SCM

fossil-scm / www / theory1.wiki
1
<title>Thoughts On The Design Of The Fossil DVCS</title>
2
3
Two questions (or criticisms) that arise frequently regarding Fossil
4
can be summarized as follows:
5
6
1. Why is Fossil based on SQLite instead of a distributed NoSQL database?
7
8
2. Why is Fossil written in C instead of a modern high-level language?
9
10
Neither question can be answered directly because they are both
11
based on false assumptions. We claim that Fossil is not based on SQLite
12
at all and that Fossil is not based on a distributed NoSQL database
13
because Fossil is a distributed NoSQL database. And, Fossil does use
14
a modern high-level language for its implementation, namely SQL.
15
16
<h2>Fossil Is A NoSQL Database</h2>
17
18
We begin with the first question: Fossil is not based on a distributed
19
NoSQL database because Fossil <u><i>is</i></u> a distributed NoSQL database.
20
Fossil is <u>not</u> based on SQLite.
21
The current implementation of Fossnil uses
22
SQLite as a local store for the content of the distributed database and as
23
a cache for meta-information about the distributed database that is precomputed
24
for quick and easy presentation. But the use of SQLite in this role is an
25
implementation detail and is not fundamental to the design. Some future
26
version of Fossil might do away with SQLite and substitute a pile-of-files or
27
a key/value database in place of SQLite.
28
(Actually, that is very unlikely
29
to happen since SQLite works amazingly well in its current role, but the point
30
is that omitting SQLite from Fossil is a theoretical possibility.)
31
32
The underlying database that Fossil implements has nothing to do with
33
SQLite, or SQL, or even relational database theory. The underlying
34
database is very simple: it is an unordered collection of "artifacts".
35
An artifact is a list of bytes - a "file" in the usual manner of thinking.
36
Many artifacts are simply the content of source files that have
37
been checked into the Fossil repository. Call these "content artifacts".
38
Other artifacts, known as
39
"control artifacts", contain ASCII text in a particular format that
40
defines relationships between other artifacts, such as which
41
content artifacts that go together to form a particular version of the
42
project. Each artifact is named by its SHA1 or SHA3-256 hash and is
43
thus immutable.
44
Artifacts can be added to the database but not removed (if we ignore
45
the exceptional case of [./shunning.wiki | shunning].) Repositories
46
synchronize by computing the union of their artifact sets. SQL and
47
relation theory play no role in any of this.
48
49
SQL enters the picture only in the implementation details. The current
50
implementation of Fossil stores each artifact as a BLOB in a SQLite
51
database.
52
The current implementation also parses up each control artifact as it
53
arrives and stores the information discovered from that parse in various
54
other SQLite tables to facilitate rapid generation of reports such as
55
timelines, file histories, file lists, branch lists, and so forth. Note
56
that all of this additional information is derived from the artifacts.
57
The artifacts are canonical. The relational tables serve only as a cache.
58
Everything in the relational tables can be recomputed
59
from the artifacts, and in fact that is exactly what happens when one runs
60
the "fossil rebuild" command on a repository.
61
62
So really, Fossil works with two separate databases. There is the
63
bag-of-artifacts database which is non-relational and distributed (like
64
a NoSQL database) and there is the local relational database. The
65
bag-of-artifacts database has a fixed format and is what defines a Fossil
66
repository. Fossil will never modify the file format of the bag-of-artifacts
67
database in an incompatible way because to do so would be to make something
68
that is no longer "Fossil". The local relational database, on the other hand,
69
is a cache that contains information derived from the bag-of-artifacts.
70
The schema of the local relational database changes from time to time as
71
the Fossil implementation is enhanced, and the content is recomputed from
72
the unchanging bag of artifacts. The local relational database is an
73
implementation detail which currently happens to use SQLite.
74
75
Another way to think of the relational tables in a Fossil repository is
76
as an index for the artifacts. Without the relational tables,
77
to generate a report like a timeline would require scanning every artifact -
78
the equivalent of a full table scan. The relational tables hold pointers to
79
the relevant artifacts in presorted order so that generating a timeline
80
is much more efficient. So like an index in a relational database, the
81
relational tables in a Fossil repository do not add any new information,
82
they merely make the information in the artifacts faster and easier to
83
look up.
84
85
Fossil is not "based" on SQLite. Fossil simply exploits SQLite as
86
a powerful tool to make the implementation easier.
87
And Fossil doesn't use a distributed
88
NoSQL database because Fossil is a distributed NoSQL database. That answers
89
the first question.
90
91
<h2>SQL Is A High-Level Scripting Language</h2>
92
93
The second concern states that Fossil does not use a high-level scripting
94
language. But that is not true. Fossil uses SQL (as implemented by SQLite)
95
as its scripting language.
96
97
This misunderstanding likely arises because people fail
98
to appreciate that SQL is a programming language. People are taught that SQL
99
is a "query language" as if that were somehow different from a
100
"programming language". But they really are two different flavors of the
101
same thing. I find that people do better with SQL if they think of
102
SQL as a programming language and each statement
103
of SQL is a separate program. SQL is a peculiar programming language
104
in that one uses SQL to specify <i>what</i> to compute whereas in
105
most other programming languages one specifies <i>how</i>
106
to carry out the computation.
107
This difference means that SQL
108
is an extraordinary high-level programming language, but it is still
109
just a programming language.
110
111
For certain types of problems, SQL has a huge advantage over other
112
programming languages because it

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button