Fossil SCM

fossil-scm / www / theory1.wiki
Source Blame History 142 lines
a630398… drh 1 <title>Thoughts On The Design Of The Fossil DVCS</title>
a630398… drh 2
a630398… drh 3 Two questions (or criticisms) that arise frequently regarding Fossil
a630398… drh 4 can be summarized as follows:
a630398… drh 5
a630398… drh 6 1. Why is Fossil based on SQLite instead of a distributed NoSQL database?
a630398… drh 7
a630398… drh 8 2. Why is Fossil written in C instead of a modern high-level language?
a630398… drh 9
a630398… drh 10 Neither question can be answered directly because they are both
a630398… drh 11 based on false assumptions. We claim that Fossil is not based on SQLite
a630398… drh 12 at all and that Fossil is not based on a distributed NoSQL database
a630398… drh 13 because Fossil is a distributed NoSQL database. And, Fossil does use
a630398… drh 14 a modern high-level language for its implementation, namely SQL.
a630398… drh 15
a630398… drh 16 <h2>Fossil Is A NoSQL Database</h2>
a630398… drh 17
a630398… drh 18 We begin with the first question: Fossil is not based on a distributed
a630398… drh 19 NoSQL database because Fossil <u><i>is</i></u> a distributed NoSQL database.
f47b705… jan.nijtmans 20 Fossil is <u>not</u> based on SQLite.
a630398… drh 21 The current implementation of Fossil uses
a630398… drh 22 SQLite as a local store for the content of the distributed database and as
a630398… drh 23 a cache for meta-information about the distributed database that is precomputed
a630398… drh 24 for quick and easy presentation. But the use of SQLite in this role is an
a630398… drh 25 implementation detail and is not fundamental to the design. Some future
a630398… drh 26 version of Fossil might do away with SQLite and substitute a pile-of-files or
f47b705… jan.nijtmans 27 a key/value database in place of SQLite.
a630398… drh 28 (Actually, that is very unlikely
a630398… drh 29 to happen since SQLite works amazingly well in its current role, but the point
a630398… drh 30 is that omitting SQLite from Fossil is a theoretical possibility.)
a630398… drh 31
a630398… drh 32 The underlying database that Fossil implements has nothing to do with
a630398… drh 33 SQLite, or SQL, or even relational database theory. The underlying
a630398… drh 34 database is very simple: it is an unordered collection of "artifacts".
a630398… drh 35 An artifact is a list of bytes - a "file" in the usual manner of thinking.
a630398… drh 36 Many artifacts are simply the content of source files that have
a630398… drh 37 been checked into the Fossil repository. Call these "content artifacts".
a630398… drh 38 Other artifacts, known as
a630398… drh 39 "control artifacts", contain ASCII text in a particular format that
a630398… drh 40 defines relationships between other artifacts, such as which
a630398… drh 41 content artifacts that go together to form a particular version of the
796db89… drh 42 project. Each artifact is named by its SHA1 or SHA3-256 hash and is
796db89… drh 43 thus immutable.
a630398… drh 44 Artifacts can be added to the database but not removed (if we ignore
a630398… drh 45 the exceptional case of [./shunning.wiki | shunning].) Repositories
a630398… drh 46 synchronize by computing the union of their artifact sets. SQL and
a630398… drh 47 relation theory play no role in any of this.
a630398… drh 48
a630398… drh 49 SQL enters the picture only in the implementation details. The current
a630398… drh 50 implementation of Fossil stores each artifact as a BLOB in an SQLite
a630398… drh 51 database.
a630398… drh 52 The current implementation also parses up each control artifact as it
a630398… drh 53 arrives and stores the information discovered from that parse in various
a630398… drh 54 other SQLite tables to facilitate rapid generation of reports such as
a630398… drh 55 timelines, file histories, file lists, branch lists, and so forth. Note
a630398… drh 56 that all of this additional information is derived from the artifacts.
a630398… drh 57 The artifacts are canonical. The relational tables serve only as a cache.
a630398… drh 58 Everything in the relational tables can be recomputed
a630398… drh 59 from the artifacts, and in fact that is exactly what happens when one runs
a630398… drh 60 the "fossil rebuild" command on a repository.
a630398… drh 61
a630398… drh 62 So really, Fossil works with two separate databases. There is the
a630398… drh 63 bag-of-artifacts database which is non-relational and distributed (like
a630398… drh 64 a NoSQL database) and there is the local relational database. The
a630398… drh 65 bag-of-artifacts database has a fixed format and is what defines a Fossil
a630398… drh 66 repository. Fossil will never modify the file format of the bag-of-artifacts
a630398… drh 67 database in an incompatible way because to do so would be to make something
a630398… drh 68 that is no longer "Fossil". The local relational database, on the other hand,
f47b705… jan.nijtmans 69 is a cache that contains information derived from the bag-of-artifacts.
a630398… drh 70 The schema of the local relational database changes from time to time as
a630398… drh 71 the Fossil implementation is enhanced, and the content is recomputed from
a630398… drh 72 the unchanging bag of artifacts. The local relational database is an
a630398… drh 73 implementation detail which currently happens to use SQLite.
a630398… drh 74
a630398… drh 75 Another way to think of the relational tables in a Fossil repository is
a630398… drh 76 as an index for the artifacts. Without the relational tables,
a630398… drh 77 to generate a report like a timeline would require scanning every artifact -
a3be0b8… drh 78 the equivalent of a full table scan. The relational tables hold pointers to
44b02c3… BMorgat 79 the relevant artifacts in presorted order so that generating a timeline
a630398… drh 80 is much more efficient. So like an index in a relational database, the
a3be0b8… drh 81 relational tables in a Fossil repository do not add any new information,
a630398… drh 82 they merely make the information in the artifacts faster and easier to
a630398… drh 83 look up.
a630398… drh 84
a630398… drh 85 Fossil is not "based" on SQLite. Fossil simply exploits SQLite as
a630398… drh 86 a powerful tool to make the implementation easier.
a630398… drh 87 And Fossil doesn't use a distributed
a630398… drh 88 NoSQL database because Fossil is a distributed NoSQL database. That answers
a630398… drh 89 the first question.
a630398… drh 90
a630398… drh 91 <h2>SQL Is A High-Level Scripting Language</h2>
a630398… drh 92
a630398… drh 93 The second concern states that Fossil does not use a high-level scripting
f47b705… jan.nijtmans 94 language. But that is not true. Fossil uses SQL (as implemented by SQLite)
ce6b68c… drh 95 as its scripting language.
a630398… drh 96
a630398… drh 97 This misunderstanding likely arises because people fail
a630398… drh 98 to appreciate that SQL is a programming language. People are taught that SQL
ce6b68c… drh 99 is a "query language" as if that were somehow different from a
7b011ab… stephan 100 "programming language". But they really are two different flavors of the
a630398… drh 101 same thing. I find that people do better with SQL if they think of
a630398… drh 102 SQL as a programming language and each statement
81b0597… drh 103 of SQL is a separate program. SQL is a peculiar programming language
8748d75… eric 104 in that one uses SQL to specify <i>what</i> to compute whereas in
a630398… drh 105 most other programming languages one specifies <i>how</i>
a630398… drh 106 to carry out the computation.
a630398… drh 107 This difference means that SQL
a630398… drh 108 is an extraordinary high-level programming language, but it is still
a630398… drh 109 just a programming language.
a630398… drh 110
a630398… drh 111 For certain types of problems, SQL has a huge advantage over other
a630398… drh 112 programming languages because it is so high level and because it allows
4d1ac68… drh 113 programmers to focus more on the <i>what</i> and less on the <i>how</i>
a630398… drh 114 of a computation. In other words,
a630398… drh 115 programmers tend to think about problems at a much higher level when
16e703b… drh 116 using SQL; this can result in better applications.
a630398… drh 117 SQL is also very dense.
a630398… drh 118 In practice, this often means that a few
a630398… drh 119 lines of SQL can often replace hundreds or thousands of lines of
a630398… drh 120 procedural code, with a corresponding decrease in programming effort
a630398… drh 121 and opportunities to introduce bugs.
a630398… drh 122 Fossil happens to be one of those problems for which SQL is well suited.
a630398… drh 123
a630398… drh 124 Much of the "heavy lifting" within the Fossil implementation is carried
a630398… drh 125 out using SQL statements. It is true that these SQL statements are glued
a630398… drh 126 together with C code, but it turns out that C works surprisingly well in
a630398… drh 127 that role. Several early prototypes of Fossil were written in a scripting
a630398… drh 128 language (TCL). We normally find that TCL programs are shorter than the
a630398… drh 129 equivalent C code by a factor of 10 or more. But in the case of Fossil,
f47b705… jan.nijtmans 130 the use of TCL was actually making the code longer and more difficult to
a630398… drh 131 understand.
a630398… drh 132 And so in the final design, we switched from TCL to C in order to make
a630398… drh 133 the code easier to implement and debug.
a630398… drh 134
a630398… drh 135 Without the advantages of having SQLite built in, the design might well
a630398… drh 136 have followed a different path. Most reports generated by Fossil involve
a630398… drh 137 a complex set of queries against the relational tables of the repository
a630398… drh 138 database. These queries are normally implemented in only a few dozen
a630398… drh 139 lines of SQL code. But if those queries had been implemented procedurally
a630398… drh 140 using a key/value or pile-of-files database, it
a630398… drh 141 may have well been the case that a high-level scripting language such as
a630398… drh 142 Tcl, Python, or Ruby may have worked out better than C.

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button