Fossil SCM

checkin/fdeebddea7abf6addf1640da01804f230e9a10c2b081b0aafdcac1642d0412ba

5 years, 3 months ago by drh

An analysis of the SQL problem that this check-in attempts to resolve.

Consider the following schema in which there are a large number objects and each object can have zero or more tags. A single tag might apply to any number of objects. Most tags apply to only a few objects, but a few tags might apply to many objects. One or two tags might apply to 95% of the objects.

The "objects" here are really EVENT table entries and "tags" are the combination of the TAG and TAGXREF tables. Let's assume that there are 1,000,000 objects and 50,000 distinct tagNames. Most objects have exactly one tag. 5% of the objects have two or more tags. Most tagNames are assigned to less than 20 objects, but a few tags are assigned to many objects, and one tag in particular is assigned to about 95% of the objects.

~~~ CREATE TABLE object( objectId INTEGER PRIMARY KEY, value REAL ); CREATE INDEX object_value ON object(value); CREATE TABLE tag( tagName TEXT, objectId INTEGER REFERENCES object ON DELETE CASCADE, PRIMARY KEY(tagName, objectId) );


Given a tagName $TAG, we want to find the object with that
tag that has the largest value.  The code was computing that
objectId as follows:

> ~~~
SELECT objectId
  FROM object JOIN tag USING(objectId)
 WHERE tagName=$TAG
 ORDER BY value DESC
 LIMIT 1;

The previous query works well when the number of objects associated with $TAG are small (at least compared to the total number of objects). But what if 95% of the objects have $TAG? In that case, the query above is inefficient since it needs to search most of the object table to find the object with the largest value. The following query works by scaning objects in decreasing value order looking for the first one that has $TAG:

~~~ SELECT objectId FROM object WHERE EXISTS(SELECT 1 FROM tag WHERE tag.objectId=object.objectId AND tag.tagName=$TAG) ORDER BY value DESC LIMIT 1;


The problem is, we don't know in advance if $TAG is abundant or rare.
This check-in attempts to resolve the problem by using the following grotesque
query:

> ~~~
SELECT objectId, value
  FROM (SELECT objectId, value FROM object ORDER BY value DESC LIMIT 30) AS ox
 WHERE EXISTS(SELECT 1 FROM tag
               WHERE tag.objectId=ox.objectId
                 AND tag.tagName=$TAG)
UNION ALL
SELECT objectId, value FROM (
  SELECT objectId, value
    FROM object JOIN tag USING(objectId)
   WHERE tagName=$TAG
   ORDER BY object.value DESC
   LIMIT 1
)
LIMIT 1;

The previous query works as follows: The first term of the UNION ALL searches for an object with $TAG amoung the 30 objects with the largest values. If there are no matches, then it falls down into the second term of the UNION ALL which searches all objects with $TAG for the one with the largest value. The query works because SQLite evaluates UNION ALL compounds sequentially. (This is not guaranteed by the SQLite documentation, but it has always been so. Perhaps SQLite should make it a hard guarantee moving forward?)

The grotesque query is still inefficient if, for example, all one million objects have $TAG except for the 50 largest. We could denormalize the schema, making a copy of the object.value into the tag table, and keeping the values in sync using triggers. like this:

~~~ CREATE TABLE object( objectId INTEGER PRIMARY KEY, value REAL ); CREATE INDEX object_value ON object(value); CREATE TABLE tag( tagName TEXT, objectId INTEGER REFERENCES object ON DELETE CASCASE, objectValue REAL, -- Copy of object.value PRIMARY KEY(tagName, objectId) ); CREATE INDEX tag_value ON tag(objectValue); CREATE TRIGGER AFTER UPDATE ON object(value) BEGIN UPDATE tag SET objectValue=new.value WHERE objectId=new.objectId; END; CREATE TRIGGER AFTER INSERT ON tag BEGIN UPDATE tag SET objectValue=(SELECT value FROM object WHERE objectId=new.objectId) WHERE tagName=new.tagName and objectId=new.objectId; END; CREATE TRIGGER AFTER UPDATE ON tag BEGIN UPDATE tag SET objectValue=(SELECT value FROM object WHERE objectId=new.objectId) WHERE tagName=new.tagName and objectId=new.objectId; END;


In that case, the following query is always efficient:

> ~~~
SELECT objectId
  FROM tag
 WHERE tagName=$TAG
 ORDER BY objectValue DESC
 LIMIT 1;

The downside is that there are now three triggers and an extra index in the schema and the schema is denormalized.

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button