Fossil SCM
Post: The cluster storm of 2024-12-20
On 2024-12-20 09:05Z, Andy Goth pushed a chain of 50 cluster artifacts to the main Fossil repository, each artifact sent as a separate "push". The entire operation took about 7 seconds. Subsequent sync operations by other users had to pull down these clusters, one by one, using 50 round-trips to the server.
Do you know what a cluster is? If not, here is more information: https://fossil-scm.org/home/doc/trunk/www/fileformat.wiki#cluster
Two issues here:
-
How and why did these 50 redundant cluster artifacts get generated.
-
Why does it take 50 round-trips to the server to push or pull the cluster chain.
I speculate that somehow Andy was running a Fossil instance separately for a long time, and it generated about fifty new and redundant clusters, then he decided to "sync" that repo. Perhaps he can add a follow-up to this post better explaining what happened.
As for why it takes 50 server round-trips to sync them all, I think that is because the machine with the extra clusters initially notifies the far side about only the most recent cluster. The other side gets the most recent cluster (only) and sees that it already has all the artifacts in that cluster, except for the next cluster in the chain, which it then requests. That process repeats 50 times.
In the meantime, the 50 redundant clusters that where pushed have been shunned, so they will not trouble passers-by who just want to do a quick "sync" of the main Fossil self-hosting repository. I have also added new web pages to help analyze clusters, and I will be improving those pages in the coming days. Having 50 redundant clusters in the repository is not ideal, but neither is it the end of the world. It would be great to be able to suppress them.
Doing 50 round-trips to the server in order to load a linked list of clusters seems like a bigger a problem. I'm not sure exactly how to solve that one yet. Perhaps Fossil can detect the situation and preemptively send a list of all of its cluster artifacts on the second round-trip, thereby shortening the sync process from 50 round-trips down to 3 or 4. Analysis is ongoing.
Remediation:
-
Added new reporting for clusters, including the /clusterlist web page and improved display of the content of a cluster by the /info web page.
-
The sync mechanism recognizes situations where it might be dealing with a long chain of cluster artifacts and sends "igot" cards on the third round-trip. This cases the sync to finish by the 5th or 6th round-trip.
We still do not know how the 50 redundant cluster artifacts got into the server to begin with.
Maybe fossil rebuild --cluster was run regularly, for some reason?
- Create a server repository (S) with 99 artifacts
- Clone the repository to (C)
- Amend the comment of tip in the S check-out
- Amend the comment of tip in the C ckeck-out
- Rebuild C with the
--clusteroption - Now C has one cluster
- Have C pull from S
- S creates a cluster for its 100 artifacts during the pull, and also sends it to C
- Now S has one cluster, and C has two clusters
- Have C sync with S
- Now both S and C are in sync and have 103 artifacts:
- The 99 original artifacts
- Two control artifacts for each comment rename made on S and C
- Two clusters, each with 100 entries, differing only by the rename control artifacts
- One cluster has the rename control artifact generated by S, the other the one by C
- The remaining 99 entries in the cluster are redundant
- No unknown artifacts, no phantoms, nor anything else that looks conspicious
The test scenario above was run with the current Fossil trunk [54e4222237]
This is one way how redundant clusters might appear. The step to rebuild C with
the --cluster option is mandatory, so maybe this feature should be disabled?
(I'd suggest making --cluster a no-op for backwards/script compatibility.) Or
is the --cluster option required for some scenarios?
But maybe there's still other ways how redundant clusters might be generated...
Another way how redundant clusters may appear:
- Clone a server repository S with unclustered artifacts to a local clone C0.
- Clone C0 to another local clone C1.
- Push, pull and/or sync between C0 and C1:
- New clusters for unclustered artifacts are created in both C0 and C1.
- An independent 3rd party does a commit on their own local clone X followed by a push and/or sync between X and S:
- New clusters for unclustered artifacts are created in S and X.
- Push and/or sync between S and C0:
- The existing clusters from C0, lacking the commit made on X, are sent to S, making up redundant clusters.
The mitigation here might be do have clone generate clusters for unclustered
artifacts in S.
However, both clone and pull seem like "read-only" operations regarding the
server, and it feels a bit odd that they change the contents of the server
(which pull already does).
Also, if enough new artifacts are added (probably during development of a new feature on a distinct branch) to the C0/C1 ecosystem to reach the threshold to create new clusters, without regularly syncing to S, there will again be redundant clusters.
It looks like clusters require a central authority server to avoid redundancy?
Or, maybe servers and clients could replace or reject on-the-fly (not shun) clusters that only contain artifacts already included in larger clusters.