Fossil Forum

anonymous 1 week, 3 days ago

Post: How do merges work in Fossil (and optionally other VCSs)?

Hi.

I'm creating my own VCS and I was wondering how Fossil went about implementing merges. I'm more interested in the concept as opposed to the direct source code, which is why I chose to write a forum post instead.

I'm curious on the following:

  • How do you implement a merge across more than two branches? If you were merging branches A, B, C and D, with E as the common ancestor, is it just merge3(A, merge3(B, merge3(C, D, E), E), E) where the signature is merge3(ours, theirs, base)?

  • How do you check if a file has been merged or not? How do you check if it's been cleaned up the user and is then ready to commit? Is this managed in the .fslckout or somewhere else?

It's 1am right now for me (my timezone is UTC+0). If I have any more questions, I'll drop them tomorrow. If there are any resources for implementing this kind of merge functionality that I can check out, please do share. I'm very interested in this sort of thing.

chungy 1 week, 3 days ago

How do you implement a merge across more than two branches? If you were merging branches A, B, C and D, with E as the common ancestor, is it just merge3(A, merge3(B, merge3(C, D, E), E), E) where the signature is merge3(ours, theirs, base)?

That's effectively what's happening, even though the code actually just loops back to progressively merge each branch until the list is empty. See merge.c for those details.

How do you check if a file has been merged or not? How do you check if it's been cleaned up the user and is then ready to commit? Is this managed in the .fslckout or somewhere else?

The fact of a merge is recorded/staged in the .fslckout file and automatically modify the working tree to reflect the result of the merge operation. All file changes, removals, renames, and conflicts should be present. It does give a loud warning about the presence of conflicts, left to the user to resolve. You might notice that the fossil status command dutifully reports message such as MERGED_WITH, ADDED_BY_MERGE, and so forth.

Fossil puts it upon your responsibility to test the results of the merge before committing to it in a new checkin. You can even choose to use the fossil undo or fossil revert commands to wipe clean the working directory of the intention to merge. Deleting the working directory works too, of course. :-)

Because of this working method, it's also possible to simply merge one branch at a time before committing. It might be easier to deal with conflicts one-at-a-time this way.

stephan 1 week, 3 days ago

I'm creating my own VCS ...

When you get it working, please post an announcement. It may come as no surprise that many of us here are interested in VCSs :).

To add only a small amount of detail to Mike's excellent response...

It does give a loud warning about the presence of conflicts, left to the user to resolve.

And it does so by adding distinctive "conflict markers" in the conflicting merged files.

Deleting the working directory works too, of course. :-)

That's colloquially known as "git-style revert" ;).

andybradford 1 week, 2 days ago

Also, for some conceptual reading on merging:

https://fossil-scm.org/home/doc/trunk/www/branching.wiki

anonymous 1 week, 2 days ago

This is incredibly helpful. Thank you. I will check out merge.c and do my best to figure out how this works.

drh 1 week, 2 days ago

I designed and wrote the merge algorithm used in Fossil. The way I think about merging is this: You have two different versions of the same file - call them A and B. You find the most recent common ancestor of A and B - call it X. To merge B into A, you compute the differences (a.k.a. a "patch") for all changes going from X to B, then you apply that patch to A.

If A and B change the same part of the file in different ways, then that is a conflict. Conflicts must be resolved by a human. If A and B both independently make the same exact change, then the merge algorithm can recognize that fact and avoid the conflict.

Fossil merges at the line-level. In other words, each line of text is considered to be an indivisible unit. However, if there is a conflict, Fossil falls back to comparing the two files at the character level, and tries to do a character-level merge as the "Suggested Conflict Resolution". Character-level merging for conflict resolution works in roughly 50% of the cases, in my experience, so it is still left to the human operator to approve or disapprove.

Just because a merge is conflict-free, does not mean that the output is correct! It is quite possible that the changes in A and B are incompatible in some way, such that even though they are widely separated from each other and merge together cleanly, but they do not compile, or they do compile but get the wrong answer. For this reason Fossil does not store the results of a merge back into the repository. The merge results are left in your local check-out so that they can be tested first, and then committed only after all tests pass. Git does not have such scruples.

If you want to merge together multiple variants of the file, you repeat the process for each file to be merged in.

anonymous 1 week, 2 days ago

I have attempted this once before with this already, but I became dissatisfied with the project and I craved the self-contained nature of Fossil's repositories as opposed to the filesystem which I chose originally to be more like Git. I'm reattempting this from a pure key-value standpoint using ReDB and dlhn (i <3 rust).

One issue I've had with this project is that I do it solely for the sake of doing it. I don't preplan a philosophy and instead just do what kind of feels right, which leads to a somewhat mess that's jumbled enough to be infuriating to work with. I also don't really use what I make because there are just better options like Fossil (wink wink - I'm using it to track progress on my second iteration now). Again, even on my rewrite, I question if a key-value approach even properly works.

anonymous 1 week, 2 days ago

I'm taking a look at it now.

chungy 1 week, 2 days ago

Reply: How do merges work in Fossil (and optionally other VCSs)?

stephan 1 week, 2 days ago

One issue I've had with this project is that I do it solely for the sake of doing it. I don't preplan a philosophy and instead just do what kind of feels right, which leads to a somewhat mess that's jumbled enough to be infuriating to work with...

FWIW, you're not alone in the "throw it all against the wall and see what sticks" approach.

You may also want to look at https://bramcohen.com/p/manyana, which only recently came about.

Again, even on my rewrite, I question if a key-value approach even properly works.

At its lowest level, fossil is even more basic than a KV store - it's a "bag of unordered blobs", but it uses the db features to sort through that bag and make sense of the data. See for a high-level overview of that.

It may be worth noting that Fossil's architecture and merge code are both almost 19 years old now and neither have undergone any fundamental changes since their initial implementation. The merge code was recently (18 months ago, IIRC) extended a bit to provide more feedback, but it's initial code has held up perfectly fine in all that time. That is to say: if you're looking for a stable model as a basis for study, fossil provides that. Don't just re-make fossil, though - do your own thing (and then tell us about it). The world could certainly benefit from more experimentation in the field of SCM/VCS.

chungy 1 week, 2 days ago

Deleting the working directory works too, of course. :-)

That's colloquially known as "git-style revert" ;).

Well, maybe, but it was more to highlight that if you don't commit, the changes and state only exist in the working directory and the checkout database. The repository doesn't automatically get, nor know, anything about the merge before you commit it. You usually don't need to delete the whole directory to undo whatever state you're in, but it's sometimes the fastest way to just forget things.

Git defaults to creating a commit immediately at the time you do a merge (unless you specify --no-commit). It's core to the design philosophies of the two systems that they differ in this detail. Git doesn't have an autosync mode and encourages history rewriting, with plenty of commands to assist in that operation. So it'll commit a merge first, you can test it, and if you don't like it, you can get rid of that commit and start over. Rinse and repeat until you're happy enough to push it to a remote.

Fossil is very keen on never losing history, as well as having autosync. These two facts alone would make it very dangerous for fossil merge to automatically perform the commit step, and you are encouraged to do local testing before making it permanent.


On another note: Even though both systems have the same subcommand, git revert is actually equivalent to fossil merge --backout. Fossil uses the command instead in the way that CVS and Subversion conveyed it: undoing local changes.

kostix 1 week, 1 day ago

If you're curious enough, you might also consider Pijul and the theory of patches it's based on.

That may be of interest because most DVCSes (including Fossil and Git) are somewhat "lame" when it comes to merging because it is implemented purely on "compare these pieces of opaque text" basis (albeit with occasional tweaks and improvements like using three-way diff algorithm).

(Note that I do not advocate that approach as being "better" — just interesting to study.)

MelvaigT 1 week ago

Interesting, to me as well as the OP (I hope).

I have been reading the whole thread pondering the question - is it a mandatory / fundamental part of a VCS to provide a merging algorithm at all? Or is it an option to let the user of your VCS choose their own, and just provide the means to record that a 'merge' has occurred?

In short, there are two questions implied in the OP, not one:

  • how do you represent the idea that my new checkin is a successor/replacement/new version of all of A, B, C and D, not just one of them. i.e. it is a child of more than one other checkin.

  • is there any help to be provided in constructing the files which are included in the checkin, or comparing the differences between this version and its ancestors (as a group rather than one at a time).

As far as I can see there is no option in Fossil to do the database bit of 'merge' (i.e. add the current checkout as a future parent of the merged version) without the file editing part. Did I miss something?

A fictitious use case where this would be helpful: I have versions 1.0, 2.0, 2.1. There is a bug or missing feature in 1.0 which I have dealt with by creating a branch 1.1. When I come to make a matching change to create 2.2 I want to do the work in a different way but indicate in the history that the different changes are related.

I seem to have two choices with most existing VCSes:

  • ignore the provided 'merge' function and concoct some other way of recording the relation between 2.2 and 1.1 (visible to readers of either version, especially 1.1 which might otherwise look like an open branch which can be used as a base for a new version).

  • use the provided 'merge' but reinstate the files afterwards (which runs the risk of mistakenly leaving in part of the 1.1 version of the change).

Note: I left the D (distributed) out of DVCS because even as a lone developer I can still create branches and conflicts all by myself :-)

andybradford 1 week ago

When I come to make a matching change to create 2.2 I want to do the work in a different way but indicate in the history that the different changes are related.

One way to accomplish this is to use "fossil merge --cherrypick", make the alterations that you want, and then commit it.

Another way would be to make similar changes in the 2.2 and then in the commit comment reference the hash of the 1.1 checkin. E.g.

fossil commit -m 'Changes inspired by [b4a769c9d2735ae1]'

Or even do both.

drh 1 week ago

It would be relatively simple in Fossil to provide a setting and/or command-line option to specify an external program to do a 3-way merge, and to have Fossil use that external program for all merge actions. Fossil would generate three file: The A and B versions and the X common ancestor, and the external program would read those files and generate a fourth file M which is the merge of X→B into A. Fossil would then use the fourth file.

But I don't see that as being particularly useful, as the built-in Fossil merge algorithm works great. If you find a case where something else works better, point it out and the Fossil algorithm will probably get enhanced.

kostix 1 week ago

is it a mandatory / fundamental part of a VCS to provide a merging algorithm at all?

JFTR, Git has that due to its approach of having a set of "high-level" — user facing — commands, and a set of low-level, "plumping", commands. So it has commands which basically implement stuff like "hash the contents of this file, add them to the object DB and print the hash", "place a record in the would-be next commit under such and such pathname referring such and such hash", so one can script committing stuff wihch came about in any way, including custom merging algorithms.

Of course, Git provides its own implementation of merging (actually, multiple, with subvariants).

I think, the latter is just plain convenience: while one may benefit from a VCS supporting certain degree of flexibility, most real-world use cases of a VCS still involve dealing with small-to mid-sized textual files, and so having an ability to do merging of this sort of data out-of-the-box is taken for granted ;-)

anonymous 5 days, 19 hours ago

I think the idea of using CRDTs to make merges, rebases and other operations much easier and more user-friendly is exactly what I'm looking for. I'll see if I can find a Rust equivalent of their Python example, or just write my own. Either way, I think I'll stop with trying to rush this project and do more experimenting and theory.

Thank you for this resource.

anonymous 5 days, 14 hours ago

I do see a bright future in a patch-based VCS like Pijul, and I plan on redirecting my project to go in that direction. It does mean I'd have to study the same mathematical theory of patches, and then expand on that knowledge by doing research (effort, yuck). I imagine it'll be worth it though - something to do in the meantime.

I've seen this thing called weave-crdt that works as a better Git merger using tree-sitter. Would something like that be a good idea for merging?

Also, what would be the drawbacks of a patch-based system? Why would it not be "better" than a snapshot-based one if it means (almost) no-conflict merges and lighter network loads (I imagine) for updating remotes?

danield 5 days, 2 hours ago

Too bad such deep insights must remain anonymous.

drh 5 days, 1 hour ago

I don't think a "patch-based" VCS is a fundamentally different algorithm than a "snapshot-based" system. I do not believe you get fewer merge conflicts with one over the other.

The key difference is between patch-based and snapshot-based is what part of the code is stored. Patch-based stores the differences between successive check-ins - requiring you to compute the snapshot when needed. The snapshot-based system stores snapshots, requiring you to compute the differences (a.k.a. patches) when needed.

Which is more important? The actual code itself, or the changes in the code from the previous check-in? If you are looking at a project XYZ, do you want to see a snapshot of the latest check-in, or do you want a list of patches that when applied in a particular order will recreate the latest check-in? If you want to build XYZ yourself, what do you pass to the compiler? A bunch of patches that the compiler has to assemble to recreate the code, or the actual code?

I intuit from anonymous's writings here that he thinks that a patch-based VCS is some kind of silver bullet that magically makes version control simple. It is not. It organizes the information differently, and makes some kinds of information easier to extract, at the expense of making other kinds of information more difficult to extract. The problem does not get any simpler. It just moves the difficulty around. That might be either a win or a loss, depending on what you are trying to do.

Most (but not all) people who have looked deeply into the matter, and who have actually written working code to do this kind of thing have determined that snapshot-based VCSes tends to do a better job of managing complexity. But you should do whatever you think is best, of course.

anonymous 5 days ago

I think I'll make an account. This community seems like something very useful to me.

For reference, I'm the same guy from this post about the branches.

anonymous 4 days, 23 hours ago

Which is more important? The actual code itself, or the changes in the code from the previous check-in?

Well, my instinctive approach is that in a collaborative sense, especially after pulling, seeing the changes made would be better, but you only have the changes, and to make those meaningful, you'd have to reconstruct the version from before to see the difference (unless there's another way to do that representation).

If you want to build XYZ yourself, what do you pass to the compiler? A bunch of patches that the compiler has to assemble to recreate the code, or the actual code?

Isn't that just a non-point? Compilers would take in source code, which they'd get from your current check-in. I don't see how you would compile code that's stored in a repository in some SCM-specific format.

I intuit from anonymous's writings here that he thinks that a patch-based VCS is some kind of silver bullet that magically makes version control simple. It is not.

Yes, that's what I originally thought. According to your insightful response, my interpretation was wrong.

The problem does not get any simpler. It just moves the difficulty around.

What does that look like?

That might be either a win or a loss, depending on what you are trying to do.

Hmm, interesting. What cases would this be a win and what cases would this be a loss?


Just a heads up, apologies if I'm asking seemingly trivial questions. I only want to learn and you guys seem like the right place for these questions :)

MelvaigT 4 days, 18 hours ago

There seem to be two unstated assumptions built into the approach described which bear thinking about:

  • that most of the files to be controlled are complete examples of specific programming languages. For a counter example see Literate Programming.

  • that reducing the number of identified conflicts must always be a good thing. Designing a good merging algorithm is a bit like designing a good test for Covid - it is impossible to achieve both zero false positives and zero false negatives. To try to reduce one of them to zero just leads to the trivial and useless solution of always returning positive, or always returning negative. You need to identify, and then aim, for the right balance.

To illustrate, consider the trivial, probably incorrect, program:

int A, B;
A=1; A=1;

User #1 sees the "obvious" mistake, that the intention is to initialise both variables, and so they change the second A=1 to B=1.

User #2 also sees the mistake, but for reasons best known to themselves change the first A=1 to B=1.

We end up with two competing solutions for the same problem. I know this is not a common issue, but bear with me - I have my reasons beyond the example.

So, what do the available merge algorithms do?

Line oriented schemes will see the changes as conflicting. Tree-oriented ones will resolve the situation by accepting both changes: so the second line becomes

B=1; B=1;

This is (seems?) as obviously wrong as the original.

Before anyone thinks that I am making a case for tree-oriented / crdt approaches being worthless: repeat the experiment with the (semantically) identical initial condition:

int A, B;
A=1;
A=1;

Now the line-oriented approach will make the same mistake (usually, depending on whether it takes a precautionary approach when two changes are in close proximity.)

What are we missing? Here is the twist:

User #0 now comes back from holiday and says sorry guys, I started off thinking I needed two variables, but then realised one would do and was trying to eliminate B.

My point is that merge tools fail to produce the expected result when they are missing information of some kind. So as a review of the various approaches that have been mentioned so far I would make these points:

  • line-oriented diffs are usually better than character-oriented ones because coders tend not to put unrelated information on the same line, and conversely they will often draw attention to a relationship by not breaking a line. This technique uses line breaks as a proxy for structure. This is how I constructed the above example - by knowing the difference between the structure and the information available to the algorithms.

  • tree-oriented diffs derive the structure by parsing. The information being added is about the language being used - in effect the syntax is being used as a proxy for structure rather than line breaks. That should work fairly well in practice except that in most languages line breaks are not significant, yet the human programmer tends still to use them to express structure not captured by the language. The very fact that parsing is language specific makes this a poor choice for me, because my files are not (fragments of) program code. Good luck with Scala BTW, because the formal syntax doesn't match the programmer's untuitive idea of the structure very well.

  • crdt based diffs, as far as I understand with only a cursory read, try to capture more information from the actual editing of code. So in the above example it would not just see 'replace a particular A with B' (twice for different A's) but rather 'replace the second/first A by B'. Basically the programmer is somehow indicating that they are editing both assignment statements as a unit, not just one character. The fact the other A remains unchanged is then known to be relevant, and we can avoid the erroneous merge of the two changes. If I am right this constrains the database to record deltas because you can no longer reconstruct them from the two versions of the file. If on the other hand the scheme is trying to derive crdts to express an edit (patch) then, like Richard in a parallel reply, I don't see how it helps. You don't find extra information this way, you just express it differently. Editing itself is not conflict free, you have to do more than just invoke the term. To emphasise: the additional information we are looking for is that the two assignments are related - the tree-oriented algorithm does't have that, it sees two seperate syntax nodes (regardless of an intervening line).

  • no scheme can ever be perfect, even if it has humans in it. See above. As a user I would choose a system which produces more conflicts rather than more instances of invalid merges (which I only find out about after a future release). In the same way that false positives in a COVID test are far less damaging than false negatives. I only discover the latter when my friends start to keel over.

Rant: I notice in one of the documents on the tree-based approach that the increased conflicts are being generated by AI coders. Why not sack the coders? Don't you want the work yourself??

Trevor

MelvaigT 4 days, 16 hours ago

Thanks for the offer, and for taking the time to consider the problem.

And also apologies for a poorly constructed / worded example. Bad habit of mine.

In terms of yours and others time, bear in mind this is intended to help the OP think about the concepts involved, and out of my own curiosity. Fossil is fine as it is for my actual usage.

What I was trying to get at, and which might be better expressed in my other post of today, was an example of where any merging algorithm fails not because it is not good enough, but because it does not have access to the required information to do the right thing.

I have come across this option to fossil merge:

--force
    Force the merge even if it would be a no-op

I take this to mean "add the arrow to the timeline etc" as if I had found some files to merge, even if there are none. But I suspect it would still try to do the merge anyway rather than take my word for it?

So perhaps

fossil merge --binary * --force

the idea being that the --binary * ensures that the merge is indeed a no-op.

If I had a real need to do this I would also look at what merge-info would tell me - ideally I would still like to see which files have changed relative to the baseline, even if it wasn't Fossil itself that calculated the changes.

If that worked, then this would also fit the bill where somebody does want to run an alternative (not necessarily better in general) algorithm. It would even fit a scheme where the other algorithm considers more than one file at a time. You run your chosen tool, then do perhaps merge --manual which would be equivalent to what I suggested above.

If this doesn't convince at least the OP that the algorithm is best seperate from the information held, I'll give up :-)

Trevor

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button