Promit's Ventspace

March 9, 2011

Understanding Subversion’s Problems

Filed under: Software Engineering — Promit @ 6:13 pm
Tags: ,

I’ve used Subversion for a long time, even CVS before that. Recently there’s a lot of momentum behind moving away from Subversion to a distributed system, like git or Mercurial. I myself wrote a series of posts on the subject, but I skipped over the reasons WHY you might want to switch away from Subversion. This post is motivated in part by Richard Fine’s post, but it’s a response to a general trend and not his entry specifically.

SVN is a long time stalwart as version control systems go, created to patch up the idiocies of CVS. It’s a mature, well understood system that has been and continues to be used in a vast variety of production projects, open and closed source, across widely divergent team sizes and workflows. Nevermind the hyperbole, SVN is good by practically any real world measure. And like any real world production system, it has a lot of flaws in nearly every respect. A perfect product is a product no one uses, after all. It’s important to understand what the flaws are, and in particular I want to discuss them without advocating for any alternative. I don’t want to compare to git or explain why it fixes the problems, because that has the effect of lensing the actual problems and additionally the problem of implying that distributed version control is the solution. It can be a solution, but the problems reach a bit beyond that.

Committing vs publishing
Fundamentally, a commit creates a revision, and a revision is something we want as part of the permanent record of a file. However, a lot of those revisions are not really meant for public consumption. When I’m working on something complex, there are a lot of points where I want to freeze frame without actually telling the world about my work. Subversion understands this perfectly well, and the mechanism for doing so is branches. The caveat is that this always requires server round-trips, which is okay as long as you’re in a high availability environment with a fast server. This is fine as long as you’re in the office, but it fails the moment you’re traveling or your connection to the server fails for whatever reason. Subversion cannot queue up revisions locally. It has exactly two reference points: the revision you started with and the working copy.

In general though, we are working on high availability environments and making a round trip to the server is not a big deal. Private branches are supposed to be the solution to this problem of work-in-progress revisions. Do everything you need, with as many revisions as you want, and then merge to trunk. Simple as that! If only merges actually worked.

SVN merges are broken
Yes, they’re broken. Everybody knows merges are broken in Subversion and that they work great in distributed systems. What tends to happen is people gloss over why they’re broken. There are essentially two problems in merges: the actual merge process, and the metadata about the merge. Neither works in SVN. The fatal mistake in the merge process is one I didn’t fully understand until reading HgInit (several times). Subversion’s world revolves around revisions, which are snapshots of the whole project. Merges basically take diffs from the common root and smash the results together. But the merged files didn’t magically drop from the sky — we made a whole series of changes to get them where they are. There’s a lot of contextual information in those changes which SVN has completely and utterly forgotten. Not only that, but the new revision it spits out necessarily has to jam a potentially complicated history into a property field, and naturally it doesn’t work.

For added impact, this context problem shows up without branches if two people happen to make more than trivial unrelated changes to the same trunk file. So not only does the branch approach not work, you get hit by the same bug even if you eschew it entirely! And invariably the reason this shows up is because you don’t want to make small changes to trunk. Damned if you do, damned if you don’t.

Newer version control systems are typically designed around changes rather than revisions. (Critically, this has nothing at all to do with decentralization.) By defining a particular ‘version’ of a file as a directed graph of changes resulting in a particular result, there’s a ton of context about where things came from and how they got there. Unfortunately the complex history tends to make assignment of revision numbers complicated (and in fact impossible in distributed systems), so you are no longer able to point people to r3359 for their bug fix. Instead it’s a graph node, probably assigned some arcane unique identifier like a GUID or hash.

File system headaches
.svn. This stupid little folder is the cause of so many headaches. Essentially it contains all of the metadata from the repository about whatever you synced, including the undamaged versions of files. But if you forget to copy it (because it’s hidden), Subversion suddenly forgets all about what you were doing. You just lost its tracking information, after all. Now you get to do a clean update and a hand merge. Overwrite it by accident, and now Subversion will get confused. And here’s the one that gets me every time with externals like boost — copy folders from a different repository, and all of a sudden Subversion sees folders from something else entirely and will refuse to touch them at all until you go through and nuke the folders by hand. Nope, you were supposed to SVN export it, nevermind that the offending files are marked hidden.

And of course because there’s no understanding of the native file system, move/copy/delete operations are all deeply confusing to Subversion unless it’s the one who handles those changes. If you’re working with an IDE that isn’t integrated into source control, you have another headache coming because IDEs are usually built for rearranging files. (In fact I think this is probably the only good reason to install IDE integration.)

It’s not clear to me if there’s any productive way to handle this particular issue, especially cross platform. I can imagine a particular set of rules — copying or moving files within a working copy does the same to the version control, moving them out is equivalent to delete. (What happens if they come back?) This tends to suggest integration at the filesystem layer, and our best bet for that is probably a FUSE implementation for the client. FUSE isn’t available on Windows, though apparently a similar tool called Dokan is. Its maturity level is unclear.

Changelists are missing
Okay, this one is straight out of Perforce. There’s a client side and a server side to this, and I actually have the client side via my favorite client SmartSVN. The idea on the client is that you group changed files together into changelists, and send them off all at once. It’s basically a queued commit you can use to stage. Perforce adds a server side, where pending changelists actually exist on the server, you can see what people are working on (and a description of what they’re doing!), and so forth. Subversion has no idea of anything except when files are different from their copies living in the .svn shadow directory, and that’s only on the client. If you have a couple different live streams of work, separating them out is a bit of a hassle. Branches are no solution at all, since it isn’t always clear upfront what goes in which branch. Changelists are much more flexible.

Locking is useless
The point of a lock in version control systems is to signal that it’s not safe to change a file. The most common use is for binary files that can’t be merged, but there are other useful situations too. Here’s the catch: Subversion checks locks when you attempt to commit. That’s how it has to work. In other words, by the time you find out there’s a lock on a file, you’ve already gone and started working on it, unless you obsessively check repository status for files. There’s also no way to know if you’re putting a lock on a file somebody has pending edits to.

The long and short of it is if you’re going to use a server, really use it. Perforce does. There’s no need to have the drawbacks of both centralized and distributed systems at once.

I think that’s everything that bothers me about Subversion. What about you?

About these ads

8 Comments »

  1. “Everybody knows merges are broken in Subversion and that they work great in distributed systems.”

    I keep hearing that a lot when people praise git over SVN, but I’ve never heard a single example or a satisfying explanation why this should be the case. What exactly is broken? Which problems would I run into using SVN? I’ve used SVN since 2004 and I have no idea what people are talking about. Please enlighten me.

    Comment by bla@blo.com — March 11, 2011 @ 7:02 am | Reply

    • Before you ask: Yes, we’ve used release and feature branches all the time. Very easy and lightweight to create, almost as easy to merge if you don’t let them drift too far apart. Even if you do, the only “problem” arising is that you will have to resolve more conflicts.

      Comment by bla@blo.com — March 11, 2011 @ 7:08 am | Reply

      • You don’t consider “having to resolve more conflicts” to be a problem?

        You’re a strange person.

        And I’m kind of confused by your first question because Promit just *gave* you “a satisfying example why this should be the case”.

        SVN doesn’t track the necessary metadata to smoothly merge branches. It takes a diff between two snapshots, and hopes and prays they can be trivially merged together.

        A sane tool will look at the sequence of changes that have been made to each branch, and use them to understand the context of each change, and so merge them together in a meaningful way.

        Comment by jalf — March 12, 2011 @ 4:53 am | Reply

        • > You don’t consider “having to resolve more conflicts” to be a problem?
          >
          > You’re a strange person.

          Ha! You should see me at full moon! ;)

          What I meant to say is that a conflict is a conflict and there’s only so much an automatic merge tool can do, right?

          In my experience, conflicts are the exception if you follow a reasonable merging policy (which may be nothing more than “merge once a week”), even with supposedly clumsy SVN merging. That’s why I didn’t consider it a problem to invest a few minutes to resolve conflicts every couple of weeks.

          What I understand from your’s and Kevin’s comments is that git produces LESS conflicts by applying diffs incrementally. Alright, sounds reasonable. But it also conflicts with git’s supposedly lightning fast merging. How does that go together?

          Also, I can’t see how SVN has forgotten all the in-between changes. It knows all about them. SVN chooses not to use it, though. I guess it’s a performance vs. “merge is good enough” trade-off.

          Don’t get me wrong. Actually, I’m not here to defend SVN. I’m just looking for a real reason to switch to git.

          Comment by bla@blo.com — March 14, 2011 @ 7:13 am

    • I think the problem is the conflicts. The system should remember the sequences of changes that happened to a file, then simply do the same changes to the other file. This will cut down on conflicts and manual merge situations when the two branches are “too far” apart. You can’t eliminate conflicts when two people make conflicting edits, but that should be the ONLY reason to do a manual merge.

      Comment by Kevin P — March 11, 2011 @ 10:49 pm | Reply

  2. > Unfortunately the complex history tends to make assignment of revision numbers complicated (and in fact impossible in distributed systems), so you are no longer able to point people to r3359 for their bug fix. Instead it’s a graph node, probably assigned some arcane unique identifier like a GUID or hash.

    I think that’s one thing Bazaar does right. For many common cases, it is still possible to get away with a simple revision number, even in a decentralized system. Bazaar obviously has to keep a unique id for every commit, like Git or Hg does, but it usually just displays a “friendly” ID which is just a single integer that is incremented with each commit. For many common use cases, you don’t need to even look at the “real” unique ID, and can just merge/branch/revert/whatever using these simpler friendly ID’s.

    Comment by jalf — March 13, 2011 @ 6:55 am | Reply

  3. I think you’re a bit unfair to SVN’s “lock” feature. It works like this:

    - Binary files are tagged with the svn:needs-lock property
    - When someone downloads a working copy from the server, all files marked like that will have their read-only attribute set
    - In case you want to edit the file, you don’t remove the read-only attribute, but “lock” the file. At which point you either get the lock, removing the read-only attribute or Subversion tells you that the file is locked by someone else.

    Comment by Markus Ewald — May 9, 2011 @ 1:06 am | Reply

    • That sounds a little bit like how Perforce handles all files by default, but is read-only checkout is a a property you have to set per file? I’ve never seen it do that by default.

      Comment by Promit — May 9, 2011 @ 9:44 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 479 other followers

%d bloggers like this: