Promit's Ventspace

March 9, 2011

Understanding Subversion’s Problems

Filed under: Software Engineering — Promit @ 6:13 pm
Tags: ,

I’ve used Subversion for a long time, even CVS before that. Recently there’s a lot of momentum behind moving away from Subversion to a distributed system, like git or Mercurial. I myself wrote a series of posts on the subject, but I skipped over the reasons WHY you might want to switch away from Subversion. This post is motivated in part by Richard Fine’s post, but it’s a response to a general trend and not his entry specifically.

SVN is a long time stalwart as version control systems go, created to patch up the idiocies of CVS. It’s a mature, well understood system that has been and continues to be used in a vast variety of production projects, open and closed source, across widely divergent team sizes and workflows. Nevermind the hyperbole, SVN is good by practically any real world measure. And like any real world production system, it has a lot of flaws in nearly every respect. A perfect product is a product no one uses, after all. It’s important to understand what the flaws are, and in particular I want to discuss them without advocating for any alternative. I don’t want to compare to git or explain why it fixes the problems, because that has the effect of lensing the actual problems and additionally the problem of implying that distributed version control is the solution. It can be a solution, but the problems reach a bit beyond that.

Committing vs publishing
Fundamentally, a commit creates a revision, and a revision is something we want as part of the permanent record of a file. However, a lot of those revisions are not really meant for public consumption. When I’m working on something complex, there are a lot of points where I want to freeze frame without actually telling the world about my work. Subversion understands this perfectly well, and the mechanism for doing so is branches. The caveat is that this always requires server round-trips, which is okay as long as you’re in a high availability environment with a fast server. This is fine as long as you’re in the office, but it fails the moment you’re traveling or your connection to the server fails for whatever reason. Subversion cannot queue up revisions locally. It has exactly two reference points: the revision you started with and the working copy.

In general though, we are working on high availability environments and making a round trip to the server is not a big deal. Private branches are supposed to be the solution to this problem of work-in-progress revisions. Do everything you need, with as many revisions as you want, and then merge to trunk. Simple as that! If only merges actually worked.

SVN merges are broken
Yes, they’re broken. Everybody knows merges are broken in Subversion and that they work great in distributed systems. What tends to happen is people gloss over why they’re broken. There are essentially two problems in merges: the actual merge process, and the metadata about the merge. Neither works in SVN. The fatal mistake in the merge process is one I didn’t fully understand until reading HgInit (several times). Subversion’s world revolves around revisions, which are snapshots of the whole project. Merges basically take diffs from the common root and smash the results together. But the merged files didn’t magically drop from the sky — we made a whole series of changes to get them where they are. There’s a lot of contextual information in those changes which SVN has completely and utterly forgotten. Not only that, but the new revision it spits out necessarily has to jam a potentially complicated history into a property field, and naturally it doesn’t work.

For added impact, this context problem shows up without branches if two people happen to make more than trivial unrelated changes to the same trunk file. So not only does the branch approach not work, you get hit by the same bug even if you eschew it entirely! And invariably the reason this shows up is because you don’t want to make small changes to trunk. Damned if you do, damned if you don’t.

Newer version control systems are typically designed around changes rather than revisions. (Critically, this has nothing at all to do with decentralization.) By defining a particular ‘version’ of a file as a directed graph of changes resulting in a particular result, there’s a ton of context about where things came from and how they got there. Unfortunately the complex history tends to make assignment of revision numbers complicated (and in fact impossible in distributed systems), so you are no longer able to point people to r3359 for their bug fix. Instead it’s a graph node, probably assigned some arcane unique identifier like a GUID or hash.

File system headaches
.svn. This stupid little folder is the cause of so many headaches. Essentially it contains all of the metadata from the repository about whatever you synced, including the undamaged versions of files. But if you forget to copy it (because it’s hidden), Subversion suddenly forgets all about what you were doing. You just lost its tracking information, after all. Now you get to do a clean update and a hand merge. Overwrite it by accident, and now Subversion will get confused. And here’s the one that gets me every time with externals like boost — copy folders from a different repository, and all of a sudden Subversion sees folders from something else entirely and will refuse to touch them at all until you go through and nuke the folders by hand. Nope, you were supposed to SVN export it, nevermind that the offending files are marked hidden.

And of course because there’s no understanding of the native file system, move/copy/delete operations are all deeply confusing to Subversion unless it’s the one who handles those changes. If you’re working with an IDE that isn’t integrated into source control, you have another headache coming because IDEs are usually built for rearranging files. (In fact I think this is probably the only good reason to install IDE integration.)

It’s not clear to me if there’s any productive way to handle this particular issue, especially cross platform. I can imagine a particular set of rules — copying or moving files within a working copy does the same to the version control, moving them out is equivalent to delete. (What happens if they come back?) This tends to suggest integration at the filesystem layer, and our best bet for that is probably a FUSE implementation for the client. FUSE isn’t available on Windows, though apparently a similar tool called Dokan is. Its maturity level is unclear.

Changelists are missing
Okay, this one is straight out of Perforce. There’s a client side and a server side to this, and I actually have the client side via my favorite client SmartSVN. The idea on the client is that you group changed files together into changelists, and send them off all at once. It’s basically a queued commit you can use to stage. Perforce adds a server side, where pending changelists actually exist on the server, you can see what people are working on (and a description of what they’re doing!), and so forth. Subversion has no idea of anything except when files are different from their copies living in the .svn shadow directory, and that’s only on the client. If you have a couple different live streams of work, separating them out is a bit of a hassle. Branches are no solution at all, since it isn’t always clear upfront what goes in which branch. Changelists are much more flexible.

Locking is useless
The point of a lock in version control systems is to signal that it’s not safe to change a file. The most common use is for binary files that can’t be merged, but there are other useful situations too. Here’s the catch: Subversion checks locks when you attempt to commit. That’s how it has to work. In other words, by the time you find out there’s a lock on a file, you’ve already gone and started working on it, unless you obsessively check repository status for files. There’s also no way to know if you’re putting a lock on a file somebody has pending edits to.

The long and short of it is if you’re going to use a server, really use it. Perforce does. There’s no need to have the drawbacks of both centralized and distributed systems at once.

I think that’s everything that bothers me about Subversion. What about you?

Advertisements

October 10, 2010

Evaluation: Git

Filed under: Software Engineering — Promit @ 6:59 pm
Tags: , , , ,

Last time I talked about Mercurial, and was generally disappointed with it. I also evaluated Git, another major distributed version control system (DVCS).

Short Review: Quirky, but a promising winner.

Git, like Mercurial, was spawned as a result of the Linux-BitKeeper feud. It was written initially by Linus Torvalds, apparently during quite a lull in Linux development. It is for obvious reasons a very Linux focused tool, and I’d heard that performance is poor on Windows. I was not optimistic about it being usable on Windows.

Installation actually went very smoothly. Git for Windows is basically powered by MSYS, the same Unix tools port that accompanies the Windows GCC port called MinGW. The installer is neat and sets up everything for you. It even offers a set of shell extensions that provide a graphical interface. Note that I opted not to install this interface, and I have no idea what it’s like. A friend tells me it’s awful.

Once the installer is done, git is ready to go. It’s added to PATH and you can start cloning things right off the bat. Command line usage is simple and straightforward, and there’s even a ‘config’ option that lets you set things up nicely without having to figure out what config file you want and where it lives. It’s still a bit annoying, but I like it a lot better than Mercurial. I’ve heard some people complain about git being composed of dozens of binaries, but I haven’t seen this on either my Windows or Linux boxes. I suspect this is a complaint about old versions, where each git command was its own binary (git-commit, git-clone, git-svn, etc), but that’s long since been retired. Most of the installed binaries are just the MSYS ports of core Unix programs like ls.

I was also thrilled with the git-svn integration. Unlike Mercurial, the support is built in and flat out works with no drama whatsoever. I didn’t try committing back into the Subversion repository from git, but apparently there is fabulous two way support. It was simple enough to create a git repository but it can be time consuming, since git replays every single check-in from Subversion to itself. I tested on a small repository with only about 120 revisions, which took maybe two minutes.

This is where I have to admit I have another motive for choosing Git. My favorite VCS frontend comes in a version called SmartGit. It’s a stand-alone (not shell integrated) client that is free for non commercial use and works really well. It even handled SSH beautifully, which I’m thankful about. It’s still beta technically, but I haven’t noticed any problems.

Now the rough stuff. I already mentioned that Git for Windows comes with a GUI that is apparently not good. What I discovered is that getting git to authenticate from Windows is fairly awful. In Subversion, you actually configure users and passwords explicitly in a plain-text file. Git doesn’t support anything of the sort; their ‘git-daemon’ server allows fully anonymous pulls and can be configured for anonymous-push only. Authentication is entirely dependent on the filesystem permissions and the users configured on the server (barring workarounds), which means that most of the time, authenticated Git transactions happen inside SSH sessions. If you want to do something else, it’s complicated at best. Oh, and good luck with HTTP integration if you chose a web server other than Apache. I have to imagine running a Windows based git server is difficult.

Let me tell you about SSH on Windows. It can be unpleasant. Most people use PuTTY (which is very nice), and if you use a server with public key authentication, you’ll end up using a program called Pageant that provides that service to various applications. Pageant doesn’t use OpenSSH compatible keys, so you have to convert the keys over, and watch out because the current stable version of Pageant won’t do RSA keys. Git in turn depends on a third program called Plink, which exists to help programs talk to Pageant, and it finds that program via the GIT_SSH environment variable. The long and short of it is that getting Git to log into a public key auth SSH server is quite painful. I discovered that SmartGit simply reads OpenSSH keys and connects without any complications, so I rely on it for transactions with our main server.

I am planning to transition over to git soon, because I think that the workflow of a DVCS is better overall. It’s really clear, though, that these are raw tools compared to the much more established and stable Subversion. It’s also a little more complicated to understand; whether you’re using git, Mercurial, or something else it’s valuable to read the free ebooks that explain how to work with them. There are all kinds of quirks in these tools. Git, for example, uses a ‘staging area’ that clones your files for commit, and if you’re not careful you can wind up committing older versions of your files than what’s on disk. I don’t know why — seems like the opposite extreme from Mercurial.

It’s because of these types of issues that I favor choosing the version control system with the most momentum behind it. Git and Mercurial aren’t the only two DVCS out there; Bazaar, monotone, and many more are available. But these tools already have rough (and sharp!) edges, and by sticking to the popular ones you are likely to get the most community support. Both Git and Mercurial have full blown books dedicated to them that are available electronically for free. My advice is that you read them.

October 6, 2010

Evaluation: Mercurial

Filed under: Software Engineering — Promit @ 11:13 am
Tags: , , , , ,

I’ve been a long time Subversion user, and I’m very comfortable with its quirks and limitations. It’s an example of a centralized version control system (CVCS), which is very easy to understand. However, there’s been a lot of talk lately about distributed version control systems (DVCS), of which there are two well known examples: git and Mercurial. I’ve spent a moderate amount of time evaluating both, and I decided to post my thoughts. This entry is about Mercurial.

Short review: A half baked, annoying system.

I started with Mercurial, because I’d heard anecdotally that it’s more Windows friendly and generally nicer to work with than git. I was additionally spurred by reading the first chapter of HgInit, an e-book by Joel Spolsky of ‘Joel on Software’ fame. Say what you will about Joel — it’s a concise and coherent explanation of why distributed version control is, in a general sense, preferable to centralized. Armed with that knowledge, I began looking at what’s involved in transitioning from Subversion to Mercurial.

Installation was smooth. Mercurial’s site has a Windows installer ready to go that sets everything up beautifully. Configuration, however, was unpleasant. The Mercurial guide starts with this as your very first step:

As first step, you should teach Mercurial your name. For that you open the file ~/.hgrc with a text-editor and add the ui section (user interaction) with your username:

Yes, because what I’ve always wanted from my VCS is for it to be a hassle every time I move to a new machine. Setting up extensions is similarly a pain in the neck. More on that in a moment. Basically Mercurial’s configurations are a headache.

Then there’s the actual VCS. You see, I have one gigantic problem with Mercurial, and it’s summed up by Joel:

Whereas, in Mercurial, all commands always apply to the entire tree. If your code is in c:\code, when you issue the hg commit command, you can be in c:\code or in any subdirectory and it has the same effect.

This is an incredibly awkward design decision. The basic idea, I guess, is that somebody got really frustrated about forgetting to check in changes and decided this was the solution. My take is that this is a stupid restriction that makes development unpleasant.

When I’m working on something, I usually have several related projects in a repository. (Mercurial fans freely admit this is a bad way to work with it.) Within each project, I usually wind up making a few sets of parallel changes. These changes are independent and shouldn’t be part of the same check-in. The idea with Mercurial is, I think, that you simply produce new branches every time you do something like this, and then merge back together. Should be no problem, since branching is such a trivial operation in Mercurial.

So now I have to stop and think about whether I should be branching every time I make a tweak somewhere?

Oh but wait, how about the extension mechanism? I should be able to patch in whatever behavior I need, and surely this is something that bothers other people! As it turns out that definitely the case. Apart from the branching suggestions, there’s not one but half a dozen extensions to handle this problem, all of which have their own quirks and pretty much all of which involve jumping back into the VCS frequently. This is apparently a problem the Mercurial developers are still puzzling over.

Actually there is one tool that’s solved this the way you would expect: TortoiseHg. Which is great, save two problems. Number one, I want my VCS features to be available from the command line and front-end both. Two, I really dislike Tortoise. Alternative Mercurial frontends are both trash, and an unbelievable pain to set up. If you’re working with Mercurial, TortoiseHg and command line are really your only sane options.

It comes down to one thing: workflow. With Mercurial, I have to be constantly conscious about whether I’m in the right branch, doing the right thing. Should I be shelving these changes? Do they go together or not? How many branches should I maintain privately? Ugh.

Apart from all that, I ran into one serious show stopper. Part of this test includes migrating my existing Subversion repository, and Mercurial includes a convenient extension for it. Wait, did I say convenient? I meant borderline functional:

Subversion’s Python bindings are a prerequisite. The bindings (generated with SWIG) are installed separately on Windows, and can be found on http://subversion.tigris.org/ . Note that you can’t do this with the Win32 Mercurial binaries — there’s no way to install the Subversion bindings into its built-in Python library. So you’ll need to use a Mercurial installed on top of a stand-alone Python, and you may also need to do something like “set HG=python c:\Python25\Scripts\hg” to override the default Win32 binaries if you have those installed also. For Mac OS X, the easiest way is to install the CollabNet Subversion build, and then copy the content of /opt/subversion/lib/svn-python to the site-package directory of the python installation.

The silver lining is there are apparently third party tools to handle this that are far better, but at this point Mercurial has tallied up a lot of irritations and I’m ready to move on.

Spoiler: I’m transitioning to git. I’ll go into all the gory details in my next post, but I found git to be vastly better to work with.

December 22, 2009

My Favorite SVN Tool: SmartSVN

Filed under: Software Engineering — Promit @ 4:41 pm
Tags: , , ,

SmartSVN is my go-to tool for Subversion work. I believe that most people use either the command line ‘svn’ tool, or TortoiseSVN. Now I like command line from time to time too, and for that I use SlikSVN under Windows. I don’t like command line for the majority of the work though, so I stick to SmartSVN.

I’ve used Tortoise a lot, as well as the VS plugins AnkhSVN and VisualSVN. I’m not going to criticize them from a technical standpoint, but what it comes down to is this — integration sucks. Visual Studio integration is pointless when half your files aren’t in VS anyway, and Explorer was never designed to be Subversion. I went looking for a stand-alone SVN client and tried a couple (RapidSVN comes to mind) before settling on SmartSVN.

Why is a stand-alone client better? I can see a lot more information about the repository at once for starters, like the revision history, working copy status (with all kinds of sort options), my recent transactions, etc. The menu structure is also a lot nicer to work with than one giant embedded shell menu. I also like SmartSVN’s project management, so that I don’t have to go hunting through the filesystem to pull up projects. And as a simple practical matter, it keeps my Subversion windows separate from my Explorer windows in the taskbar.

Is this an advertisement? Yeah, a bit. The people at Syntevo were nice enough to kick a pro license my way, which adds a bunch of features — I’m looking forward to Perforce style change sets. And I won’t lie, at $70 USD the pro license is a bit steep. But I’ve been using the free version for a couple years now and I’m not planning on going back to Tortoise any time soon. (And yes, Smart has shell integration if you still want it.)

And there’s my pitch. I seem to have gotten into a routine of highlighting the tools I develop with, so I’ll continue that trend for a while.

Blog at WordPress.com.