Complex SVN repository conversion to GIT

When I converted our two Visual Source Safe (vss from now on) repositories into one Subversion (svn from now on) repository I did it in an ugly fashion. I dumped the file based history for the current live trees, then loaded these into svn one after the other. They both had their own root names, so I shuffled things around to make a single Trunk, Branch and Tags arrangement.

This did not matter at the time, as we had the history for the code we cared about, and you could do per-file blame, etc.

But when I tried to import this tree into Git, it was messy as the svn tree was not standard for all time, thus the default scripts didn’t know what to do with it.

The history timeline (names changed to protect the innocent, but revisions are not altered, otherwise I’d make editing mistakes)

  • Tree-A imported from vss (single branch) – rev 1 – 3163
  • Tree-B imported from vss (single branch) – rev 3164 – 5710
  • Rename of roots, Trunk, Branch etc – rev 5711 – 5713 (we want to skip this stuff)
  • Current Unified Tree (multiple branches) – 5714 – HEAD

1. Setup our repo

Create new directory

  • Run ‘git init‘ – this will create the git directory
  • Git by default is setup for unix text files since most projects using git on windows are unix based projects. Thus we want to turn off the auto crlf -> lf conversion. We do this by adding the following to the .git/config file:
[core]
    filemode = false
    symlinks = false
    autocrlf = false
    whitespace = nowarn

If on Windows, also add

ignorecase = true
  • Add the authors file (.git/authors), this maps from svn users to git users (full email address)
    (no author) = Simeon Pilgrim <simeon.pilgrim@example.com>
    user1 = user one <user1@example.com>
    user2 = user two <user2@example.com>
    spilgrim = Simeon Pilgrim <simeon.pilgrim@example.com>
  • Setup git svn to use the authers file by adding this to .git/config (otherwise we have to mention it every time we do a fetch with the git svn -A .git/authors …, which is a pain)
[svn]
    authorsfile = .git/authors

2. Create three remotes for each of the three chunks.

  • Normally this would be done via git svn init, but doing it manually gives us more control.
  • Edit .git/config to add the three remotes

Add:

[svn-remote "Tree-A-Old"]
    url = svn://svn-server-url/repository
    fetch = Tree-A-RootName:refs/remotes/Tree-A-Old

[svn-remote "Tree-B-Old"]
    url = svn://svn-server-url/repository
    fetch = Tree-B-RootName:refs/remotes/Tree-B-Old

[svn-remote "svn"]
    url = svn://svn-server-url/repository
    fetch = Trunk:refs/remotes/svn/trunk
    branches = Branches/*:refs/remotes/svn/*
    tags = Tags/*:refs/remotes/svn/tags/*

This defines three svn-remotes, which we can fetch individually. The first two will fetch from svn://svn-server-url…/Tree-A-RootName or svn://svn…/Tree-B-RootName to the remote branches Tree-A-Old and Tree-B-Old. We fetch them by using ‘git svn fetch Tree-A-Old‘ and ‘git svn fetch Tree-B-Old‘. The last remote will fetch trunk from svn://svn…/Trunk to svn/trunk and all branches and tags from svn://nzc…/Branches and svn://nzc-…/Tags as git remote branches with the names svn/<Branch or Tag folder name>.

The former two can just be straight fetched without specifying a revision range (eg. ‘git svn fetch Tree-A-Old‘), but the last requires a range to be specified. This is for two reasons:

  • git svn fetch with branches and tags seems to check every revision up to the start point
  • We actually don’t want to grab the first few revisions after the svn merge as I was purely renaming/moving directories. Otherwise files will randomly disappear and then reappear a commit later in the middle of the history.

Thus to grab the recent svn history we run fetch as ‘git svn fetch -r 5714:HEAD svn

3. Merge the two old vss branches

We want to merge the two old vss histories together (in correct time order) so we replicate the svn merge in a nicer fashion.

  • First we check out one of the branches ‘git checkout -b merged Tree-B-Old‘. This creates a new local branch ‘merged‘ which is a copy of the remote ‘Tree-B-old’, since we can’t edit remote branches directly.
  • We then merge the Tree-A-Old remote branch into ‘merged‘ by running ‘git merge Tree-A-Old‘. This creates a merge commit (ie a commit with more than one parent). The message can be edited by using ‘git commit –amend‘ but the existing is probably fine.

4. Grafting the branches together

The next step is to graft the branches. Essentially we take the start point and force it to have the merge commit as a parent.

  • Note that since git stores the state of the tree at each commit rather than the diff from the parents, we can essentially tell it that a given commit has a different parent commit and it will handle it fine.
  • Add a graft entry to .git/info/grafts. The format of this file is a series of lines with $parent_id $child_id. We thus add an entry with $merged_head_id $trunk_base_id. To get the ids we can run ‘git log merged‘ and take the head/top commit id, and run git log ‘svn/trunk’ and take the root/bottom commit id.
  • This is a temporary graft (ie the link will only remain whilst the grafts file is there), which most tools should adhere to, but we want to make it permanent. We do this by running git-filter-branch on each of the svn branches by running ‘git filter-branch svn/trunk svn/v3.0 …

5. Get git-svn to use the new remote branches

  • git-svn adds an entry of the form git-svn-id: svn://svn…/Trunk@40403 <UUID of svn serv> to each commit. It uses this to rebuild its cache in the event of corruption.
  • To force git-svn to use the new branches is quite easy. Remove the folder .git/svn and rerun ‘git svn fetch svn‘. git-svn will then rebuild its cache and start from where we left off.

B1. Cleaning history

The history can also be imported directly from a local copy of the svn data. This makes the svn import much faster, but the commits need some post cleanup for git-svn to sync up with the real svn server. Specifically the git-svn-id entries will come out as follows:

git-svn-id: file:///home/james/src/repo/Tree-A-Rootname@3161 1923097a-7eed-ce49-a323-f810e19527ea

but we need it in the form of:

git-svn-id: svn://svn-server-url/repository/Tree-A-Rootname@3161 1923097a-7eed-ce49-a323-f810e19527ea

This is where we use the git filter-branch command again. In particular its –msg-filter argument which takes a chunk of bash script that is run for every commit message with the current message given on stdin and the new message on stdout. In this case we use it as follows:

git filter-branch --msg-filter "sed -e 's#file:///home/james/src/#svn://svn-server-url/#g' " -f svn-trunk/trunk svn-trunk/v3.1 svn-trunk/v3.2 svn-trunk/v3.3 svn-trunk/v4.0

The other arguments given here are -f which forces filter-branch to run even if they are backups of previous filters (also seems to be needed on windows) and the names of the branches to be updated.

In my particular case I was also running an email/address fixup with the entire command being:

git filter-branch --msg-filter "sed -e 's#file:///home/james/src/#svn://svn-server-url/#g'" --env-filter 'export GIT_AUTHOR_NAME="$(echo $GIT_AUTHOR_NAME | sed -f /c/SG/authors)";export GIT_AUTHOR_EMAIL="$(echo $GIT_AUTHOR_EMAIL | sed -f /c/SG/emails)"' -f svn-trunk/trunk svn-trunk/v3.1 svn-trunk/v3.2 svn-trunk/v3.3 svn-trunk/v4.0

The env-filter lets you change environmental variables in this case GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL, which I updated using two sed scripts.

The contents of /c/SG/authors was:

s#^.*no author.*#Simeon Pilgrim#gi
s#user1#User One#gi
s#user2#User Two#gi
s#spilgrim#Simeon Pilgrim#gi

And /c/SG/emails:

s#^.*no author.*#simeon.pilgrim@example.com#gi
s#user1.*#user1@example.com#gi
s#user2.*#user2@example.com#gi
s#spilgrim.*#simeon.pilgrim@example.com#gi

This post was written by James, thus the /home/james/ paths, I just generalised the grammar, and removed work-place specific information.

Getting into Git

With the pending move to the US, I need someway to work detached from the main NZ network. So I was keen to try-out Mercurial or Git. I had heard more positive Mercurial stories, verse the real men use Git type stories. So had intended to go with Mercurial.

But our local linux/git hacker was appalled and spent some time showing me how well git and svn can co-exist.

So I started playing with Git, trying to clone my subversion repository, but was not having the best of luck, due to how the repository was formed when I moved and merged two VSS repositories.

Anyway showing James (yes he has a name) my odd clone behaviours, got him intrigued. Half an hour later he was back, telling me all about how the SVN repository history was all messed-up (aka Trunk was not always there, etc). Then offered to work some Git vodoo and fix it all for me.

A day later, he excitedly told me how he manually extracted the blocks of history, and using Git filters, that allow per entry rewrites (or paths) was rebuilding my Trunk and Branches into git, so they looked like they were always located like how the SVN tree currently is.

So it sounds like Git is quite powerful. Sounds like the type of stuff that would make great blog fodder. But I’ve not done any of it myself, so I can’t tell you how todo it, other than it can be done. The tools seem to have lots of flexibility, and when things go wrong, James will have to drop everything and fix it.

See I am becoming more manager like as the days roll by…

Also to make-up for steeling my blog post, he invited me to Google Wave.

Subversion upgrade missing UUID

We recently upgraded our subversion server at work because it was having performance problems, yet the new server also performed poorly, so our dev-svn-admin guy did a dump reload of all the repositories, to gain the benefits of the new Subversion file layouts.

There were a couple of hung transactions on some repositories, and two repositories (one was ours) didn’t have the UUID set, thus wouldn’t reload. No problem, he just set a UUID, and ta-da it loaded, but then our local working copies would not work, giving this error:

Repository UUID 'new UUID' doesn't match expected UUID '????????-????-????-????-???????????'

The first team just re-checked-out all the working copies the manually merged their local changes.

I pulled out Visual Studio and did a file based Find-and-Replace,

Find-and-Replace-UUID

Whereas one of the my other team members came up with the following Cygwin command:

find . -name "entries" -exec sed -i 's\????????-????-????-????-????????????\1923097a-7eed-ce49-a323-f810e19527ea\' {} \;

DiffMerge 3.1 Released

Eric Sink has announced the SourceGear release of DiffMerge 3.1. Killer feature for me is the correct TortoiseSVN merge command is now documented in the help, copied here for your pleasure:

/m /r=%merged /t1=%yname /t2=%bname /t3=%tname /c=%mname %mine %base %theirs

Fantastic. I’d still love to see all four views at once (theirs,base,mine,merged), but this help makes the product very useable now for conflict mergers.

Conversion complete

Well the full conversion ran last night for the other team.

Some stats:

  • The current four Visual Source Safe repositories had a combined size of 11.2 GB
  • The wanted code tree was 490 MB  (mix of text and binary files)
  • The subversion history dump of the above code was 4.1 GB
  • The subversion repository size is 620 MB
  • The copy/dump/load took 12 hours to run

To merge the repositories together, I had to run a few svn commands between svnadmin loads to create/move/delete so the sub-trees were all happy.

One oddity noticed was that some files were different between the old and new repositories.  This was due to the file being altered by a developer is the US in his time zone, and then within the time-zone difference, a developer in NZ changing the files also. So even though the US change was made first, the dump program sees the NZ one having the earliest time, thus swapped the order of these edits. (Because VSS is done on local time, and the local client alter the repository, so different time-zone really should not work on the same repository)

But this will not happen now, because subversion itself does not have this problem.