Complex SVN repository conversion to GIT

When I converted our two Visual Source Safe (vss from now on) repositories into one Subversion (svn from now on) repository I did it in an ugly fashion. I dumped the file based history for the current live trees, then loaded these into svn one after the other. They both had their own root names, so I shuffled things around to make a single Trunk, Branch and Tags arrangement.

This did not matter at the time, as we had the history for the code we cared about, and you could do per-file blame, etc.

But when I tried to import this tree into Git, it was messy as the svn tree was not standard for all time, thus the default scripts didn’t know what to do with it.

The history timeline (names changed to protect the innocent, but revisions are not altered, otherwise I’d make editing mistakes)

  • Tree-A imported from vss (single branch) - rev 1 - 3163
  • Tree-B imported from vss (single branch) - rev 3164 - 5710
  • Rename of roots, Trunk, Branch etc - rev 5711 - 5713 (we want to skip this stuff)
  • Current Unified Tree (multiple branches) - 5714 - HEAD

1. Setup our repo

Create new directory

  • Run ‘git init’ - this will create the git directory

  • Git by default is setup for unix text files since most projects using git on windows are unix based projects. Thus we want to turn off the auto crlf -> lf conversion. We do this by adding the following to the .git/config file:

      [core]
      filemode = false
      symlinks = false
      autocrlf = false
      whitespace = nowarn
    
If on Windows, also add

    ignorecase = true
  • Add the authors file (.git/authors), this maps from svn users to git users (full email address)

      (no author) = Simeon Pilgrim <simeon.pilgrim@example.com>
      user1 = user one <user1@example.com>
      user2 = user two <user2@example.com>
      spilgrim = Simeon Pilgrim <simeon.pilgrim@example.com>
    
  • Setup git svn to use the authers file by adding this to .git/config (otherwise we have to mention it every time we do a fetch with the git svn -A .git/authors …, which is a pain)

      [svn]
      authorsfile = .git/authors
    

2. Create three remotes for each of the three chunks.

  • Normally this would be done via git svn init, but doing it manually gives us more control.

  • Edit .git/config to add the three remotes

    Add:

      [svn-remote "Tree-A-Old"]
      url = svn://svn-server-url/repository
      fetch = Tree-A-RootName:refs/remotes/Tree-A-Old
    
      [svn-remote "Tree-B-Old"]
      url = svn://svn-server-url/repository
      fetch = Tree-B-RootName:refs/remotes/Tree-B-Old
    
      [svn-remote "svn"]
      url = svn://svn-server-url/repository
      fetch = Trunk:refs/remotes/svn/trunk
      branches = Branches/*:refs/remotes/svn/*
      tags = Tags/*:refs/remotes/svn/tags/*
    

This defines three svn-remotes, which we can fetch individually. The first two will fetch from svn://svn-server-url…/Tree-A-RootName or svn://svn…/Tree-B-RootName to the remote branches Tree-A-Old and Tree-B-Old. We fetch them by using ‘git svn fetch Tree-A-Old’ and ‘git svn fetch Tree-B-Old’. The last remote will fetch trunk from svn://svn…/Trunk to svn/trunk and all branches and tags from svn://nzc…/Branches and svn://nzc-…/Tags as git remote branches with the names svn/.

The former two can just be straight fetched without specifying a revision range (eg. ‘git svn fetch Tree-A-Old’), but the last requires a range to be specified. This is for two reasons:

  • git svn fetch with branches and tags seems to check every revision up to the start point
  • We actually don’t want to grab the first few revisions after the svn merge as I was purely renaming/moving directories. Otherwise files will randomly disappear and then reappear a commit later in the middle of the history.

Thus to grab the recent svn history we run fetch as ‘git svn fetch -r 5714:HEAD svn’

3. Merge the two old vss branches

We want to merge the two old vss histories together (in correct time order) so we replicate the svn merge in a nicer fashion.

  • First we check out one of the branches ‘git checkout -b merged Tree-B-Old’. This creates a new local branch ‘merged‘ which is a copy of the remote ‘Tree-B-old’, since we can’t edit remote branches directly.

  • We then merge the Tree-A-Old remote branch into ‘merged‘ by running ‘git merge Tree-A-Old’. This creates a merge commit (ie a commit with more than one parent). The message can be edited by using ‘git commit –amend’ but the existing is probably fine.

4. Grafting the branches together

The next step is to graft the branches. Essentially we take the start point and force it to have the merge commit as a parent.

  • Note that since git stores the state of the tree at each commit rather than the diff from the parents, we can essentially tell it that a given commit has a different parent commit and it will handle it fine.

  • Add a graft entry to .git/info/grafts. The format of this file is a series of lines with $parent_id $child_id. We thus add an entry with $merged_head_id $trunk_base_id. To get the ids we can run ‘git log merged‘ and take the head/top commit id, and run git log ‘svn/trunk’ and take the root/bottom commit id.

  • This is a temporary graft (ie the link will only remain whilst the grafts file is there), which most tools should adhere to, but we want to make it permanent. We do this by running git-filter-branch on each of the svn branches by running ‘git filter-branch svn/trunk svn/v3.0 …

5. Get git-svn to use the new remote branches

  • git-svn adds an entry of the form git-svn-id: svn://svn.../Trunk@40403 <UUID of svn serv> to each commit. It uses this to rebuild its cache in the event of corruption.

  • To force git-svn to use the new branches is quite easy. Remove the folder .git/svn and rerun ‘git svn fetch svn‘. git-svn will then rebuild its cache and start from where we left off.

B1. Cleaning history

The history can also be imported directly from a local copy of the svn data. This makes the svn import much faster, but the commits need some post cleanup for git-svn to sync up with the real svn server. Specifically the git-svn-id entries will come out as follows:

git-svn-id: file:///home/james/src/repo/Tree-A-Rootname@3161 1923097a-7eed-ce49-a323-f810e19527ea

but we need it in the form of:

git-svn-id: svn://svn-server-url/repository/Tree-A-Rootname@3161 1923097a-7eed-ce49-a323-f810e19527ea

This is where we use the git filter-branch command again. In particular its –msg-filter argument which takes a chunk of bash script that is run for every commit message with the current message given on stdin and the new message on stdout. In this case we use it as follows:

git filter-branch --msg-filter "sed -e 's#file:///home/james/src/#svn://svn-server-url/#g' " -f svn-trunk/trunk svn-trunk/v3.1 svn-trunk/v3.2 svn-trunk/v3.3 svn-trunk/v4.0

The other arguments given here are -f which forces filter-branch to run even if they are backups of previous filters (also seems to be needed on windows) and the names of the branches to be updated.

In my particular case I was also running an email/address fixup with the entire command being:

git filter-branch --msg-filter "sed -e 's#file:///home/james/src/#svn://svn-server-url/#g'" --env-filter 'export GIT_AUTHOR_NAME="$(echo $GIT_AUTHOR_NAME | sed -f /c/SG/authors)";export GIT_AUTHOR_EMAIL="$(echo $GIT_AUTHOR_EMAIL | sed -f /c/SG/emails)"' -f svn-trunk/trunk svn-trunk/v3.1 svn-trunk/v3.2 svn-trunk/v3.3 svn-trunk/v4.0

The env-filter lets you change environmental variables in this case GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL, which I updated using two sed scripts.

The contents of /c/SG/authors was:

s#^.\*no author.\*#Simeon Pilgrim#gi
s#user1#User One#gi
s#user2#User Two#gi
s#spilgrim#Simeon Pilgrim#gi

And /c/SG/emails:

s#^.\*no author.\*#simeon.pilgrim@example.com#gi
s#user1.*#user1@example.com#gi
s#user2.*#user2@example.com#gi
s#spilgrim.*#simeon.pilgrim@example.com#gi

This post was written by James, thus the /home/james/ paths, I just generalised the grammar, and removed work-place specific information.