migrating coderanch from svn to git

CodeRanch has been using SVN for a long time for the forum software. It’s high time to switch. We have just over 20 projects in our one SVN repository. Most are small/inactive so it wouldn’t be terrible to have the history in only the biggest project. However, I wanted to try to do it “right” and migrate the history of each project into a separate git repository.

Also see:

Choosing GitLab as a provider

We chose to use GitLab instead of GitHub for a few reasons

  1. GitLab has free organizations (security grouping) for private repositories. It also allows multiple admins. (which you can’t do in a free github repo)
  2. GitLab has a built in continuous integration tool (we aren’t using that yet, but want to leave the option open.)
  3. A couple moderators use GitLab professionally and have had good experiences with it.


Getting a local dump of the SVN database

Migrating a remote repository involves a large number of network roundtrips. It’s far faster to export the SVN repository/database to a local dump file. Starting with SVN 1.7, this is a completely client side operation.

However, I’ve been using SVN through Eclipse so I didn’t actually have a SVN 1.7 command line installed. The first thing I did was install a command line of SVN 1.7:

brew install subversion

Then I got a dump of the whole repository. I used svnrdump so I didn’t need to sign on to the machine with the repo. I ran:

svnrdump dump '<url>' > full.dmp

We have just under ten thousands revisions so the dump took about 20 minutes. The full dump was just under 800MB. (A lot of this is duplication in tags. The GitLab repository is about a quarter the size.)

Create authors.txt

I decided to do it by hand. I looked at the conf/htaccess-projects.acl SVN file since I have admin access to the server. There were 90 users. I wound up migrating them all (which was a poor decision.)

This hit or miss approach works (slowly) because “git svn clone” complains if it encounters an author that isn’t defined. This let me go back and add that person without having to manually do a lot of analysis. Luckily, “git svn clone” does let you resume from where you left off.

Author: xxx not defined in authors.txt file

Use git-svn bridge

I used the git-svn bridge for the actual migration. This was only a few steps for the small projects that didn’t have branches or tags:

  1. Use the git-svn bridge to clone just the part of the repo we want: git svn clone <url>/svn/project/ -A authors.txt new-git-repo
  2. Add an origin on my machine: git remote add origin git@gitlab.com:coderanch/new-repo.git (fun fact, you don’t need to create the repository on gitlab. When you push, it gets created for you.)
  3. Push all: git push -u origin –all
  4. Push tags (we don’t have any for most projects so skipped this): git push –tags
  5. Repeat for all our projects. (We have about 20)

It turned out some of the projects in our repo have trunk/branches/tags inside. So I needed the command to use the standard layout (the -s flag):

git svn clone -s https://svn.javaranch.com/svn/project -A authors.txt new-repo

For branches

We hardly have any branches so I didn’t search for the optimal way. I used a three step procedure.

  1. I used the UNIX script from sailmaker to convert the SVN tags to Git tags.
    for branch in `git branch -r | grep "branches/" | sed 's/ branches\///'`; 
    do   
       git branch $branch refs/remotes/$branch 
    done
  2. Checked out each branch
    git checkout -b <branch_name>
  3. Sent it on to the remote repo: git push –all

This approach leaves stray remote branches. They are visible using git branch -a, but not in the GitLab UI. They seem to go away if you clone the repository to another directory so I’m thinking they are local branches and were never pushed to GitLab.

For the big repo with tags

We have 396 tags for the forum software and zero tags for all our other projects. I tried migrating our entire repository including tags using “git svn clone.” It took too long. 9K+ commits 800MB, 300+ tags is really slow. After running (against my local SVN) for over 24 hours, it was less than half done.

I realized the bulk of the time was going to migrating tags. So I did the “git svn clone” migration for just the trunk and branches. Then I dealt with the tags using a script.

First I created a local repo since this was the big one. This took about 20 minutes to import:

  • svnadmin create jforum-local-svn
  • svnadmin load jforum-local-svn < full.dmp

Then I did the clone:

git svn clone file:///<local repo> -A authors.txt 
  --trunk=JForum --branches=branches 
  jforum-from-local-dump-with-trunk-and-branches

Finally, I was ready to add gitlab as a remote. Again, I migrated the branches manually since there weren’t a lot.

Then I dealt with the tags. But that warrants a separate blog post.

Another possibility

There’s a number of tools called svn2git. The most promising one looks like this one per this post.

I didn’t try it because I was almost done at that point. Also, it requires you to build from source. Wasn’t worth the effort.

References

  • https://john.albin.net/git/convert-subversion-to-git
  • https://www.mugo.ca/Blog/Splitting-a-Subversion-repository-into-multiple-repositories
  • https://daneomatic.com/2010/11/01/svn-to-multiple-git-repos/
  • https://git-scm.com/docs/git-svn
  • https://www.getdonedone.com/converting-5-year-old-repository-subversion-git/ – uses master/trunk/branches structure – why want that?
  • https://github.com/nirvdrum/svn2git – uses git-svn bridge and does cleanup after. so presumably the same performance issue

how not to migrate from subversion to git

You know how you typically read blog posts of what to do that works. And not all the things people tried that didn’t work. This post is dedicated to what didn’t work.

Also see:

Don’t do this #1 – Migrate from a remote repository

Migrating from SVN to Git requires a large number of network roundtrips (for a large repository.) This slows things down greatly. It’s better to export/dump the repository and run everything locally.

See the main blog post for how to create a local dump/rep

Don’t do this #2 – Split the dump by project

I had the idea to split the SVN full dump file into smaller SVN dump files by project. I chose to preserve revision numbers and not use “renumber-revs”. We used the revision numbers in our release notes. Here’s a sample command:

svndumpfilter include "IntegrationTests" --drop-empty-revs < full.dmp 
  > project_IntegrationTests.dmp

We had one project that consists of the majority of the SVN code base (the forum software.) All of the tags were for this project. I thought to import this one as “full.dmp” and just delete the “trunk” projects afterwards for this one. That way I’d only be filtering the smaller/safer ones.

None of this was necessary! You can just point migration at the same full SNV dump with different paths to migrate projects into their own repositories.

Don’t do this #3 – Check out the entire repository including tags

Migrating using “git svn clone” requires an authors.txt to map SVN users to GitHub names/emails. I had the idea to check out the entire repository including tags and running svn log on it to get the committers. After 90 minutes, I gave up on this idea.

Don’t do this #4 – Assume that all authors/committers are people

There were a couple commits from Jenkins which seems reasonable. There were also a couple commits as “root”, “test” and other random users. Looking at the readme.txt from one of those commits, it looks like a command line import.

Don’t do this #5 – Guess at what should be in the authors.txt file

We have about 90 users in our authors.txt file. I thought I would save time by only putting the people I thought were committers in the authors.txt. This was a problem for a few reasons:

  • About 30 people committed to the main project
  • A few people committed who no longer have access to the code base.
  • We had some “funky” committers including “root” and “test”

This meant I kept running the “git svn clone” command, having it fail on missing users, adding them to authors.txt and resuming the run (re-running automatically resumes).

It would have better to us svn log on trunk to get all the authors or the –authors-prog flag to specify a command to fill in any defaults. This would have let me write “Unknown” for the funky ones and be done with it.

Don’t do this #6 – Make assumptions about project structure

At the top level, the repository had:

  • about 20 projects (directly at the root level, not under trunk)
  • a branches directory
  • a tags directory

I foolishly assumed that meant that the 20 projects had the code directly inside them. And sometimes that was true. However, for about 5 projects, there was a nested trunk/branches/tags structure under that project.

We all know that thing about standards. There are so many….

Don’t do this #7 – Migrate 300 large tags

This project uses Ant (and not Ivy) so there are a lot of jar files in the repository. This means tags are large. With just under ten thousand commits and just under 400 tags, this proved to be just too much.

Watching the “git svn clone” procedure, it goes through commit 1-n as it goes. This means the later commits/tags need to go through a large amount of work to make progress. Despite that, it was surprisingly linear.

After 12 hours, it had migrated 2700 commits and after 26 hours, it was up to commit 5446. At the 18 hour mark, it was up to commit 6926. (At the 24 hour mark, I decided to abandon this approach. I let it run until I needed to shut down my computer to see what would happen.)

Most of the wasted time was for the tags. Which in SVN are a copy. In Git, they are just a label so this is a lot of unnecessary duplication in a migration.

See another approach for migrating tags

sonarqube and the scm plugin

I tried enabling the SCM integration from our CodeRanch Sonar install to our SVN repository. It didn’t quite work out the way I was hoping. But I learned stuff so it makes for a good blog entry.

Wait? The SCM Activity Plugin is deprecated?

If you search for Sonar SCM plugin, you get to the plugin documentation page. Which appears to be useful. Except that it says the plugin has been deprecated since Sonar 5.0. I stopped reading at that point and puzzled over it. If you continue to read, it says “This plugin is deprecated since SonarQube 5.0 which has built-in support for SCM information and which relies on independent plugins to cover SCM providers.” Ah. So it is deprecated because it is now a core feature. That explains why the functionality shows up on our Sonar.

Time to enable

I generated another SVN user and set the username/password in Sonar. I then set “disable the SCM sensor” to false and ran our Jenkins build. And I got this. Well, it went on for pages, but you get the idea:

[sonar:sonar] Sensor SCM Sensor
[sonar:sonar] SCM provider for this project is: svn
[sonar:sonar] 1404 files to be analyzed
[sonar:sonar] 0/1404 files analyzed
[sonar:sonar] Missing blame information for the following files:
[sonar:sonar] * /data/vhost1/ciengine.javaranch.com/data/jobs/JForum/workspace/src/main/java/net/jforum/util/bbcode/BBCodeHandler.java
[sonar:sonar] * /data/vhost1/ciengine.javaranch.com/data/jobs/JForum/workspace/src/main/java/net/jforum/util/concurrent/Executor.java

Ok. So the credentials are good as Sonar can connect to see how many files there are.  Google and SVN plugin page both suggested the problem was not being on Subversion 1.5 or lower. I looked around and that didn’t appear to be the case.  svnadmin —version returned 1.7.8 and the repository was on 1.6. So I thought maybe the mismatch was the problem and tried to run svnadmin upgrade svn (svn is the name of the folder containing our repository on disk). Then this happened and I had to recover from it.

Then it worked

After I upgraded the Subversion repository (or whatever the hell I did by accident), I was able to commit and trigger another Jenkins/Sonar build. I saw the person who committed lines of code in each file. And what’s really cool is that it WENT BACK IN TIME. I see the committers that aren’t active on CodeRanch anymore and haven’t contributed code in years. That is really cool.

What I was hoping it would do vs what it actually does

I wanted to see the “scm blame” output when reviewing files in Sonar. That works and I see it as being useful. I can see if something came from me (or another current developer.) Or if it came from someone no longer involved in the code base.

I wanted to see coverage on new code. This works depending on how you define new code. If you define it as “touched” code, it works well. You also have to remember to use a differential view so Sonar knows how far back to consider “new” code. This makes sense. I typically use the last 7 days as my view so “new” means “touched in the last 7 days”. Unfortunately, we don’t currently run our integration tests in Jenkins so this isn’t useful for the back end layer. I had started dealing with that. Maybe I should finish :).

I had read about the “Developer Cockpit” and seeing a view of your own contributions. I hadn’t realized that was a paid enterprise feature so not helpful to us at CodeRanch.  Without the cockpit, you can still go to “My account” to see a table with “leaks” (issues you created in the past week) and a link to all of your issues. Unfortunately Sonar thinks half the issues are mine and the others aren’t assigned:

  1. I was the one who committed the initial JForum fork we made which means every single open source issue is treated as mine.  (I don’t remember if I did or not, but I led that project so I wouldn’t be surprised.)
  2. Most of the developers don’t have accounts on Sonar. Only the admins do. Everyone else has been using it anonymously. (It’s behind an Apache password wall so not public.)

On the bright side, you *can* filter issues by author now. Which is the last committer. But still helpful. If the code is new, the last committer is someone who might look at it. If the committer is someone who isn’t around anymore, that’s informative.