19 May 2011

Importing SourceForge code into Apache for Jena

The Jena project now has the legal paperwork done for the vast majority of the codebase. It's now time to move the code from SourceForge, where's it's been for almost 10 years (the project was registered November 2001).

During that time, the SourceForge infrastructure has been excellent. We're not moving because of dissatisfaction but because we want to put the post-HP on a solid legal basis where the license and IP situation is well-understood and completely clear. We now have committers in 3 different organisations, and contributions from yet more - it's slowly getting more and more complicated.

The way Apache works is that software is granted to Apache, which grants Apache the right to re-license it. Any software you use from Apache is a license (with IP guarantees etc.) from Apache to you - not between you and the original contributor, so you can when use the software commercially and only need to check one Apache license.

Until now, we have had a setup where any contributions are simply incorporated with the license and conditions of the contributor. It so happens that all the licensed code in the codebase is the same BSD-type license but in using Jena you don't get a single license, you get one from every contributor. For some people who are going to depend on Jena for commercial use or long term big deployment, this matters. We've had a user crawl the codebase to check each of the licenses (as they should but it's just work). With Apache it's different - one license, well-understood legal situation.

Contributors grant software two ways - either a software grant document or when they upload code to a mailing list or to Jira. When you add something to Jira there's a tick box to say you are making the grant to Apache, otherwise while it may illustrate some issue, we can't use it in the codebase.

Apache use subversion so Jena needs to import the code base to svn.

Subversion or git or Mercurial ...

Aside: one question I've been asked is why not a DVCS like git or mercurial. Aapche use Subversion. As I understand it, there are legal matters to consider. Suppose A pushes code to B and B pushes to Apache. A has not necessarily granted the software to Apache - B could check but it's a new burden for B, and pushing to Apache is B's responsibility but B does not own A's contribution. Maybe this will change sometime but at the moment, DVCS works for direct contributor to user licensing (and the user "should" then check every license) but not the consolidation offered by Apache.


Jena has three repositories, Jena in CVS, Jena in SVN and Joseki in CVS. There are active projects in all of them but theer is also a lot of history and legacy. We want to import everything as a record of ownerships, not just copy the latest working copy.

This is the process I have put together:

1. Grab the repositories

SourceForge offer rsync access for backup, with history (the tarballs are just the current state).

2. Convert CVS to SVN

We have a multi-project layout so cvs2svn needs some arguments.

MODS="ARQ BRQL DataGenerator Eyeball EyeballAcceptance Scratch extras grddl gvs iri jena jena-perf jena2 modeljeb owlsyntax rdf-html sparql2sql

SVN=ASF-Jena-CVS   # Destination
CVS=../Jena-CVS    # Local rsync backup

for in $MODS
    echo "==== $m"
    ARGS="$ARGS --encoding=utf8 --encoding=iso-8859-1"
    # Create trunk/branshes/tag structure per project
    ARGS="$ARGS --trunk=$m/trunk --branches=$m/branches --tags=$m/tags"
    cvs2svn $ARGS --existing-svnrepos --svnrepos "$SVN" $CVS/$m

and much the same for Joseki except the modules list is just "Joseki1 Joseki3 Joseki3" and it is much faster.

Dry-run this first : it showed up two problems.

The "--encoding=utf8 --encoding=iso-8859-1" to to get the translation of some people's names right (non-ASCII characters).

A name clash in Joseki couldn't be resolved. Fortunately, it was with some old intermediate binaries so simply deleting from CVS (the joy of CVS using the filesystem layout) was simplest.

3. Dump the repositories

Use "svnadmin dump" and gzip the files. They are going to uploaded to an Apache machine and they are quite large - 3.1G to upload over from my home cable connection (1.5Mbit up).

4. Import to subversion

This step has been done by the Apache Infrastructure team as it requires svnadmin access to the respository. See INFRA-3628 for the details.

It's good to check it's going to do the right thing first. We now have the files for three repositories. We want the imported svn to look like:


so we have a permanent record of the code state at the start of the Aapche svn. After import, active project can be "svn copy"ed out to give the working versions going forward.

To test it's going to work when the apache infrastrucure team so the actual import, I built a local repo in the same layout.

# ---- Create the layout in Apache repository
mkdir -p Layout/incubator/jena/Import/Joseki-CVS
mkdir -p Layout/incubator/jena/Import/Jena-CVS
mkdir -p Layout/incubator/jena/Import/Jena-SVN
svnadmin create ApacheRepo
svn import Layout/ file://$PWD/ApacheRepo -m "Set layout"
rm -rf Layout

then it's juts a matter of inserting the code in the right place:

# --- Imports

# Joseki-CVS
gzip -d < Imports/ASF-Joseki-CVS.svn.gz | \
     svnadmin load --parent-dir incubator/jena/Import/Joseki-CVS $REPO

# Jena-CVS
gzip -d < Imports/ASF-Jena-CVS.svn.gz | \
     svnadmin load --parent-dir incubator/jena/Import/Jena-CVS $REPO

# Jena-SVN
gzip -d < Imports/ASF-Jena-SVN.svn.gz | \
     svnadmin load --parent-dir incubator/jena/Import/Jena-SVN $REPO

The slow bits where csv2svn (it's not bad but it's not instant : an hour or so), the upload to Apache (a couple of hours) and the checking the "svnadmin load" (another couple of hours).

5. Extract working copies

We're keeping the imports unchanged as a record of the starting point at Apache (revision 1124118)

The whole process has been done now - Jena code at Apache