data transformation: an example with jforumApril 22nd, 2012 by Jeanne Boyarsky
Rafael sent me a copy of his jforum database so we could use the data on jforum. It’s not a trivial import so I thought it would be interesting to walk through what I did. I’ve ommitted the details of the dead ends but tried to at least allude to them for those learning to develop.
Step 1: Importing what Rafael sent to make it easy to look at and be able to run queries
It took me a little while to notice I that when I asked for a “data dump”, I wasn’t specific enough. I got a mysql specific format dump.
If you were importing a postgres dump, it would be simple:
- In pgadmin, create another database named raf_dump.
- Import data from the command line using: “psql -h localhost -U postgres -f jforum.sql -d raf_import” or via pgadmin open file.
The “right” way to deal with a mysql dump would have been to import it into mysql and then export a dump in postgres compatibility mode. I stupidly tried to avoid installing mysql and instead:
- Copy the create table statement from the dump and convert the types to postgres types.
- Run the insert that inserts all the data.
- Download dmg image
- Install both dmg files
- Then click MySql.prefpane to add to system preferences. This lets you click start or start always on machine starup. You can get back to it by going to system preferences and seeing the “other” section at the bottom.
- sudo su – root
- cd /usr/local/mysql/bin/
- Create database: ./mysqladmin create raf_jforum
- ./mysql -u root mysql
- create user ‘jforum’@'localhost’identified by ‘xxx’;
- create user ‘jforum’@’127.0.0.1 by ‘xxx’; (I don’t think this was needed but I typed it in so including it here to be thorough.) Obviously pick a better password than xxx.
- grant all on *.* to ‘jforum’@'localhost’
- grant all on *.* to ‘jforum’@’127.0.0.1′ (same deal, I don’t think this was necessary)
- Install Sequel pro to have a nice GUI for querying. (It also uses Mac shortcuts for newline and allows you to copy/paste directly. Much better than the postgres client.)
- In Sequel pro, connect to the database.
- In Sequel pro, File > Import > point to sql file. It was really fast The 34 meg import took less than 30 seconds.
- Download the JDBC driver for mysql.
Step 2: Analysis
There are just under 5000 topics and 20,000 posts. While we have a one to one mapping with the table names, we have a number of adjustments that need to be made. I’m listing the significant ones here along with my approach for handling.
|We’ve added columns to two of the tables.||Read the source data into a new file format and then use the java JForum API to create the topics/posts for consistency with our schema changes.|
|The topic/post id numbers the source data uses are already in use by our forum.||Using our Java API solves this as well.|
|The user ids of the source data are already in use by our forum and some of the members duplicate.||Have all posts show as being from the same “Migrated from jforum” user. The original user can be specified in the post as a footer.|
|The topics come from multiple forums in the source data and we want them to end up in the same forum.||Have the conversion program ignore the source data forum and choose our JForum forum id.|
|Some forums aren’t appropriate to migrate (off topic, jforum team relax, test forum)||Have the conversion program filter on which posts it migrates.|
After doing this analysis, I realize I don’t need the data in postgres at all. I need a flat file to run through a program on the server. I can generate that flat file from msql just as easily.
Summary of migration process:
- Delete data that don’t want to migrate
- Write out a flat file containing topic/post/post data/user info that we do want to migrate
- Write a servlet to read from that flat file and call the JForum APIs to add data on server (we used the servlet approach for the initial UBB to JForum migration so I can reuse parts of that.) The advantage is that it is using the actual JForum codebase to insert.
Step 3 – Code and test locally
For the flat file, I used the Java CSV library for generating my flat file since posts contain spaces, quotes, line breaks and other special characters. It’s good library. Easy to use and works exactly as you’d expect. The flat file generation code took about an hour to write and less than ten seconds to run.
For the servlet, coding was easier since I had my “starter program” from the UBB migration. I still hit a couple of bugs (that would have been avoided with unit testing). My analysis was good though. Few surprises when coding/testing. I did create helper objects for IntermediatePost and IntermediateTopic so I didn’t have to keep thinking about the fields.
Step 4 – run on test server
I had a network connection error while the program was running disconnecting my browser from the output. It’s still going in the background so I’m looking at the database counts increasing and the server error logs. I should have run it in smaller batches.
The servlet was also much faster on my machine than on the server. I’m guessing I have too many database trips or the fact that there is more existing data. It’s not horribly slow though and we don’t need to do this again so just letting it finish at it’s pace.
Step 5 – run on the real server
Smooth! The forum is live.
Thanks to Rafael for the data!
Posted: 22 April, 2012 in Java/Java EE.