data transformation: an example with jforum

Rafael sent me a copy of his jforum database so we could use the data on jforum.  It’s not a trivial import so I thought it would be interesting to walk through what I did.  I’ve ommitted the details of the dead ends but tried to at least allude to them for those learning to develop.

Step 1: Importing what Rafael sent to make it easy to look at and be able to run queries

It took me a little while to notice I that when I asked for a “data dump”, I wasn’t specific enough.  I got a mysql specific format dump.

If you were importing a postgres dump, it would be simple:

  • In pgadmin, create another database named raf_dump.
  • Import data from the command line using: “psql -h localhost -U postgres -f jforum.sql -d raf_import” or via pgadmin open file.

The “right” way to deal with a mysql dump would have been to import it into mysql and then export a dump in postgres compatibility mode.  I stupidly tried to avoid installing mysql and instead:

  1. Copy the create table statement from the dump and convert the types to postgres types.
  2. Run the insert that inserts all the data.
Don’t do this!  It doesn’t work.  Aside from type mismatches, I had encoding errors and escape character problems.  Two hours into this, I gave up and did what I should have done in the first place – install mysql.
To install mysql:
  1. Download dmg image
  2. Install both dmg files
  3. Then click MySql.prefpane to add to system preferences. This lets you click start or start always on machine starup.  You can get back to it by going to system preferences and seeing the “other” section at the bottom.
  4. sudo su – root
  5. cd /usr/local/mysql/bin/
  6. Create database: ./mysqladmin create raf_jforum
  7. ./mysql -u root mysql
  8. create user ‘jforum’@’localhost’identified by ‘xxx’;
  9. create user ‘jforum’@’127.0.0.1 by ‘xxx’;  (I don’t think this was needed but I typed it in so including it here to be thorough.)  Obviously pick a better password than xxx.
  10. grant all on *.* to ‘jforum’@’localhost’
  11. grant all on *.* to ‘jforum’@’127.0.0.1’ (same deal, I don’t think this was necessary)
  12. Install Sequel pro to have a nice GUI for querying.  (It also uses Mac shortcuts for newline and allows you to copy/paste directly.  Much better than the postgres client.)
  13. In Sequel pro, connect to the database.
  14. In Sequel pro, File > Import > point to sql file.  It was really fast  The 34 meg import took less than 30 seconds.
  15. Download the JDBC driver for mysql.

Step 2: Analysis

There are just under 5000 topics and 20,000 posts.  While we have a one to one mapping with the table names, we have a number of adjustments that need to be made.  I’m listing the significant ones here along with my approach for handling.

Mismatch Approach
We’ve added columns to two of the tables. Read the source data into a new file format and then use the java JForum API to create the topics/posts for consistency with our schema changes.
The topic/post id numbers the source data uses are already in use by our forum. Using our Java API solves this as well.
The user ids of the source data are already in use by our forum and some of the members duplicate. Have all posts show as being from the same “Migrated from jforum” user.  The original user can be specified in the post as a footer.
The topics come from multiple forums in the source data and we want them to end up in the same forum. Have the conversion program ignore the source data forum and choose our JForum forum id.
Some forums aren’t appropriate to migrate (off topic, jforum team relax, test forum) Have the conversion program filter on which posts it migrates.

After doing this analysis, I realize I don’t need the data in postgres at all.  I need a flat file to run through a program on the server.  I can generate that flat file from msql just as easily.

Summary of migration process:

  • Delete data that don’t want to migrate
  • Write out a flat file containing topic/post/post data/user info that we do want to migrate
  • Write a servlet to read from that flat file and call the JForum APIs to add data on server (we used the servlet approach for the initial UBB to JForum migration so I can reuse parts of that.)  The advantage is that it is using the actual JForum codebase to insert.

Step 3 – Code and test locally

For the flat file,  I used the Java CSV library for generating my flat file since posts contain spaces, quotes, line breaks and other special characters.  It’s good library.  Easy to use and works exactly as you’d expect.  The flat file generation code took about an hour to write and less than ten seconds to run.

For the servlet, coding was easier since I had my “starter program” from the UBB migration.  I still hit a couple of bugs (that would have been avoided with unit testing).  My analysis was good though.  Few surprises when coding/testing.  I did create helper objects for IntermediatePost and IntermediateTopic so I didn’t have to keep thinking about the fields.

Step 4 – run on test server

I had a network connection error while the program was running disconnecting my browser from the output.  It’s still going in the background so I’m looking at the database counts increasing and the server error logs.  I should have run it in smaller batches.

The servlet was also much faster on my machine than on the server.  I’m guessing I have too many database trips or the fact that there is more existing data.  It’s not horribly slow though and we don’t need to do this again so just letting it finish at it’s pace.

Step 5 – run on the real server

Smooth!  The forum is live.

Thanks to Rafael for the data!

coursera saas-class and mitx circuits class feedback

After AI-Class, I blogged with some feedback.  Over the past month or two, I took saas-class and MITx 6.0002x.

How I found out

I found about saas-class from comments at aiqus – the AI-class forum.  I found out about the MITx course via an e-mail from a friend.

Why I took it

Software as a Service is a hot topic and I was curious what they would teach about designing for scalability.  Even before the class started, it was apparent the class was really about software engineering practices using Ruby.  It seemed like a cool way to learn a little Ruby and tools like Cucumber.  It turned out I had a real world friend along with two fellow coderanch moderators in this class.

The MITx circuits class was less work related.  I took a hands on circuits class and was disappointed on the lack of coverage of concepts/how things work.

Pre-requisites

For SAAS, the main pre-req was knowledge of at least one object oriented programming language.  Check.  I’m a Java developer.  This pre-req was important in the course as it moved too fast for someone who has never programmed before.

For MITx the pre-reqs were calculus, linear algebra and first year college physics.  I did take all these classes.  However, I haven’t used much of them since college so I didn’t really have the pre-reqs in my mind.

How it worked

Area Coursera – SAAS MITx – 6.002 Circuits
What I liked What could be better next time What I liked What could be better next time
Lectures 5-15 minute course snippets.  The videos were a bit jumpy but still possible to follow. I felt like there was too much overlap between the lectures and the book.  I read the book and this caused me to gloss over some of the later videos. Tended to be longer, but still a reasonable range.  Good snippets as well and broken up as needed.  I liked the switch between lecture (powerpoint), human face and demo. Bookmark your place more clearly.  Your position in the lecture sequences was marked but it was hard to see and you had to know what to look for.
In lecture quizzes Quizzes were short multiple choices to make sure you understood the lecture. The AI-class and MITx quizzes tested deeper knowledge.  As did the saas-class exam type quizzes.  Bringing some of that rigor to the lectures would have made it more interesting. These were well thought out exercises.  You could submit as many times as you wanted or see the answers and backfigure from there.  There were excellent forum discussions on the exercises as students posted worked answers. Provide a built in walkthru for the early ones?
Homework I think the homeworks were by far the best part of course.  They were well crafted to reinforce/try/play with the material.  They gave you a sandbox that was big enough to play in but not so big as to flounder around in.  The auto grader which allowed multiple attempts help avoid any perceived ambiguity. The only thing I would change is that the auto-grader was up late for the last assignment. The format was similar to the quizzes in that there were unlimited retries.  The big difference is that the answer wasn’t available until afterwards. The content was more challenging and checked your understanding.  There were also virtual labs where you could build circuits which was really cool.  The homeworks/labs also generated different sets of numbers making it harder for people to cheat. There was some reports that the virtual labs were too sensitive.  I didn’t run into this problem personally though.
Exams The exams were timed multiple choice quizzes.  They were a good way of seeing how much you retained in a quick manner. On the last quiz, I got a longer quiz and then when I submitted “it was gone.”  Going back I was presented with a shorter quiz as if I had never been there.  I would have liked the answers to the first one. While the exams haven’t actually occurred yet, the announcements say they will be like the homeworks except you can only submit three wrong answers (as opposed to trial an error around) and you only have 24 hours from when you start to complete it.  I like this idea.  It’s a natural progression. n/a
Book Choice of e-book or printed book.  On a very beta book.  The book was good though so I don’t mind. The book was only available in certain countries. The book went well with the course and let you go into more detail on concepts/understanding. Aside from being expensive, this book was difficult to find.  I ordered it several weeks before the course and barely got it in time.
Forums The forum was built into coursera.  It was a bare bones forum and organized by “general”, “assignments”, etc.  The forum moderators did create a new forum for study groups when it became apparent that was causing clutter. I didn’t like the forum.  Part of it was the large number of stickied threads for long periods of time.  Part of it was the lack of contrast between read and unread posts.  And I suspect part of it is was the subconscious missing of features I reply on in other forums. I really like the forum.  It is powered by askbot and has a lot of the features I like in stackoverflow – tags, showing tags, hiding tags and voting.  It also has badges, karma and an easy way to see your topics/posts. At the beginning of the course, hardly anyone had enough karma to delete posts including those that contained answers to the homeworks.
Progress Bar If you click on assignments or quizzes, you can see your scores. There was a lot of “click to expand” needed to see all the information at once. A page clearly showed how “done” you were.  It gave a percentage for homeworks, labs, exams and non-credit quizzes in the lectures. Minor, but it would be nice if the homework/lab scores linked to the homework/lab.
Announcements The home page had announcements.  Kind of.  Most things were “announced” in the forums or by editing a paragraphs long text area.  Only a few announcements had a date. The announcements had a feel of the plans were “on display” in the basement of the planning office, where both the lights and stairs had been removed, in an old filing cabinet locked in a disused lavatory with a sign on the door saying “Beware of the Leopard.”Yes.  They were there.  But it took a lot of looking and re-looking to ascertain what was new.Source: Hitchhiker Guide Useful, timely and ordered by date. n/a

My measure of success

For SAAS, I know more than I did when I started the class.  I got to do some hands on development with Ruby and Ruby testing tools.  I got to participate in an online forum with fellow learner.

For MITx, I did learn two and a half weeks of material.  And it was great to see how they did it.  Between being shakey on the pre-reqs and not being at home at all for three of the first five weekends, I had to declare defeat.  This is not the course’s fault and I imagined it would happen.  I wanted to try it out anyway.

How I did

I’d like to repeat the part about learning being the important part here.  When one says that someone often chimes in “oh, that’s just because you didn’t do well.”  So I’ll share.  For SAAS,  I don’t know my exact score because I didn’t calculate it.  (I’m not clear on exactly how they count the quizzes.)  It think it is somewhere between 85 and 90 though. For MITx, my score is 7%.   Seven percent is two weeks of 100%, one week of partially done and then all zeros.

What’s next?

SAAS will be running a part 2 later in the year.  And a lot more courses are being offered.  See the full list at class central.  I still want to take Human Computer Interaction.  Which conveniently didn’t conflict with SAAS!  I’m also thinking about Udacity 253 – how to build a blog with the creator of reddit.  It seems to cover some of the scalability concerns I was hoping for in SAAS.  That will also let me take a look at Udacity – the third major provider.