javaranch & jforum – how we solved a threading issue

Don’t you just hate threading errors?  They are hard or impossible to reproduce.  It’s hard to figure out exactly what’s going on.  It’s hard to tell if you fixed it.

The problem

We’ve had a problem at JavaRanch for about a month where some posts mysteriously didn’t appear.  The problem was at its worst early in the business day in India pretty regularly.  When unfortunately pretty much everyone who worked on the system is sound asleep.  It did occur at other times as well.  We did come up with a workaround pretty quickly – reboot the server.  I know – not much of a solution, but it let operations continue in some fashion.

We had a few symptoms:

  • It often took 1-4 minutes to post.
  • The page refreshed without actually storing the post although the index page indicated the post was there.
  • Our e-mail server was acting up causing it to take hours to send out certain types of e-mails even after many retries.  (We didn’t realize this was a symptom until the problem was solved.  At the time, it was just another incredibly time consuming problem that needed addressing.)

What we did

We added metrics.  We monitored.  We did code reviews.

Attempt #1 – A week after the problem started occurring one moderator thought he found it.  There was an Executor class in JForum that added e-mails to send out to a queue.  If the queue was full it blocked causing the user to have to wait minutes for a post.  Which in turn tied up all the resources.  Eventually the database transaction would timeout waiting and give up.  Since the cache was already updated, it looked like the post partially went through and the database didn’t know about it.  So he added a call to Doug Lei’s concurrency library and deployed executor.discardOldestWhenBlocked();

The problem didn’t go away!  But we were so sure it was that.  Oh, no!  Now we have to find another root cause.  More metrics, more monitoring.  More trying to figure out what was going on with the mail server.  Another moderator started work on moving to a different mail server.

Attempt #2a -We moved to a different mail server.  It’s slower than the old one so still possible to back up the outgoing queue.  Just less likely.

Attempt #2b – A third moderator (me) wrote a standalone test against the executor to see if we could reproduce the problem.  We replaced the e-mail logic with Thread.sleep() for 10 seconds and then threw a bunch of threads at it.  Lo and behold the later ones all blocked.  At this point there were two possible solutions – upgrade to Java’s built in concurrency package  which doesn’t have this race condition – or call executor.discardWhenBlocked().  After consulting with Henry Wong of “Java Threads“, we decided to make the smallest possible change and call a different executor method.

Both attempts 2a and 2b were deployed over the same weekend so we don’t know for sure what fixed the problem.  We do know that 2a helped and 2b worked in simulated tests.  And most importantly, we know JavaRanch hasn’t lost posts or had long posting delays in a week.  That’s the important thing!

Lessons learned

  • Listing all the symptoms in one place helps draw connections.
  • Test a potential fix works in an isolated test case.  We were *so* close three weeks ago.
  • Having a threading expert on hand saves a lot of time in making sure a fix doesn’t have negative side effects.  Just in explaining to Henry we thought about the problem on a deeper level.

JForum 3 (in development) does not have this problem as it does not use Doug Lei’s concurrency library.  I will report the problem for JFOrum 2.1.8.  While it’s only a problem for large installation with e-mail or resource problems, it should be in their forums too.