javaranch & jforum – how we solved a threading issue

Don’t you just hate threading errors?  They are hard or impossible to reproduce.  It’s hard to figure out exactly what’s going on.  It’s hard to tell if you fixed it.

The problem

We’ve had a problem at JavaRanch for about a month where some posts mysteriously didn’t appear.  The problem was at its worst early in the business day in India pretty regularly.  When unfortunately pretty much everyone who worked on the system is sound asleep.  It did occur at other times as well.  We did come up with a workaround pretty quickly – reboot the server.  I know – not much of a solution, but it let operations continue in some fashion.

We had a few symptoms:

  • It often took 1-4 minutes to post.
  • The page refreshed without actually storing the post although the index page indicated the post was there.
  • Our e-mail server was acting up causing it to take hours to send out certain types of e-mails even after many retries.  (We didn’t realize this was a symptom until the problem was solved.  At the time, it was just another incredibly time consuming problem that needed addressing.)

What we did

We added metrics.  We monitored.  We did code reviews.

Attempt #1 – A week after the problem started occurring one moderator thought he found it.  There was an Executor class in JForum that added e-mails to send out to a queue.  If the queue was full it blocked causing the user to have to wait minutes for a post.  Which in turn tied up all the resources.  Eventually the database transaction would timeout waiting and give up.  Since the cache was already updated, it looked like the post partially went through and the database didn’t know about it.  So he added a call to Doug Lei’s concurrency library and deployed executor.discardOldestWhenBlocked();

The problem didn’t go away!  But we were so sure it was that.  Oh, no!  Now we have to find another root cause.  More metrics, more monitoring.  More trying to figure out what was going on with the mail server.  Another moderator started work on moving to a different mail server.

Attempt #2a -We moved to a different mail server.  It’s slower than the old one so still possible to back up the outgoing queue.  Just less likely.

Attempt #2b – A third moderator (me) wrote a standalone test against the executor to see if we could reproduce the problem.  We replaced the e-mail logic with Thread.sleep() for 10 seconds and then threw a bunch of threads at it.  Lo and behold the later ones all blocked.  At this point there were two possible solutions – upgrade to Java’s built in concurrency package  which doesn’t have this race condition – or call executor.discardWhenBlocked().  After consulting with Henry Wong of “Java Threads“, we decided to make the smallest possible change and call a different executor method.

Both attempts 2a and 2b were deployed over the same weekend so we don’t know for sure what fixed the problem.  We do know that 2a helped and 2b worked in simulated tests.  And most importantly, we know JavaRanch hasn’t lost posts or had long posting delays in a week.  That’s the important thing!

Lessons learned

  • Listing all the symptoms in one place helps draw connections.
  • Test a potential fix works in an isolated test case.  We were *so* close three weeks ago.
  • Having a threading expert on hand saves a lot of time in making sure a fix doesn’t have negative side effects.  Just in explaining to Henry we thought about the problem on a deeper level.

JForum 3 (in development) does not have this problem as it does not use Doug Lei’s concurrency library.  I will report the problem for JFOrum 2.1.8.  While it’s only a problem for large installation with e-mail or resource problems, it should be in their forums too.

javaranch – the web based pick winners program

JavaRanch uses a Java program to pick the weekly winners for book promotions.  It previously used a serious of classes that went the URLs, parsed the data, went to more URLs, picked some random winners and then output them to a file.  These contortions were done because the old software was hard to change.  With the new Java based software, we have much more active development.  Time for a new approach.

Designing the new pick winners program.  (It’s the 3rd iteration of the program and the 2nd I’ve done so I’m familiar with the domain.)

  1. Decide to make a web based version (servlet)
  2. Think about what I need from the database.
  3. Write three DAO methods to get post, topic and user info.  While I wrote the integration tests first, I did write the unit tests after the code.
  4. Start the pick winner class  Realize there is a lot of date validation logic (and determining the default week) and rename class to WinnerPickingWeek to encapsulate the date range.
  5. Start the pick winner class again.  Call the three DAO methods tying them together.
  6. Now add the randomness.  My test with 1 post will give me enough determinism to keep the tests passing and useful.
  7. Added a test for excluding ineligible winners (like Henry and I – the winner pickers)
  8. Now on to the front end.  My servlet needs to make sure you are logged in as admin and then delegate to the processing logic.

This got a useable program that runs much faster.  After that I added some jQuery logic to make the page dynamic and even more useful.  But that’s another topic – possibly a more interesting one.  I’ll post it later in the week.

how testing can improve legacy code design

There’s no shortage of articles on how TDD improves the design of new code.  That’s all well and good.  But what about legacy code?

How it came up

This weekend, I had occasion to make a few enhancements to the email sending project at JavaRanch.  The one that got me thinking about the design was when I needed to add some logic to filter the e-mail list.  I was trying to allow specifiying the start and end index so we could resend to just part of the list if the process failed.  After all, you don’t want people getting two copies.

What I did

As the current filtering logic was in the BulkMailerProcess class, I decided to start there.  This class has a bunch of dependencies and isn’t currently tested.  Hello legacy code!  My first thought was that I would make my new filtering method package private and test that.  As I set out to write the test, I realized I needed to get access to the instance variable containing the list of e-mails.  Ugh.  This made me cringe enough to think about an alternate direction.

My next thought was that I really have a separate concept here.  I’m filtering e-mails.  So it one of the existing methods in BulkMailerProcess.  (It makes sure the e-mails are properly formed.)  Time to create a new class.  At this point it was easy.  My new class EmailFilter takes a list of e-mails and runs both filtering/cleaning operations on them.  It’s very focused and gets all that logic out of the main processing class.  I feel like I left the code cleaner than I found it.

The result

It certainly is more tested.  The method to clean the e-mails wasn’t originally tested since it was so embedded in everything else.  Now it is.  In fact now it is tested at 100%.  Without looking at the implementation of the method I copied in, I did write one test to verify the method gets called and at least does what it sounds like.  With this system I got to 96% coverage on the new EmailFilter class.  I could easily get to 100%, but that’s for another day.  The goal here isn’t to be perfect.  It’s to leave things better than they started out.

I left a comment in the code so nobody thinks it is more tested than it is.

/**
* Check valid format for email. This logic was moved verbatim from
* BulkMailerProcess class. Do not change without adding more detailed unit
* tests to verify behavior.
*/

What’s next

The only thing I’m less than happy with is the name of the class.  I started with EmailListUtil because I didn’t have a better name.  Then I changed it to EmailFilter.  But it’s not just a filter.  It’s doing validation too for cleaning the invalid e-mails from the list before filtering by index.  I’ll have to think about the name some more.

The other next step is to improve things a little more next time I touch the code.  I already have some ideas.  The key is to not attempt too much at once.  That would be overwhelming.  A little at a time makes it doable.