javaranch/jforum – nailing down a threading problem

Similar to the thought process for debugging a performance problem, here’s one for debugging a threading issue:

  1. In Production, we have a problem where the threads in a forum listing disappear showing only one or two threads in their place.  We found a workaround pretty quickly (clear the cache), but want to stop the problem from occurring.
  2. For the better part of a week, we found it puzzling.  A couple moderators noted that the problem seemed to crop up after moving a thread from one forum to another.
  3. We still haven’t been able to reproduce the bug on demand.  It sounds like something threading related.  These tend to be the mysterious looking bugs.
  4. All right.  I’m going to sit down and look at the relevant code now.
    • I know the problem has something to do with the TopicRepository class which is where this data is stored.
    • I see clearCache method that empties the cache is in fact called when we move threads.
    • I also see addAll resets the forum list.  This seems to not be called since we only see a post or two.  Hmm.  Let’s see where this is called.  I see; it’s called when someone requests to view the list of topics in a forum.  The code gets the topics from the cache.  If there aren’t any topics cached or the cache reports as “not loaded” for the forum, it gets them from the database and loads the cache.  Clearly this isn’t happening.
    • I also see a method that adds a single topic to the cache.  It is called when we post announcements.  This looks promising.
    • I tried locally to reproduce the bug and got it!  We weren’t posting announcements though.  Then I searched the code for other calls to this method.  There is another call when we “update board status” which happens after someone replies to an existing thread.  Eureka!  This must be what’s happening.
  5. Now I can reproduce on our sandbox server on demand.  (see path below.)
  6. How to fix.  Well the topics being non-empty doesn’t seem like something I can control.  Looking at the other part – the loaded flag – seems more promising.
  7. I see the flag isn’t cleared in the clearCache method.  I’ll add that in and see what happens.
  8. The bug goes away!  Great news!

The path that causes the bug to manifest:

  1. open two threads in the Test Forum
  2. move a thread from Test Forum to Another Test – this clears the cache
  3. in Test Forum, add a post to an existing thread – this adds one topic to the the list
  4. reload list of threads for Test Forum – now it thinks there is one topic in the list and doesn’t look for more
  5. you only see the one edited thread in the list

production is slow – tuning jforum for javaranch – part 2

A moderator reported that moving topics took a long time in production and our sandbox.  I just tried it and got 70 seconds through the GUI!  It didn’t used to take this long.  I’m comparing our development sandbox a few months ago to our development sandbox today.  Clearly something has changed.  The question is what!

I heard a couple people liked reading about the thought process for troubleshooting slow queries. If you aren’t familiar with what a database explain is used for, see that entry before reading this one.

This time the path to success was a little more meandering.  I’ve left the dead ends in here so you get a feel for the actual thought process and not a sanitized version.  if you are a student, this is a decent feel for what problems look like in the “real” world.

  1. First step – find piece of code in JForum that handles moving threads.
  2. It’s GenericForumDAO.moveTopics.  I see this method was modified by us (JavaRanch) to support leaving a copy of the thread in the original forum or creating a new copy and locking the original (for use in moving to our private “Trash Can” forum.)  We made that change a long time ago though and it wasn’t slow then.
  3. Looking at the method, I see a number of database queries.  I’m adding logging (nothing fancy – just System.out’s) to see which query is using up so much time.
  4. The method I added logging to takes 5-8 milliseconds.  (across two trials).  Yet the browser plugs away for almost a full minute after that.
  5. Ah!  The web/action layer class ModerationHelper makes a bunch of calls to clear/reload the cache.  I bet that’s where all the time is going.  Adding logging on that level and redeploying.
  6. Now we are getting somewhere.  Everything goes very quickly until the part where the hotest topics get reloaded and then that takes almost a minute.
  7. Yuck.  Table scans left and right.  No wonder it takes so long.  Three tables scans, joining on the result and THEN discarding almost all the records.  (see “original query” and “original explain” in supplemental materials section below)
  8. If I was doing this by brute force, I would add an index on topic_views and refactor the query to search on the the topics table only joining with the others on result.  This is similar to the last post where I walked through an optimization.  However, I’m not sure that makes sense here.
  9. This definition of hottest posts uses the most page views forever.  This is ok for now because all the migrated posts started with zero views against them.  Making the hottest threads really “the hottest threads since our Saturday January 3rd migration.”  What happens down the road.  When a thread stops being viewed altogether, it’s not hot anymore.  I’m thinking it would be better to put a time based constraint on things.
  10. I’ll post in our Moderators Only forum to get some feedback on my proposed definition change and go have dinner.
  11. Unanimous agreement on my proposed definition from everyone who saw the question while I was eating.  Including the site owner.  Good news.
  12. Modify the query to add a where clause by thread last modified date
  13. This modified query takes 80 thousand units instead of 253 thousand units.  Better but still way too high.
  14. Next I’ll try adding index on topic_last_post_id, topic_views
  15. Running explain again gives the exact same output.  It isn’t using the index.
  16. Dropping the index since it didn’t work.  I may re-add it later but I like to test only one variable at a time.
  17. The database seems to be favoring the table scan of the forum table first.  Probably because this table is order of magnitudes smaller than the other two in the join.
  18. Now I’ll try removing the forum table from the query and run an explain to see what happens.  It’s still 78 thousand units, but at least it is using an index (post_time) as the first step.
  19. Time to try to add the index and run explain again.  Nothing.  Still 78 thousand units.
  20. All right.  This tweaking thing isn’t getting me anywhere.  New approach.  How would I build this query if I were writing it from scratch.
  21. The base query that I want to happen first is explain select p.post_id from jforum_posts p WHERE p.post_time >= ‘2008-01-01’.  Running an explain on just this says it takes 1 thousand cost units.  That sounds high – I’ll query some details.  Over 100 thousand posts in a few days?  Oh yeah!  It’s 2009 – a New Year.  I should know that – I celebrated recently.
  22. Re-running the original query with the post_time clause gives me under two thousand cost units.  Whew.  That’s much better.
  23. Now I’m uploading the changed query to our sandbox environment to see how the move performs.  Three to five seconds.  Good.
  24. Another moderator suggested using replies and not views.  This is a good idea, but has no effect on performance so I switched it. (see “Final query” in supplemental materials)
  25. And a good techie verifies everything so running explain one last time to confirm this last change didn’t affect performance.  It didn’t.  Performance is still ok. ( see “Final explain” in supplemental materials)

Summary

Two orders of magnitude performance improvement on the database query and a web action that isn’t painful to wait for.  I’m happy with that.  Not only is it faster, but we uncovered a better business definition for the query.

I also learned the last time we tested a move on our sandbox server was before we tested migrating large quantities of data.  Making it an invalid test.

Supplemental materials

Original query

SELECT t.*, p.user_id AS last_user_id, p.post_time, p.attach AS attach, f.forum_name \
FROM jforum_topics t, jforum_posts p, jforum_forums f \
WHERE p.post_id = t.topic_last_post_id \
AND p.need_moderate = 0 \
and p.forum_id = f.forum_id \
ORDER BY topic_views DESC \
LIMIT ?

Original explain

Row #         QUERY PLAN
1       Limit (cost=253457.29..253457.39 rows=40 width=132)
2       -> Sort (cost=253457.29..254467.05 rows=403905 width=132)
3       Sort Key: t.topic_views
4       -> Hash Join (cost=26807.80..136006.35 rows=403905 width=132)
5       Hash Cond: (“outer”.forum_id = “inner”.forum_id)
6       -> Hash Join (cost=26707.81..129847.79 rows=403905 width=113)
7       Hash Cond: (“outer”.post_id = “inner”.topic_last_post_id)
8       -> Seq Scan on jforum_posts p (cost=0.00..50770.74 rows=1775019 width=24)
9       Filter: (need_moderate = 0)
10       -> Hash (cost=19386.05..19386.05 rows=403905 width=93)
11       -> Seq Scan on jforum_topics t (cost=0.00..19386.05 rows=403905 width=93)
12       -> Hash (cost=99.79..99.79 rows=79 width=27)
13       -> Seq Scan on jforum_forums f (cost=0.00..99.79 rows=79 width=27)

Final query

SELECT t.*, p.user_id AS last_user_id, p.post_time, p.attach AS attach, f.forum_name \
FROM jforum_topics t, jforum_posts p, jforum_forums f \
WHERE p.post_time >= ? \
and p.post_id = t.topic_last_post_id \
AND p.need_moderate = 0 \
and p.forum_id = f.forum_id \
ORDER BY topic_replies DESC \
LIMIT ?

Final explain:

Row #         QUERY PLAN
1       Limit (cost=1774.18..1774.23 rows=20 width=132)
2       -> Sort (cost=1774.18..1774.28 rows=40 width=132)
3       Sort Key: t.topic_replies
4       -> Hash Join (cost=99.99..1773.12 rows=40 width=132)
5       Hash Cond: (“outer”.forum_id = “inner”.forum_id)
6       -> Nested Loop (cost=0.00..1672.53 rows=40 width=113)
7       -> Index Scan using idx_posts_time on jforum_posts p (cost=0.00..678.22 rows=177 width=24)
8       Index Cond: (post_time >= ‘2009-01-05 00:00:00’::timestamp without time zone)
9       Filter: (need_moderate = 0)
10       -> Index Scan using idx_topics_lp on jforum_topics t (cost=0.00..5.61 rows=1 width=93)
11       Index Cond: (“outer”.post_id = t.topic_last_post_id)
12       -> Hash (cost=99.79..99.79 rows=79 width=27)
13       -> Seq Scan on jforum_forums f (cost=0.00..99.79 rows=79 width=27)

Making MySQL Use More Memory

Unlike a lot of database servers, MySQL is strangely conservative (by default) on how much memory it will allocate. If you’re not careful, you can have 16GB of RAM on your machine with MySQL only using 50MBs, leading to extremely poor performance under heavy load. I know firsthand that navigating MySQL configuration guides can be a daunting task, so I’ve prepared this post for those looking for a ‘quick fix’ to encourage MySQL to use a more healthy amount of memory.

Which database storage engine do you use primarily?

Many beginner users may not understand this, but with MySQL you have a choice in which storage engine implementation your database runs on. This is where performance tuning begins to get complicated, as you have to set the configuration variables that correspond to the storage engine you are using! You can see what engine you are relying on by opening MySQL Administrator and viewing your schema under the Catalogs tab. Each table should should have an engine associated with it, which likely says either MyISAM or InnoDB.

In this post, I’ll cover how to increase general memory usage for InnoDB:

Set “innodb_buffer_pool_size” to be up to 80% of RAM, or at least a few hundred megabytes. The default is 8MB and I imagine anyone running a MySQL server these days can at least spare 200 megabytes of memory, so it should at look like this:

innodb_buffer_pool_size=200M

Again, the default is 8M! So, if you’re not setting this variable, you’re choking your database. Feel free to give it 1-2GB if you have the available memory, but the most gain will be made by just going above the default.

There are more InnoDB settings you could set, but their benefits pale in comparison with the value you’ll gain by increasing this from the default of 8M.