tuning jforum for javaranch

JavaRanch has been updating our forum software to an extension of JForum.  While testing, Ulf Dittmer noticed that the RSS feeds were incredibly slow – to the point of timing out.  He found that the database query itself was taking so long that the browser gave up.

JavaRanch Moose
Some of the queries built into JForum are written in a way that is fine for forums with a relatively small number of posts – even half a million posts total.  JavaRanch is much larger than that.  We imagine that we are the largest forum to use JForum.  (We have over 400 thousand threads/topics and over 1.7 million posts.)

I enjoy tuning queries, so I started looking at it.  When someone asked me about what I did, I thought it would be nice to blog about my thought process.  I’ve included the before/after queries and execution plans at the bottom so as not to break up the thought process.

1. Run original query at command line in our test environment (which is slightly smaller than the real data set, but close to realistic.)
2. Give up waiting after 5 minutes.
3. Run explain on the original query.
4. Cringe at the heap scan with a relative cost of 16628 – and worse within a nested loop.
5. Wonder why the query needs to do so much processing to get a limited number of rows.  The execution plan shows all that joining occurring on the entire table rather than just the 50 or so rows we care about (via the limit.)
6. Refactor the query to use a subquery to avoid this.  I had used this technique on another similar query the day before and gotten the query down to under 1000 cost units.
7. Run explain again.
8. See cost is 3807.  This is better, but there’s no logical reason it should be four times the cost of the query from yesterday.  Better keep tuning.
9. Hmm.  It’s using the index idx_topics_fp which is on topic_first_post_id.  We have almost 100 forums.  Wouldn’t it be more efficient to include the forum_id in the index.
10. Added index idx_topics_fp_forum on both topic_first_post_id and forum_id.
11. Run explain again.
12. See cost of 978.  Good.  This is about what I got on yesterday’s query.  I guess the built in indexes better matched what I needed on yesterday’s query.
13. Go back to GUI and try URL.  Takes a couple seconds.
14. Commit my changes.  I got a two order of magnitude increase on the query.  Very satisfying.


Original query:

PostModel.selectLatestByForumForRSS = SELECT p.topic_id, p.topic_id, p.post_id, p.forum_id, pt.post_subject AS subject, pt.post_text, p.post_time, p.user_id, u.username, u.user_first_name, u.user_last_name
FROM  jforum_topics t, jforum_posts p, jforum_posts_text pt, jforum_users u
WHERE p.post_id = t.topic_first_post_id
AND p.topic_id = t.topic_id
AND p.user_id = u.user_id
AND p.post_id = pt.post_id
AND p.need_moderate = 0
ORDER BY t.topic_first_post_id DESC

Modified query

PostModel.selectLatestByForumForRSS = SELECT p.topic_id, p.topic_id, p.post_id, p.forum_id, pt.post_subject AS subject, pt.post_text, p.post_time, p.user_id, u.username, u.user_first_name, u.user_last_name
FROM (select topic_first_post_id from jforum_topics where forum_id = ? order by topic_first_post_id desc limit ?) as nested,
jforum_topics t, jforum_posts p, jforum_posts_text pt, jforum_users u
WHERE p.post_id = nested.topic_first_post_id
AND p.topic_id = t.topic_id
AND p.user_id = u.user_id
AND p.post_id = pt.post_id
AND p.need_moderate = 0
ORDER BY t.topic_first_post_id DESC

Original explain

Row #         QUERY PLAN
1       Limit (cost=97566.69..97566.70 rows=1 width=504)
2       -> Sort (cost=97566.69..97566.70 rows=1 width=504)
3       Sort Key: t.topic_id
4       -> Nested Loop (cost=275.61..97566.68 rows=1 width=504)
5       -> Nested Loop (cost=275.61..97560.76 rows=1 width=473)
6       -> Nested Loop (cost=275.61..97556.05 rows=1 width=32)
7       Join Filter: (“inner”.topic_id = “outer”.topic_id)
8       -> Bitmap Heap Scan on jforum_topics t (cost=275.61..16628.72 rows=18174 width=8)
9       Recheck Cond: (forum_id = 1)
10       -> Bitmap Index Scan on idx_topics_forum (cost=0.00..275.61 rows=18174 width=0)
11       Index Cond: (forum_id = 1)
12       -> Index Scan using jforum_posts_pkey on jforum_posts p (cost=0.00..4.44 rows=1 width=24)
13       Index Cond: (p.post_id = “outer”.topic_first_post_id)
14       Filter: (need_moderate = 0)
15       -> Index Scan using jforum_posts_text_pkey on jforum_posts_text pt (cost=0.00..4.70 rows=1 width=449)
16       Index Cond: (“outer”.post_id = pt.post_id)
17       -> Index Scan using jforum_users_pkey on jforum_users u (cost=0.00..5.91 rows=1 width=35)
18       Index Cond: (“outer”.user_id = u.user_id)

Explain on modified query
Row #         QUERY PLAN
1       Sort (cost=3807.84..3807.94 rows=40 width=504)
2       Sort Key: t.topic_id
3       -> Nested Loop (cost=0.00..3806.78 rows=40 width=504)
4       -> Nested Loop (cost=0.00..3579.24 rows=40 width=500)
5       -> Nested Loop (cost=0.00..3342.41 rows=40 width=469)
6       -> Nested Loop (cost=0.00..3164.39 rows=40 width=453)
7       -> Limit (cost=0.00..2975.39 rows=40 width=4)
8       -> Index Scan Backward using idx_topics_fp on jforum_topics (cost=0.00..1351866.51 rows=18174 width=4)
9       Filter: (forum_id = 1)
10       -> Index Scan using jforum_posts_text_pkey on jforum_posts_text pt (cost=0.00..4.70 rows=1 width=449)
11       Index Cond: (pt.post_id = “outer”.topic_first_post_id)
12       -> Index Scan using jforum_posts_pkey on jforum_posts p (cost=0.00..4.44 rows=1 width=24)
13       Index Cond: (p.post_id = “outer”.topic_first_post_id)
14       Filter: (need_moderate = 0)
15       -> Index Scan using jforum_users_pkey on jforum_users u (cost=0.00..5.91 rows=1 width=35)
16       Index Cond: (“outer”.user_id = u.user_id)
17       -> Index Scan using jforum_topics_pkey on jforum_topics t (cost=0.00..5.68 rows=1 width=4)
18       Index Cond: (“outer”.topic_id = t.topic_id)

Explain after index
Row #         QUERY PLAN
1       Sort (cost=978.89..978.99 rows=40 width=504)
2       Sort Key: t.topic_id
3       -> Nested Loop (cost=0.00..977.83 rows=40 width=504)
4       -> Nested Loop (cost=0.00..741.01 rows=40 width=473)
5       -> Nested Loop (cost=0.00..513.47 rows=40 width=469)
6       -> Nested Loop (cost=0.00..335.45 rows=40 width=453)
7       -> Limit (cost=0.00..146.44 rows=40 width=4)
8       -> Index Scan Backward using idx_topics_fp_forum on jforum_topics (cost=0.00..66535.36 rows=18174 width=4)
9       Index Cond: (forum_id = 1)
10       -> Index Scan using jforum_posts_text_pkey on jforum_posts_text pt (cost=0.00..4.70 rows=1 width=449)
11       Index Cond: (pt.post_id = “outer”.topic_first_post_id)
12       -> Index Scan using jforum_posts_pkey on jforum_posts p (cost=0.00..4.44 rows=1 width=24)
13       Index Cond: (p.post_id = “outer”.topic_first_post_id)
14       Filter: (need_moderate = 0)
15       -> Index Scan using jforum_topics_pkey on jforum_topics t (cost=0.00..5.68 rows=1 width=4)
16       Index Cond: (“outer”.topic_id = t.topic_id)
17       -> Index Scan using jforum_users_pkey on jforum_users u (cost=0.00..5.91 rows=1 width=35)
18       Index Cond: (“outer”.user_id = u.user_id)

postgresql explain

I had an opportunity to do some tuning on postgresql and was pleasantly surprised at how smoothly it went.

The first thing I did was try to run an “explain” on the query under discussion.  (Explain is a tabular or graphical view of the detailed steps the database uses to execute your query.  By knowing what path it will take and what tables/indexes it will look at, you can tune your query appropriately.) Knowing this works differently in different databases, I looked up what to do.  Here are the steps:

To run explain at the command line:
1) Type “explain” followed by your query.  For example “explain select * from table”.

That’s right – one step!

To run explain graphically:
1) Install pgadmin if you haven’t already
2) Type query into editor
3) Choose query –> explain
This shows the graphical view of the query.  Clicking on the data output tab shws the text view generated by the command line.

Now it may have changed since then, but I needed to create a separate table the last time I ran an explain in Oracle.  This was extra steps that I have to look up each time.  db2 had a good graphical explain built into the tool.

What surprised me here was that I figured out postgresql’s explain much faster than Oracle’s.  Namely because it was so simple!  For the command line version, there is only one step – and it’s not one I am likely to forget.

It’s always nice when software works in such an intutive manner.

Memo: Avoid Nested Queries in MySQL at all costs

Some of my readers may be aware that nested subqueries such as “SELECT * FROM widgets WHERE id IN (SELECT …)”, don’t work all that well in MYSQL. While the syntax is usually correct, the performance issues in practice can be horrendous. This article delves deeper into this issue, and why MySQL performs so poorly with nested subqueries, but not so deep as to drive us all crazy.

Nested Queries

Background

The first complex query I learned how to write was a nested subquery, or just nested query for short. At the time I was learning SQL and databases, it was the simple and most obvious of all the complex queries/joins: Find a set of records whose Id is in a list of outputs of another query.

SELECT id,name,price 
FROM widgets 
WHERE id IN (SELECT DISTINCT widgetId FROM widgetOrders)

In the example above, we first query the widgetOrders table for all unique widgets that have been sold based on widgetId (the DISTINCT doesn’t change the output of the query, but can help performance). After we have such a list, we select the id, name, and price of those widgets using data from the widget table.

First off, why do people like nested queries? As I said, they are pretty easy to understand, ESPECIALLY for non-programmers. But what are the alternatives to nested queries? Joins! Non-programmers (and even some programmers) find joins to be mystifying and scary things. Words like INNER JOIN, RIGHT OUTER JOIN, or even the shortcut symbols *= scare a lot of people. For example, the previous query can be rewritten as an INNER JOIN as follows:

SELECT DISTINCT w.id,w.name,w.price 
FROM widgets w 
INNER JOIN widgetOrders o ON w.id=o.widgetId

So why is this scarier? First, while aliases like ‘w’ and ‘o’ for table names were previously optional, they become almost required with complex joins, since we’re essentially mixing a two level query into a single level. Also, we have to add new syntax such as INNER JOIN … ON. There’s a lot more going on, and a lot more for beginners to pick up and/or be scared off by.

Why nested joins are bad in theory

The first big question of this article revolves around how query optimizers work. You can write a query a thousand different ways that would all output the same information and might seem equivalent to you, but query optimizers are just not that smart. The search space they have to cover to effectively optimize a query is massive, longer than could ever be searched in a reasonable amount of time. Therefore, query optimizes are often collections of greedy algorithims. Sure, they will do intelligent things when they can figure them out in time, but often they just look for the ‘quick path out’ using some simple heuristics. If a query optimizer thinks a particular plan may be the fastest, it won’t necessarily spend time verifying it; it will just act. Because of this, it is very easy to trip up or hinder a query optimizer.

This brings us to nested queries. Even the best query optimizers in the best database software available have trouble with nested queries. This is because they often cannot optimize them in any reasonable manner. As we saw in the example, I took two separate queries and merged them into one. Most query optimizers are not smart enough to do this since finding such a conversion would take too long, or in computing terms would require too large a search space and near-infinite time. In fact, many query optimizers will flat out refuse to optimize nested queries if it sees them. Therefore, a general rule of thumb is to avoid nested queries as much as possible since you are essentially blocking the query optimizer from touching that part of the query. You should stick with more traditional joins as much as possible since this encourages the query optimizer to find better query paths.

Why nested joins are really bad in MySQL

While nested queries may have been the first type of complex query I worked with, I never had serious problems with them and never spent hours reworking them to non-nested queries, until I started working in MySQL. Many nested queries you might easily write are capable of completely grinding your MySQL database to a halt under certain data conditions. MySQL has posted a list of excuses and tips (to fix your queries instead of their code) and there’s numerous forums posts, blogs, bug reports, and articles discussing the issue, but I’ll streamline it for you: MySQL does terrible things when handling nested subqueries; therefore, if you are using MySQL they should be avoided at all costs.

Note: This does not mean you should avoid the IN or NOT IN syntax, for example “WHERE id IN (1,2,3)” is just fine. The problems is when “1,2,3” is replaced with a subquery such as “SELECT …”.

But Scott, I need a nested query!

If you absolutely need a nested query, you can always perform two distinct queries in your application as such:

Set X = CallDatabase("SELECT DISTINCT widgetId FROM widgetOrders");
CallDatabase("SELECT id,name,price FROM widgets WHERE id IN ("+X+")");

As strange as it sounds to recommend two database calls over one, there are many real cases in MySQL where this will perform better than nested queries.

The Future

The problem with nested queries is that in many circumstances they will perform just fine, but change the data slightly and they can seriously harm database performance in MySQL. For example, strange things can happen if the subquery returns no records so that you end up with “WHERE id IN ()”. Many of the issues with subqueries have been logged as bug on MySQL’s support site, so it’s possible in future versions they will be safer to use. For now though, avoid them as long as you program with MySQL, lest you want to create headaches for yourself down the road.