more postgres tuning in jforum

A teammate installed a new feature on CodeRanch JForum that uses a 4,515,409 row table. When dealing with over a million rows, scans become a huge performance drain. To the point where one query was slow but real usages with many at the same time brought down the app. The reason why the query was slow was interesting so I asked him if I could blog about it.

The original query

select countryCode, countryName, region, city
     from ip2location
     where 540815125 >= low and 540815125 <= high;

Running this through explain says it uses the index with a cost of:

Bitmap Heap Scan on ip2location (cost=5949.66..54170.71 rows=219870 width=32)

That’s a really high cost explain plan.

My first thought was to change it to:

explain  select countryCode, countryName, region, city
     from ip2location
     where 540815125 >= low and 540815125 <= high;

Which has a much better explain plan of

Index Scan using ip2l_low_asc_idx on ip2location (cost=0.00..8.77 rows=1 width=32)

The reason is that in the first query, postgres needs to scan the large index from the beginning until it hits the low value. In the second, I gave it permission to start really close to the target row. I subtracted 1000 but that was arbitrary. It just needs to high enough to be in the vicinity of the row without missing out on any data.

My approach also makes the lookup time consistent. It is always looking through 1000 worth of index. (Which is always less than 1000 rows given the bunching of low to high.) The original is immediate through a full index scan.

Then the original teammate switched it to:

select countryCode, countryName, region, city
     from ip2location
     where  low =
           (select max(low)
            from   ip2location
            where 540815125 >= low)

This has the same explain cost as the hacky one and is clearer.  Same logic though – it doesn’t require scanning/returning extra results.

sql query optimization – why temp tables can help

Last month, I migrated the jforum.net forum data to a coderanch jforum forum.  I had a requirement/goal to update the links in our forum so they work rather than point to the jforum broken links.

First I created the mapping.  (lesson: store this data as you migrate so you don’t have to do it later.)  I wound up mapping on subject lines and dates.  Luckily the threads were migrated in numeric order so I could fill in the gaps.  But that wasn’t interesting enough to blog about.  What was interesting enough to blog about was the SQL queries to update the database.

My goal

I wanted to make the changes entirely through the database – no Java code.  I also wanted to avoid postgres stored procedures because I encountered some time sinks last time I wrote a postgres stored procedure.  I am happy to say I achieved my goal.

Step 1 – Analysis

Noted that I need to update 375 rows.  Too many to do by hand.  There are just under 5000 posts in the jforum forum.  Which means therea rea t most 5000 search/replace strings to check.  This doesn’t seem bad for a computer.  Once I know the SQL, I can write code to generate 5000 of them using my local mappings and then run the SQL script on the server.

select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%'

Step 2 – Horrible performing but functional query

It’s got to work before you can tune it.  My first attempt was:

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_text like '%http://www.jforum.net/posts/list/5.page%';

The query plan was:

Seq Scan on jforum_posts_text (cost=0.00..132971.82 rows=1 width=459) 
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)

5000 of those is going to take over an hour of hitting the database hard. Not good. I could run it over the weekend when volumes are need be, but I can do better than that.

Step 3 – Trying to use an index

I know that most (95% maybe) of the updates are in the JForum forum.  We had a few “legacy” links to jforum.net in other forums, but not a lot.  I then tried adding a condition on forum id

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_text like '%http://www.jforum.net/posts/list/5.page%' and post_id in (select post_id from jforum_posts where forum_id = 95);

The query plan was:

Nested Loop (cost=22133.48..97281.40 rows=1 width=459)
HashAggregate (cost=22133.48..22263.46 rows=12998 width=4)
Bitmap Heap Scan on jforum_posts (cost=245.18..22100.99 rows=12998 width=4)
Recheck Cond: (forum_id = 95)
Bitmap Index Scan on idx_posts_forum (cost=0.00..241.93 rows=12998 width=0)
Index Cond: (forum_id = 95)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..5.76 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jforum_posts.post_id)
Filter: (jforum_posts_text.post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)

Well, it is using the indexes now.  But it is only about a 30% drop in cost for the worst case. On an untuned complex query, I usually see at least an order of magnitude performance jump on my initial tuning.

Step 4 – time for a temp table

Most of the work is finding the 375 rows that need updating in a table with 1,793,111 rows.  And it has to happen for each of the 5000 times I run the query.

I decided to use a temporary table so I could run the expensive part once.

create table jeanne_test as select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%';

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_id in (select post_id from jeanne_test where post_text like '%http://www.jforum.net/posts/list/5.page%');

Now I’m doing the expensive part once.  It still takes a couple seconds to do the first part.  But the second part – the update I’m running 5000 times – drops the query plan to

Nested Loop (cost=13.50..21.99 rows=1 width=459)
HashAggregate (cost=13.50..13.51 rows=1 width=4)
Seq Scan on jeanne_test (cost=0.00..13.50 rows=1 width=4)
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..8.46 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jeanne_test.post_id

Nice.  Running the script with the 5000 update statements only took a few seconds.

Conclusion

Database tuning is fun.  Explain is your friend.  As are different approaches.  And for those who aren’t doing match, the performance jump was 3-4 orders of magnitude.

TSS Symposium Preview – Throw Away All The Rules. Now What Process Do You Follow?

I'm Speaking at TheServerSide Java Symposium As previously mentioned, Scott and I are both be presenting talks at TheServerSide Java Symposium in March. In preparation for the conference, we are providing sneak peaks of talks this week on the blog.  Scott gave a sneak preview of his GWT lecture.  He has so much information it didn’t all fit in his slides!

I’m giving a talk titled Throw Away All The Rules. Now What Process Do You Follow?” Here are just a few points from my upcoming conference talk:

  1. Testing – for anyone who knows me it isn’t a surprise that I consider regression critical.  But it isn’t always easy to get started.  And yet, the pain of not having it is large.  See how a testing strategy can evolve.
  2. Deployment – a repeatable deployment sounds like a corporate bureaucratic thing.  Yet knowing what is in production and how to get there quickly is an agile concept.
  3. Leader – Note I said leader and not project manager.  Even if you don’t have a project manager, someone needs to be running the show.  Or in the case of Scrum – the team is the leader.

Hope you enjoyed this sneak peak of my conference talk.  For more, consider coming to my session at the conference. If you register with the coderanch discount code, you can save $200.