sql query optimization – why temp tables can help

Last month, I migrated the jforum.net forum data to a coderanch jforum forum.  I had a requirement/goal to update the links in our forum so they work rather than point to the jforum broken links.

First I created the mapping.  (lesson: store this data as you migrate so you don’t have to do it later.)  I wound up mapping on subject lines and dates.  Luckily the threads were migrated in numeric order so I could fill in the gaps.  But that wasn’t interesting enough to blog about.  What was interesting enough to blog about was the SQL queries to update the database.

My goal

I wanted to make the changes entirely through the database – no Java code.  I also wanted to avoid postgres stored procedures because I encountered some time sinks last time I wrote a postgres stored procedure.  I am happy to say I achieved my goal.

Step 1 – Analysis

Noted that I need to update 375 rows.  Too many to do by hand.  There are just under 5000 posts in the jforum forum.  Which means therea rea t most 5000 search/replace strings to check.  This doesn’t seem bad for a computer.  Once I know the SQL, I can write code to generate 5000 of them using my local mappings and then run the SQL script on the server.

select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%'

Step 2 – Horrible performing but functional query

It’s got to work before you can tune it.  My first attempt was:

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_text like '%http://www.jforum.net/posts/list/5.page%';

The query plan was:

Seq Scan on jforum_posts_text (cost=0.00..132971.82 rows=1 width=459) 
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)

5000 of those is going to take over an hour of hitting the database hard. Not good. I could run it over the weekend when volumes are need be, but I can do better than that.

Step 3 – Trying to use an index

I know that most (95% maybe) of the updates are in the JForum forum.  We had a few “legacy” links to jforum.net in other forums, but not a lot.  I then tried adding a condition on forum id

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_text like '%http://www.jforum.net/posts/list/5.page%' and post_id in (select post_id from jforum_posts where forum_id = 95);

The query plan was:

Nested Loop (cost=22133.48..97281.40 rows=1 width=459)
HashAggregate (cost=22133.48..22263.46 rows=12998 width=4)
Bitmap Heap Scan on jforum_posts (cost=245.18..22100.99 rows=12998 width=4)
Recheck Cond: (forum_id = 95)
Bitmap Index Scan on idx_posts_forum (cost=0.00..241.93 rows=12998 width=0)
Index Cond: (forum_id = 95)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..5.76 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jforum_posts.post_id)
Filter: (jforum_posts_text.post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)

Well, it is using the indexes now.  But it is only about a 30% drop in cost for the worst case. On an untuned complex query, I usually see at least an order of magnitude performance jump on my initial tuning.

Step 4 – time for a temp table

Most of the work is finding the 375 rows that need updating in a table with 1,793,111 rows.  And it has to happen for each of the 5000 times I run the query.

I decided to use a temporary table so I could run the expensive part once.

create table jeanne_test as select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%';

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_id in (select post_id from jeanne_test where post_text like '%http://www.jforum.net/posts/list/5.page%');

Now I’m doing the expensive part once.  It still takes a couple seconds to do the first part.  But the second part – the update I’m running 5000 times – drops the query plan to

Nested Loop (cost=13.50..21.99 rows=1 width=459)
HashAggregate (cost=13.50..13.51 rows=1 width=4)
Seq Scan on jeanne_test (cost=0.00..13.50 rows=1 width=4)
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..8.46 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jeanne_test.post_id

Nice.  Running the script with the 5000 update statements only took a few seconds.

Conclusion

Database tuning is fun.  Explain is your friend.  As are different approaches.  And for those who aren’t doing match, the performance jump was 3-4 orders of magnitude.

a sql quiz + is a lower hourly rate cheaper?

Suppose you have a task to write some JDBC code and you need to do so in the cheapest way possible.  Having your people grow should not be considered here, just the rate.

Person A

Charges 3X per hour and can write working code on the first shot, test it and complete it within an hour

Person B

Charges X per hour, but needs to multiple cycles to fix once the code is typed.  (This includes the time to identify what the problem is for each, build cycles and fix time.) For the SQL quiz , see how many errors you can find in the following code. I’ll post the answers as a comment. See if you can spot any I didn’t insert on purpose.

PreparedStatement stmt = null;
ResultSet rs = null;
try {
  String sql = "select count(*) from table" + "where column = ?";
  stmt = conn.prepareStatement(sql);
  stmt.setString(1, "test");
  rs = stmt.executeQuery();
  System.out.println(rs.getInt(1));
  sql = "select count(*) from table2" + "where column = ?";
  stmt = conn.prepareStatement(sql2);
  stmt.setString(1, "test");
  rs = stmt.executeQuery();
  System.out.println(rs.getInt(1));
} finally {
  stmt.close();
  rs.close();
}

Which is better

In today’s economy, driving down costs is a hot topic.  One phrase service providers consider is “how can we lower the rate.”  This is the wrong question.  The question should be how to lower costs overall.  Person A is going to be cheaper overall even though the hourly rate is three times as high.  We’ve all heard the comment that a good developer is many times more productive than an average one.

Granted, my example is extreme.  It shows the difference between an experienced person and an someone new to JDBC.  The point is to use the extreme to emphasize that it’s not all about the hourly rate.  In the real world, training the entry level person has value too of course.

Another view

This reminds me of the methodologies that ask you to estimate how many hours/days/weeks a task will take without saying who will do the work.  Well, if I do it, the work will take a week.  If an entry level person is doing it, it may take four.  If someone experienced is doing it who doesn’t know the system, it may take two.  How do we balance the differences in people when estimating?

Memo: Avoid Functions in Database Queries

The most common type of questions in the JavaRanch JDBC forum tends to be about improving performance in a database (that, and “Where can I download a JDBC Driver?”). While remote trouble-shooting performance issues can be tricky, we often spot issues with the presented queries and offer alternatives to improve performance. One issue I see from time to time is the use of database built-in functions in database queries. In general, these should be avoided at all costs for commonly-executed queries, especially ones that could trigger the function to be applied to every record in a table. This article will address this issue of using functions in database queries in more detail and provide explanations and tips for avoiding them in the future.

1. Throwing away indexes

By default, databases search an entire table for records. The purpose of indexes, created automatically by some DBMS systems, is to allow records to be retrieved faster. Using Big O notation, a sorted set of records can find an item within a range in O(log(n)) while an unsorted set would have to search the entire table, or O(n). For hash searches on a specific key, the search time is near O(constant) for properly balanced data structures.

Developers can create indexes on multiple columns in a table, but they have no direct control over how and when they are used. It is the job of the DBMS query optimizer to apply what it thinks are the best indexes at the time the query is requested. Queries that apply a function to a column in a database will likely throw away any index that the query optimizer could take advantage of to search the table, resulting in a full-table scan every time the table is queried. If you are lucky and your table only contains a few dozen records, this issue might not be noticeable, but for the rest of us, this could pose a serious problem. While there are some DBMSs, Oracle being one of them, that support function-based indexes, but they are far from standard practice.

2. Goodbye Portability

My biggest issue with functions, like any database-specific feature, is the fact that they are database specific features! Once you write a query that takes advantage of a specific function, porting your application to a different database system becomes much more difficult. Oftentimes, functions are used because there is a muddling of the data, application, and presentation layer. I have seen developers use database-specific functions to format data (most commonly, dates) from SELECT queries that are transmitted from JDBC directly to the user. If you have a strong mid-tier platform like Java, it is better to leverage the formatting functions within the language, not database-specific ones, to present data to the user.

Porting an application to a different database is non-trivial at best, but the use of database-specific functions will make the job much more difficult. And to those developers out there who often comment that switching databases never happens, it does. I’ve done it. And it’s not for the faint of heart.

3. Slow if used incorrectly

There are correct ways to use functions in database queries that are often overlooked when writing the query. For example, compare the following two MySQL-based queries, both of which search for orders placed on February 15, 2010 and use built-in functions:

SELECT * FROM Widgets WHERE DATE_FORMAT(orderDate,"%Y-%m-%d") = '2010-02-15';
SELECT * FROM Widgets WHERE orderDate = (CAST '2010-02-15' as DATETIME);

First, question is, are they equivalent? The answer is no, but let’s skip that for now and discuss performance. Which is likely to perform better on a database that contains a tree-based index on Widgets.orderDate?

Give up? The second query, of course! The second query applies a function to a constant, and most database query analyzers are intelligent enough to only apply this function once. It then uses the range index to find the records in Log(n) time. On the other hand, the first query performs a function call on every record in the table and therefore ignores the index – resulting in a slow table scan.

As for accuracy, if orderDate is a DATETIME, the second query will return all the results placed only at midnight on February 2nd (2010-02-15 00:00:00) while the second query will return all results placed during the entire day. No worries – there is an easy fix that still uses the index on orderDate for optimal performance:

SELECT * FROM Widgets WHERE orderDate >= (CAST '2010-02-15' as DATETIME)
     AND orderDate < (CAST '2010-02-16' as DATETIME);

Even though this adds a second parameter to the search, the sorted index can be applied to both.

Final Thoughts

Like many things in the world, database functions are necessary evil that are required to solve certain problems. The goal shouldn’t be to never use them, but to keep using them to a minimum. One common solution, if the function is being applied to a column repeatedly, is to denormalize the data and add a new column containing the value of the original column with the function applied to it. This is equivalent to how a function-based index works, for databases that do not have this feature built-in. The only overhead is the cost of maintaining the value in the new column each time the original value is updated, although since reads are more common than writes, this is often easy in practice. I am aware some developers have issues with denormalized data, but it is often necessary on large tables with queries that can run for minutes or even hours.

In the previous example, if searching on specific single dates is extremely common, the developer could add an orderDateSTR column that contains strings such as ‘2010-02-15’. A hash index could then be built on the column which would allow searches for single dates to be accomplished in near constant time. Granted, this is not useful for range queries, but would be useful in situations where single date searching needed to be fast on large data sets.