Memo: Avoid Functions in Database Queries

The most common type of questions in the JavaRanch JDBC forum tends to be about improving performance in a database (that, and “Where can I download a JDBC Driver?”). While remote trouble-shooting performance issues can be tricky, we often spot issues with the presented queries and offer alternatives to improve performance. One issue I see from time to time is the use of database built-in functions in database queries. In general, these should be avoided at all costs for commonly-executed queries, especially ones that could trigger the function to be applied to every record in a table. This article will address this issue of using functions in database queries in more detail and provide explanations and tips for avoiding them in the future.

1. Throwing away indexes

By default, databases search an entire table for records. The purpose of indexes, created automatically by some DBMS systems, is to allow records to be retrieved faster. Using Big O notation, a sorted set of records can find an item within a range in O(log(n)) while an unsorted set would have to search the entire table, or O(n). For hash searches on a specific key, the search time is near O(constant) for properly balanced data structures.

Developers can create indexes on multiple columns in a table, but they have no direct control over how and when they are used. It is the job of the DBMS query optimizer to apply what it thinks are the best indexes at the time the query is requested. Queries that apply a function to a column in a database will likely throw away any index that the query optimizer could take advantage of to search the table, resulting in a full-table scan every time the table is queried. If you are lucky and your table only contains a few dozen records, this issue might not be noticeable, but for the rest of us, this could pose a serious problem. While there are some DBMSs, Oracle being one of them, that support function-based indexes, but they are far from standard practice.

2. Goodbye Portability

My biggest issue with functions, like any database-specific feature, is the fact that they are database specific features! Once you write a query that takes advantage of a specific function, porting your application to a different database system becomes much more difficult. Oftentimes, functions are used because there is a muddling of the data, application, and presentation layer. I have seen developers use database-specific functions to format data (most commonly, dates) from SELECT queries that are transmitted from JDBC directly to the user. If you have a strong mid-tier platform like Java, it is better to leverage the formatting functions within the language, not database-specific ones, to present data to the user.

Porting an application to a different database is non-trivial at best, but the use of database-specific functions will make the job much more difficult. And to those developers out there who often comment that switching databases never happens, it does. I’ve done it. And it’s not for the faint of heart.

3. Slow if used incorrectly

There are correct ways to use functions in database queries that are often overlooked when writing the query. For example, compare the following two MySQL-based queries, both of which search for orders placed on February 15, 2010 and use built-in functions:

SELECT * FROM Widgets WHERE DATE_FORMAT(orderDate,"%Y-%m-%d") = '2010-02-15';
SELECT * FROM Widgets WHERE orderDate = (CAST '2010-02-15' as DATETIME);

First, question is, are they equivalent? The answer is no, but let’s skip that for now and discuss performance. Which is likely to perform better on a database that contains a tree-based index on Widgets.orderDate?

Give up? The second query, of course! The second query applies a function to a constant, and most database query analyzers are intelligent enough to only apply this function once. It then uses the range index to find the records in Log(n) time. On the other hand, the first query performs a function call on every record in the table and therefore ignores the index – resulting in a slow table scan.

As for accuracy, if orderDate is a DATETIME, the second query will return all the results placed only at midnight on February 2nd (2010-02-15 00:00:00) while the second query will return all results placed during the entire day. No worries – there is an easy fix that still uses the index on orderDate for optimal performance:

SELECT * FROM Widgets WHERE orderDate >= (CAST '2010-02-15' as DATETIME)
     AND orderDate < (CAST '2010-02-16' as DATETIME);

Even though this adds a second parameter to the search, the sorted index can be applied to both.

Final Thoughts

Like many things in the world, database functions are necessary evil that are required to solve certain problems. The goal shouldn’t be to never use them, but to keep using them to a minimum. One common solution, if the function is being applied to a column repeatedly, is to denormalize the data and add a new column containing the value of the original column with the function applied to it. This is equivalent to how a function-based index works, for databases that do not have this feature built-in. The only overhead is the cost of maintaining the value in the new column each time the original value is updated, although since reads are more common than writes, this is often easy in practice. I am aware some developers have issues with denormalized data, but it is often necessary on large tables with queries that can run for minutes or even hours.

In the previous example, if searching on specific single dates is extremely common, the developer could add an orderDateSTR column that contains strings such as ‘2010-02-15’. A hash index could then be built on the column which would allow searches for single dates to be accomplished in near constant time. Granted, this is not useful for range queries, but would be useful in situations where single date searching needed to be fast on large data sets.

Memo: Avoid Nested Queries in MySQL at all costs

Some of my readers may be aware that nested subqueries such as “SELECT * FROM widgets WHERE id IN (SELECT …)”, don’t work all that well in MYSQL. While the syntax is usually correct, the performance issues in practice can be horrendous. This article delves deeper into this issue, and why MySQL performs so poorly with nested subqueries, but not so deep as to drive us all crazy.

Nested Queries

Background

The first complex query I learned how to write was a nested subquery, or just nested query for short. At the time I was learning SQL and databases, it was the simple and most obvious of all the complex queries/joins: Find a set of records whose Id is in a list of outputs of another query.

SELECT id,name,price 
FROM widgets 
WHERE id IN (SELECT DISTINCT widgetId FROM widgetOrders)

In the example above, we first query the widgetOrders table for all unique widgets that have been sold based on widgetId (the DISTINCT doesn’t change the output of the query, but can help performance). After we have such a list, we select the id, name, and price of those widgets using data from the widget table.

First off, why do people like nested queries? As I said, they are pretty easy to understand, ESPECIALLY for non-programmers. But what are the alternatives to nested queries? Joins! Non-programmers (and even some programmers) find joins to be mystifying and scary things. Words like INNER JOIN, RIGHT OUTER JOIN, or even the shortcut symbols *= scare a lot of people. For example, the previous query can be rewritten as an INNER JOIN as follows:

SELECT DISTINCT w.id,w.name,w.price 
FROM widgets w 
INNER JOIN widgetOrders o ON w.id=o.widgetId

So why is this scarier? First, while aliases like ‘w’ and ‘o’ for table names were previously optional, they become almost required with complex joins, since we’re essentially mixing a two level query into a single level. Also, we have to add new syntax such as INNER JOIN … ON. There’s a lot more going on, and a lot more for beginners to pick up and/or be scared off by.

Why nested joins are bad in theory

The first big question of this article revolves around how query optimizers work. You can write a query a thousand different ways that would all output the same information and might seem equivalent to you, but query optimizers are just not that smart. The search space they have to cover to effectively optimize a query is massive, longer than could ever be searched in a reasonable amount of time. Therefore, query optimizes are often collections of greedy algorithims. Sure, they will do intelligent things when they can figure them out in time, but often they just look for the ‘quick path out’ using some simple heuristics. If a query optimizer thinks a particular plan may be the fastest, it won’t necessarily spend time verifying it; it will just act. Because of this, it is very easy to trip up or hinder a query optimizer.

This brings us to nested queries. Even the best query optimizers in the best database software available have trouble with nested queries. This is because they often cannot optimize them in any reasonable manner. As we saw in the example, I took two separate queries and merged them into one. Most query optimizers are not smart enough to do this since finding such a conversion would take too long, or in computing terms would require too large a search space and near-infinite time. In fact, many query optimizers will flat out refuse to optimize nested queries if it sees them. Therefore, a general rule of thumb is to avoid nested queries as much as possible since you are essentially blocking the query optimizer from touching that part of the query. You should stick with more traditional joins as much as possible since this encourages the query optimizer to find better query paths.

Why nested joins are really bad in MySQL

While nested queries may have been the first type of complex query I worked with, I never had serious problems with them and never spent hours reworking them to non-nested queries, until I started working in MySQL. Many nested queries you might easily write are capable of completely grinding your MySQL database to a halt under certain data conditions. MySQL has posted a list of excuses and tips (to fix your queries instead of their code) and there’s numerous forums posts, blogs, bug reports, and articles discussing the issue, but I’ll streamline it for you: MySQL does terrible things when handling nested subqueries; therefore, if you are using MySQL they should be avoided at all costs.

Note: This does not mean you should avoid the IN or NOT IN syntax, for example “WHERE id IN (1,2,3)” is just fine. The problems is when “1,2,3” is replaced with a subquery such as “SELECT …”.

But Scott, I need a nested query!

If you absolutely need a nested query, you can always perform two distinct queries in your application as such:

Set X = CallDatabase("SELECT DISTINCT widgetId FROM widgetOrders");
CallDatabase("SELECT id,name,price FROM widgets WHERE id IN ("+X+")");

As strange as it sounds to recommend two database calls over one, there are many real cases in MySQL where this will perform better than nested queries.

The Future

The problem with nested queries is that in many circumstances they will perform just fine, but change the data slightly and they can seriously harm database performance in MySQL. For example, strange things can happen if the subquery returns no records so that you end up with “WHERE id IN ()”. Many of the issues with subqueries have been logged as bug on MySQL’s support site, so it’s possible in future versions they will be safer to use. For now though, avoid them as long as you program with MySQL, lest you want to create headaches for yourself down the road.