Memo: Avoid Functions in Database Queries

The most common type of questions in the JavaRanch JDBC forum tends to be about improving performance in a database (that, and “Where can I download a JDBC Driver?”). While remote trouble-shooting performance issues can be tricky, we often spot issues with the presented queries and offer alternatives to improve performance. One issue I see from time to time is the use of database built-in functions in database queries. In general, these should be avoided at all costs for commonly-executed queries, especially ones that could trigger the function to be applied to every record in a table. This article will address this issue of using functions in database queries in more detail and provide explanations and tips for avoiding them in the future.

1. Throwing away indexes

By default, databases search an entire table for records. The purpose of indexes, created automatically by some DBMS systems, is to allow records to be retrieved faster. Using Big O notation, a sorted set of records can find an item within a range in O(log(n)) while an unsorted set would have to search the entire table, or O(n). For hash searches on a specific key, the search time is near O(constant) for properly balanced data structures.

Developers can create indexes on multiple columns in a table, but they have no direct control over how and when they are used. It is the job of the DBMS query optimizer to apply what it thinks are the best indexes at the time the query is requested. Queries that apply a function to a column in a database will likely throw away any index that the query optimizer could take advantage of to search the table, resulting in a full-table scan every time the table is queried. If you are lucky and your table only contains a few dozen records, this issue might not be noticeable, but for the rest of us, this could pose a serious problem. While there are some DBMSs, Oracle being one of them, that support function-based indexes, but they are far from standard practice.

2. Goodbye Portability

My biggest issue with functions, like any database-specific feature, is the fact that they are database specific features! Once you write a query that takes advantage of a specific function, porting your application to a different database system becomes much more difficult. Oftentimes, functions are used because there is a muddling of the data, application, and presentation layer. I have seen developers use database-specific functions to format data (most commonly, dates) from SELECT queries that are transmitted from JDBC directly to the user. If you have a strong mid-tier platform like Java, it is better to leverage the formatting functions within the language, not database-specific ones, to present data to the user.

Porting an application to a different database is non-trivial at best, but the use of database-specific functions will make the job much more difficult. And to those developers out there who often comment that switching databases never happens, it does. I’ve done it. And it’s not for the faint of heart.

3. Slow if used incorrectly

There are correct ways to use functions in database queries that are often overlooked when writing the query. For example, compare the following two MySQL-based queries, both of which search for orders placed on February 15, 2010 and use built-in functions:

SELECT * FROM Widgets WHERE DATE_FORMAT(orderDate,"%Y-%m-%d") = '2010-02-15';
SELECT * FROM Widgets WHERE orderDate = (CAST '2010-02-15' as DATETIME);

First, question is, are they equivalent? The answer is no, but let’s skip that for now and discuss performance. Which is likely to perform better on a database that contains a tree-based index on Widgets.orderDate?

Give up? The second query, of course! The second query applies a function to a constant, and most database query analyzers are intelligent enough to only apply this function once. It then uses the range index to find the records in Log(n) time. On the other hand, the first query performs a function call on every record in the table and therefore ignores the index – resulting in a slow table scan.

As for accuracy, if orderDate is a DATETIME, the second query will return all the results placed only at midnight on February 2nd (2010-02-15 00:00:00) while the second query will return all results placed during the entire day. No worries – there is an easy fix that still uses the index on orderDate for optimal performance:

SELECT * FROM Widgets WHERE orderDate >= (CAST '2010-02-15' as DATETIME)
     AND orderDate < (CAST '2010-02-16' as DATETIME);

Even though this adds a second parameter to the search, the sorted index can be applied to both.

Final Thoughts

Like many things in the world, database functions are necessary evil that are required to solve certain problems. The goal shouldn’t be to never use them, but to keep using them to a minimum. One common solution, if the function is being applied to a column repeatedly, is to denormalize the data and add a new column containing the value of the original column with the function applied to it. This is equivalent to how a function-based index works, for databases that do not have this feature built-in. The only overhead is the cost of maintaining the value in the new column each time the original value is updated, although since reads are more common than writes, this is often easy in practice. I am aware some developers have issues with denormalized data, but it is often necessary on large tables with queries that can run for minutes or even hours.

In the previous example, if searching on specific single dates is extremely common, the developer could add an orderDateSTR column that contains strings such as ‘2010-02-15’. A hash index could then be built on the column which would allow searches for single dates to be accomplished in near constant time. Granted, this is not useful for range queries, but would be useful in situations where single date searching needed to be fast on large data sets.

clone a postgresql database for testing cleanly

I’m looking at writing integration tests for the back end of JavaRanch‘s JForum install.

A few “pesky” requirements/constraints

  • Multiple developers all over the word have their own local test databases filled with data in different states.  The tests must work for everyone.  Ideally they won’t leave data floating around either.
  • The tests must use PostgreSQL.  While the original JForum supported multiple databases, the JavaRanch version has been scaled down to just run with the one we need.  We do have some PostgreSQL specific SQL which rules out using an embedded database like HSQLDB or Derby.
  • Developers are using both Eclipse and IntelliJ.  Tests should care about the IDE anyway, so this isn’t a big constraint.
  • Developers are using a variety of operating systems and languages on their operating systems.  While code is in English, there can’t be assumptions as to the OS state.

Strategy

I think the best strategy is to create a second database just for testing.  The JForum database would remain untouched and a jforum_integration_test database can be created for the tests.  dbUnit can control the state of that special database.

The problem

Before I even start thinking about dbUnit, I did a proof of concept to ensure I could create a new database from scratch using the command line.  Creating a database is the easy part.  The “hard” part is that JForum doesn’t come with a schema.  It comes with an installation servlet that creates the schema.  While few people will be creating a schema for JForum, the technique I used applies elsewhere.

The procedure “before”

  1. Start up the JForum war
  2. Go to the JForum install URL and enter some information which creates the tables
  3. Run the JavaRanch customizations.

How to clone a database for which you only have a partial script

  1. Create an empty database
    createdb jforum_integration_test
  2. Arrive at the base schema
    1. Go the JForum installation URL
    2. Enter the information to create the tables
  3. Export the schema thus far
    pg_dump -U postgres jforum_integration_test > c:\temp\postgres.sql
  4. Provide instructions for the rest of the sql which were created by our developers.

How to import

Now for the easy part!

Importing this dump is a matter of a single command:

psql -U postgres jforum < "pathToWorkspace\JForum\javaranch-docs\deployment\file.ddl"

Lessons learned after

The next day I learned that this wasn’t enough.  We also needed some test data from the server.  I ran this a few times to get the relevant test data.

pg_dump --data-only --inserts -U user -W database --file roles  --table tableName

Conclusion

My next step will be to actually configure dbUnit against this new database and start writing tests.

Which Database to Start With?

When people ask me how to learn to use a database or how to write SQL queries, I tell them to pick a database system and immerse themselves in it. In fact that advice goes for a lot of software technologies: just immerse yourself in a language, as programming tutorials are easy to come by these days. On the other hand, when people ask me which database software to use, I tend to give pause. Most of the time, I recommend MySQL for beginners since it tends to be the most light-weight system to install and use, but I know it’s not often the easiest to understand. With the advent of new light-weight database editions of often heavier products, perhaps it’s time I reconsider the issue.

1. MySQL: Free, lightweight, and readily available

MySQL stands out as the easiest for users to start with, in part because most people can get access to a MySQL database without having to setup anything. Most, if not all, hosting companies that offer database support do so in the form of a MySQL database. The only disadvantage with hosting solutions is that users lose the ability to run local applications on the database, often relying on phpMyAdmin for all database changes. I recommend anyone serious about learning MySQL download and install it themselves, as there are plenty of installation platforms supported.

The good: Free. Easy to download and/or find an existing database to work with. Somewhat easy to install. Lots of free tools available. Good documentation.
The bad: If the installation or auto-configuration breaks, user is left spending hours diagnosing the problems. The MySQL GUI tools, while nice, have to be downloaded separately from the server. Limited support. Clustering and support of large transaction systems is not uncommon. Also, it can be buggy and unpredictable at times, as I’ve seen in practice.

2. Oracle: Heavy and Powerful

Oracle is one of the oldest database systems and stands out as a powerhouse among databases given its vast support for advanced clustering, memory management, and query optimization. If you need something robust, powerful, and able to support millions or billions of transactions a day, it’s the best there is. Oracle needs to be licensed for a production environment, although developers can download a free limited-use version which is good for building an application.

The good: Powerful. Can do some really cool things for those that appreciate it. Extremely scalable.
The bad: Often large and time-consuming installation. Least user friendly of all the database systems, although it’s gotten better over the last few years. Not free. Not a wide variety of tools, free or otherwise, to manipulate the database.

3. Microsoft SQL Server: Easy to use administration interface, often powerful

Microsoft SQL Server has matured greatly over the last 10 years into a decent rival of Oracle. I like MS SQL Server in that it hides a lot of the underlying configuration information from the user. On the other hand, I dislike MS SQL server in that it hides a lot of the underlying configuration information from the user. Double-edged sword, I know. Like Oracle, you need a license if you want to use it in a production environment.

The good: Easy to set up new databases and administer them. Best for those who have no idea how to administer a database. New express editions can be used for free.
The bad: Over-simplifies a lot for advanced users, making it harder to optimize. Not free. Developer edition has nominal cost, although it probably should be free.

Other Databases

This article is not meant to be the end-all for database software discussion, but a beginning guide of the big three database systems for those who are not well-versed in the area. To cover every possible database software, such as PostgreSQL or DB2, as well as countless others, would take a book or two. Most students starting out just need to find a single database and start ‘playing’ with it until they get the hang of it, rather than an exhaustive discussion of which database is best.

Non-standard Databases

Some of you may be more familiar with embedded databases such HSQLDB, SQLite, or Derby than the ones I have mentioned. Rarely do I see beginners using embedded databases, so perhaps I’ll write an article about such systems down the road. Also, I have not purposely not mentioned Microsoft Access as a learning database, simply because I don’t consider it standard database software, but rather a glorified Excel spreadsheet. Most of teaching someone how to use a regular database after using Access, is convincing them all databases are not like Access.

My favorite database? If I’m teaching or writing a relatively simple web-application, MySQL. If someone else is paying for the license and the application is large enough, Oracle.