Memo: Avoid Functions in Database Queries

The most common type of questions in the JavaRanch JDBC forum tends to be about improving performance in a database (that, and “Where can I download a JDBC Driver?”). While remote trouble-shooting performance issues can be tricky, we often spot issues with the presented queries and offer alternatives to improve performance. One issue I see from time to time is the use of database built-in functions in database queries. In general, these should be avoided at all costs for commonly-executed queries, especially ones that could trigger the function to be applied to every record in a table. This article will address this issue of using functions in database queries in more detail and provide explanations and tips for avoiding them in the future.

1. Throwing away indexes

By default, databases search an entire table for records. The purpose of indexes, created automatically by some DBMS systems, is to allow records to be retrieved faster. Using Big O notation, a sorted set of records can find an item within a range in O(log(n)) while an unsorted set would have to search the entire table, or O(n). For hash searches on a specific key, the search time is near O(constant) for properly balanced data structures.

Developers can create indexes on multiple columns in a table, but they have no direct control over how and when they are used. It is the job of the DBMS query optimizer to apply what it thinks are the best indexes at the time the query is requested. Queries that apply a function to a column in a database will likely throw away any index that the query optimizer could take advantage of to search the table, resulting in a full-table scan every time the table is queried. If you are lucky and your table only contains a few dozen records, this issue might not be noticeable, but for the rest of us, this could pose a serious problem. While there are some DBMSs, Oracle being one of them, that support function-based indexes, but they are far from standard practice.

2. Goodbye Portability

My biggest issue with functions, like any database-specific feature, is the fact that they are database specific features! Once you write a query that takes advantage of a specific function, porting your application to a different database system becomes much more difficult. Oftentimes, functions are used because there is a muddling of the data, application, and presentation layer. I have seen developers use database-specific functions to format data (most commonly, dates) from SELECT queries that are transmitted from JDBC directly to the user. If you have a strong mid-tier platform like Java, it is better to leverage the formatting functions within the language, not database-specific ones, to present data to the user.

Porting an application to a different database is non-trivial at best, but the use of database-specific functions will make the job much more difficult. And to those developers out there who often comment that switching databases never happens, it does. I’ve done it. And it’s not for the faint of heart.

3. Slow if used incorrectly

There are correct ways to use functions in database queries that are often overlooked when writing the query. For example, compare the following two MySQL-based queries, both of which search for orders placed on February 15, 2010 and use built-in functions:

SELECT * FROM Widgets WHERE DATE_FORMAT(orderDate,"%Y-%m-%d") = '2010-02-15';
SELECT * FROM Widgets WHERE orderDate = (CAST '2010-02-15' as DATETIME);

First, question is, are they equivalent? The answer is no, but let’s skip that for now and discuss performance. Which is likely to perform better on a database that contains a tree-based index on Widgets.orderDate?

Give up? The second query, of course! The second query applies a function to a constant, and most database query analyzers are intelligent enough to only apply this function once. It then uses the range index to find the records in Log(n) time. On the other hand, the first query performs a function call on every record in the table and therefore ignores the index – resulting in a slow table scan.

As for accuracy, if orderDate is a DATETIME, the second query will return all the results placed only at midnight on February 2nd (2010-02-15 00:00:00) while the second query will return all results placed during the entire day. No worries – there is an easy fix that still uses the index on orderDate for optimal performance:

SELECT * FROM Widgets WHERE orderDate >= (CAST '2010-02-15' as DATETIME)
     AND orderDate < (CAST '2010-02-16' as DATETIME);

Even though this adds a second parameter to the search, the sorted index can be applied to both.

Final Thoughts

Like many things in the world, database functions are necessary evil that are required to solve certain problems. The goal shouldn’t be to never use them, but to keep using them to a minimum. One common solution, if the function is being applied to a column repeatedly, is to denormalize the data and add a new column containing the value of the original column with the function applied to it. This is equivalent to how a function-based index works, for databases that do not have this feature built-in. The only overhead is the cost of maintaining the value in the new column each time the original value is updated, although since reads are more common than writes, this is often easy in practice. I am aware some developers have issues with denormalized data, but it is often necessary on large tables with queries that can run for minutes or even hours.

In the previous example, if searching on specific single dates is extremely common, the developer could add an orderDateSTR column that contains strings such as ‘2010-02-15’. A hash index could then be built on the column which would allow searches for single dates to be accomplished in near constant time. Granted, this is not useful for range queries, but would be useful in situations where single date searching needed to be fast on large data sets.

Starting a voluntary discussion group

I co-founded a lunchtime discussion group at work five years ago.  As we approach the fifth anniversary, I was reflecting on what went into it.  What worked; what didn’t; what would I do differently.  Our purpose was to have a place to discuss technology that we weren’t necessarily working on and to share knowledge/ideas across teams/departments.

Background

Technically the group started more than five years ago as a technical book club.  That didn’t work because everyone had to read from the same book.  Which requires preparation each meeting.  It also makes it hard to come back after you’ve missed a meeting.  The organizer of that group and I had an informal conversation near the cafeteria hallway about how to improve things and the new group was born!

The new group

We started meeting twice a month and switched to once a month over time.  We have a mix of moderated discussion topics and presentations with heavy Q & A.  At the beginning, we said we would cancel the meeting if there weren’t 4 people in the room.  This hasn’t been a problem in years.

What I consider key to the group’s success

  • We always meet the same day and time – this makes it easy for people to remember when the group meets.  For example, the first Tuesday of the month; the first and third Thursday of the month, etc.
  • Advertising – We advertise through an opt-in mailing list (don’t spam people), telling people who might be interested and putting up physical signs the day of the meeting.  Speaking of which these signs are always the same color/font/logo to encourage branding/memory.  People have even said “you are the blue sign people”.
  • Voluntary – This one is important to me.  Attending the group is voluntary.  Running the group is voluntary.  This means it is not a work thing and does not get billed to work projects.  It also allows the group to remain independent topic wise and FUN.
  • Informal – While we do have presentations, they are usually of an informal type. (Sometimes someone is previewing a formal presentation.)  Again, this is in keeping with the goal of things being FUN.
  • Size of group – We haven’t had problems with the size of the group per se.  We get between 6 and 20 people most months which is a good size.  We’ve been asked about including other offices in other states in the group.  We do have a phone bridge for people within the company and have invited people from one or two other offices on occasion recently.  And this is fine if it is small.  If there are too many from too many offices, it changes the spirit of the group.  And there are already company wide information exchange type groups in place.  This particular item is something we are going to have to monitor so we don’t become victims of our own success and lose the benefits of a New York based group.
  • Varying formats – Having a mix of presentations and moderated discussion has served us well.  Presentations are good when someone has a lot of experience with a technology.  (or can throw something together on it.)  Moderated discussion is good when multiple people have lots of small thoughts.  And the moderator doesn’t have to be an expert.  Just like at JavaRanch where I moderate the BlackBerry forum without knowing anything about BlackBerry development.  Whenever I moderate at the discussion group, I come with a list of questions/talking points.
  • Hook to real work topics – Most of the time topics involve technologies or emerging technologies someone in the company is working on.  Or planning to work on.  Or a cross cutting concern like performance.
  • Range of topics – You’ll notice I said most of the time.  Some of our topics have nothing to do with work.  There’s that FUN thing again.  We also select topics based on what attendees want to hear.
  • Emphasis on interactiveness – While we do have people who just sit and listen, most participants are actively engaged.  This keeps a good dynamic, keeps in informal and keeps it FUN.
  • Different moderators – Trying to have as wide a range of people moderate as possible is helpful.  It gives the group a changing voice.  It helps more people feel invested in the group.  And it gives people a chance to practice leading.
  • Group leaders changing over time – We have had 5 different people co-lead the group at different points it time.  (Co-leaders are good because it is less work and you don’t have to worry about vacations.)  One person left the company.  And two others (including myself) stepped down from actively running the group after getting job responsibilities that would have compromised the informal appearance of the group.  The two of us still attend/moderate and are plenty involved.  But we don’t run it or give the opening/closing remarks anymore.
  • Topic control – I mentioned we are an informal group.  This gives a stronger ability to decline topics.  At times, people from outside the group come to us with ideas.  Some are good ideas and we use them.  Some are “good ideas that don’t fit.”  (There’s no such thing as a bad idea; we like ideas; that’s how brainstorming happens.)  Being able to decline these ideas keeps the group FUN and interesting.

What to think about if starting your own discussion group?

  1. What is your goal?– Knowing what your goal/vision is for the group helps you set it up in a way conclusive to what you want to accomplish.  It also tells you who to advertise your group to.
  2. Do you want it to be official or unofficial?– Both are valid options.  They have different tones/topics/styles and deciding which you have is important.  (Incidentally, the reason I stepped down from running this group is because I became involved with an official one.)
  3. Do you want it to be voluntary or mandatory? – Even an official group is hard to make mandatory.  Beware of this option.
  4. When is good to meet?– Think about when meetings are generally scheduled in your company.  It is ok to change this until you stabilize.  But then try to stick to the same date/time.
  5. What is a good group size?– What is your target?  Is there a number that is too high or too low?
  6. How do you gain critical mass? – It’s much easier to run a group than to start one.  Think about what you can do to get people to come to your first few meetings.

Summary

I used the word FUN a lot in this post.  That’s really the vision for this group.  Good techies are passionate about technology.  While that doesn’t always come out amidst business activities, our group is an opportunity to channel that pure FUN for what we do.

time warner’s cutomer service

I’ve had the “pleasure” of dealing with Time Warner customer support two or three times in the last month.  (One time the queue was too long and I gave up.)

The one thing I liked

Time Warner lets you contact customer service via online chat rather than sitting on the phone.  It still takes forever, but at least one can do something else during that time.  The first time I was #41.  The second time I was #37 (and decided not to wait.)  Today I was #15.  The pace crawled today compared to the first time.   Now that we’ve gotten the good stuff out of the way, let’s go through what happened.

Contact #1

Saturday September 18th I contacted customer service to make sure they knew the cable was out in my building.  This was about a day and a half after a tornado hit my area.  Both Verizon DSL/Time Warner Cable were out in my building and DSL had just come back.  (luckily power was not affected.)  Now I figured someone else had already reported the cable being out, but was just checking in case.

Result: The rep said he would credit my account for the inconvenience.  That’s nice, but I have to complaint to get a credit?  They know the whole building didn’t have cable.  It should be automatic.  I’m a bit peeved you have to complain to get a credit.

Contact #1.5

My DVR was ok so I watched some recorded shows.  On Monday, I thought “surely it will have been fixed by now” and decided to reboot the cable box.  After all, it helps with computers.  After rebooting, my cable box couldn’t tell what time it was, what was on the DVR or that I was even a customer.  Not good.  I tried to contact customer service to ask about this, but the queue was too long and I gave up.

The high tech notification system

I found out the cable was back Tuesday when I threw out the garbage and noticed the super of our building put a sign on the elevator saying the cable was back.  Thank you!  I had tried when I got home and the cable was still out.  I wasn’t going to try again that night.

Contact #2

I got my bill e-mailed and no credit.  Sigh.  I thought we went through this on September 18th.  The rep I chatted with this time (Pablo) says there will be a credit of $19.42 on next month’s bill.  If  it isn’t done then, I’m giving up the time and calling on the telephone to speak to the supervisor.  That would be time to ask for *another* inconvenience credit for the time I’ve wasted with their lousy support.

False promises

The chat form asks if you want to be e-mailed a transcript.  I clicked yes and never received anything.  Why offer it if you aren’t going to send one?

And finally

And finally, I’d like to commend Con Edison,Verizon and the city/parks/etc for their prompt work in fixing things, removing trees/debris and overall making the area.

Pattern.MULTILINE