Live from TSSJS – Tips for Hibernate with Patrycja

This morning I’m live blogging from TheServerSide Java Symposium 2011 at the breakout “Anti-Patterns and Best Practices for Hibernate” presented by Patrycja Wegrzynowicz.

Making Good Software
Patrycja’s primary goal is to “understand what makes good software good and bad software bad” and she is working to formalize these concepts.  She would like to develop automated code tools that can detect ahead of time whether or not code contains anti-patterns and potential performance issues.

For those not familiar with the concept, an anti-pattern is an obvious or common strategy which leads to degraded or counter-intuitive performance.

Is Java Fat
Patrycja asks the question “Is Java fat/slow?”.  She claims the language is not, in fact, slow, but that some of the ways in which we program objects is slow, which I agree with strongly.  For example, I have seen computational entensive code that improved in memory/speed by large degrees by relying on primitive variables, rather than more complex objects, in part because “new” is considered expensive in Java.

Automated Code Analysis
Requires large knowledge base.  Patrycja has been researching semantic code query system which focusses on behavior, not just structure.

Conclusion
Patrycja says that proper usage of hibernate is tricky.  Transactional issues and correctness of code under various scenarios.  Best solution may be combination of Hibernate and JDBC.

I liked that fact that she stepped back and indicate Hibernate, by itself, isn’t perfect.  I agree with her take that the best systems are often hybrid systems that take advantage of the ‘low hanging fruit’ of both worlds, and avoid the pitfalls of each.

Live from TSSJS – PostgreSQL with Tom

Live blogging from TheServerSide Java Symposium with Tom Kincaid at his vendor presentation “Introduction to PostgreSQL for Development and Deployment”.  Tom spends a lot of time contrasing Postgres with MySQL, and commenting how since Oracle’s aquisition of MySQL, the licensing of MySQL is now much more restrictive. Tom says the licensing of Postgres is basically “Do what you want with the code but don’t bother us”.

1. Why has it not the most adopted open source DB?
Tom talks about some of the limitation of Postgres that prevented it from becoming one of the most adopted DBMS software including:

  • Early versions were Linux only with Windows support coming later
  • Installation was difficult
  • Default configuration was not neccessarily the best, and was designed for widest platform adoption
  • Lacked bundled distribution tools

Tom points out that originally, developers “first 20 mins” of using the product was often frustrating and would turn people off to the software.

2. Today’s Postgres
Postgres has made a number of changes in the last few years to grow Postgres has a more developer-friendly and easier-to-use product.  It also has been extended to work with all major IDEs, object-relational mapping tools such as JDBC, ODBC drivers, and stored procedures based in Java.  They have also spent time improving the GUI tools to compete with other major DBMS providers.  The GUI also shows performance tools and query plans.

Conclusion
The vendor sessions were abridged compared to the normal sessions so there was a lot Tom did not get to.  He did peak my interest in learning more about Postgres, now that it has grown to a more substational platform.  Tom’s belief is that everyone should use Postgres given its power and hands-off licensing, especially compared to the direction MySQL may be going.

Memo: Avoid Functions in Database Queries

The most common type of questions in the JavaRanch JDBC forum tends to be about improving performance in a database (that, and “Where can I download a JDBC Driver?”). While remote trouble-shooting performance issues can be tricky, we often spot issues with the presented queries and offer alternatives to improve performance. One issue I see from time to time is the use of database built-in functions in database queries. In general, these should be avoided at all costs for commonly-executed queries, especially ones that could trigger the function to be applied to every record in a table. This article will address this issue of using functions in database queries in more detail and provide explanations and tips for avoiding them in the future.

1. Throwing away indexes

By default, databases search an entire table for records. The purpose of indexes, created automatically by some DBMS systems, is to allow records to be retrieved faster. Using Big O notation, a sorted set of records can find an item within a range in O(log(n)) while an unsorted set would have to search the entire table, or O(n). For hash searches on a specific key, the search time is near O(constant) for properly balanced data structures.

Developers can create indexes on multiple columns in a table, but they have no direct control over how and when they are used. It is the job of the DBMS query optimizer to apply what it thinks are the best indexes at the time the query is requested. Queries that apply a function to a column in a database will likely throw away any index that the query optimizer could take advantage of to search the table, resulting in a full-table scan every time the table is queried. If you are lucky and your table only contains a few dozen records, this issue might not be noticeable, but for the rest of us, this could pose a serious problem. While there are some DBMSs, Oracle being one of them, that support function-based indexes, but they are far from standard practice.

2. Goodbye Portability

My biggest issue with functions, like any database-specific feature, is the fact that they are database specific features! Once you write a query that takes advantage of a specific function, porting your application to a different database system becomes much more difficult. Oftentimes, functions are used because there is a muddling of the data, application, and presentation layer. I have seen developers use database-specific functions to format data (most commonly, dates) from SELECT queries that are transmitted from JDBC directly to the user. If you have a strong mid-tier platform like Java, it is better to leverage the formatting functions within the language, not database-specific ones, to present data to the user.

Porting an application to a different database is non-trivial at best, but the use of database-specific functions will make the job much more difficult. And to those developers out there who often comment that switching databases never happens, it does. I’ve done it. And it’s not for the faint of heart.

3. Slow if used incorrectly

There are correct ways to use functions in database queries that are often overlooked when writing the query. For example, compare the following two MySQL-based queries, both of which search for orders placed on February 15, 2010 and use built-in functions:

SELECT * FROM Widgets WHERE DATE_FORMAT(orderDate,"%Y-%m-%d") = '2010-02-15';
SELECT * FROM Widgets WHERE orderDate = (CAST '2010-02-15' as DATETIME);

First, question is, are they equivalent? The answer is no, but let’s skip that for now and discuss performance. Which is likely to perform better on a database that contains a tree-based index on Widgets.orderDate?

Give up? The second query, of course! The second query applies a function to a constant, and most database query analyzers are intelligent enough to only apply this function once. It then uses the range index to find the records in Log(n) time. On the other hand, the first query performs a function call on every record in the table and therefore ignores the index – resulting in a slow table scan.

As for accuracy, if orderDate is a DATETIME, the second query will return all the results placed only at midnight on February 2nd (2010-02-15 00:00:00) while the second query will return all results placed during the entire day. No worries – there is an easy fix that still uses the index on orderDate for optimal performance:

SELECT * FROM Widgets WHERE orderDate >= (CAST '2010-02-15' as DATETIME)
     AND orderDate < (CAST '2010-02-16' as DATETIME);

Even though this adds a second parameter to the search, the sorted index can be applied to both.

Final Thoughts

Like many things in the world, database functions are necessary evil that are required to solve certain problems. The goal shouldn’t be to never use them, but to keep using them to a minimum. One common solution, if the function is being applied to a column repeatedly, is to denormalize the data and add a new column containing the value of the original column with the function applied to it. This is equivalent to how a function-based index works, for databases that do not have this feature built-in. The only overhead is the cost of maintaining the value in the new column each time the original value is updated, although since reads are more common than writes, this is often easy in practice. I am aware some developers have issues with denormalized data, but it is often necessary on large tables with queries that can run for minutes or even hours.

In the previous example, if searching on specific single dates is extremely common, the developer could add an orderDateSTR column that contains strings such as ‘2010-02-15’. A hash index could then be built on the column which would allow searches for single dates to be accomplished in near constant time. Granted, this is not useful for range queries, but would be useful in situations where single date searching needed to be fast on large data sets.