Database Performance – Block Updates over the Internet

When you run a website within a hosted web environment, you often do not have the luxury of direct or intranet access to your database anymore. With remoting tools such as phpMyAdmin becoming increasingly popular among web hosting providers, remote connections over the internet may be the only access a user has to their database. Every once in a while a user may have an application that needs to perform a large number of inserts into their database, and in many situations this can be become a daunting task. The purpose of this article is to bring the problem of large updates to light and discuss some solutions.

Key

  • The Problem

Let’s say I’m writing a script within my application to initialize a new table in my database, and this table requires the insertion of 100,000 items. For anyone who’s ever tried something like this over the internet, they probably are aware of some of the limitations. Web hosting providers tend to insert a number of timeouts throughout the system, such as the time a user can maintain a web connection open or the time a user can maintain a database connection. Both of these timeouts, often set at 60 seconds, can be easily reached if the 100,000 records are being transmitted over the internet.

For example, you may have tried to insert a large file in phpMyAdmin and seen the screen go white and the transfer stop. This is often because you have reached one of the server’s predefined timeouts. Other times, the page may explicitly throw a timeout exception.

The core of the problem is that the server is unwilling to open a connection for the length of time required to perform the update. So what is a developer to do? Well, let’s address 3 potential solutions for dealing with the problem:

  • 1. Naive Approach: One Update at a Time

It’s a reasonable guess that 99% of most web applications perform updates one at a time, with commands immediately sent to the database upon execution. Most, if not all, users will stick to this pattern until they have reason to do so otherwise.

Going back to our example with inserting 100,000 items, how would this approach handle it? Well, it would create 100,000 connections to the database, one at a time, of course. The problem is the overhead of creating each connection means this script, while executing correctly, is actually the most time and resource consuming approach of any that we will discuss in this article. While establishing a connection to a database is normally a trivial thing, doing it 100,000 times is not.

Note: By connection, I’m referring to total round trips to the database, not necessarily individual connection objects you create within your application

If done as part of a transaction, this script will execute perhaps 20,000 items before throwing a timeout exception, at which point all previous inserts will be rolled back. Furthermore, if the inserts do go through, it can be frustrating from an application perspective to modify the application to pick up where it left off. Even when this approach is capable of completing successfully, the overhead of connecting to the database 100,000 individual times will often make this script run very slowly in the real world.

  • 2. Risky Approach: All Updates at Once

One potentially good solution is to upload the entire set of records in a single database connection. If the total size is of the records is not too big (5 megabytes for example), the update will likely succeed and at an extremely fast rate when compared to the first approach.

So where does this solution go wrong? Well, lets say the total record size of the 100,000 records is 100 megabytes. It is often the case that the file can never finish uploading to the server before the timeout is reached. Such as with the example with phpMyAdmin going to a white screen, the server won’t maintain a connection long enough to transfer the target file to the database.

Keep in mind, uploading the large set of records to application server may not solve this problem. I’ve seen cases where files local to the application server failed, because the connection between the application server and database within a hosted environment were simply not fast enough to transfer the file and perform the update.

  • 3. Powerful Approach: Block Updates

In the first solution we saw the overhead of creating thousands of database connections cause the time required to perform the update grow drastically, whereas in the second solution the time required to perform the update was great but not within the boundaries of most database connection timeouts. The third and last solution I’ll discuss is to perform updates as set of blocks.

For example, lets say we took the 100,000 records and divided them into 20 blocks of 5,000 records. A quick comparison of perform yields:

Solution Database Connections Count Largest File Size Per Database Connection
1 Update at a Time 100,000 1 kilobyte
All Updates at Once 1 100 megabytes
Block Updates (n=5000) 20 5 megabytes

From this table we see the block solution has the performance advantage of the second solution, namely few database connections since 1 versus 20 connections is quite negligible, but never has a file size bigger than 5 megabytes and is less likely to fail transferring a large file. Furthermore, we can double the block size to 40 blocks of 2,500 records and still have great performance (40 connections versus 1) with a file size of half. In general, you would implement such a solution with the block size, n, determined at runtime or in a properties file so that it can be easily changed. Also keep in mind the last block is most likely never filled. For example, if I had 99,995 records, the last block of 5,000 records would only have 4,995 items and it would be important to make sure the code did make a mistake and assume a full block.

  • Real World Application: Does this work in practice?

Yes, most definitely. I can recall a situation where I was faced with such an issue. The first solution which inserted one record at a time took over an hour to run. The second solution of inserting everything at once almost never completely because it would timeout long before it was finished. The third solution of inserting things as blocks always completed and often within a 5 minutes.

While this solution describes one type of scenario where block updates is the best approach, there are lot of factors that could affect what you do in your application, such as having the ability to increase the timeout values within your hosted application. There are also more advanced solutions such as Batch Updates provided by JDBC as well as the ability to run SQL scripts locally if you have shell access to your database. Overall, this article is meant to remind you that while it’s common to ignore performance consideration of large files in the real world (we’ll just get faster internet!), there are some excellent gains to be made if you spend some time considering handling large files in your application.

Leave a Reply

Your email address will not be published. Required fields are marked *