“it’s your job to defend the code”

As I read “Clean Code” by Robert C Martin, I was particular drawn to the passage about why unrealistic commitments should not be an excuse for writing bad code.  He points out that while it is a manager’s job to “defend the schedule and requirements”, it is a developer’s “job to defend the code with equal passion.”

What management says vs what management wants

One way this becomes a problem is when a developer hears management say “I want it by Tuesday.”  They may want it by Tuesday, but does that mean they are getting it by Tuesday.  The manager also wants it to work correctly and meet all the requirements. A manager isn’t going to be happy if we claim it is done on Tuesday and then a whole pile of rework needs to be done afterwards.  This really means the task wasn’t done by Tuesday and we just claimed it was.

“Done”

The manager also doesn’t want this one date to be met at the cost of meeting other dates. One way this comes up is by saying the task is done while silently thinking that parts of it will be done “later.”  Regardless of whether “later” ever comes about, the task isn’t REALLY done.  When I disagree with someone on whether a task is done, I’ll ask if they need anymore time to work on things related to it.  This may bring up parts of the task that aren’t really done.  Sometimes it brings up additional tasks that were discovered during this task.  Either way, it is often a productive answer.

Trust

When unrealistic estimates come up, there are often two responses.  One is to silently accept it knowing the date will not be met (or will be met at the cost of other things) because it is “what the manager wants.”  The other is to discuss why the estimate will not be met with management.  Of course, the later requires much more energy at first.  However, it pays off in getting everyone to have a realistic understanding of the project.

Schedule gambling

These other desires tend to be implicit, so we only hear the deadline. That doesn’t mean that the manager wants it done by Tuesday at the expense of all else.  I’m fond of saying that management would rather meet the production date than a development date.  So if we are meeting the development date at the expense of something that affects the production date, management will really be less happy.  This may be sloppiness that makes other tasks take longer or leaving parts of the task incomplete.

What all of these things have in common is that it is our responsibility as developers to keep management up to date and provide them with accurate visibility into what is going on within the project.

—-

This coming week we will be hosting Uncle Bob (Robert C Martin) at JavaRanch in the Agile and Other Processes forum.  Come on in, ask him a question or join the discussion.  You might even win a free copy of the book Clean Code: A Handbook of Agile Software Craftsmanship.

Database Performance – Block Updates over the Internet

When you run a website within a hosted web environment, you often do not have the luxury of direct or intranet access to your database anymore. With remoting tools such as phpMyAdmin becoming increasingly popular among web hosting providers, remote connections over the internet may be the only access a user has to their database. Every once in a while a user may have an application that needs to perform a large number of inserts into their database, and in many situations this can be become a daunting task. The purpose of this article is to bring the problem of large updates to light and discuss some solutions.

Key

  • The Problem

Let’s say I’m writing a script within my application to initialize a new table in my database, and this table requires the insertion of 100,000 items. For anyone who’s ever tried something like this over the internet, they probably are aware of some of the limitations. Web hosting providers tend to insert a number of timeouts throughout the system, such as the time a user can maintain a web connection open or the time a user can maintain a database connection. Both of these timeouts, often set at 60 seconds, can be easily reached if the 100,000 records are being transmitted over the internet.

For example, you may have tried to insert a large file in phpMyAdmin and seen the screen go white and the transfer stop. This is often because you have reached one of the server’s predefined timeouts. Other times, the page may explicitly throw a timeout exception.

The core of the problem is that the server is unwilling to open a connection for the length of time required to perform the update. So what is a developer to do? Well, let’s address 3 potential solutions for dealing with the problem:

  • 1. Naive Approach: One Update at a Time

It’s a reasonable guess that 99% of most web applications perform updates one at a time, with commands immediately sent to the database upon execution. Most, if not all, users will stick to this pattern until they have reason to do so otherwise.

Going back to our example with inserting 100,000 items, how would this approach handle it? Well, it would create 100,000 connections to the database, one at a time, of course. The problem is the overhead of creating each connection means this script, while executing correctly, is actually the most time and resource consuming approach of any that we will discuss in this article. While establishing a connection to a database is normally a trivial thing, doing it 100,000 times is not.

Note: By connection, I’m referring to total round trips to the database, not necessarily individual connection objects you create within your application

If done as part of a transaction, this script will execute perhaps 20,000 items before throwing a timeout exception, at which point all previous inserts will be rolled back. Furthermore, if the inserts do go through, it can be frustrating from an application perspective to modify the application to pick up where it left off. Even when this approach is capable of completing successfully, the overhead of connecting to the database 100,000 individual times will often make this script run very slowly in the real world.

  • 2. Risky Approach: All Updates at Once

One potentially good solution is to upload the entire set of records in a single database connection. If the total size is of the records is not too big (5 megabytes for example), the update will likely succeed and at an extremely fast rate when compared to the first approach.

So where does this solution go wrong? Well, lets say the total record size of the 100,000 records is 100 megabytes. It is often the case that the file can never finish uploading to the server before the timeout is reached. Such as with the example with phpMyAdmin going to a white screen, the server won’t maintain a connection long enough to transfer the target file to the database.

Keep in mind, uploading the large set of records to application server may not solve this problem. I’ve seen cases where files local to the application server failed, because the connection between the application server and database within a hosted environment were simply not fast enough to transfer the file and perform the update.

  • 3. Powerful Approach: Block Updates

In the first solution we saw the overhead of creating thousands of database connections cause the time required to perform the update grow drastically, whereas in the second solution the time required to perform the update was great but not within the boundaries of most database connection timeouts. The third and last solution I’ll discuss is to perform updates as set of blocks.

For example, lets say we took the 100,000 records and divided them into 20 blocks of 5,000 records. A quick comparison of perform yields:

Solution Database Connections Count Largest File Size Per Database Connection
1 Update at a Time 100,000 1 kilobyte
All Updates at Once 1 100 megabytes
Block Updates (n=5000) 20 5 megabytes

From this table we see the block solution has the performance advantage of the second solution, namely few database connections since 1 versus 20 connections is quite negligible, but never has a file size bigger than 5 megabytes and is less likely to fail transferring a large file. Furthermore, we can double the block size to 40 blocks of 2,500 records and still have great performance (40 connections versus 1) with a file size of half. In general, you would implement such a solution with the block size, n, determined at runtime or in a properties file so that it can be easily changed. Also keep in mind the last block is most likely never filled. For example, if I had 99,995 records, the last block of 5,000 records would only have 4,995 items and it would be important to make sure the code did make a mistake and assume a full block.

  • Real World Application: Does this work in practice?

Yes, most definitely. I can recall a situation where I was faced with such an issue. The first solution which inserted one record at a time took over an hour to run. The second solution of inserting everything at once almost never completely because it would timeout long before it was finished. The third solution of inserting things as blocks always completed and often within a 5 minutes.

While this solution describes one type of scenario where block updates is the best approach, there are lot of factors that could affect what you do in your application, such as having the ability to increase the timeout values within your hosted application. There are also more advanced solutions such as Batch Updates provided by JDBC as well as the ability to run SQL scripts locally if you have shell access to your database. Overall, this article is meant to remind you that while it’s common to ignore performance consideration of large files in the real world (we’ll just get faster internet!), there are some excellent gains to be made if you spend some time considering handling large files in your application.

premature optimization

Donald Knuth is often quoted as saying “premature optimization is the root of all evil.”

Often in the JavaRanch forums, we see questions like:
which is faster stringBuilder.append(‘a’) or “a” + “b”
(The answer is these two expressions are supposed to generate to the same byte code if used on the same line.)

My first thought when I see this is who cares. I’d be very surprised if the bottleneck in an application was being caused by this line. I’d be even more surprised if the poster had done the analysis to prove this line is the performance bottleneck before posting.

What got me thinking about this is that I think I just wrote some premature optimization on a pet project.

/* Only do search and replace if one or more images in value. This is likely premature optimization but it's faster to add the if statement than find out. */
private String convert(String value) {
  String result = value;
  if (result .contains("<img src=")) {
    for (String imageName : setOfTwentyElements()) {
      // do a regular expression search and replace
      result = result.replaceAll(imageTag,  strippedOutImage);
    }
  }
  return result;
}

The fact that I commented it means I’m aware of doing it. Which is a step ahead of what usually happens where we write the optimization assuming it will be important and later (if ever) finding out otherwise.

In this case, I expect to run the method over a million times and the value not containing an image over half the time. After that the containing code will not live on. I also know the containing code contains a database update which means the code is not the bottleneck. To me, the extra if statement seems like a low cost to avoid doing some work at all. It also feels like a code smell for premature optimization. If this were a real application and not a pet project, I would invest some time in finding out how long the code actually takes to run for my data patterns/length.

Here I’m more interested in whether it is an example of premature optimization. I think I might have to leave it a philosophical question. Comments one way or the other are certainly welcome.  My leaning is that it is still premature optimization and this post is a rationalization.