“total cost of owning a mess”

When I wrote about defending the code, I described how we don’t want to meet a date at the expense of our ability to meet future dates.  This applies to future projects as well.  If we meet the production date for the first project my producing a shaky pile of code, what is going to happen to the next project using that code base.  Some things can be scheduled for refactoring between projects.  This approach also requires trust on both sides for it to actually get done.  It also only works with selected and specific types of refactorings.

I think this is similar to Robert C Martin’s discussion on the “total cost of owning a mess”.  (Also from “Clean Code”.)  I like this expression because I think it accurately reflects what happens.  Changes get harder and harder to make as does the code mess.  They also become more and more risky.  Messy code is more likely to  introduce bugs when touched.  And bugs make the total cost of ownership skyrocket.  Which leads to an unfortunate feedback loop where things spiral out of control.  So it’s our job as professionals to make sure we do not introduce a mess and encourage our teammates to do the same.

On the other hand, we often have a mess as the status quo.  It might have been inherited from the prior author of the code.  Or it might have just snuck up on us.  When this happens, we can still try to make things better.  Even if it’s a tiny bit at a time.  Or by not making things worse.  It’s so easy to say “well it’s already a mess, so I’ll just add this hack here.”  That’s how it got to be a mess in the first place!

—-

Starting in a few hours, we will be hosting Uncle Bob (Robert C Martin) at JavaRanch in the Agile and Other Processes forum.  Come on in, ask him a question or join the discussion.  You might even win a free copy of the book Clean Code: A Handbook of Agile Software Craftsmanship.

“it’s your job to defend the code”

As I read “Clean Code” by Robert C Martin, I was particular drawn to the passage about why unrealistic commitments should not be an excuse for writing bad code.  He points out that while it is a manager’s job to “defend the schedule and requirements”, it is a developer’s “job to defend the code with equal passion.”

What management says vs what management wants

One way this becomes a problem is when a developer hears management say “I want it by Tuesday.”  They may want it by Tuesday, but does that mean they are getting it by Tuesday.  The manager also wants it to work correctly and meet all the requirements. A manager isn’t going to be happy if we claim it is done on Tuesday and then a whole pile of rework needs to be done afterwards.  This really means the task wasn’t done by Tuesday and we just claimed it was.

“Done”

The manager also doesn’t want this one date to be met at the cost of meeting other dates. One way this comes up is by saying the task is done while silently thinking that parts of it will be done “later.”  Regardless of whether “later” ever comes about, the task isn’t REALLY done.  When I disagree with someone on whether a task is done, I’ll ask if they need anymore time to work on things related to it.  This may bring up parts of the task that aren’t really done.  Sometimes it brings up additional tasks that were discovered during this task.  Either way, it is often a productive answer.

Trust

When unrealistic estimates come up, there are often two responses.  One is to silently accept it knowing the date will not be met (or will be met at the cost of other things) because it is “what the manager wants.”  The other is to discuss why the estimate will not be met with management.  Of course, the later requires much more energy at first.  However, it pays off in getting everyone to have a realistic understanding of the project.

Schedule gambling

These other desires tend to be implicit, so we only hear the deadline. That doesn’t mean that the manager wants it done by Tuesday at the expense of all else.  I’m fond of saying that management would rather meet the production date than a development date.  So if we are meeting the development date at the expense of something that affects the production date, management will really be less happy.  This may be sloppiness that makes other tasks take longer or leaving parts of the task incomplete.

What all of these things have in common is that it is our responsibility as developers to keep management up to date and provide them with accurate visibility into what is going on within the project.

—-

This coming week we will be hosting Uncle Bob (Robert C Martin) at JavaRanch in the Agile and Other Processes forum.  Come on in, ask him a question or join the discussion.  You might even win a free copy of the book Clean Code: A Handbook of Agile Software Craftsmanship.

Database Performance – Block Updates over the Internet

When you run a website within a hosted web environment, you often do not have the luxury of direct or intranet access to your database anymore. With remoting tools such as phpMyAdmin becoming increasingly popular among web hosting providers, remote connections over the internet may be the only access a user has to their database. Every once in a while a user may have an application that needs to perform a large number of inserts into their database, and in many situations this can be become a daunting task. The purpose of this article is to bring the problem of large updates to light and discuss some solutions.

Key

  • The Problem

Let’s say I’m writing a script within my application to initialize a new table in my database, and this table requires the insertion of 100,000 items. For anyone who’s ever tried something like this over the internet, they probably are aware of some of the limitations. Web hosting providers tend to insert a number of timeouts throughout the system, such as the time a user can maintain a web connection open or the time a user can maintain a database connection. Both of these timeouts, often set at 60 seconds, can be easily reached if the 100,000 records are being transmitted over the internet.

For example, you may have tried to insert a large file in phpMyAdmin and seen the screen go white and the transfer stop. This is often because you have reached one of the server’s predefined timeouts. Other times, the page may explicitly throw a timeout exception.

The core of the problem is that the server is unwilling to open a connection for the length of time required to perform the update. So what is a developer to do? Well, let’s address 3 potential solutions for dealing with the problem:

  • 1. Naive Approach: One Update at a Time

It’s a reasonable guess that 99% of most web applications perform updates one at a time, with commands immediately sent to the database upon execution. Most, if not all, users will stick to this pattern until they have reason to do so otherwise.

Going back to our example with inserting 100,000 items, how would this approach handle it? Well, it would create 100,000 connections to the database, one at a time, of course. The problem is the overhead of creating each connection means this script, while executing correctly, is actually the most time and resource consuming approach of any that we will discuss in this article. While establishing a connection to a database is normally a trivial thing, doing it 100,000 times is not.

Note: By connection, I’m referring to total round trips to the database, not necessarily individual connection objects you create within your application

If done as part of a transaction, this script will execute perhaps 20,000 items before throwing a timeout exception, at which point all previous inserts will be rolled back. Furthermore, if the inserts do go through, it can be frustrating from an application perspective to modify the application to pick up where it left off. Even when this approach is capable of completing successfully, the overhead of connecting to the database 100,000 individual times will often make this script run very slowly in the real world.

  • 2. Risky Approach: All Updates at Once

One potentially good solution is to upload the entire set of records in a single database connection. If the total size is of the records is not too big (5 megabytes for example), the update will likely succeed and at an extremely fast rate when compared to the first approach.

So where does this solution go wrong? Well, lets say the total record size of the 100,000 records is 100 megabytes. It is often the case that the file can never finish uploading to the server before the timeout is reached. Such as with the example with phpMyAdmin going to a white screen, the server won’t maintain a connection long enough to transfer the target file to the database.

Keep in mind, uploading the large set of records to application server may not solve this problem. I’ve seen cases where files local to the application server failed, because the connection between the application server and database within a hosted environment were simply not fast enough to transfer the file and perform the update.

  • 3. Powerful Approach: Block Updates

In the first solution we saw the overhead of creating thousands of database connections cause the time required to perform the update grow drastically, whereas in the second solution the time required to perform the update was great but not within the boundaries of most database connection timeouts. The third and last solution I’ll discuss is to perform updates as set of blocks.

For example, lets say we took the 100,000 records and divided them into 20 blocks of 5,000 records. A quick comparison of perform yields:

Solution Database Connections Count Largest File Size Per Database Connection
1 Update at a Time 100,000 1 kilobyte
All Updates at Once 1 100 megabytes
Block Updates (n=5000) 20 5 megabytes

From this table we see the block solution has the performance advantage of the second solution, namely few database connections since 1 versus 20 connections is quite negligible, but never has a file size bigger than 5 megabytes and is less likely to fail transferring a large file. Furthermore, we can double the block size to 40 blocks of 2,500 records and still have great performance (40 connections versus 1) with a file size of half. In general, you would implement such a solution with the block size, n, determined at runtime or in a properties file so that it can be easily changed. Also keep in mind the last block is most likely never filled. For example, if I had 99,995 records, the last block of 5,000 records would only have 4,995 items and it would be important to make sure the code did make a mistake and assume a full block.

  • Real World Application: Does this work in practice?

Yes, most definitely. I can recall a situation where I was faced with such an issue. The first solution which inserted one record at a time took over an hour to run. The second solution of inserting everything at once almost never completely because it would timeout long before it was finished. The third solution of inserting things as blocks always completed and often within a 5 minutes.

While this solution describes one type of scenario where block updates is the best approach, there are lot of factors that could affect what you do in your application, such as having the ability to increase the timeout values within your hosted application. There are also more advanced solutions such as Batch Updates provided by JDBC as well as the ability to run SQL scripts locally if you have shell access to your database. Overall, this article is meant to remind you that while it’s common to ignore performance consideration of large files in the real world (we’ll just get faster internet!), there are some excellent gains to be made if you spend some time considering handling large files in your application.