Why too much Database Normalization can be a Bad Thing

As someone with a long history in database optimization and who even did their Master’s project on Database normalization, I’m probably the last person in the world to argue against database normalization. From a theoretical standpoint, database normalization is a wonderful thing, helping to organize your data into easy-to-manage and understandable parts. For this article, I will play devil’s advocate and argue why too much normalization can be a bad thing.

Database

The years of working in the professional software industry has taught me the practical implications of a fully normalized database system: it’s often slow. It’s a non-trivial issue to design a good database schema that is both fully normalized and performs well on a variety of circumstances. I think back to some systems that I have seen in which a single common query joins dozens or more very large tables. Keep in mind, the more normalized your data is, the more joins that are required. For example, if you normalize a user’s address into a separate table (so the user can have a set of addresses) you have to join the user table with the address table every time you want to display their address. Then, you often have to join with a city, state, and country tables for full normalization.

Let’s take some of the advantages of database normalization and look at why they might not be so great:

Normalization saves space, but space is cheap!
The most obvious advantage of normalization is spacing saving. For example, instead of listing a “United States of America” for 10,000 records in a Users table, I can create a country table that lists the text once, then create a reference with an integer. Clearly, 10,000 integers take less space than 10,000 24-digit text fields. But these days, space is cheap! Terabyte drives are now common, so would normalizing some of the fields of a table really save much space? Probably not. I’m not saying denormalize all fields, but there’s some where the advantages to space are negligible.

Normalization simplifies updates, but reads are more common!
Another common reason for normalization is to simplify updates and reduce anomalies. For example, in the case of “United States of America” text in the Users table, its a lot easier to update/change the text in a single record than it is to update 10,000 records. But that brings up an interesting point, how often does the text “United States of America” change? I would argue almost never. There are examples where data does change more frequently, such as a user’s address, but its common knowledge in database systems that in most tables reads are far more frequent than writes. Therefore, if you have a table with relatively stable data that changes infrequently, normalization isn’t buying you a lot.

Performance, performance, Performance
Database administrators spend the bulk of their time worrying about performance, how to optimize queries and table structure, and in that regard normalization rarely helps. Those calls people tend to get at 2AM about the system crashing over a specific query? They often related to normalization in some way. If you have a case where you can eliminate a normalized table with minimal impact on functionality, it is better to do so. For example, in the case of address you might need users to be able to have multiple addresses (1-to-many relationship), therefore there’s no way to avoid normalizing the data into a separate table. But there are other cases such as with States and Countries, where constantly joining to two completely static tables is not helping. You can still have the tables present in order for a user to select a state or country from a drop-down list, but it may be better to save the text of the state or country in the user’s table, instead of a reference.

Conclusion and Next Article
In this article I played devil’s advocate arguing that too much normalization can be a bad thing. On the other end of the spectrum, too little normalization can also be a bad thing. The mark of a good database designer and software developer is the ability to find a good balance between the two that matches the database structure against the actual or expected use of the system. In the next article, I will discuss some of the tools available in most databases to help combat the performance issues of normalization such as indexes, triggers, and materialized views.

why do we violate law of demeter?

I was watching a Google Tech Talk on Dependency Injection.  Sixteen minutes in the speaker gave an interesting example of the Law of Demeter: If buying item for $25 in store, do you hand clerk $25 or give clerk your wallet and have him/her retrieve the $25.

This got me thinking about why we so frequently violate the Law of Demeter.  I can think of three reasons that come up frequently.

Familiarity with code/process

When we think about what needs to be done, we think “I’ll get out my wallet and get the money.”  This does translate literally to handing the clerk the $25. Since the wallet is an inanimate object, we think more about the wallet than the cash.

Wallet wallet = customer.getWallet();
Cash cash = wallet.getMoney(25);
customer.payCashier(cash);

If it were a person, we would think of the separate step.  If a six year old was buying something, he might think of the transaction as “ask mommy or daddy for $25 and then hand that to the clerk $25.”  This translates better to:

Cash cash = parent.getMoney(25);
customer.payCashier(cash);

In the real life scenario of an intermediary person, the requestor (child) never even rifles through the wallet.  (Granted in the real life scenario, the child doesn’t really ask the wallet – rather the person holding the wallet.)

Access data we shouldn’t

This is really a variant of being overly familiar with all the objects/data.  We tend to think about what we are doing rather than from the object’s point of view.

Can’t touch other code

To me this is the most interesting reason.  We have an object that we think really should have a new method but can’t touch that object.  One solution to this is to create a new object that holds the original and does contain the new method.  This solution is useful when the object belongs to a third party and really can’t be changed.  Sometimes it’s more of a mental block.

I recently came across a case where two numbers where compared via a string comparison (and some other logic) because the author didn’t want to touch the object that owned the data (and was exposing it via a string.)  After further discussion we concluded the owning object really needed a method that told us if this number was positive or negative.  And the complex logic was a “code smell” pointing us to that fact.

Personally, I find the reasons behind things to be more interesting than the actual instance.  It’s easy to say “don’t violate the Law of Demeter.”  Discussing why is plenty interesting (as in the Google Tech Talk.)  It’s thinking about what drives us to such things that helps actually prevent them.

the different levels of a question

People with different levels of experience approach tend to approach a question in different ways.  This is something I see frequently on JavaRanch.

An example

The ideas apply to many questions, but I’ll use “how do I disable the browser back button” as a common example.  Consider some answers to this:

1) In Internet Explorer, you do something the onunload event. This isn’t an ideal answer because it teaches people a hack as if it were the generally accepted approach.  It also doesn’t work universally. [While writing this, I learned this even is implemented in Firefox 3.  Hopefully that doesn’t mean people will start using it inappropriately.]

2) You can’t. – Accurate, but not very helpful as it doesn’t give the person with a problem any ideas on where to go from here.

3) Explanation of why bad practice with follow up question asking what trying to accomplish. – This is better as it tells the person something new and allows for an alternate approach to be suggested.

Answer 1 is a very developer centric way of looking at things.  They literally answer the question.  As such, this tends to be an approach given by someone who has less experience in development or is accustomed to receiving and implementing detailed specs.

Answer 3 is a higher level type answer.  It hones right in on what the person is trying to accomplish.  (maybe not submitting a form again or not seeing stale data.)  This is the type of answer more likely to be given by a senior developer or architect.

You’ve undoubtedly noticed I skipped answer 2 here.  That one could apply to any level of experience.  It is a correct answer if taken literally.  It could just as easily be given by an architect in a hurry – it is factually correct but less helpful than a full answer would be.

A more specific example:

The reason I was thinking about this is I came across a question asking how to identify the first column in a Collection in a JSP.  As of right now, there are two answers in the thread. One is a literal answer involving scriptlets.  The other is a suggestion that maybe the model should be represented.

The interesting thing is that the literal answer better answers the actual question that was asked.  By contrast, my answer (the second one), speculates about what the underlying cause might be without actually touching on the question.  I don’t think this thread particular highlights different levels of experience as much as it does different ways of approaching a problem.

On some level I think this is related to Defending the Code.  But only in that it is about speaking up and finding out what is realy wanted.