As someone with a long history in database optimization and who even did their Master’s project on Database normalization, I’m probably the last person in the world to argue against database normalization. From a theoretical standpoint, database normalization is a wonderful thing, helping to organize your data into easy-to-manage and understandable parts. For this article, I will play devil’s advocate and argue why too much normalization can be a bad thing.
The years of working in the professional software industry has taught me the practical implications of a fully normalized database system: it’s often slow. It’s a non-trivial issue to design a good database schema that is both fully normalized and performs well on a variety of circumstances. I think back to some systems that I have seen in which a single common query joins dozens or more very large tables. Keep in mind, the more normalized your data is, the more joins that are required. For example, if you normalize a user’s address into a separate table (so the user can have a set of addresses) you have to join the user table with the address table every time you want to display their address. Then, you often have to join with a city, state, and country tables for full normalization.
Let’s take some of the advantages of database normalization and look at why they might not be so great:
Normalization saves space, but space is cheap!
The most obvious advantage of normalization is spacing saving. For example, instead of listing a “United States of America” for 10,000 records in a Users table, I can create a country table that lists the text once, then create a reference with an integer. Clearly, 10,000 integers take less space than 10,000 24-digit text fields. But these days, space is cheap! Terabyte drives are now common, so would normalizing some of the fields of a table really save much space? Probably not. I’m not saying denormalize all fields, but there’s some where the advantages to space are negligible.
Normalization simplifies updates, but reads are more common!
Another common reason for normalization is to simplify updates and reduce anomalies. For example, in the case of “United States of America” text in the Users table, its a lot easier to update/change the text in a single record than it is to update 10,000 records. But that brings up an interesting point, how often does the text “United States of America” change? I would argue almost never. There are examples where data does change more frequently, such as a user’s address, but its common knowledge in database systems that in most tables reads are far more frequent than writes. Therefore, if you have a table with relatively stable data that changes infrequently, normalization isn’t buying you a lot.
Performance, performance, Performance
Database administrators spend the bulk of their time worrying about performance, how to optimize queries and table structure, and in that regard normalization rarely helps. Those calls people tend to get at 2AM about the system crashing over a specific query? They often related to normalization in some way. If you have a case where you can eliminate a normalized table with minimal impact on functionality, it is better to do so. For example, in the case of address you might need users to be able to have multiple addresses (1-to-many relationship), therefore there’s no way to avoid normalizing the data into a separate table. But there are other cases such as with States and Countries, where constantly joining to two completely static tables is not helping. You can still have the tables present in order for a user to select a state or country from a drop-down list, but it may be better to save the text of the state or country in the user’s table, instead of a reference.
Conclusion and Next Article
In this article I played devil’s advocate arguing that too much normalization can be a bad thing. On the other end of the spectrum, too little normalization can also be a bad thing. The mark of a good database designer and software developer is the ability to find a good balance between the two that matches the database structure against the actual or expected use of the system. In the next article, I will discuss some of the tools available in most databases to help combat the performance issues of normalization such as indexes, triggers, and materialized views.
Nice article, I’m no db-expert but I agree with the point you made about too much normalization though your example with country table and USA as an example isn’t that good. For instance I am from Serbia and this country has changed its name several times, from ‘Yugoslavia’; ‘Serbia and Montenegro’ to only ‘Serbia’. So even country table isn’t ‘static’ as you wrote.
I agree that too much normalization can have negative impacts for read performance. But it helps to reduce redundancy and maintain integrity. Quite often it also depends on the database you are using. Almost every MySQL developer will tell you that joins suck and you better use denormalized tables for best speed. Things are different for PostgreSQL or even Oracle, though. And things like Materialized Views can help to improve read performance as well.
Some days ago I had to “fix” a table from someone who thinks that normalization is bad. This table has 80 columns. Columns like cost_age30, cost_age31, cost_age32 …
Now imaging how queries do look like on those tables *sigh*
It is true that joins have a cost, but the cost is primarily in CPU. The cost of denormalizing is in disk. Now it is true that disk is cheap, but CPU performance has improved more dramatically in recent years, more so than disk speed. (In fact the increase of disk capacity may have reduced the performance of many disk systems in practice, because a wider RAID array with more spindles is no longer necessary from a space point of view). From a performance point of view, disk is still orders of magnitude slower than memory.
I tested the countries example in MS SQL Server. At 10,000 rows there was not much difference, but by 160,000 rows there were clearly fewer logical reads in the normalized model.
Great. So true !
This is one of the amazing aspects where the conclusion is so clear and easy to take up – but most developers don’t ask that question. 3NF is holy, if you don’t normalize till it hurts you’re a bad developer and you’ll be contracted by an awful disease soon 😉 .
Good to read that someone has the ba… to have an own opinion.
I once eliminated all FK relationships from the database and managed them in the code. It was more error prone yes, it took more time to implement yes, but it was a huge gain in performance. I’ve gone that way cause denormalisation was not an option as some of those tables were shared between many entities and updates were frequent on them. The key to success down this road was to test test test…
How about a performance chart comparing the database that stores the same information; but, with the first database being normalized too much, the second database being not too normalized, and the third being not normalized?
I think a lot of you are taking me too literally… I never said get rid of normalization, but its clear that too much of it can be a bad thing. In general, reads within a denormalized system is almost always faster than a normalized one, the catch is updating and data storage are poor.
Great Raditya, why create a chart as a project and I’ll post the results.
An old topic, but still ranks in Google, so I’ll still comment.
I think you need to also consider disk latency when assessing the cost of disk space. It might be cheap per GB of storage, but if the database manager needs to bring in four times as many pages in order to perform a join or any other operation on the table, then that could have a very negative impact on performance. There’s usually more disk storage than memory allocated to a database.
Also, the sophistication of the database manager in generating access paths and the accuracy of database statistics will play a large part in determining whether it’s faster to denormalise to plain text, or use lookup tables (as per your ‘country’ field example.
Fabulous article. I’m dealing with overnormalization right now, in an IT computer database. Normalizing, in this case, is just more work than it’s worth, especially for a small company.
I mean, I’ve got a table for CPUType and three entries for a P5. I will NEVER buy another P5. They don’t make them. I don’t don’t want one… you get the picture, I hope.
I am in the process of “denormalizing,” to a degree… but I will admit it HURTS because normalizing as been so drilled in to my brain. 🙂
Pingback: Memo: Avoid Functions in Database Queries | Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky
I’m poking around SQL Server 2012 with the Adventureworks database. So sick of all these numeric ids!!!
It seems some people didn’t read the word “too much”.
SQL Server 2016’s temporal table functionality can really eat into disk space if tables are not normalised. The convenience of history logging across multiple tables has lent me to over-normalise thus avoiding history tables duplicating wide rows in entirety for each minor data update. I realise that this functionality was not available during the OP (for SQL Server, at least), but the requirement to keep a change history was. If the native functionality didn’t exist, for whatever platform, it would be written for purpose. Change history is easier when over-normalising.
I think that people too much normalization-aware might be sick and design worse databases, less “browsable” and harder to code with.
Space isn’t an issue in most cases, and I think this: if you have values which can be naturally primary keys, use them! DO NOT create another id just because it is “shorter”. I believe PK should be the actual values, not something that “stands for”, just because you spare bytes. E.g.: if you have a customer_table, let the PK be the actual customer_id, not some sequence generated id which then you will be using around for FKs.