Denormalized Databases: A better example

There were a number of comments about my recent article on the negative effects of too much database normalization so allow me to expand the topic a little more. The most consistent comment I saw was that while many of you agreed with me in principle that too much normalization can lead to poor performance, the country name example was a bad choice. As a former teacher, I tend to choose illustrative examples that are easy to understand and explain. As a blog writer, I also tend to forget the web is full of nitpickers that will argue with you unless presented with a concrete example from the real world.

My favorite comment was about how in Europe a poster’s country name has changed 3-4 times in the last few years. I could argue most Americans would be shocked to see their country name or state name change, but I have to think… how hard is it to update the name of a country in a database even if it does change once a year? The trick with database denormalization is you have to write code that updates two separate sets of items, and in a lot of cases this is less difficult than many of you imagine. If updates are uncommon (static) as I mentioned, the performance gain from reading can far outweigh the rare cost of performing an update.

Perhaps its better to move on to a better example of where denormalization of data can play an important part: managing user permissions. And no, this is not to be confused with database security (which is completely useless for user/application-level permission management). Let’s say you have a tree-based system of widgets. Widgets can have sub-widgets and so on such that each widget has only one parent but can have many children. Now let’s say users within the application (again not to be confused with database users) are assigned read/write roles on individual widgets as well as to entire subtrees of widgets. For example, Bob may be able to read the entire tree, but he can only write to select widgets A, B, and C or a select sub-tree of widgets D which contains X, Y and Z. Now, how would you design a database system that allowed for quick determination of a user’s permissions on a node, considering the user may be accessing dozens or hundreds of nodes at a time, such as in a tree viewer?

Since the height of the tree is unbounded, we would need a database schema that allowed for arbitrary height. A common solution is to have a single table, let’s call it WidgetTable, with a field for the parent reference, let’s call it ParentId. Obviously, the root node would have a null ParentId. Given a node X, how do I quickly determine if X is writable? In this case, the user may have write access to X directly or through a parent node somewhere up the tree.

The most common answer, since SQL doesn’t support unlimited-recursive queries of this nature, is to query the parent and check if its writable, then query the parent’s parent and check if that’s writable, and so on until you find a writable permission for the group or you hit the root. Well, let’s do worst case asymptotic analysis on this! The worst case would be if you had one really long chain of single nodes with X being the leaf and the writable flag being on the root. In this case you need to perform O(n) number of queries where n is the number of nodes in the database. The performance of this solution in worst case could be awful.

Amortized (average case) analysis would find better bounds, but remember the earlier stipulation that you may be viewing dozens or hundreds of widgets at once? In worst case, that would be O(n*m) where n is the number of nodes in the table, and m is the number of nodes you are looking up. One of the best solutions in a situation like this is to maintain a denormalized lookup table, let’s call it UserToWidget that maps users to nodes they currently have permissions on, taking all parent relationships into account. With such a table in place (and hopefully good indexes on that table), determining a user’s permissions on a large set of nodes can be done in O(m) time, or near constant depending on the number of nodes you are looking up.

What’s the cost here? Maintaining the UserToWidget table of course! In Oracle, there are options such as materialized views which can help. You could also use database triggers to maintain the relationships anytime the higher-level permissions change. But let’s say you want a solution that isn’t database specific, which most of us do, then you could write application code that acts anytime a user’s permissions are updated, to seek out and update changes to affected nodes below the node you updated. If database reads are more common than database writes (which I suspect they are if you’re viewing hundreds of nodes at a time), the cost of performing the complex update will be far less than the cost of quickly being able to read a widget and determine its permission for a given user.

I hope this example helps to illustrate cases where pure normalization can be too costly for practical use. If not, try reading it again 😉