Denormalized Databases: A better example

There were a number of comments about my recent article on the negative effects of too much database normalization so allow me to expand the topic a little more. The most consistent comment I saw was that while many of you agreed with me in principle that too much normalization can lead to poor performance, the country name example was a bad choice. As a former teacher, I tend to choose illustrative examples that are easy to understand and explain. As a blog writer, I also tend to forget the web is full of nitpickers that will argue with you unless presented with a concrete example from the real world.

My favorite comment was about how in Europe a poster’s country name has changed 3-4 times in the last few years. I could argue most Americans would be shocked to see their country name or state name change, but I have to think… how hard is it to update the name of a country in a database even if it does change once a year? The trick with database denormalization is you have to write code that updates two separate sets of items, and in a lot of cases this is less difficult than many of you imagine. If updates are uncommon (static) as I mentioned, the performance gain from reading can far outweigh the rare cost of performing an update.

Perhaps its better to move on to a better example of where denormalization of data can play an important part: managing user permissions. And no, this is not to be confused with database security (which is completely useless for user/application-level permission management). Let’s say you have a tree-based system of widgets. Widgets can have sub-widgets and so on such that each widget has only one parent but can have many children. Now let’s say users within the application (again not to be confused with database users) are assigned read/write roles on individual widgets as well as to entire subtrees of widgets. For example, Bob may be able to read the entire tree, but he can only write to select widgets A, B, and C or a select sub-tree of widgets D which contains X, Y and Z. Now, how would you design a database system that allowed for quick determination of a user’s permissions on a node, considering the user may be accessing dozens or hundreds of nodes at a time, such as in a tree viewer?

Since the height of the tree is unbounded, we would need a database schema that allowed for arbitrary height. A common solution is to have a single table, let’s call it WidgetTable, with a field for the parent reference, let’s call it ParentId. Obviously, the root node would have a null ParentId. Given a node X, how do I quickly determine if X is writable? In this case, the user may have write access to X directly or through a parent node somewhere up the tree.

The most common answer, since SQL doesn’t support unlimited-recursive queries of this nature, is to query the parent and check if its writable, then query the parent’s parent and check if that’s writable, and so on until you find a writable permission for the group or you hit the root. Well, let’s do worst case asymptotic analysis on this! The worst case would be if you had one really long chain of single nodes with X being the leaf and the writable flag being on the root. In this case you need to perform O(n) number of queries where n is the number of nodes in the database. The performance of this solution in worst case could be awful.

Amortized (average case) analysis would find better bounds, but remember the earlier stipulation that you may be viewing dozens or hundreds of widgets at once? In worst case, that would be O(n*m) where n is the number of nodes in the table, and m is the number of nodes you are looking up. One of the best solutions in a situation like this is to maintain a denormalized lookup table, let’s call it UserToWidget that maps users to nodes they currently have permissions on, taking all parent relationships into account. With such a table in place (and hopefully good indexes on that table), determining a user’s permissions on a large set of nodes can be done in O(m) time, or near constant depending on the number of nodes you are looking up.

What’s the cost here? Maintaining the UserToWidget table of course! In Oracle, there are options such as materialized views which can help. You could also use database triggers to maintain the relationships anytime the higher-level permissions change. But let’s say you want a solution that isn’t database specific, which most of us do, then you could write application code that acts anytime a user’s permissions are updated, to seek out and update changes to affected nodes below the node you updated. If database reads are more common than database writes (which I suspect they are if you’re viewing hundreds of nodes at a time), the cost of performing the complex update will be far less than the cost of quickly being able to read a widget and determine its permission for a given user.

I hope this example helps to illustrate cases where pure normalization can be too costly for practical use. If not, try reading it again 😉

4 thoughts on “Denormalized Databases: A better example

  1. I wanted to chime in on the last article too, but for lack of time I couldn’t finish my point, then lost what I had written to a cache flush. Oy.

    My two cents on this one: delete this apology! Simple examples are meant to be taken apart, but for the sake of clarifying the problem, not criticizing the writer. The only nitpick that is worth reading to the end is a better example! If someon truly didn’t get what you were saying because of your example, that’s one thing.

    As I see it, whiteboard examples are limiting in proportion to how literally it is understood — and how quickly the reader would hope to apply it, rather than mediate on the meaning. If I’ve learned anything from my almost-15 years in technical training, it’s that people don’t want to think if they don’t have to, they want to plug in. Unless you’re offering that service, stick to examples that demonstrate your point without proving it. Let the impatient reader meet you halfway. If they don’t want to do even that, don’t worry about it! You do well enough to put your points out there for free.

    Keep ’em coming, Scott!

  2. Denormalization is beneficial especially if you have a normalized hierarchical table.An example could be parts having subparts resulting in a hierarchy of multiple levels. I did a time check on with an SQL query on this kind of a table(with 3 levels ) and the denormalized table gave a substantial performance improvement (with respect to time)

  3. You shouldn’t assume that you know how any particular database engine will perform a given task. They will all be different. For example, some may allocate space to empty fields, some may not; joins may sometimes be performed with table accesses and sometimes with index lookups. Frequently used small tables may be kept in memory and be very quick to access. The advantage of a normalised design is as a consistent start point from which to optimise for performance, if that is what’s needed. Optimisation may be by tuning the database, creating indexes or denormalising as appropriate.

  4. “SQL doesn’t support unlimited-recursive queries of this nature”

    That’s true for parent-linked trees.

    One denormalization to work around this is your user-to-widget table. But that can grow fairly quickly when several users are given privileges over the head of a large category tree.

    Another denormalization is to cache of the category tree’s preorder traversal as a “nested set” model. With a nested set, you can quickly look up all parents of a node or all descendants of a node. Then to see if a user has privileges on a node or any of its ancestors, you can do something like this to find all ancestors of category 1233 where Chester has privileges:

    SELECT couter.catid, p.privs
    FROM users u
    INNER JOIN user_subcat_privs p ON p.userid = u.userid
    INNER JOIN subcats couter ON p.catid = couter.catid
    INNER JOIN subcats cinner
    ON (couter.lft <= cinner.lft AND cinner.lft <= couter.rgt)
    WHERE users.username = ? AND cinner.catid = ?
    –args: ('Chester', 1233)

    This is O(n) for a node of depth n, as cinner.lft will be between couter.lft and couter.rgt only for a particular category's ancestors. It doesn't grow as quickly as more users are added or when a user's privileges on a fairly large category are changed.

    Nested sets can be regenerated periodically, or they can even be updated in place.
    http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/

Leave a Reply

Your email address will not be published. Required fields are marked *