CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Raymond Lewallen

Framework Design, Agile Coach, President Oklahoma City Developers Group, Microsoft MVP C#, TDD, Continuous Integration, Patterns and Practices, Domain Driven Design, Speaker, VB.Net, C# and Sql Server

Database Basics Part Two - Normalization

In an effort to provide information for those of you who are newcomers to databases, especially with the wonderul new Sql Server Express, I have committed myself to about half a dozen posts or so targeted directly at you to help you understand and build better databases.  Stay tuned to this blog for weekly updates to this series, or subscribe to the rss feed.

Normalization: organizing data in your database is an efficient way for both you and the database to use and understand.  What this really means is that normalization helps us to elimate redundant data and to store the data we have in a manner that makes sense.  Last week we talked about ACID, and you will recall the “atomicity” of a database.  Atomicity also must be maintained in tables via normalization.  Normalization helps us to create atomic tables by removing redundant data within rows of a table, thus creating atomic tables.

Guidelines exist for normalizing data (referred to as normal forms) and exists as 5 sets of rules, labled first normal form through fifth normal form.  In most practical applications, you rarely see 4th or 5th normal forms (herein referred to as 1NF-5NF), so we’re going to look at 1NF, 2NF and 3NF, along with denormalized data.

Denormalized Data

Let’s take a look at some denormalized data.  I’m going to use the simplest example that comes to mind, and that is a auto repair shop that specializes in domestic automobiles (domestic to the USA, that is).  Our table lists automobile manufactuers and the makes of automobiles they manufactuer that the shop services.

Fig. 1
Manufactuer Make 1 Make 2 Make 3
Ford Mercury Lincoln Ford
GM Chevrolet Pontiac Saturn, Buick
Chrysler Dodge      

Fig. 1 shows us denormalized data.  For GM, we didn’t have enough columns to store all the data we wanted, so we put Buick in with Saturn.

First Normal Form (1NF)

We want to take fig. 1 and normalize it a bit.  Before we can do that, we need to know what the rules of 1NF are.  There are 2 rules to 1NF.

  • Remove duplicate columns.
  • Each column by row position must have a unique value.

As you can see, 1NF is already enforcing the atomic nature of a table.  So first we need to remove duplicate columns from the table, at the same time creating column by row positions that have unique values, which means we also have to seperate Saturn, Buick into individual values.

Fig. 2
Manufactuer Make
Ford Mercury
Ford Lincoln
Ford Ford
GM Chevrolet
GM Pontiac
GM Saturn
GM Buick
Chrysler Dodge

Fig. 2 meets the 2 rules of 1NF.  We have normalized our Fig. 1 data into First Normal Form.

Second Normal Form (2NF)

For each level of normal form, the next level must adhere to the rules of the previous level, in addition to adding its own rules.  This means in order to create a data structure that meets the requirements of 2NF, we must first meet the requirements of 1NF.  We have already discussed the rules for 1NF, so let’s look at the rules required to meet 2NF.

  • Remove duplicate data in a single column and place the data in seperate tables.
  • Create relationships between the sets of data.

In Fig. 2, if GM gets bought by another manufactuer, say Honda, then we’d have to replace each row in the table with Honda everywhere we find GM.  That is a bit inefficient.  By meeting the 1st rule of 2NF, we can replace GM with Honda in our manufactuer table in Fig. 3, and this is must more efficient and a logical storage of data.  In order to meet the 2nd rule of 2NF, we must have keys and relationships.  The ManufactuerId in the manufactuer table is a Primary Key.  It is a unique identifier used to reference the description for the manufactuer.  We create a relationship beween it and the Make table using Manufactuer column in the Make table as the foreign key.  Fig. 4 shows the database diagram for this.  Now we are starting to adhere to consistency of our ACID model.  Fig. 3 satisfies all the rules of 1NF and 2NF.  We have normalized our Fig.2 to Second Normal Form.

Fig. 3Manufactuer Table
ManufactuerId Manufactuer
1 Ford
2 GM
3 Chrysler

Make table
Manufactuer Make
1 Mercury
1 Lincoln
1 Ford
2 Chevrolet
2 Pontiac
2 Saturn
2 Buick
3 Dodge

Fig. 4

 

Third Normal Form (3NF)

Again, in order to meet the rules of 3NF, we must first meet the rules of 1NF and 2NF.  In addition to meeting these requirements, 3NF introduces 1 other rule:

  • All columns that are not dependent on the primary key must be removed.

Sounds simple enough.  First, lets take our Make table from Fig. 3 and add a column to it that shows how much the auto repair shop charges per hour for each different make, and also a column that shows if the shop works on cars or trucks for each particular make.

Fig. 5Make table
Manufactuer Make Charge Style
1 Mercury $60 Cars
1 Lincoln $60 Cars
1 Ford $75 Trucks
2 Chevrolet $55 Both
2 Pontiac $60 Cars
2 Saturn $60 Cars
2 Buick $60 Cars
3 Dodge $75 Trucks

Not let’s examine the data and see if they meet the requirements of 3NF.  First, we must consider that Make is the primary key in this table.  It is a unique value that can be used as an identifier.  Ideally, that would be an integer column called MakeId and Make would be a description column.  We’ll look at that in a moment.  Knowing that Make is the primary key, are all columns in the Make table dependent on the primary key?  Manufactuer: yes.  The manufactuer depends on the make of the automobile.  Style?  Yes, the style the shop works on is dependent on the make of automobile.  Charge?  Hrm.  If we look closely, it appears as though the charge is dependent on the style, not the make.  This table does not meet 3NF.  We must remove the charge column and relate it to the Style, not to the make.  We do this by creating another table called charge.

Fig. 6
Style Charge
Cars $60
Trucks $75
Both $55

Now we can remove charge from the Make table in Fig. 5, because they style that is dependent on the make, gives us the charge.  The complete 3NF structure when all is said and done looks like Fig. 7 below.

Fig. 7

The Final DataManufactuer Table
ManufactuerId (PK) Manufactuer
1 Ford
2 GM
3 Chrysler

Make table
MakeId (PK) Manufactuer (FK) Make Style (FK)
1 1 Mercury 1
2 1 Lincoln 1
3 1 Ford 2
4 2 Chevrolet 3
5 2 Pontiac 1
6 2 Saturn 1
7 2 Buick 1
8 3 Dodge 2

Style Table
StyleId (PK) Style Charge
1 Cars $60
2 Trucks $75
3 Both $55

Next topic in this series: Isolation



Comments

Brendan Tompkins said:

Excellent Explanation... Can I request a topic? I've always heard that many database optimizations involve some "De-Normalization" for performance or other purposes... I'm personally kind of a freak about normalization, and hate to see denormalized data, but perhaps you could talk about why someone would deviate from 3rd normal form?
# January 4, 2006 4:44 PM

David Hayden said:

I was just about to make that point, Brendan.

Normalization is often sacrificed in ways to make applications more performant or maintainable. You have to look for the balance that makes sense, but there are times I will forgo normalization best practices to simplify and speed up my data access.

A lot of the best practices for individual tiers or layers in an application are contrary to best practices in other layers or tiers. Development is often choosing which best practice is more important than the other for your individual application.

Keep up the great posts, Raymond!
# January 4, 2006 5:34 PM

Raymond Lewallen said:

Brendan,

Absolutely. I will post about that. One of the most common scenarios you see surrounding denormalizing your data is for read-only, high speed applications i.e. reporting. When you don't have the need to worry so much about adding and changing data, this allows you to remove constraints and merge data back together (denormalize) for better read access speed and less work for the database overall.
# January 4, 2006 9:46 PM

Warren Connors said:

Another common reason for denormalization even within OLTP databases is for historical accuracy purposes. For example, in an OrderDetail table one might have OrderID, OrderDetailLineID, ProductID, ProductPrice, Quantity. Looking at the table strictly in pure database normalization terms, the ProductPrice column would not be included as it is not dependent on the primary key of the table (in this case a composite primary key of OrderID and OrderDetailLineID); it is dependent on the ProductID. So one might say do not include it to satisfy normalization and because one can always just go get the ProductPrice from the Product table for that ProductID. Only problem is that over time the price of that product might change quite frequently. So, the ProductPrice is stored with the order detail in order to be able to preserve accurately into the future what the price (and the calculated extended price of ProductPrice X Quantity as well) were at the time of the order.
# January 7, 2006 12:50 AM

Raymond Lewallen said:

Warren,

Absolutely. That is one of the scenarios I have put into a blog post I'm going to put up on denormalization. Keeping historical data accurate often requires denormalizing data.
# January 7, 2006 1:06 AM

Sam said:

Warren:

I know what you're trying to explain, but just to clear up any confusion I think you picked a poor example. When modelling your entities you would find out the LineItemPrice is in fact a member of a LineItem and not a Product. As you said, a ProductPrice may change over time, while the LineItemPrice stays fixed.

That's because the LineItemPrice isn't modelling the Price of a Product. It's modelling the price of a LineItem, which has a product (usually). This LineItem price is something completely different in modelling terms since it's not mean to give detail about the product at all. They aren't related entities. It's meant to give the price of the LineItem, and the LineItem only. This may include premiums for special-orders, or discounts for an overstock condition or special customer relation. Your margin on the Product may generally dictate a certain profitable LineItem price, but you might also take a huge hit on a particular LineItem knowing the sum of the Order more than makes up for it in another LineItem.

So sorry if this seems petty. Just trying to get people to think about (in my opinion) something that often isn't given enough thought: What is your Model? In this case you aren't really denormalizing at all. You're representing exactly what your Model requires.

Raymond:

Great series!

If I could make a suggestion though: Maybe going into a bit of detail about common misconceptions might be helpful. 99% of the "denormalization" I've seen for example hasn't helped performance, because it was never profiled. The cost of larger indexes was never taken into account or measured. The cost of violating the fundamentals of the Relational model was never considered because the person doing the work didn't understand that a database was more than just a QueryLanguage for what looked like CSV files.

I've been guilty of that myself in the past, so no offense intended of course. In my experience though *de*normalization rarely is though since the database was never created with normalization in mind. Either Joins are thought too complex or tedious, or it's just assumed that denormalization equals greater performance when that's generally not the case.

Another scenario you touch on, Reporting, is often used to justify it, but again, in my experience it's usually premature (meaning that a normalized version was never even attempted), or the wrong tool is being used (lots of aggregates, which would be better served by something like Analysis Server).

Of course there are exceptions to every rule, but everyone thinks they're a database whiz it seems without actually understanding the Relational model (I only have a loose grasp on it myself), or more importantly, coming up with the hard data to prove one way or another that they're approaching the problem with the right perspective. So unless your a "real" DBA, I would _always_ start out with a normalized design, tune your indexes, default cluster, use the hardware you have available to you by doing things like using different file clusters and fill-factors, make sure you're caching effectively in your applications, and then, and only then, would I look into denormalization if you still have performance issues. Even then, I'd suggest looking at trying denormalizing different entities instead of just trying to flatten everything with a big stick. Look at your entities and see what has the least impact on your model if it were to be denormalized. Look at what will get you to that make Join count of 3. Don't assume that one really large table with several covering indexes is always going to be faster if you've basically bloated your index size to the point where it duplicates most of your row data.

I could ramble more but hopefully there's a point in there somewhere. :-)
# January 7, 2006 1:50 AM

Raymond Lewallen said:

Sam,

Good comments. The are absolutely needs for denormalizing data, and the biggest need is reporting. Hopefully we'll be able to continue these conversations on a "denormalizing" post instead of this "normalizing" post.

In my current project, we have a need to build numerous reports, each of which contains 250+ fields. This was originally built using mutliple views, nested queries and 15+ joins. That is the ideal situation to denormalize data into a seperate entity to store that data and pull it for the report, thus executing the query once per primary key and storing denormalized data, rather than executing the massive query every time the report needs to be viewed.

Sure, understanding the relational model is important, as is understanding the engine which executes your queries. I would estimate that smaller sets of data that get denormalized don't provide any performance increase, but they aren't providing decresed performance either, if used for what they are intended for, which is typically read-only data.

Its the really big sets of data, such as a view, where you have many, many joins and nested queries that provide benefit, but then, like I said, generally for read-only data because its harder on the indexes if you are updating or inserting data into large, denormalized tables.
# January 7, 2006 8:51 AM

Raymond Lewallen said:

Sam,

Also, I forgot to mention, I could care less about minor performance losses when denormalizing data for certain purposes. There are very real scenarios like Warren talked about where storing historical data requires data denormalization. You mention "everyone thinks they're a database whiz it seems without actually understanding the Relational model ", and it seems like you are saying the people who want to denormalize data typically don't know what they are doing. Denormalization is just as much part of the relational model as normalizing. The more common problem with understanding the relational model is that too many people think 3NF is the only way to design a database, and its not. Hopefully my upcoming denormalization post will help clear that up.
# January 7, 2006 11:01 AM

Sam said:

I look forward to your denormalizing post!

I will take it to my grave that Warren's example was deceptively easy to mischaracterize as denormalization, when in fact he was speaking of a unique entity. :-)

>> "it seems like you are saying the people who want to denormalize data typically don't know what they are doing."

I think it's fair to say most programmers (including myself) are often just trying to get the job done, without really coming to a deep understanding of their tools or problem domain.

I don't think 3NF is the only way to _implement_ a database (and as you pointed out, there are _obvious_ situations where it's the *wrong* way to implement one). I do think it's the right way to _design_ a database though. At least until you have metrics, or the experience, as you obviously do, to determine if something is an obvious candidate for denormalization.

If you skip the whole step of prototyping with 3NF, then I would _strongly_ suspect that you've probably also skipped modelling your problem domain at all. While Domain Models aren't the _only_ way to create an application, I suppose you could call me a true believer since I think they're the _right_ way in _most_ (but not all of course) scenarios.

I understand I'm probably in the minority on this one in the .NET community of course, so I'll try not to preach too much. :-)

Regardless, great article, and I'm really looking forward to more!
# January 7, 2006 11:53 PM

chuck said:

If you have a normalized database and find you always are joining tables for a report(s) can't you just create a View table just for those report(s) that take alot of joins.

If i remember correctly the database will update the view if any of the tables get modified.
# January 25, 2006 9:55 AM

Raymond Lewallen said:

Author <a href="blogs/raymond.lewallen/">Raymond Lewallen</a><br />In an effort to provide information for those of you who are newcomers to databases, these Database Basics series of posts are targeted directly at you to help you understand and build better databases.
# February 10, 2006 11:29 AM

Development said:

# March 4, 2006 11:35 PM

Harpreet said:

Your 1NF to 2NF definition and example are flat out incorrect.  Look up the definition.  Substituting 1 for Ford doesn't change your normalization at all.    I think people are too eager to publish stuff for web stats and google ads.
# April 15, 2006 2:09 AM

Raymond Lewallen said:

Harpreet, I'm not sure where you are getting your definations from, but the definitions I have provided are exactly what 1NF, 2NF and 3NF are defined as.  I think people are too eager to criticize what they don't understand.
# April 15, 2006 11:34 AM

Harpreet said:

I've taught this stuff many times so I'm pretty sure that your 2NF example is wrong.  One good way to look at this is to see if the number of rows/columns in your table have changed.  If they have not then you probably haven't changed normal forms.  A simple substitution doesn't change that.  Just check a book on database theory if you don't believe me.  

Here's a textbook definition:
A Relation schema R is in 2NF if every nonprime attribute A in R is not partially dependent on any key of R.

Your example can't be put into 2NF because it already is.  That's why your step is incorrect.  To even test for 2NF issues you have to have a primary key with multiple attributes.  

I realize you're confident that I'm wrong but you might explore it a bit more anyway. I'm not just making this up.
# April 22, 2006 12:39 PM

Raymond Lewallen said:

Harpreet, perhaps I should have used larger sets of data and columns to explain the normalizations and extraced larger subsets of data, then their wouldn't be any confusion.  Perhaps I'll go back later and create larger sets of data to better explain the normalizations.
# April 26, 2006 4:54 PM

prasad said:

hi
# June 22, 2006 4:40 AM

karthik scjp said:

Your explanation is nice. Dont mind those stupid commenters

# March 18, 2008 10:12 AM

Leave a Comment

(required)  
(optional)
(required)  

Enter the numbers above:
Add

About Raymond Lewallen

Working primarily in the public sector during his career, Raymond has designed and built several high profile enterprise level applications for all levels of the government. Raymond now works as a solutions architect for EMC. Raymond is an agile coach, Microsoft MVP C# and also president of the Oklahoma City Developers Group and Oklahoma Agile Developers Group. Raymond spends a lot of his time learning and teaching such things as Test Driven Development, Domain Driven Design, Design Patterns and Extreme Programming practices and principles, to name a few. Raymond is also an advocate of Alt.Net. Raymond is primarily a framework guy, so don't ask him anything about UI :) Check out Devlicio.us!