Database Normalization…. ?
Hanging out in the #mysql channel on Freenode the other night, some members were talking about database normalization. I wasn’t sure what that term meant, so I followed the links that were pasted to find out more. What I discovered was that these ‘normalization rules’ are very practical guidelines for schema creation that I’ve been using for years without knowing what they were called!
In the following article I start looking at the actual rules, and why they can be helpful!
What is normalization?
The concept of normalization is about structuring your databases in such a way as to prevent update anomalies and data inconsistencies. Normalization has been ‘normalized’ into five distinct levels, each level adding new rules to your schema design. Below I’ve tried to summarize the rules about normalization… not necessarily how to apply these rules to your own schemas. For that help you will want to check the further reading links at the end of the article.
First Normal Form (1NF) – No Repeating Groups
All occurrences of a record type must contain the same number of fields.
This one may seem obvious… and impossible when you first think about it. Defining a table schema sets up the number of fields and you can’t change it on a record by record basis…. or can you? Check out the example below.
+--------+------------------+--------------------+ | BookID | Title | Category | +--------+------------------+--------------------+ | 1 | Harry Potter | Fantasy | | 2 | Green Eggs & Ham | Childrens, Fantasy | | 3 | Twilight | Fiction, Vampire | +--------+------------------+--------------------+
We’ve defined a pretty simple table to store information about books. So how does this not meet the guidelines of first normal form? Well, look at the Category field. In this instance several categories are being referenced in the same column, delimited by commas.
This might seem like an acceptable practice to some of you, but imagine what happens when each book is linked to ten or twenty different categories, and you had to run a query to find say, all books in the following two categories… Your application would either need to pull out that category field and do its own processing on it, or your query becomes a mess of LIKE ‘%category%’ type statements. Not an ideal situation.
Second Normal Form (2NF) – No Repeated Data
All non keyed columns in a table must be dependent on the entire key.
This rule is a little trickier than the first, but essentially boils down to this: Any fields in the table must relate directly to the key. Not doing so results in data duplication. Perhaps an example can illustrate this rule better:
+---------+----------------+----------+-----------------------+ | ColorID | ManufacturerID | Quantity | ColorDescription | +---------+----------------+----------------------------------+ | 112 | 12 | 1000 | A pinky-red color | | 113 | 96 | 1895 | Sky blue | | 114 | 3 | 10 | Orange with red tints | +---------+----------------+----------+-----------------------+
The above table has a multi-column key using ColorID and ManufacturerID.
So what violates the second normal form? That’d be the ColorDescription field. This field is referring only to the ColorID, and not the ManufacturerID. Why is this bad? Well imagine there was a second manufacturer making the same color… you’d have a duplicate ColorDescription!
Third Normal Form (3NF) – All Columns Dependent on Key
All non key columns must be dependent on the key, the whole key, and nothing but the key.
Similar to the rule for 2NF, Third Normal Form looks to ensure that everything in a table adds something to the key.
+---------+-----------+--------------+ | FruitID | FruitName | Supplier | +---------+-----------+--------------+ | 77 | Apple | Joes Fruits | | 78 | Orange | Happy Farms | | 79 | Banana | Del Monte | +---------+-----------+--------------+
Here we have a table about Fruit. The key is FruitID. Can you see the problem?
The Supplier column isn’t strictly about fruit, but instead defining a relationship between the fruit and a supplier. A case may arise where we’d want to have multiple suppliers for the same fruit. Tricky with this schema. Further, if you wanted to update the name of a supplier, you’d be looking at modifying many records potentially instead of just one.
Supplier should be broken out into its own table to satisfy Third Normal Form.
Conclusion
These first three rules of database normalization may seem complicated at first, but understanding them can help lead you to happy healthy databases!
Two more normal forms exist, and beyond normal forms other theories exist. We haven’t even touched on performance implications either, which could be an important deciding factor in how normalized you want your database to be. I hope to address these issues in a future article.
Let me know if you have any questions or comments! I’ll be happy to get a discussion going.
Further Reading
Wikipedia article on normalization
Guide to Five Normal Forms in Relational Database Theory
| BookID | Title | Category |
+——–+——————+——————–+
| 1 | Harry Potter | Fantasy |
| 2 | Green Eggs & Ham | Childrens, Fantasy |
| 3 | Twilight | Fiction, Vampire |
+——–+——————+——————–+
