Database Normalization Overview

Developed in the 1970's by E.F. Codd, database normalization is standard requirement of many database designs. Normalization is a technique that can help you avoid data anomalies and other problems with managing your data. It consists of transforming a table through various stages: 1st normal form, 2nd normal form, 3rd normal form, and beyond.

It aims to:

  • Eliminate data redundancies (and therefore use less space)
  • Make it easier to make changes to data, and avoid anomalies when doing so
  • Make referential integrity constraints easier to enforce
  • Produce an easily comprehensible structure that closely resembles the situation the data represents, and allows for growth

Let's begin by creating a sample set of data. You'll walk through the process of normalization first without worrying about the theory, to get an understanding of the reasons you'd want to normalize. Once you've done that, we'll introduce the theory and the various stages of normalization, which will make the whole process described below much simpler the next time you do it.

Imagine you are working on a system that records plants placed in certain locations, and the soil descriptions associated with them.

The location:

  • Location Code: 11
  • Location name: Kirstenbosch Gardens

contains the following three plants:

  • Plant code: 431
  • Plant name: Leucadendron
  • Soil category: A
  • Soil description: Sandstone
  • Plant code: 446
  • Plant name: Protea
  • Soil category: B
  • Soil description: Sandstone/Limestone
  • Plant code: 482
  • Plant name: Erica
  • Soil category: C
  • Soil description: Limestone

The location:

  • Location Code: 12
  • Location name: Karbonkelberg Mountains

contains the following two plants:

  • Plant code: 431
  • Plant name: Leucadendron
  • Soil category: A
  • Soil description: Sandstone
  • Plant code: 449
  • Plant name: Restio
  • Soil category: B
  • Soil description: Sandstone/Limestone

Tables in a relational database are in a grid, or table format (MariaDB, like most modern DBMSs is a relational database), so let's rearrange this data in the form of a tabular report:

Plant data displayed as a tabular report

Location codeLocation namePlant codePlant nameSoil categorySoil description
11Kirstenbosch Gardens431LeaucadendronASandstone
446ProteaBSandstone/limestone
482EricaCLimestone
12Karbonkelberg Mountains431LeucadendronASandstone
449RestioBSandstone/limestone

How are you to enter this data into a table in the database? You could try to copy the layout you see above, resulting in a table something like the below. The null fields reflect the fields where no data was entered.

Trying to create a table with plant data

Location codeLocation namePlant codePlant nameSoil categorySoil description
11Kirstenbosch Gardens431LeucadendronASandstone
NULLNULL446ProteaBSandstone/limestone
NULLNULL482EricaCLimestone
1 2Karbonkelberg Mountains431LeucadendronASandstone
NULLNULL449RestioBSandstone/limestone

This table is not much use, though. The first three rows are actually a group, all belonging to the same location. If you take the third row by itself, the data is incomplete, as you cannot tell the location the Erica is to be found. Also, with the table as it stands, you cannot use the location code, or any other fields, as a primary key (remember, a primary key is a field, or list of fields, that uniquely identify one record). There is not much use in having a table if you can't uniquely identify each record in it.

So, the solution is to make sure each table row can stand alone, and is not part of a group, or set. To achieve this, remove the groups, or sets of data, and make each row a complete record in its own right, which results in the table below.

Each record stands alone

Location codeLocation namePlant codePlant nameSoil categorySoil description
11Kirstenbosch Gardens431LeucadendronASandstone
11Kirstenbosch Gardens446ProteaBSandstone/limestone
11Kirstenbosch Gardens482EricaCLimestone
12Karbonkelberg Mountains431LeucadendronASandstone
12Karbonkelberg Mountains449RestioBSandstone/limestone

Notice that the location code cannot be a primary key on its own. It does not uniquely identify a row of data. So, the primary key must be a combination of location code and plant code. Together these two fields uniquely identify a row of data. Think about it. You would never add the same plant type more than once to a particular location. Once you have the fact that it occurs in that location, that's enough. If you want to record quantities of plants at a location - for this example, you're just interested in the spread of plants - you don't need to add an entire new record for each plant; rather, just add a quantity field. If for some reason you would be adding more than one instance of a plant/location combination, you'd need to add something else to the key to make it unique.

So, now the data can go in table format, but there are still problems with it. The table stores the information that code 11 refers to Kirstenbosch Gardens three times! Besides the waste of space, there is another serious problem. Look carefully at the data below.

Data anomaly

Location codeLocation namePlant codePlant nameSoil categorySoil description
11Kirstenbosch Gardens431LeucadendronASandstone
11Kirstenbosh Gardens446ProteaBSandstone/limestone
11Kirstenbosch Gardens482EricaCLimestone
12Karbonkelberg Mountains431LeucadendronASandstone
12Karbonkelberg Mountains449RestioBSandstone/limestone

Did you notice anything strange? Congratulations if you did! Kirstenbosch is misspelled in the second record. Now imagine trying to spot this error in a table with thousands of records! By using the structure in the table above, the chances of data anomalies increases dramatically.

The solution is simple. You remove the duplication. What you are doing is looking for partial dependencies - in other words, fields that are dependent on a part of a key and not the entire key. Because both the location code and the plant code make up the key, you look for fields that are dependent only on location code or on plant name.

There are quite a few fields where this is the case. Location name is dependent on location code (plant code is irrelevant in determining project name), and plant name, soil code, and soil name are all dependent on plant number. So, take out all these fields, as shown in the table below:

Removing the fields not dependent on the entire key

Location codePlant code
11431
11446
11482
12431
12449

Clearly you can't remove the data and leave it out of your database completely. You take it out, and put it into a new table, consisting of the fields that have the partial dependency and the fields on which they are dependent. For each of the key fields in the partial dependency, you create a new table (in this case, both are already part of the primary key, but this doesn't always have to be the case). So, you identified plant name, soil description and soil category as being dependent on plant code. The new table will consist of plant code, as a key, as well as plant name, soil category and soil description, as shown below:

Creating a new table with location data

Plant codePlant nameSoil categorySoil description
431LeucadendronASandstone
446ProteaBSandstone/limestone
482EricaCLimestone
431LeucadendronASandstone
449RestioBSandstone/limestone

You do the same process with the location data, shown below:

Creating a new table with location data

Location codeLocation name
11Kirstenbosch Gardens
12Karbonkelberg Mountains

See how these tables remove the earlier duplication problem? There is only one record that contains Kirstenbosch Gardens, so the chances of noticing a misspelling are much higher. And you aren't wasting space storing the name in many different records. Notice that the location code and plant code fields are repeated in two tables. These are the fields that create the relation, allowing you to associate the various plants with the various locations. Obviously there is no way to remove the duplication of these fields without losing the relation altogether, but it is far more efficient storing a small code repeatedly than a large piece of text.

But the table is still not perfect. There is still a chance for anomalies to slip in. Examine the table below carefully:

Another anomaly

Plant codePlant nameSoil categorySoil description
431LeucadendronASandstone
446ProteaBSandstone/limestone
482EricaCLimestone
431LeucadendronASandstone
449RestioBSandstone

The problem in the table above is that the Restio has been associated with Sandstone, when in fact, having a soil category of B, it should be a mix of sandstone and limestone (the soil category determines the soil description in this example). Once again you're storing data redundantly. The soil category to soil description relationship is being stored in its entirety for each plant. As before, the solution is to take out this excess data and place it in its own table. What you are in fact doing at this stage is looking for transitive relationships, or relationships where a nonkey field is dependent on another nonkey field. Soil description, although in one sense dependent on plant code (it did seem to be a partial dependency when we looked at it in the previous step), is actually dependent on soil category. So, soil description must be removed. Once again, take it out and place it in a new table, along with its actual key (soil category) as shown in the tables below:

Plant data after removing the soil description

Plant codePlant nameSoil category
431LeucadendronA
446ProteaB
482EricaC
449RestioB

Creating a new table with the soil description

Soil categorySoil description
ASandstone
BSandstone/limestone
CLimestone

You've cut down on the chances of anomalies once again. It is now impossible to mistakenly assume soil category B is associated with anything but a mix of sandstone and limestone. The soil description to soil category relationships are stored in only one place - the new soil table, where you can be much more certain they are accurate.

Often, when you're designing a system you don't yet have a complete set of test data available, and it's not necessary if you understand how the data relates. This article has used the tables and their data to demonstrate the consequences of storing data in tables that were not normalized, but without them you have to rely on dependencies between fields, which is the key to database normalization.

The following articles will describe the process more formally, starting with moving from unnormalized data (or zero normal form) to first normal form.

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.