Using MySQL, Normalisation

Workshop Requirements

You should have experience of using databases and be familiar with retrieving data from multiple tables using SQL joins. It is also advised that you have undertaken parts 1 -5 of this MySQL series so that you are familiar with the examples used so far.

Introduction

In the previous workshops in the MySQL series, I've hinted at elements of database theory (such as normalisation and foreign keys) without fully explaining these concepts and why people use them when building databases. This workshop aims to fill this gap, but from previous experience in trying to teach this material, I know that a lot of people when learning about normalisation have problems relating the theory to practice. This workshop aims to bridge this gap by providing numerous examples that illustrate both why we need to normalise at each step and also what effect is has on the data. It is not intended to be an indepth theoretical discussion of database theory, however that said we cannot normalise by studying the data alone and thus we will need to touch on a few other database concepts to fully understand what is going on.

Unusually for these workshops I'm going to give figure numbers to the examples used and store them in external files. This is due to the size of some of the tables, and it may be worth printing these figures before continuing.

What is normalisation?

Normalisation is a process whereby the tables in a database are optimised to remove the potential for redundancy. Two main problems may arise if this is not done:

I'm not going to develop the idea of specific anomalies much here, but to briefly illustrate an update anomaly imagine the problem of updating a customer address. If there are multiple instances of that customer in a database, any query designed to update the address will have to update them all. That may seem easy - as we've seen in previous workshops, SQL is quite capable of updating information based on field values. But if one of those field values (say an instance of customer name) has been input incorrectly, then the query would fail to update the address in that row.

Normalisation aims to remove this redundancy by applying rules in a series of stages, splitting tables and creating relationships between unique identifiers (keys), to ensure that the database table structure is efficient, but data can still be accurately manipulated. It can also be used in conjunction with other database modeling techniques, such as Entity-Relationship diagrams, but for simple databases normalisation can sometimes be enough.

Before we begin looking at the stages of normalisation it is worth having a brief look at some of the relational database concepts that inform the process.

Relational Databases

Relational databases, such as MySQL, are so called as they rely on a series of relationships to connect data stored in different tables. For example in Workshop 5 we looked at joining tables based on a common ID number. This ID number was stored as the uniquely valued field (Primary Key) in one table (for example in the artists table), but was also present in another table (the cds table) acting as a reference (Foreign Key) that allowed data to be retrieved from both tables when the values matched. For ease of understanding we used unique numbers, but a key can have any value provided it conforms to a few rules.

Terminology

When were we were looking at the physical database in the other workshops we used certain terms such as TABLE, COLUMN, ROW and FIELD. Discussion of database theory uses different terms:

Relation/ Entity - These are the same as a table.

Attributes - These are similar to columns in that they describe the type of data stored.

Domain - This is values within the same attribute (a collection of fields that exist in the same column).

Tuple - This is a record similar to a row.

As I'm trying to tie the theoretical discussion to the practical database experience that the first Virtual Workshops provide, I will use both sets of terms in this workshops for clarity and to reinforce their meanings.

Different Keys

The Primary Key is frequently used to identify a specific row (although other keys may exist) and all the other values are dependent on that value to be meaningfully identified. A primary key is usually one attribute (column) in a relation (table), but can be a combination of attributes (Composite Key). If we consider the following attributes some cds may have.

Catalogue No. Artist Title

The obvious Primary Key for a CDs table would be the Catalogue No. as each cd has already been given a unique identifier. But consider if we knew for a FACT that each artist could only release one CD per year (perhaps something a cold war communist five year plan would envisage ;-).

CDs
Catalogue No. Artist Year Title

By adding the Year attribute we could also use the combination of Artist and Year to identify the row, as together they form a Composite Key. Thus we have two potential or Candidate Keys. We would probably still choose the Catalogue No. as it is simpler to use one column. Consider a Radio station wishing to record how many times a day they played a CD. The combination of Catalogue No. and the date would be sufficient to identify that record in a new Play History table.

 
Play History
Catalogue No. Date Times Played

In this situation the Catalogue No. number is a Primary Key in the CDs table but only part of the Composite Key in the Play History table as the date is required to form the whole key. An alternative could have been to create a new unique single-attribute Primary Key for the table. As we still need to know the catalogue number in order to work out how many times a DJ has played Britney, Christina or Pink (BBC Radio One is just rubbish isn't it? ;-). The catalogue number would become the Foreign Key in the Play History table, i.e. a Foreign Key is a attribute (column) in a table that refers to a Primary Key in another table.

Play History
HistoryID Catalogue No. Date Times Played

During the Normalisation process we will look at the rules that determine how tables (and thus Keys) are formed.

NULLs

NULLs are used in databases to signify that no value exists, but are not really values themselves. Instead they are just an indictor that a particular field has no value. Some argue this is better that leaving a blank value or having a default value. I personally see little value in a blank field, but default values (such as inserting the current date if no date is offered) can be useful. The interesting point about NULLs is that no Primary Key can contain a NULL and thus it's useful when comparing Candidate Keys to see if one could potentially contain a NULL, thus disqualifying that key.

Functional Dependency

This is another important concept and describes the relationship that columns have to the Primary Key. If the value of a column can be determined by knowing the value of the Primary Key and no other, then that column is said to be functionally dependent. Or more simply, if we know the value of Primary Key we can find out the value of any other dependent column. This dependency is often written as follows (where A is the Primary Key and B is the dependent column):

A → B

So for the CDs table above we could express the dependency of the title column as either:

Catalogue No → Title         or

(Artist, Year) → Title

This is important as we will see later as having a column that is NOT wholly dependent on the Primary Key causes redundancy. Having introduced some of the terminology of Normalisation we can begin to look at the stages.

Describing relations (tables)

When describing tables there is a way of expressing their structure that we will follow here. This includes the name of the relation, the attributes within the relation and which attributes are key. The format of this is:

RELATION (attribute one, attribute two, attribute three etc)

NOTE: The key attribute is underlined to signify its status and the relation is always UPPERCASE.

For example consider how we might express the CDs table above:

CDS (Catalogue No, Artist, Year, Title)

Stages of Normalisation

There are several stages of normalisation that a database structure can be subject to, each with rules that constrain the database further and each creating what is called a Normal Form. These are, in order:

As the database goes through each stage of normalisation it is said to be in the normal form of that stage, i.e. my database is 2NF and I need it to be 3NF. We are not going to take a detailed look at all of these Normal Forms as BCNF, 4NF and 5NF are probably overkill for small to medium databases, with the first 3 normal forms usually being sufficient. There is also one other stage - that of the Un-normalised Normal Form (UNF) which is the starting point for Normalisation. To begin our examination we must first create this Un-normalised data.

Our Scenario

In order to look at database normalisation we are going to use a fictional CD library (quaint in the post-napster internet age I know). Monitoring who has which CD is done by logging the CDs in and out of a loans logbook that contains the following information (with sample data):

Date Borrowed Due Back Borrower Artist Title
1st Dec 8th Dec David Findlay Britney Spears
Pink
Christina Aguilera
Britney
Cant Take Me Home
Stripped
8th Dec 15th Dec Iain Brown Darius
Christina Aguilera
Will Young
Dive In
Stripped
From Now On

Loans Log

Information is also stored about each borrower in a separate register:

Borrower Address Telephone No. E-mail
David Findlay Davies Lodge
23 Driven St
Edinburgh
EH5 APE
0131 555 5579 david@keithjbrown.co.uk
Iain Brown 54 Home St,
Edinburgh,
EH23 8TF
0131 555 5580 iain@keithjbrown.co.uk

Borrower Log

Finally there is a ledger that stores information about the CDs:

Artist Title Year Label No. of Tracks Copy
Britney Spears Britney 2001 Jive 14 1
Pink Cant Take Me Home 2002 LaFace 15 3
Christina Aguilera Stripped 2002 RCA 20 1
Darius Dive In 2002 Mercury 13 2
Will Young From Now On 2002 RCA 13 2

CDs Log

We can see that by keeping the information in different places there has been some common sense organisation to avoid repeatedly entering the same data. This is quite common, but doesn't really help up as we need to use the normalisation rules to determine our structure. Thus to begin with we'll merge the fields (attributes) into a one-table structure along with the same sample data (figure one).

We have also created a column called 'Transaction No' which we use as a primary key to uniquely identify each lending transaction. We do this as there is no other simple method of identifying the row, as the obvious Composite Key is pretty complex (date borrowed, name, artist, title ), and thus assigning a unique 'Transaction No' is sensible at this time.

Note: As I mentioned above, a modeling tool like Entity Relationship diagramming is good to use before normalisation. This would be essential on larger database projects as it would be pretty impossible to create one big flat database to begin normalising. It is also worth noting that although we have included sample data, we are normalising the database structure and not the data itself. It can be easier however to spot potential problems when using example data.

So to begin the normalisation process, we use the Un-normalised data and apply our first rules.

First Normal Form (1NF)

For a relation (table) to conform be 1NF, it must obey the following rules:

To ensure that all attributes (columns) are atomic - what this means is that any single field should only have a value for ONE thing. For example in a 'cars' database a field in a car may have a value 'Blue Ford Focus'. This field is not atomic as it contains more than one piece of information (colour, manufacturer and model). To demostrate the problem with this, try to imagine constructing a query to select all Ford cars. This would be difficult as the value 'Ford' is trapped in the middle of the field. We must therefore split this column into three (colour, manufacturer, model).

There must be no repeating groups. This is perhaps most easily understood as there cannot be repeating columns (that which contain the same type of information). For example a cars field may contain a list of Ford cars (Focus, Mondeo, Puma). As we know we must also make this field atomic, but splitting the column into three (cars1, cars2, cars3) to hold these values isn't an option this time as that would create three domains that contain similar data. In practice this would make querying the databases difficult as we would have to create a statement to match potential values in 3 columns (or four, or five or a hundred if there was a list of a hundred). Splitting the field so that each value was in a separate row would require the other values in the rows to be repeated. The solution is to change the structure of the database by extracting the repeating groups into a new relation.

1NF in practice

Looking at the Un-normalised data we can see that there are non-atomic attributes in the following columns:

All of these attributes will have to be normalised so the domain contains only atomic values. To begin, let's look at the Borrower attribute. If you look at a section of the table in figure one, the borrower contains a forename and a surname expressed as a single borrower. Thus we should split this attribute in two. The following excerpt from the table illustrates how this would be implemented:

TransactionNo. Forename Surname Address TeleNo
1 David Findlay Davies Lodge
23 Driven St
Edinburgh
EH5 APE
0131 555 5579
2 Iain Brown 54 Home St
Edinburgh
EH23 8TF
0131 555 5580

 

Similarly the address should be broken down into different attributes. Anyone that has filled out an application form will be familiar with this structure.

Once more we can see the effect that this has on an excerpt of the table:

TransactionNo. Forename Surname Address_One Address_Two City Postcode
1 David Findlay Davies Lodge 23 Driven St Edinburgh EH5 APE
2 Iain Brown 54 Home St   Edinburgh EH23 8TF

The Repeating Group

If we have a look at the remaining non-atomic attributes we can see that they are similar informational attributes about a CD. In order to make these attributes atomic we cannot repeat the trick of creating additional columns as each repeated column would hold values of a similar type. If we take the artist field for example:

TransactionNo. (Other Attributes) Artist_One Artist_Two Artist_Three
1 ... Britney Spears Pink Christina Aguilera
2 ... Darius Christina Aguilera Will Young

If we split each original field into 3 we can see the value 'Christina Aguilera' appears twice, once in Artist_Two and once in Artist_Three. If we were searching to see who had borrowed CDs by Christina we'd have to create a query to search ALL the artist fields. Also if someone borrowed a 4th CD as part of a transaction we would have to create a fourth Artist column and so on. The other reason that repeating groups are not allowed to be split into columns is that we MUST know the number of columns required in order to create the database.

The alternative, for someone who knew nothing about normalisation, would be to just create extra rows and copy the values in non-repeating attributes - that way we could make sure that each field contained atomic values as seen in figure two.

This is a trap that we could fall into by looking at only the data. We have not actually changed the structure of the database and thus the problem of repeating groups still exists. This can be seen by the fact that the Primary Key is no longer unique. The correct solution is to remove the repeating groups (Artist, Title, Year, Label, No_of_Tracks and Copy) into a new 'Borrowed CDs' relation .

TransactionNo Artist Title Year Label No_of_Tracks Copy
1 Britney Spears Britney 2001 Jive 14 1
1 Pink Cant Take Me Home 2002 LaFace 15 3
1 Christina Aguilera Stripped 2002 RCA 20 1
2 Darius Dive In 2002 Mercury 13 2
2 Will Young From Now On 2002 RCA 13 2
2 Christina Aguilera Stripped 2002 RCA 20 1

Because we need to be able to join this data back with that left in the original table we also need to include the TransactionNo in this Borrowed CDs relation which will act (along with a qualifying value) as Key for this relation. The qualifying value, allowing us to identify each CD in the library, combined with the transaction number will be able to tell us the specific CD borrowed per transaction. Thus we next need to define the qualifying value which will be the least number of attributes required to uniquely identify the CD. This could be (Artist, Title) if we were sure that this was a unique indentifer for a CD, but we already can see that there are different copies of the same CD. We could try adding Copy to the Key (Artist, Title, Copy), but in practice having a single unique value for each CD is preferable. Thus we will also create a cdID attribute in this relation and use that to represent the uniqueness of a CD.

After achieving First Normal Form our database has two relations as can be seen in figure three or can be expressed as follows:

LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS (TransactionNo, cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)

Second Normal Form (2NF)

The second normal form has one rule that must be followed:

All non-key attributes must be dependant ON THE WHOLE KEY and not just one attribute of the key. This obviously only applies to Composite Keys and means that attributes in the relation that only contain information (and have no role in the structure of the database) must be functionally dependent on all parts of the Key. Another way of thinking about this is that if there are some attributes that are not dependent on all parts of the Key this means that they are in the wrong relation.

For example: Imagine we have a cars_bought table where each row (record) contains information about the car that a customer bought, which has a Primary Key of (CustomerID, Model). Information about the car (such as engine size, year etc) could be derived from just one part of the Key (Model) and thus isn't dependent on the whole key. Whereas if we wanted to find out the colour of the car the customer bought then we could search for the customer and this would give us the colour - assuming he has only bought one car. BUT what if the customer has ordered two cars, of different models and different colours. In order to find the colour of a specific car we would need to know the CustomerID AND the model number. Thus colour is dependent on the WHOLE Primary Key and not just a part of it. What this tells us is that colour is in the correct relation as it is specific to the customer and the model, whereas information about the car, which is dependent upon the Model alone, should be in a different table.

2NF in practice

When removing the partial dependencies from our database we can only look at tables that have a Composite Key and not a sole attribute as the Primary Key. As the loans Relation has a Primary Key of 'TransactionNo' this table doesn't require to be changed in this Normal Form. The second relation (Borrowed CDs) DOES have a Composite Key so we must examine the attributes of this table to see if they are all dependent on the whole Key.

We can see that ALL the CD information attributes are dependent on the cdID attribute rather than the whole Key. This is because they aren't really about the CDs borrowed and more about just the CDs bought and thus belong on their own relation - CDs Bought.

cdID Artist Title Year Label No_of_Tracks Copy
1 Britney Spears Britney 2001 Jive 14 1
2 Pink Cant Take Me Home 2002 LaFace 15 3
3 Christina Aguilera Stripped 2002 RCA 20 1
4 Darius Dive In 2002 Mercury 13 2
5 Will Young From Now On 2002 RCA 13 2

CDs Bought

And that means that we can remove all the rest of CD attributes from the Borrowed CDs table.

TransactionNo cdID
1 1
1 2
1 3
2 4
2 5
2 3

Borrowed CDs

The database in the second normal form will now have the relations as shown in figure four and is expressed as:

LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS(TransactionNo, cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
CDS BOUGHT(cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)

Third Normal Form (3NF)

To achieve Third Normal Form no attribute must be dependent on a non-key attribute. This means that every informational attribute must be DIRECTLY dependent on the Primary Key and not on another column. If we again look at an imagined cars database, a customer table contains information such as address, city, postcode and also a column called shipping cost. The value of shipping cost changes in relation to where the car should be delivered, and thus is not directly dependent on the customer, but the city. Thus we would need to create another separate relation to hold the information about cities and shipping costs.

3NF in practice

As is good practice we should examine every table in turn to see if the non-key attributes are indeed directly dependent on the Primary Key.

If we look at the loans relation we can see that the columns AddressOne, AddressTwo, City, Postcode, and E-mail are not dependent on the TransactionNo, but rather a Composite Key of Forename and Surname. Thus they should be removed from this table into a new relation Borrowers. As there is a possibility that there could be more that one David Findlay or Iain Brown, this Composite Key isn't very effective so once more we will replace it with a unique numeric reference (BorrowerID). This will become the new Primary Key of Borrowers and a foreign Key in the loans relation.

The effect that this has on the tables can be seen in figure five.

The Borrowed CDs relation only has two attributes, which together make up the Primary Key and thus is not required to be normalised. The final table CDs Bought can however be nornailised further. Year, Label and No of Tracks are dependent on a Composite Key of Artist + Title. Copy is dependent on the cdID i.e. that which relates a particular physical CD that has been purchased, rather than album details such as artist or label. Thus we can create another relation CD Releases with a releaseID that identifies an album of which there may be millions of copies.

releaseID Artist Title Year Label No_of_Tracks
1 Britney Spears Britney 2001 Jive 14
2 Pink Cant Take Me Home 2002 LaFace 15
3 Christina Aguilera Stripped 2002 RCA 20
4 Darius Dive In 2002 Mercury 13
5 Will Young From Now On 2002 RCA 13

CD Releases Relation

We of course have to replace these dependent attributes in the CDs Bought relation with the releaseID.

cdID releaseID Copy
1 1 1
2 2 3
3 3 1
4 4 2
5 5 2

CDs Bought Relation

To further illustrate the difference between the CDs and the releases consider that we have bought two new CDs that will be available to borrow. One is new to our database (Robbie Williams - Escapology) whereas the Christina Aguilera - Stripped release has been so popular we decided to buy another copy.

Adding the Robbie album would mean adding data to two relations, CDs Bought and CD Releases, like so.

releaseID Artist Title Year Label No_of_Tracks
6 Robbie Williams Escapology 2002 Chrysalis
14

CD Releases Relation

cdID releaseID Copy
6 6 12/12/2002

CDs Bought Relation

Whereas we would only need to add information about the Christina Aguilera album to the CDs Bought relation, as its release information has already been entered into the CD Releases relation.

cdID releaseID Copy
7 3 12/12/2002

CDs Bought Relation

Thus after three levels of normalisation we have a table structure as shown in figure six and is expressed as so:

LOANS (TransactionNo, BorrowerID, Date_Borrowed, Due_Back)
BORROWER (BorrowerID, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail)
BORROWED CDS (TransactionNo, cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)
CDS BOUGHT (cdID, releaseID, Copy)
CD RELEASES (releaseID, Artist, Title, Year, Label, No_of_Tracks)

Other Normal Forms

As mentioned above there are more Normal Forms that we could use to normalise our database further. We are not going to do this, because in practice the anomalies that they are designed to remove are rare. We will however have a brief look at BCNF so that there is at least some understanding of that norm should you come across it. If you are confused or struggling with the above, it would perhaps be worth skipping past this section until you are more confident about your understanding of the first three Normal Forms.

Boyce-Codd Normal Form (BCNF)

This often referred to as a strong 3NF and states that each determinant must be a Candidate Key. In a relation there may be more than just the Primary Key from which we can derive the values of other attributes, i.e. from another attribute or combination of attributes. These are known as Candidate Keys and in 3NF it is possible to have only some of the attributes functionally dependent on them. In BCNF all attributes must depend on all Candidate Keys.

Variations

While the rules of Normalisation stay the same the practical implementation of them differs. The method I've used above is the one that I think is easiest to understand for the beginner, but there are a couple of others worth mentioning in case you see them discussed elsewhere. It doesn't really matter which you use as the derived data structure is the same.

Composite Key in the First Relation

In the above example we had 1NF of :

LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS(TransactionNo, cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)

We are reminded that there was a Composite Key in the second relation (Borrowed CDs). We could have just as easily created the Composite Key in the first relation, where we remove (Artist, Title, Year, Label, No_of_Tracks, Copy) to a new relation called CDs Bought, create the Primary Key cdID for that relation and create a new attribute in the Loans relation that is a Foreign Key and the qualifying value for the TransactionNo. This would have duplicated the data in the non-repeating groups, which doesn't matter as we're normalising the database structure not the actual data, and that repetition is removed by the later Normal Forms. This would leave a 1NF of:

LOANS (cdID, TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)

This means that in to achieve 2NF we no longer look at the second relation, but instead the first relation to determine if the attributes are functionally dependent on the whole Composite Key. All the attributes are functionally dependent on the TransactionNo. not the whole Key, so we remove them to a new relation called Transaction Info. 2NF is thus expressed:

LOANS (cdID, TransactionNo)
TRANSACTION INFO (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)

We can see that this is the same data structure as the original method at 2NF, but we've just chosen different names due to the order in which a relation was first established.

1NF with only one relation

This is another method that is quite popular. If we recall the rules for 1NF they are:

This says nothing about not repeating data or having to create new relations to resolve the problems. If we look at figure two, we can see that the data in the non-repeating attributes has indeed been repeated. The problem we had with this is that there was no longer a Primary Key. We can however make a Composite Key by using the same cdID as we did above to represent the uniqueness, except this time in just one relation. So 1NF would be:

LOANS (cdID, TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back, Artist, Title, Year, Label, No_of_Tracks, Copy)

The consequence of this is that to achieve 2NF would be a longer stage. Initially we can see (similar to previous examples) that the attributes (Artist, Title, Year, Label, No_of_Tracks, Copy) are dependant on only the cdID part of the key and thus can be removed to a new relation:

LOANS (cdID, TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)

This creates a structure identical to that of the 'Composite Key in the First Relation' example in 1NF, where the loan information attributes are only dependent on the Transaction No, and thus we have to split the relation again:

LOANS (cdID, TransactionNo)
TRANSACTION INFO (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
CDS BOUGHT (cdID, Artist, Title, Year, Label, No_of_Tracks, Copy)

The database structure is now 2NF.

Without A Joining Table and cdID

This produces the only truly normalised database schema in this workshop - and I'll explain why in a moment. But first what is a 'Joining Table'? This is a table that is an entirely Composite Key of attributes which are Foreign Keys that contain no informational data and thus its function is to join the Primary Keys of other tables together.

For example: in each of the database structures we have produced so far there has been a relation with two attributes (cdID, TransactionNo) so that if we look up a Transaction No we can retrieve the cds by the corresponding cdIDs. In strict normalisation this table should not exist in this form, it has come into existence because we created an artificial key (cdID) to represent the Composite Key (Artist, Title, Copy). This is something that is done quite frequently in practice (why I've done so above), but is theoretically wrong. To explain let's normalise the structure again.

Because we cannot replace the Composite Key elements with cdID in 1NF, we are left with a structure like so:

LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS (TransactionNo, Artist, Title, Copy, Year, Label, No_of_Tracks)

As we can see, the Composite Key for the Borrowed CDs relation is now a lot more complicated. When we achieve 2NF we would need to create a structure like so:

LOANS (TransactionNo, Forename, Surname, AddressOne, AddressTwo, City, Postcode, TeleNo, E-mail, Date_Borrowed, Due_Back)
BORROWED CDS (TransactionNo, Artist, Title, Copy)
CDS BOUGHT (Artist, Title, Copy, Year, Label, No_of_Tracks)

This has meant that any join we perform using SQL becomes much more complicated, as we are having to match more fields. Consider the following sample MySQL Inner Join to retrieve all the information about a bought cd.

With cdID:

SELECT * 
FROM cds_bought, cds_released 
WHERE (cds_bought.cdID = cds_released.cdID);

Without cdID:

SELECT * 
FROM cds_bought, cds_released 
WHERE (cds_bought.Artist = cds_released.Artist) 
AND (cds_bought.Title = cds_released.Title) 
AND (cds_bought.Copy = cds_released.Copy);

As we observe, joining tables on three fields is more complicated than on one field.

NOTE: The names of tables and columns are slightly different as the physical implementation of any database is best done without spaces in the names, whereas in the theoretical design stage spaces aid readability.

Thus to summarise, we have replaced the 'true' Composite Key with an artificial Primary Key, in order to make things easier in the physical implementation, BUT THERE IS NO RULE IN NORMALISATION THAT SAYS YOU SHOULD.

Conclusion

This has hopefully been a pretty straightforward guide to a complicated (for the novice anyway) subject. I first got into this whole web design game from a media standpoint rather than a technical computer standpoint. When I started using databases on the web, I had to become familiar with concepts that are second nature to those from a formal technical background, which was tricky at first due to the target audience of the literature. Thus I hope that this VW can be of some use to those coming to normalisation from other backgrounds as well as those in IT seeking a little clarification.

In this section

Related Reading

Related Books

Relational Database Principles

Related Ads