Steve Tillman's Home Page

 
Home

About

Courses

Schedule

Vitae

Sports



Membership

Login

 
 

325 notes-6

Database Design

As mentioned earlier in the course, the authors of the text believe that the most important aspect of a database course is database design. I agree that design is extremely critical in real life applications, but I firmly believe that trying to design a database with little or no experience in using one is apt to be a frustrating and futile exercise. The baby examples used in texts do not scale to real life situations. Therefore I have held off any discussion of design until now. Even if you do not actually do design, however, it is important to understand the "language" of database design. For that reason, we are returning to chapters 4, 5, and 6 in the text.
arrow2: Read chapter 4
Entity - Relationship modeling is a form of data modeling which is commonly used in the early phases of database design. It is independent of any particular hardware or software configuration, and there are CASE tools to aid the process for designing large complex databases. When it is done correctly an entity relationship diagram (ERD) models the essential structure of the data, and is the first step toward developing the conceptual schema, which is the beginning process in database design. By structuring the data in a manner that is independent of specific processing and output requirements, i.e. structuring the data based on its inherent nature, you create a data model that is stable when requirements change. The basic data model (data structure) is much less likely to change than the processing requirements, i.e. what is done with the data. A good ERD is an important step in developing a data structure which will provide data independence (in the database sense).

An entity can be anything, real or abstract, about which you might want to store data. It is an object of interest to the end user.
arrow2: Hand out entity examples
Eventually we will see that an entity (or technically an entity type or entity set) will correspond to a table. A specific row of a table corresponds to an entity instance or entity occurrence. The term "entity" is used interchangeably to represent either an "entity set" or an "entity instance". It is up to the reader to determine from the context what is meant in any specific situation.

An entity is generally described by data elements or attributes that pertain to that entity. Usually an attribute (or a combination of attributes) form a primary key which uniquely identifies a corresponding entity instance. The notation we will be using to describe an entity, its attributes, and the primary key is consistent with the notation we have been using for relational tables.
arrow2: Page 107 overhead (top, bottom for composite key)

	EMPLOYEE (SSN, EMP_NAME, ADDRESS, SKILL, PAY_RATE)
	INVENTORY_ITEM (ITEM_NO, WAREHOUSE_NO, NO_IN_STOCK, ...)
In the initial phases of database design it may not be completely obvious what the entities will be, or even which entity a specific attribute will be assigned to. For example a database for a shipping company will have an attribute called SHIPPING_DATE. Should it be assigned to a customer order entity? to a ship cargo entity? to some other entity? perhaps to several different entities? Partly for this reason, CASE tools often use "bubble charts" to aid in design. These are consistent with the Chen model of ERDs. (The main alternative to the Chen model is called the Crow's Foot Model. The text shows both, but emphasizes the Crow's Foot model, which is somewhat newer. I personally prefer the Chen model, but that could be mostly because I am more used to it.) Attributes (at least with the Chen model) are isolated in "bubbles", which are connected to the appropriate entity via a line. This assignment of attribute to entity can then be easily changed.
arrow2: Overheads from pages 106, 108, 110
Note that figure 4.3 from page 108 has a multi-valued attribute, indicated by a double line. In this example, this is the result of the fact that a car can have more than one color. This will require some type of design compromise when actual DBMS implementation is reached. Normalization, which we will get to in the next chapter, will be a major aid in this process, while keeping an independent design. Figure 4.6, from page 108, has a derived attribute. A derived attribute is an attribute that is calculated from other attributes at the time that its value is requested, rather than just a stored value. Age can be calculated from system functions using the current date and the date of birth. The actual implementation of a derived attribute goes beyond what is available in a simple relational database. It is a simple example of one of the extensions available in an object relational database.

A relationship is a natural association between entities. The association can involve one or more entity types. A relationship is an abstract concept, and would have no actual existence other than that which is inherited from the associated entities. A relationship has a degree, which is the number of entity types involved. By far the most common is degree 2, or binary relationship. In ERD's (at least with the Chen model) entities are indicated by rectangles, and relationships by diamonds, usually with an associated verb so that the relationship is described in a sentence. Each entity in a relationship has a connectivity. For a binary relationship this would indicate whether it is 1:1, 1:n, or m:n, which we have already discussed. Some simple examples:

erd 1-1:
erd 1-n:
erd m-n:
Sometimes (often) with an m:n relationship, the relationship will have some attributes of its own, leading to "relationship entities". This is basically the same as the intersection data that we discussed earlier in the course.

Entities can be involved in any number of relationships, and it is not uncommon to have more than one relationship between the same entities.

erd multiple:

Below are examples of relationships of degrees other than 2:

Unary (degree 1)
unary:

Ternary (degree 3)
ternary:
The above example is many-to-many-to-many. It assumes that multiple employees will be working on multiple projects using multiple skills. We can get other connectivity factors by making other assumptions. N-ary relationships, with n > 3, are very rare in actual cases.

As stated earlier, m:n relationships usually have intersection data associated with them, so the relationships themselves define entities. These "relationship entities" are also referred to as associative entities or composite entities. This is sometimes depicted in the Chen model with a rectangle over the diamond. A bridge or composite entity is depicted in the Crow's Foot model with a solid relationship line.
relationship-entity:
arrow2: Overhead from page 126

Sometimes the existence of an entity, say A, depends on the existence of one or more other entities. In such a case entity A is said to be existence dependent.
existance a:
A purchase order entity cannot exist without a corresponding supplier.
existance b:
In the second example, however, PUR_ORDER is not existence dependent on employee because once an order is created, it could exist without a corresponding employee.

An entity is considered a weak-entity if

  1. It is existence dependent on a parent entity, and
  2. It has a primary key that is partially or totally derived from that parent.
PUR_ORDER, in the SUPPLIER case above, is not a weak-entity because of the second condition. Consider the case below:
weak entity:
Weak entities are frequently indicated by a double rectangle. arrow2: overhead page 117

The actual process of developing an ERD (or any part of database design) is not trivial. It is best learned by experience in an actual design environment. ERD design tends to be an iterative process, i.e. it will be done and then done over again several times, with each succeeding step refining and correcting what was done previously. The initial and subsequent steps should not be done in a vacuum. Rather they should be done while keeping close contact with relevant users.

arrow2: Hand out erd constructions steps

The steps can be summarized as shown on the handout.

Steps 2 - 5 comprise the actual ERD construction, and those are the parts of the iterative process, done and redone, mentioned earlier. There are many potential pitfalls. Even things which appear to be "obvious" sometimes turn out to be not so obvious.
arrow2: Hand out example of order as entity, relationship, and attribute.
As you read chapter 4 you will see a discussion of details and concepts of database design (e.g. cardinality, mandatory vs. optional relationships, etc.) that I do not want to discuss at this time. Small book problems can never really illustrate the difficulties that can arise in actual design applications. Nevertheless, I think it is instructive to see at least a simple example carried out in some detail. The text does this with the Tiny College example. As a somewhat briefer example we will consider problem 4, which is taken from pages 142 and 143 in the text.
arrow2: Problem 4 overhead
To keep things simple we will ignore cardinality and optional-mandatory considerations. We will also only consider entities and relationships, and (mostly) ignore attributes and keys. (You should never ignore attributes and keys in a real-life example.) The text already tells us what the basic entities are. (Something else which never happens in real life. Identification of the entities is an arduous process.) We will give initial "user views" and then use them to construct the ERD. Each of the "bullets" on page 142 can be considered a user view. In a real life example these would be derived during the requirements phase (probably with a great deal of pain). The first bullet identifies an entity, namely CANDIDATE. Bullet 2 produces the following:
user view 1:
I am not actually going to worry about weak entities in general, but for this example I will stay with that notation. The next bullet produces
user view 2:
The many to many relationship has to be resolved, so we will do that now.
user view 2 prime:
The fourth and fifth bullets together produce
user view 3:
The sixth bullet gives us
user view 4:
Note the "1". Without the sixth bullet we would have an n. The third from the last along with the second from the last bullet give us the last two user views.
user view 5:
This is again a many-to-many relationship, and so it must be resolved
user view 5 prime:
And the last user view is
user view 6:
The user views are combined (merged) into the initial ERD.
arrow2: Hand out the final version of the ERD. One side of the handout shows my solution which merged the user views created in class, using the Chen model. The other side shows the book's crow's foot solution.

Before getting to the next chapter, bear in mind that getting the user views, which were so clearly stated in the text, can be very painful in real life, and merging them into a coherent overall design can be a nightmare, as different users may view the same concept quite differently. You will be dealing with users (people) and that can be very frustrating. They can be suspicious, resentful, uncooperative, and perhaps even openly hostile. Even when they are cooperative, you have to overcome their tendency to start in the middle, to describe processes in the wrong order, to get sidetracked on irrelevant matters, to exaggerate the importance of insignificant details, to exaggerate their personal importance, to overlook or minimize critical features, to assume that you are familiar with aspects of their work that you are not familiar with, to use the same term for different things, to use different terms for the same thing, to give different explanations at different times for the same concept, and in general to misrepresent things in some fashion or other. All of the above can happen when they are trying to cooperate. If they have a hidden agenda and are trying to sabotage you (as will happen), things can be a lot worse.

arrow2: Read chapter 5
After several iterations of the preceding type, a working ERD is created. Simplistically, to convert the ERD to a relational database schema, you would have a table for every entity. That might not always produce the best design. At this time a careful data analysis of each of the proposed tables is called for. Let's recall the definition of a relation:

arrow2: Overhead on relation definition

  1. Each entry in the table represents one data item. (No repeating groups, no variable length records.)
  2. All the items in a given column are the same type.
  3. Each column is assigned a specific name.
  4. All rows are distinct.
  5. Commands should exist which allow the viewing of rows and columns in any sequence, at any time, without affecting the information content.

The last requirement is more of a requirement for the DML of the corresponding relational database than a requirement of the table. A relation which satisfies these five conditions is said to be in First Normal Form (1NF).

The first step of the data analysis is to make sure that all the tables are in 1NF. Careful selection of primary keys will ensure most of the conditions. The main thing to look out for is repeating groups, i.e. internal arrays of records.
arrow2: Repeating group handout.
[Note: There can be situations where an internal array or collection would be desirable. They can be implemented using the object relational features of oracle, but those are not considered to be part of an ordinary relational database.]

1NF tables can be entered into oracle (or any other relational database) and can be used by any of the standard DML commands. However, there are potential problems.

Definition:

An Attribute, A, is functionally dependent on an attribute, B, if, at every instant of time, each value in B has no more than one value of A associated with it, i.e. the value of B determines the value of A.
We write B determine: A.

EMPLOYEE (EMP#, EMP_NAME, SALARY, PROJECT#, COMPLETION_DATE)

EMP# determine: EMP_NAME, EMP_NAME not determine: EMP#

EMP# determine: PROJECT#, PROJECT# not determine: EMP#, PROJECT# determine: COMPLETION_DATE

Definition:

A collection of attributes, C, is fully functionally dependent on a collection of attributes, D, if C is dependent on D, but not on any subset of D.

STU_CLASS (SNUM, SNAME, MAJOR, CNAME, TIME, ROOM, GRADE)

Definition:

A relation is in second normal form (2NF) if it is in 1NF and every non-key attribute is fully functionally dependent on every candidate key.

STU_CLASS is not in 2NF. MAJOR is not fully functionally dependent on
(SNUM, CNAME).

SNUM determine: MAJOR, CNAME determine: TIME, CNAME determine: ROOM,

(SNUM, CNAME) determine: GRADE.

The normalization rules, which we will get to shortly, are designed to prevent update anomalies and data inconsistencies. With respect to performance trade offs, normalization assumes that non-key fields will be updated frequently, and key fields will be updated rarely. It tends to penalize retrieval time, since data which may have been retrievable from one record in a 1NF design will often have to be retrieved from several records joined together in a normalized design. The join operations can be slow, and in extreme cases can actually cause system crashes. However, a decision to leave a table in 1NF rather than in a higher normal form should be a deliberate decision made for performance reasons rather than just left to circumstance.

Informally, 2NF and 3NF deal with the relationship between key and non-key fields. You might say that a non-key field must provide a fact about the key, the whole key, and nothing but the key.

In the STU_CLASS example, MAJOR provides a fact about SNUM, not the whole key. Similarly TIME and ROOM provide facts about CNAME. 2NF is violated when a non-key field provides a fact about a subset of the key. Why is this bad (or potentially bad)? Basically there are problems or anomalies (irregularities) associated with the standard modification operations of insert, delete, and update. Specifically, in this example, assuming that this is the database's main information about students and classes, we are led to the following potential problems:

  1. Until a student actually signs up for a class, there is no record of the class information. This is called an insert anomaly. You need a student to sign up for a class before you can store information about the class.
  2. If a student drops all classes (e.g. because of illness) the information about the student is dropped from the database. This is called a delete anomaly, and is potentially the most disastrous of the anomaly types for an organization because you might lose information without being aware of it, and not be able to easily get the information back.
  3. TIME and ROOM are repeated for every student in the class. If there is a change, it must be done to numerous records, leading to potential data inconsistency. This is called an update anomaly.

A relation in 1NF but not in 2NF can be decomposed into multiple relations, all of which are in at least 2NF, and which together have exactly the same information content as the original.

	STUDENT (SNUM, SNAME, MAJOR)
	CLASS (CNAME, TIME, ROOM)
	STUDENT_CLASS (SNUM, CNAME, GRADE)
The anomalies go away (try it), so modification is easier. However, retrieval is likely to take longer if you frequently want student and class information together because in the normalized version you must first join three tables to get it.

2NF does not solve all problems. Consider

STU_MAJOR (SNUM, GPA, MAJOR_DEPT, DEPT_HEAD)

This relation is in 2NF because it has an atomic key, i.e. a key with only one attribute. (By the definition of 2NF, a relation in 1NF with an atomic key is automatically in 2NF.) Yet the above relation suffers from the same types of modification anomalies as the STU_CLASS example (though probably less likely to occur). Specifically:

  1. If a new department is formed, you cannot store information about it until it has at least one major. Similarly, if a department is formed which would normally not have any majors (e.g. physical education), you cannot store information about it. (Insert anomaly)
  2. If a department has many majors, and the department head is changed, multiple records have to be updated. (Update anomaly)
  3. If a major has only a few students (e.g. physics), and they all leave (graduate, switch majors, drop out, etc.) then the information about the department is lost. (Delete anomaly)

The problem is that DEPT_HEAD is really a fact about MAJOR_DEPT rather than SNUM. This is an example of a transitive dependency. More formally, attribute C is transitively dependent on attribute A if there is an attribute B, such that
A determine: B, B determine: C, B not determine: A.

A relation is in third normal form (3NF) if it is in second normal form and there are no transitive dependencies. Informally, 3NF is violated when a non-key field provides a fact about another non-key field. 2NF refers to the "whole key" and 3NF refers to "nothing but the key". 1NF and 2NF relations can have modification anomalies. 3NF relations rarely do. 2NF relations can be decomposed into relations in 3NF. In our example, we can have

STU_MAJOR (SNUM, GPA, MAJOR_DEPT)
DEPARTMENT (MAJOR_DEPT, DEPT_HEAD)

For practical purposes 3NF is sufficient to take care of virtually all modification anomalies, and there is little need to normalize beyond that level. However a potential problem can occur if there is more than one candidate key.

STU_ADVISOR (SNUM, MAJOR_DEPT, ADVISOR)

This is in 3NF, however note that ADVISOR determine: MAJOR_DEPT.

  1. If a faculty member has numerous advisees, that information would be stored many times. If the name is changed (fairly unlikely) then numerous records need to be updated. (Update anomaly)
  2. If a faculty member has no current advisees, you cannot store the information that he/she advises in a particular department. (Insert anomaly)
  3. If a faculty member had few advisees, and they all graduated, dropped out of school, or changed majors, you would lose the information that the faculty member is an advisor in a particular department. (Delete anomaly)

Granted that we are reaching somewhat here, in another example the anomalies might not be so far fetched. We do away with most of the remaining anomalies by mildly extending 3NF tables to what is known as Boyce-Codd Normal Form (BCNF). A relation is in BCNF if no non-key field can determine a sub-key field. We can decompose the above table into BCNF tables as follows:

STU_ADVISOR (SNUM, ADVISOR)
ADVISOR_SUBJECT (ADVISOR, MAJOR)

In this example we would still have the update anomaly (which is unlikely to occur anyway) but the others go away.

4NF and 5NF deal mainly with multi-valued facts (or more correctly multi-valued dependencies). This means a value of attribute A determines a possible collection of values of attribute B. We write A double arrow: B. Multiple-valued facts often result from many-to-many relationships, for example between employee and skill.

EMPLOYEE double arrow: SKILL.

Consider the relation CTX (COURSE, TEACHER, TEXT). (c, t, x) is a row in CTX iff course c can be taught by teacher t using text x. We will assume that text is independent of teacher, i.e. any teacher teaching a course will use the same group of textbooks.
arrow2: Hand out sample CTX. (4NF-example)
In the original design there is clearly a lot of redundancy. The decomposition does not look as if it helps much, but that is because there are very few rows. To add the information that White can teach math would mean adding 3 records in the original, but only one in the decomposition. If Basic Mechanics, 2nd Ed. replaces Basic Mechanics 3 records need to be changed in the original, but only 2 in the new.

The problem is COURSE double arrow: TEACHER and COURSE double arrow: TEXT, and TEACHER and TEXT are independent. We say a table is in 4NF if it is in BCNF and it does not contain 2 (or more) multi-valued facts.

5NF (and beyond) delves further into multi-valued dependencies. It has limited practical use (in my opinion), so we will skip it. The normalization definitions are summarized in table 5.2, page 157.

arrow2: Read section 6.1, skim the remainder of chapter 6. In particular look at table 6.6, page 212.
Chapter 6 is titled Advanced Data Modeling. Since database design is not the primary focus of this database course, I do not plan to cover the chapter, except for section 6.1, in any detail.

It is fairly common for an entity to be broken up into distinct subtypes. Sometimes this is called a generalization hierarchy. The idea is that an entity type (called the super type) would be broken up into subtypes. Each subtype entity would automatically inherit all the attributes of the super type. In addition, a subtype would normally have other attributes of its own. Each subtype instance would, in fact, also be an instance of the super type.
arrow2: See examples from pages 195 and 198 (overhead).
The notation used in the text is not standard, but in order to avoid problems I will use the same notation. The letter inside a circle is called a category symbol. A "d" means the subtypes are distinct. An "O" means that they can overlap. A single line under the category symbol means that there could be instances of the super type which do not fall into any of the listed subtype categories. A double line under the category symbol means that every instance of the super type must fall into one of the listed subtype categories.
arrow2: Hand out generalization example.
This is sometimes called an ISA relationship because each instance of a subtype ISA super type instance as well. It is one of the key ideas in the object oriented paradigm. The tables which correspond to the entities can be modeled with a standard relational database such as oracle as we have used it, but the relationship has to be contrived. (Using the object features of oracle, generalization hierarchies can be modeled directly, but the cost is adding additional complexity.) One procedure which can be used with standard relational databases is the following: The super type entity and each subtype entity would be modeled with their own tables. There would be a 1:1 relationship between the super type and each subtype. This relationship would not be in the strict sense, i.e. 1 would mean 1 or zero. The super type table would contain the primary key and all the common attributes. Each subtype entity would contain the specialized attributes and the super type≠s primary key as both a foreign key and a primary key for the subtype. Features from the super type would be accessible, but would not be automatically included in the subtype as they would if the generalization hierarchy had been modeled using the object features of oracle.
arrow2: Hand out generalization SQL

Following the development of the ERD and the normalization process, the next step would be converting your software independent design into a database design appropriate for your specific installation. Appendix D of the text, available as a download from the publisher's website, briefly discusses this when the software is a standard relational database, a primary assumption of this course. I do not intend to spend class time on this material, but I believe it is worth reading. If you wish to read it (it is only 7 pages) and have trouble downloading it, let me know and I will print a copy for you.

arrow2: Skim chapter 9
Chapters 4, 5, and 6 show the basics of database design. Database design is an integral part of the development of an application system. This is material that is covered extensively in CS 324, so I will not spend much time on it here. However, there are a couple aspects which are worth noting. The first is the Systems Development Life Cycle (SDLC),
arrow2: Overhead of SDLC from page 375,
and the second is the database life cycle, (DBLC) usually a part of the SDLC,
arrow2: Overhead of DBLC from page 379.
For those of you who already have some database experience, or who might use something like this in a senior project, it might be worth reading chapter 9 in detail, where the authors expand on the topics in the diagram.

For the previous set of notes click on 325 notes-5a

For the next set of notes click on 325 notes-7.




Last update: Monday, August 10, 2009 at 1:52:42 PM.