| Home Membership
|
325 notes-6Database Design
As mentioned earlier in the course, the authors of the text believe that the most important aspect of a database course is database design. I agree that design is extremely critical in real life applications, but I firmly believe that trying to design a database with little or no experience in using one is apt to be a frustrating and futile exercise. The baby examples used in texts do not scale to real life situations. Therefore I have held off any discussion of design until now. Even if you do not actually do design, however, it is important to understand the "language" of database design. For that reason, we are returning to chapters 4, 5, and 6 in the text.
An entity can be anything, real or abstract, about which you might want to store data. It is an object of interest to the end user.
An entity is generally described by data elements or attributes that pertain to that entity. Usually an attribute (or a combination of attributes) form a primary key which uniquely identifies a corresponding entity instance. The notation we will be using to describe an entity, its attributes, and the primary key is consistent with the notation we have been using for relational tables. EMPLOYEE (SSN, EMP_NAME, ADDRESS, SKILL, PAY_RATE) INVENTORY_ITEM (ITEM_NO, WAREHOUSE_NO, NO_IN_STOCK, ...)In the initial phases of database design it may not be completely obvious what the entities will be, or even which entity a specific attribute will be assigned to. For example a database for a shipping company will have an attribute called SHIPPING_DATE. Should it be assigned to a customer order entity? to a ship cargo entity? to some other entity? perhaps to several different entities? Partly for this reason, CASE tools often use "bubble charts" to aid in design. These are consistent with the Chen model of ERDs. (The main alternative to the Chen model is called the Crow's Foot Model. The text shows both, but emphasizes the Crow's Foot model, which is somewhat newer. I personally prefer the Chen model, but that could be mostly because I am more used to it.) Attributes (at least with the Chen model) are isolated in "bubbles", which are connected to the appropriate entity via a line. This assignment of attribute to entity can then be easily changed. Note that figure 4.3 from page 108 has a multi-valued attribute, indicated by a double line. In this example, this is the result of the fact that a car can have more than one color. This will require some type of design compromise when actual DBMS implementation is reached. Normalization, which we will get to in the next chapter, will be a major aid in this process, while keeping an independent design. Figure 4.6, from page 108, has a derived attribute. A derived attribute is an attribute that is calculated from other attributes at the time that its value is requested, rather than just a stored value. Age can be calculated from system functions using the current date and the date of birth. The actual implementation of a derived attribute goes beyond what is available in a simple relational database. It is a simple example of one of the extensions available in an object relational database. A relationship is a natural association between entities. The association can involve one or more entity types. A relationship is an abstract concept, and would have no actual existence other than that which is inherited from the associated entities. A relationship has a degree, which is the number of entity types involved. By far the most common is degree 2, or binary relationship. In ERD's (at least with the Chen model) entities are indicated by rectangles, and relationships by diamonds, usually with an associated verb so that the relationship is described in a sentence. Each entity in a relationship has a connectivity. For a binary relationship this would indicate whether it is 1:1, 1:n, or m:n, which we have already discussed. Some simple examples:
Entities can be involved in any number of relationships, and it is not uncommon to have more than one relationship between the same entities.
Below are examples of relationships of degrees other than 2:
Unary (degree 1)
Ternary (degree 3)
As stated earlier, m:n relationships usually have intersection data associated with them, so the relationships themselves define entities. These "relationship entities" are also referred to as associative entities or composite entities. This is sometimes depicted in the Chen model with a rectangle over the diamond. A bridge or composite entity is depicted in the Crow's Foot model with a solid relationship line.
Sometimes the existence of an entity, say A, depends on the existence of one or more other entities. In such a case entity A is said to be existence dependent. An entity is considered a weak-entity if
![]() Weak entities are frequently indicated by a double rectangle. The actual process of developing an ERD (or any part of database design) is not trivial. It is best learned by experience in an actual design environment. ERD design tends to be an iterative process, i.e. it will be done and then done over again several times, with each succeeding step refining and correcting what was done previously. The initial and subsequent steps should not be done in a vacuum. Rather they should be done while keeping close contact with relevant users.
Steps 2 - 5 comprise the actual ERD construction, and those are the parts of the iterative process, done and redone, mentioned earlier. There are many potential pitfalls. Even things which appear to be "obvious" sometimes turn out to be not so obvious. Before getting to the next chapter, bear in mind that getting the user views, which were so clearly stated in the text, can be very painful in real life, and merging them into a coherent overall design can be a nightmare, as different users may view the same concept quite differently. You will be dealing with users (people) and that can be very frustrating. They can be suspicious, resentful, uncooperative, and perhaps even openly hostile. Even when they are cooperative, you have to overcome their tendency to start in the middle, to describe processes in the wrong order, to get sidetracked on irrelevant matters, to exaggerate the importance of insignificant details, to exaggerate their personal importance, to overlook or minimize critical features, to assume that you are familiar with aspects of their work that you are not familiar with, to use the same term for different things, to use different terms for the same thing, to give different explanations at different times for the same concept, and in general to misrepresent things in some fashion or other. All of the above can happen when they are trying to cooperate. If they have a hidden agenda and are trying to sabotage you (as will happen), things can be a lot worse.
The last requirement is more of a requirement for the DML of the corresponding relational database than a requirement of the table. A relation which satisfies these five conditions is said to be in First Normal Form (1NF).
The first step of the data analysis is to make sure that all the tables are in 1NF. Careful selection of primary keys will ensure most of the conditions. The main thing to look out for is repeating groups, i.e. internal arrays of records. 1NF tables can be entered into oracle (or any other relational database) and can be used by any of the standard DML commands. However, there are potential problems. Definition: An Attribute, A, is functionally dependent on an attribute, B, if, at every instant of time, each value in B has no more than one value of A associated with it, i.e. the value of B determines the value of A.
EMPLOYEE (EMP#, EMP_NAME, SALARY, PROJECT#, COMPLETION_DATE) A collection of attributes, C, is fully functionally dependent on a collection of attributes, D, if C is dependent on D, but not on any subset of D. STU_CLASS (SNUM, SNAME, MAJOR, CNAME, TIME, ROOM, GRADE) Definition: A relation is in second normal form (2NF) if it is in 1NF and every non-key attribute is fully functionally dependent on every candidate key.
STU_CLASS is not in 2NF. MAJOR is not fully functionally dependent on The normalization rules, which we will get to shortly, are designed to prevent update anomalies and data inconsistencies. With respect to performance trade offs, normalization assumes that non-key fields will be updated frequently, and key fields will be updated rarely. It tends to penalize retrieval time, since data which may have been retrievable from one record in a 1NF design will often have to be retrieved from several records joined together in a normalized design. The join operations can be slow, and in extreme cases can actually cause system crashes. However, a decision to leave a table in 1NF rather than in a higher normal form should be a deliberate decision made for performance reasons rather than just left to circumstance. Informally, 2NF and 3NF deal with the relationship between key and non-key fields. You might say that a non-key field must provide a fact about the key, the whole key, and nothing but the key. In the STU_CLASS example, MAJOR provides a fact about SNUM, not the whole key. Similarly TIME and ROOM provide facts about CNAME. 2NF is violated when a non-key field provides a fact about a subset of the key. Why is this bad (or potentially bad)? Basically there are problems or anomalies (irregularities) associated with the standard modification operations of insert, delete, and update. Specifically, in this example, assuming that this is the database's main information about students and classes, we are led to the following potential problems:
A relation in 1NF but not in 2NF can be decomposed into multiple relations, all of which are in at least 2NF, and which together have exactly the same information content as the original. STUDENT (SNUM, SNAME, MAJOR) CLASS (CNAME, TIME, ROOM) STUDENT_CLASS (SNUM, CNAME, GRADE)The anomalies go away (try it), so modification is easier. However, retrieval is likely to take longer if you frequently want student and class information together because in the normalized version you must first join three tables to get it.
2NF does not solve all problems. Consider
The problem is that DEPT_HEAD is really a fact about MAJOR_DEPT rather than SNUM. This is an example of a transitive dependency. More formally, attribute C is transitively dependent on attribute A if there is an attribute B, such that STU_MAJOR (SNUM, GPA, MAJOR_DEPT) DEPARTMENT (MAJOR_DEPT, DEPT_HEAD)
For practical purposes 3NF is sufficient to take care of virtually all modification anomalies, and there is little need to normalize beyond that level. However a potential problem can occur if there is more than one candidate key.
Granted that we are reaching somewhat here, in another example the anomalies might not be so far fetched. We do away with most of the remaining anomalies by mildly extending 3NF tables to what is known as Boyce-Codd Normal Form (BCNF). A relation is in BCNF if no non-key field can determine a sub-key field. We can decompose the above table into BCNF tables as follows: STU_ADVISOR (SNUM, ADVISOR) ADVISOR_SUBJECT (ADVISOR, MAJOR) In this example we would still have the update anomaly (which is unlikely to occur anyway) but the others go away.
4NF and 5NF deal mainly with multi-valued facts (or more correctly multi-valued dependencies). This means a value of attribute A determines a possible collection of values of attribute B. We write A
EMPLOYEE
Consider the relation CTX (COURSE, TEACHER, TEXT). (c, t, x) is a row in CTX iff course c can be taught by teacher t using text x. We will assume that text is independent of teacher, i.e. any teacher teaching a course will use the same group of textbooks.
The problem is COURSE 5NF (and beyond) delves further into multi-valued dependencies. It has limited practical use (in my opinion), so we will skip it. The normalization definitions are summarized in table 5.2, page 157.
It is fairly common for an entity to be broken up into distinct subtypes. Sometimes this is called a generalization hierarchy. The idea is that an entity type (called the super type) would be broken up into subtypes. Each subtype entity would automatically inherit all the attributes of the super type. In addition, a subtype would normally have other attributes of its own. Each subtype instance would, in fact, also be an instance of the super type. Following the development of the ERD and the normalization process, the next step would be converting your software independent design into a database design appropriate for your specific installation. Appendix D of the text, available as a download from the publisher's website, briefly discusses this when the software is a standard relational database, a primary assumption of this course. I do not intend to spend class time on this material, but I believe it is worth reading. If you wish to read it (it is only 7 pages) and have trouble downloading it, let me know and I will print a copy for you.
|
|||
|
Last update: Monday, August 10, 2009 at 1:52:42 PM. |
||||