Steve Tillman's Home Page

 
Home

About

Courses

Schedule

Vitae

Sports



Membership

Login

 
 

325 notes-1

The use of a database management system (DBMS) has become so pervasive in the field of information technology (IT) that most people within that field, including students, have already had some experience with a DBMS. That experience might be with a system such as Microsoft's Access, the database from a Works package, or perhaps something more powerful. This is an advanced level course. It will be assumed that you have experience in IT, including programming experience, but not necessarily any database experience. There will be two major thrusts to this course:

  1. Understanding the underlying principles and problems of database management systems (DBMS).
  2. Getting as much "hands-on" experience as possible with a "large scale" database system (e.g. mainframe, distributed, client/server, etc.)

In a real-life large scale environment database design is a critical issue. For that reason the authors make database design the major focus of their text. On page XVII of the preface, they give several excellent reasons why this should be the case.
arrow2: Overhead on design (preface page XVII)
The first sentence in bullet 2 in particular stands out. "Most of the really vexing database system management problems seem to be triggered by poorly designed databases."

I regard this as a true statement. Surprisingly, the situation is worse today with all the available automated design tools than it was in "primitive" times, 30 or more years ago. In those early days altering the structure of an existing database tended to be a nightmare. All the existing data had to be unloaded. The database structure was then changed. The data was reloaded into the new structure (not always an easy task). All the application programs (frequently hundreds at least) had to be recompiled, and many of them had to be rewritten. Database professionals tended to be very careful with design because they wanted to delay structure changes as long as possible.

As indicated in the first bullet of the overhead, these days there exist tools which make the creation of databases relatively easy. (Note the second sentence of bullet 1.) Modern desktop software is relatively easy to use, and inexperienced people can create databases and database applications quickly. It is also possible to alter a database structure fairly easily, though that is a bit more of a problem. With any project there is a tendency to want to "get going", so even professionals will often fall into the trap of "quick and dirty" design, figuring they can always make any necessary changes later. "....such 'create before you design' activity can be compared to building a house without consulting a blueprint." It will tend to work for small projects (a doghouse) but it does not scale upward very well.

Given the above, I cannot quarrel with the authors' decision to emphasize design, but (to use another analogy) that seems a little like designing a bicycle without knowing how to ride one. In my opinion design is much easier to accomplish once you understand what a database is and how to use one. Hence the two thrusts mentioned above. We will discuss some of the basic design concepts (chapters 4, 5, and 6), but the concentration will be on principles and application development. As a consequence all computer projects will be with a previously designed database. (This is similar to real-life. If you start a job related to databases, in your first assignments you are much more likely to be asked to program on [or make use of in some fashion] an existing database than to design a new one.)

arrow2: Read Chapter 1.
For the next three weeks or so we will be discussing database history and fundamentals. Some of this material is covered in chapters 1 and 2, but much of it is not. I consider this material, background material on file systems and data structures, a necessity. It will not be for a while that we get to chapter 3, the actual start of "modern" database systems.

Database systems started to be used in the late 60's and early 70's. Before that all application systems were based on file processing. Today most application systems are based on database systems, but there are still some that remain file systems, and many more that use independent files along with a database system. A file system design overview is the following:

  1. Need for the system is recognized.
  2. System is designed.
  3. Application programs and files are developed.
  4. System is implemented.

Typical large scale system development time averaged 6 - 12 months. Smaller systems could be done in 2 - 3 months sometimes, but some larger ones have been known to take 2 - 3 years. Statistically an average such system consisted of

	55 programs (23,000 source statements),
	6 master files, and
	26 predefined reports.
Essentially each such system was designed in a vacuum, i.e. files and programs would be customized for the particular application, without regard to other existing or planned applications. The only coordination would be operational, i.e. planning when individual programs would be run to avoid bottlenecks and to avoid periods of time when the CPU was under utilized. Even that minimal coordination would probably be less important today, with high speed multiprocessor systems, then it was in "ancient" times.

There is a discussion and critique of file systems in the text (sections 1.4 and 1.5). To complement the discussion in the text, and to see the limitations and problems with file systems, I think it might be instructive to consider a specific (hypothetical) example. Suppose we work for a business which uses file systems for IT applications. Two existing application systems are a payroll system (for accounting) and a personnel system (for human resources). Both systems deal with employees. Associated files would have much overlap in content, but the structure and use of the files, e.g. file format, field sizes, method of storage, method of access, etc. may be totally dissimilar. We will shortly see that this file content redundancy leads to some major problems. For now, assuming that file redundancy is bad, we want to answer the following question: Should an effort be made to coordinate applications so that they share files? In our example case, keeping it as simple as possible, that would mean payroll and personnel would share a common employee file.

Problem 1: Politics. Suppose the human resources director and the comptroller hate each other. Chances of successful coordination diminish. Even if there is no actual animosity, "turfism" is difficult to overcome. (It's mine, and you can't have it!) Neither may be willing to give up control. In real life examples, coordination often (usually) involves numerous parties, not just two.
arrow2: interrelationship diagram

Problem 2: Security. If many different functional areas of an organization share a common file, how do you keep unauthorized people from reading (or changing!) sensitive data? Security features such as effective password protection, locking out parts of files, etc. are fairly difficult to program. Frequently, for that reason, they are omitted, but even if included making them part of programs rather than the underlying data structure is much less stable and is generally an inferior approach. New programs may or may not have the security features needed.

Problem 3: Content maintenance. Who is responsible for updating the file? Reports are usually generated immediately after update (so they are as current as possible). When are updates scheduled? Different applications would want different reports at different times, hence updates at different times.

Problems 1, 2, and 3 are nontrivial, but they can be overcome. The big problem is

Problem 4: File structure maintenance. Using our example, suppose payroll, personnel, and perhaps other areas share a common employee file. Personnel notes that the common file has no next-of-kin field, and decides it wants a 30-character next-of-kin field added to the file (a relatively minor change).

Worst case: All of payroll's programs die or produce gibberish. The reason would not be obvious. Last week the programs worked, this week they do not. When the reason finally comes out, if there was no animosity before, there could very well be some now.

Best case: All of the other application areas which use the file are informed of the change well in advance. It may be a while before they actually do anything about it. (Many companies did not complete work on the year 2000 problem until several years after 2000.) But since this the best case, we will assume that they start work immediately. They have to alter all the programs which use the file. Suppose 50 programs throughout the organization (outside of personnel) use the file. They all have to be recompiled. That might take a day. But before you recompile them, you have to recode them to account for a 30-byte filler. That would take about another day (or 2 or 3). Since the file size is different, blocking and buffering have to be recalculated, which could take another couple of days. [Digression on blocking and buffering, if necessary.] Since the new field cannot be assumed to be at the end of the record, any record size moves and receiving field sizes have to be redone, which means that you have to carefully exam every line of active code looking for references to that record. Now we are talking about 2 - 3 months work (or more). But even before you can do any of that, you have to find which 50 of the 1500+ production line programs actually use the file (you know what your documentation is like). Bottom line: The human resources director, who happens to be the executive vice-president's son-in-law, says to IS, "Add a next-of-kin field to the employee file." IS responds by saying "OK, but it will take 6 months and cost $50,000." The human resources director then says, "Get me my own employee file, and I≠m not sharing it with anyone!"

At this point you should see why applications which use traditional file systems rarely share files. So we are back to customized, isolated files for each application. I stated earlier that this is bad. Now I will go over the reasons for that statement.

Problem 1: Redundancy. The same information is stored in numerous locations within the system's secondary storage. At the very least this wastes space. While this can never be good, with disk space so cheap now it does not have to be a disaster. (Although it might slow down searches, which is not good.) The real issue with redundancy is the problems it leads to.

Problem 2: Updating. If the same piece of information is stored in 23 different locations (possibly only a slight exaggeration), how many transaction records have to be generated to update the information? How much extra CPU time (a more valuable asset than disk space) is spent updating the same information? How much extra people time (and money; we are now talking actual cash) are spent generating the same transaction records? Some groups may not even be aware that the same information is stored elsewhere.

Problem 3: Data inconsistency. When data is stored redundantly, there are inevitably inconsistencies in the data. Transaction records are not always identical.

	JOHN B. BROWN CO.
	JOHN B. BROWN INC.
	JOHN B BROWN CO
Do these refer to the same entity? Not to a computer. How many of you have at one time or another received several different pieces of mail, all the same, but with mild variations in the name or address? Even if you can avoid update errors (and you can't) you will still get data inconsistencies. Data will be stored in different field lengths and with different formats. Different methods of rounding off with numeric data will, long term, lead to inconsistent results. But even if you could avoid these problems (and you can't) you cannot avoid different update cycles. Suppose, for example, that file A and file B have a 50% overlap in information content. File A is updated nightly. File B is updated monthly. At best the information is consistent one day each month. Other consistency problems can arise from a lack of standards; e.g. purchasing, accounting, and manufacturing all have their own item number for "small left-handed widgets".

Problem 4: New requests. When files are created with only one application system in mind, related data is often scattered over many different files. Suppose, for example, that the vice-president for sales asks for a list of all backordered items which have been ordered by customers from two or more different states over the past three years. That information is almost certainly contained within different accounting, inventory, shipping, and possibly other files, but unless there have been similar requests in the past, it is probably not in any one file. Extracting information which was not anticipated is difficult. The relevant files may not be organized for convenient searching. Often the only way to satisfy such requests is by creating a new application system with new files, a process which, at best, will take weeks and the VP wants the information now. Also, the new system further exacerbates the redundancy problem.

Problem 5: System maintenance. Uncontrolled redundancy and inconsistency are an immediate financial burden on an organization, but the most serious problem, both financially and psychologically, is system maintenance. We have seen that there is extra routine maintenance in just maintaining the information normally. In addition to that there is a significant amount of maintenance in trying to reconcile inconsistent information. This is more difficult than routine maintenance because there are no obvious errors. In several organizations 80% (or more) of IS effort, money, and people power were spent on maintenance. Not only was this expensive in its own right, but it made it very difficult to develop new applications. Proposed new systems were backlogged for years. Some companies actually folded because they lost control of their information systems. They did not know who owed money to them, who they owed money to, when it was due, what is in stock, what is ordered, etc. Machines (and workers) could be sitting idle in one location waiting for parts for which the company was paying excess storage costs in another location.

We will close this discussion with an actual case history: A manager received two profit and loss statements from the same IS department, but generated from two different standard cost files belonging to two different accounting applications. One showed a previous year profit of $18,000,000, the other showed a previous year loss of $20,000,000.

Summarizing, the limitations of file system data management are as follows:

  • Extensive programming.
  • Difficult to perform ad hoc queries.
  • System administration is usually complex and difficult.
  • Difficult to make changes to existing structures.
  • Security is difficult.

DBMS

A database is a collection of data logically organized to meet the information and time requirements of a collection of users with little or no duplication of data. A DBMS is a generalized software system designed to manage the database, providing facilities for organization, access, security, and control. The ideal DBMS will provide the following:

  1. Integrated database for the entire organization.
  2. A data structure independent of the applications which use the data. The independence is in the following sense: A change in the data structure for a given application should have no effect on any other applications, even if they use the same data. (Data independence perhaps the most critical feature)
  3. Data should be able to be accessed in any order using any kind of processing (batch, on-line, primary key, secondary key, all data associated some other data, etc.) The system should keep track of data associations. Small example: small example:

    What parts are supplied by Robotek? What suppliers offer the best price for green widget wings?

  4. Ensure privacy and security of sensitive data.
  5. Maintain data integrity. Data should be accurate. Prevent problems if possible, and recover easily when problems occur.
  6. Should have the ability to access certain data without the need for complicated programming. (A query facility.)

Most organizations have databases somewhat less than the ideal, often having several databases and sometimes several DBMS's. A summary of the major advantages of a DBMS is as follows:

  1. Elimination of uncontrolled data duplication and its associated problems (wasted space, data inconsistencies, lack of standard terminology, etc.)
  2. New requests and one-of-a-kind requests are much more easily implemented (especially with relational databases).
  3. Program/data independence, which speeds up application development and reduces maintenance.
  4. Better data management. (You know what you have, where it is, and how it relates to other items.)
  5. Better security.
  6. Better control of concurrent access.
  7. Centralized backup and recovery, so better control of the data asset.
  8. Easier to enforce standards of use.

You should be aware of the fact that a DBMS is not a panacea for all information processing problems. We will close this part of our discussion by considering some of the disadvantages of a DBMS.

  1. It is not cheap. Large scale software will cost well into 5 figures, and, depending on the system, can easily go into 6. (The ORACLE database we have at Wilkes listed for over $1,000,000, though of course no one pays the list price.) Desktop systems are much cheaper (some can be had for approximately $100) but that is per machine, and large organizations have thousands of machines. Furthermore, they often need to be networked, and good networking software is not easy to manage, and is itself not cheap. Organizations generally find that the "client-server" approach has lower startup costs, but higher training and maintenance costs. Control can be a problem and security a nightmare with decentralized databases.
  2. With any kind of database environment, you may need hardware upgrades (expensive) and you will certainly incur training costs.
  3. Conversion from existing systems will be long, involved, and expensive. You will likely run into some internal resistance from your own people.
  4. Operating costs for some applications will be higher. For example, sequential payroll processing will not be as fast with a DBMS as it would be with a dedicated payroll sequential file. The DBMS simply has too much overhead.
  5. A DBMS requires a nontrivial sophistication to design effective databases because of the interrelationship of large amounts of data. If that skill is not present in your organization, you may have to bring in new, highly paid people, which could cause resentment among your current staff.
  6. Backup and recovery will be more difficult, especially in a distributed environment.
  7. Because of the complexity, systems will be more vulnerable to failure, and a failure could affect numerous application systems.
  8. Potential internal conflict over the use of a general corporate asset.

In spite of the above, the advantages of a DBMS outweigh the disadvantages, usually by a wide margin.

Digression into some data structures

Understanding the underlying principles of database management is one of the key components of this course. The actual methods used by the DBMS depend heavily on the data structures we will be looking at in this section. A DBMS provides a layer of insulation between the user and the actual physical data. To some extent (in many cases to a large extent) the user may be unaware of the methods used to store the data, how it is accessed, what physical order it is in, and precisely where it is stored on secondary storage. Since the same data is used for many different applications, physical storage will generally be unrelated to the order of usage in a given application. Consider the following example: Within the database is a file (table) of customer records.

Application 1: Access customers served by a particular sales representative.
Application 2: Access those customers in a particular part of the country.
Application 3: Access those customers purchasing a particular product.
Application 4: Access one specific customer for a credit check.
Application 5: Access all customers whose billing date is the 10th business day of the month.

Unless the file is small, examining every record to pick out those which satisfy the current condition is not a desirable alternative. Application 4 requires direct access into the file, and the others require accessing a number of records satisfying a particular condition or relationship. The computer keeps track of relationships between records by means of indices and pointers. A pointer is a field within a record whose contents is the machine (disk) address (location) of the entity being pointed to. A pointer is most likely to be one of the following forms:

  1. Actual disk address (e.g. 057 06 234) (cylinder surface segment)
    arrow2: (physical disk picture)
  2. A relative file address (e.g. 2043rd record in the file)
  3. Record identifier (symbolic pointer) e.g. social security number

A quick overview of the pros and cons of the three methods is as follows:

  1. This is the fastest since no other manipulation is needed, and it is compact (does not take much file space). It is very inflexible since the record being pointed to cannot be moved without a major maintenance problem. It is not trivial to tell how many records, and where they are, would be "pointing to" a particular record.

  2. The value of the pointer contains the relative position (offset) of the record with respect to the beginning of the file. To find the actual address of record R, the machine would do the following calculation:
    	(address of R)  =  Base   +   (R  -  1) * (record length)
    This requires marginally more computer time than method a (a little more if the file is not contiguous), but has the advantage of allowing you to change the position of the entire file (say during garbage collection) or even to move the entire file to a different disk (to distribute disk activity or to upgrade hardware).

  3. Many DBMS's have an "optimizing" factor which causes individual records to be moved. For example if a hashing technique is used for direct access, and a "home location" record is deleted, the first overflow (if any) is often moved to the home location. Some systems allow you to cluster records that are normally accessed together so that they can be near each other to minimize disk arm movement. (Some systems will even do this automatically based on usage history.) When a new record is added which should be part of a cluster, but can't be because of a space problem, it is often moved to the cluster later when space opens up. Methods a and b cannot handle this type of activity, so symbolic pointers, which are independent of anything except the specific record, may be preferable. Also a symbolic pointer may contain useful information itself. For example, an order record should "point" to the customer record of the customer who placed the order. If a symbolic pointer is used, it may be the customer account number. In some applications, e.g. a sales status report, the account number alone may be all the information needed about the customer, saving a disk access to the actual customer record. If more information is needed, say name and address of customer, the account number can be used as a key for direct access to the customer record. The downside of c is that it is much slower than a or b, particularly if the direct access method requires an index lookup. (Index lookup is the default direct access method in Oracle. It is possible to set up a hashing scheme, but that would require an extra overt step while defining the database.) Also the pointer itself may be quite large. While a relative file address may be 2 - 4 bytes long, a symbolic pointer could be 30 bytes (or more), and we will see soon that multiple pointers within a record is not unusual.


click on 325 notes-2 to go to the next set of notes.




Last update: Monday, July 13, 2009 at 12:11:37 PM.