Steve Tillman's Home Page

 
Home

About

Courses

Schedule

Vitae

Sports



Membership

Login

 
 

325 notes-3

DBMS architecture overview

DBMS's can be broken up into several distinct architectural classes. The first three are well-defined. After that things are a little fuzzy. The first three classes are

  1. Hierarchical
  2. Network
  3. Relational

Database management systems evolved in the 60's and early 70's primarily to handle standard business type applications such as various accounting functions (accounts payable, general ledger, payroll, etc.), inventory control, automated banking, reservation systems, and the like. These types of applications have four requirements that characterize database systems:

  1. Efficiency in the access to and modification of very large amounts of data.
  2. Resilience, or the ability of the data to survive hardware crashes and software errors without sustaining loss or becoming inconsistent.
  3. Access control, including simultaneous access of data by multiple users in a consistent manner and assuring only authorized access to information.
  4. Persistence, i.e. the maintenance of data over long periods of time, independent of any programs that access the data.

Hierarchical and network systems, considered first generation database systems, were largely successful in these respects. They essentially solved the file problems that plagued data processing systems before the advent of DBMS's. However first generation systems had some fundamental disadvantages. They were very efficient for processing applications that used predetermined access paths (large scale transaction processing) but could not easily handle ad-hoc type queries, e.g. "How many employees in the widget department are due to retire in the next three years?" To answer a query such as this, an application programmer skilled in performing disk-oriented optimization would have to write a procedural program to navigate through the database. The time frame for such an activity might be anywhere from 1 day to 6 months. (In contrast, a skilled SQL user could likely answer the question within minutes, using a relational database.) First generation systems did not have as much data independence as most would find desirable. That is, structural changes to the database to accommodate one application generally required all application programs to be recompiled at the very least, and might well cause some of them to be rewritten if access paths were altered in any way. Essentially for these reasons (among others, including the ease of use of relational systems), there are no longer very many implementations of first generation systems. Let's briefly look at the first generation systems in a little more detail. The downloadable appendices of the text have a more complete description of hierarchical and network databases.

Hierarchical databases: This approach is based on representing data in a tree structure. Information Management Systems (IMS) from IBM was the most widely used such system. It was developed in the 60's, and there may still be a few IMS installations in use today. At one time it was the most widely used DBMS in the world.
tree: tree info:
physical positioning, link fields, or both, along with secondary key index organization.
arrow2: See figure page 36
A database could be designed so that certain types of processing would be very efficient, especially if the database was primarily designed as a large number of one-to-many relationships, with little if any need for multiple parent types. On the other hand, general database design often calls for relationships which allow a record type to have more than one parent. Implementation of such structures is possible in IMS, but at best it would be awkward, possibly causing either excess redundancy, slower processing through extra disk accesses, or both. What the hierarchical approach had going for it was that IMS became so well established during the late 60's and early 70's, and the cost of conversion was so high, that it was years before it was mostly replaced. Given a choice, few database people would start fresh with a hierarchical model.

Network databases: The term "Network" should not be confused with the distributed databases in use today, many (if not most) of which use the internet to communicate among geographically distributed sites. In this context, the term "Network" referred to a data structure similar to a tree structure, except that a node could have any number of "parents".
arrow2: See figure page 37
The best known network model was the CODASYL (Conference On DAta SYstems Languages) model, which was first developed in the early 70's. It was not a specific piece of software. Rather it was a specification which was supposed to be the industry standard. The idea was that all specific software systems were to be based on that standard. In that sense it was an improvement on the hierarchical model, which had no set of standards. In practice it did not work out that way. Many database systems were based on CODASYL, but there were highly successful non-CODASYL network databases which did not comply with the standard specification. The CODASYL model was designed specifically to be used with COBOL (although it could be used with other languages), and the CODASYL DML (in many implementations) basically consisted of new verbs introduced into the COBOL language. It was used for several commercial DBMS's, the most successful being IDMS by Cullinane Software. DEC, Data General, Honeywell, Burroughs, and other hardware vendors at one time marketed CODASYL compliant databases. (For six years we had a Data General computer and database here at Wilkes, and the database mostly followed the CODASYL model.) The most successful non-CODASYL network databases were TOTAL, which rivaled IMS for market share during the mid 70's, and IMAGE, by H.P. (very similar to TOTAL), which was rated the number 1 DBMS in Datamation's annual software survey for several years in a row around 1980. (Datamation, which has since shut down as a print journal,, was a leading trade journal at the time. It is now available on line.) Both TOTAL and IMAGE were limited network systems, allowing a record type to be either a parent or a child, but not both. There were relatively simple methods to get around this limitation.

Basically, a network database allowed any record type to be associated with any other record type. In a full network model a record type can have any number of different parent types and any number of different child types. This meant that a conceptual schema could be directly modeled with very few design compromises. The main difficulty with the network approach was its complexity. It required programming users to be aware of and understand the underlying data structure, which was likely based on a sophisticated, complex model. Many of the programmers, perhaps even a majority, were not up to it. A network database solved file problems, but application development was slow in part because the typical programmer could not cope with the complexity of network databases. In addition, query facilities were primitive at best, so the only way a non-programmer could use the database was through application programs written by the programmers. Relationships between record types were almost exclusively implemented via linked lists. To get information from a first generation database it was frequently necessary to "navigate" (hence the term "navigational databases") through a complex series of linked lists, being explicitly aware of pointer fields and which record types were "pointing to" which other record types. It took very careful programming to avoid "getting lost", i.e. being unable to get back to a specific point in the database. (I speak from experience.)

Relational database systems (RDBMS) are generally considered second generation systems. They were pioneered by E. F. Codd in the early 70's, and offered a fundamentally different approach to data storage, at least conceptually, if not in fact. (I strongly suggest you read the vignette on page 3 of the text. It gives a very brief capsule of the development of relational databases in general and Oracle in particular. Oracle is the most widely used database software in the world and the system we will be using in this course.)
arrow2: Overhead of page 39
Data is represented by simple tabular data structures called relations and all user access is through a high-level, nonprocedural query language. (SQL has become the de facto such language for RDBMS.) In popular usage the term table has become a synonym for relation. For this approach to work the system needs a query optimizer that translates the nonprocedural statements into an algorithm to perform database access. Algorithms were needed to allocate rows of tables to pages (blocks of records) in files on secondary storage to minimize the average cost (in terms of time) of accessing those rows. Buffer management algorithms which exploit knowledge of access patterns for moving pages back and forth between disk and memory were needed. Indexing techniques were needed to provide fast associative access to random single records and/or sets of records specified by values or value ranges for one or more attributes. It was not until the 80's that these tasks were accomplished for large scale database systems. Today RDBMS's are "state-of-the-art" for business type applications. All of your programming projects will be based on relational databases, so we will discuss the concept in some detail shortly, but before we do so, let's briefly consider third generation systems.

RDBMS's have some limitations. For example, data types are limited to standard types (integer, floating point, character, date, and Boolean). There are no composite types, no collection types, and in general no user defined data types. There are a broader class of applications, however, for which RDBMS's are inadequate. These applications include computer aided design (CAD), computer aided software engineering (CASE), and applications which can make use of hypertext, graphics, video, sound, and perhaps other complex data types. Consider, for example, a publishing application for a newspaper layout. This requires storing text segments, graphics, icons, and many other kinds of data elements found in most hypertext environments. Supporting such data elements is usually difficult in second generation systems. A new type, a third generation system (or more appropriately a "next-generation" system, since there is more than one candidate for whatever the "third generation" would be), is needed for this type of application. The third generation system needs to be able to handle routine business data processing also. In the newspaper example these would include handling classified and other advertisements. The text of the ad would have to be stored, along with the rate, the number of days the ad will run, the billing address of the customer, etc. Handling of the ads requires normal business transaction processing. In addition, rules might be needed. For example, a rule might say that competing businesses do not have their ads placed on the same page or facing pages. Of course the rule could be built into the application program which lays out the paper, but it would be better to have the rules stored within the database so that they would not be lost if the application program is changed or modified. Generally speaking features that are built into the database are more stable than features which must be supplied by application programs.

The most likely candidates for a third generation DBMS are Object-Oriented Database Management Systems (OODBMS) and/or Object-Relational Database Management Systems (ORDBMS). The latter are sometimes referred to as Extended Relational Database Systems. Chapter 2 section 2.5.5 briefly discusses the OO model, and appendix G goes into more detail. I am not sure if we will ever get to the point of having "state-of-the-art" databases which fall into the category of OODBMS. At the present time OODBMS's do exist, but they are immature and do not follow any set standard.

The Object Oriented database model (OODM) seemed to be poised (at least by theorists) to dislodge the Relational Data Model (RDM) in the face of increasingly complex data that included video and audio , yet the OODM fell short in the database arena. However, the OODM≠s basic concepts have become the basis of a wide variety of database systems analysis and design procedures. In addition, the basic OO approach has been adopted by many application generators and other development tools.

The OODM's inability to replace the RDM is due to several factors. First, the large installed base of RDM-based databases is difficult to overcome. Change is often complex and expensive, so the prime requisite for change is an overwhelming advantage of the change agent. The OODM advantages were simply not accepted as overwhelming and were, therefore, not accepted as cost-effective. Second, compared to the RDM, the OODM's design, implementation, and management learning curves are much steeper than the RDM's. (Recall that ease of use was a strong factor in having relational systems take over from first generation systems.) Third, the RDM preempted the OODM in some important respects by adopting many of the OODM's best features, thus becoming the extended relational data model (ERDM). Because the ERDM retains the basic modeling simplicity of the RDM while being able to handle the complex data environment that was supposed to be the OODM's forte, you can have the proverbial cake and eat it, too.
arrow2: See the chart on page 35. It is interesting to note that Oracle 10 is listed as an extended relational system. We will actually use Oracle 11 which goes even further in this area.

The OODM-ERDM battle for dominance in the database marketplace seems remarkably similar to the one waged by the hierarchical and network models against the relational model almost three decades ago. The OODM and ERDM are similar in the sense that each attempts to address the demand for more semantic information to be incorporated into the model. However, the OODM and the ERDM differ substantially both in underlying philosophy and in the nature of the problem to be addressed. Although the ERDM includes a strong semantic component, it is primarily based on the relational data model's concepts. In contrast, the OODM is wholly based on the OO and semantic data model concepts. The ERDM is primarily geared to business applications, while the OODM tends to focus on very specialized engineering and scientific applications. In the database arena, the most likely scenario appears to be an ever-increasing merging of OO and relational data model concepts and procedures.

Oracle 11 is an ERDM. It is not truly OO (at least to the purist), but it has many OO features, including the two most important, abstract data types and type inheritance. Future versions will undoubtedly have more. In terms of data types Oracle 11 has large object types (CLOB, character large object, and BLOB, binary large object), reference types so that a field within a record can explicitly reference another record (note that this puts some navigational aspects,which were a major problem with first generation systems, back into the database arena), XML types, various media types (sound, video, etc.), nested table types, variable array types, and user defined data types which allow the user to define composite types and/or include methods (built in procedures) within a user defined data type. We will be using Oracle as if it is a straight relational database, but the OO features are there for "power users".

As was just noted, the OO model tends to be complex, which was one of the problems with first generation systems. There is no question that the object features of Oracle 11, while very powerful, are much more complex and difficult to use than if the DBMS is treated as a strictly relational database. Perhaps today's users are better educated and more technologically sophisticated, and are better able to deal with the complexity than users were 30 years ago and are ready for the complexities of the OO model.
"arrpw2" See chart on page 47, which gives the pros and cons of the various models.

Relational databases: In chapter 3 we will get to the details of the relational model, but at this point we will expand a little more on the overview. RDBMS's are the current state of the art in the sense that most of the time when an organization gets new DBMS software, the software would be based on the relational model (or these days an extended-relational model, but with Oracle at least, it can be treated as if it is purely a relational database). The products are mature, and because they are based on a single standard they tend to be very similar. That makes it easy to share data among systems with different software, and to move from one piece of software to another. The name, relational, comes from the fact that the underlying theory is based upon the mathematical theory of relations (which may have made practitioners think it is more complicated than it is). Think of a relation as a file in the form of a 2-dimensional table, where each row represents a record. The number of columns (fields) is fixed, but the number of rows is indefinite. No two rows should be precisely the same. (See examples on page 39.) The driving force of the relational model is simplicity. To that end there are two crucial features:

  1. Associations between rows of different tables are logically represented solely by the data values drawn from a common attribute (field). Thus, logically, the relationships are contained in the data itself, rather than with artificial constructs such as link fields or indices (see p. 39). The actual physical implementation is transparent to the users.

  2. Relational databases are designed to use nonprocedural languages. With a procedural language (COBOL, Java, C++, C, etc.) we specify step-by-step the procedures we wish the computer to use to carry out the processing. For example, using the tables on page 39 suppose we wanted the names of all customers serviced by sales agent 501, along with the agent's name.
    arrow2: procedural nonprocedural example
    In the procedural version, the system will keep track of where you are and whether or not you get to the end of the chain, but you still must tell it what to do, step-by-step. In the SQL version, you tell the system what you want, and let it decide how. This makes life (relatively) easy for the programmer, but difficult for the system.

Relational databases were proposed in theory in the early 70's, but did not become commercially viable until the mid 80's. On a large mainframe with huge files, processing could still be fairly slow. The optimizer (part of every large scale relational database system) may choose to do an exhaustive search of a particular table if there were no other way. A designer can design in indices, hash fields, in some cases link fields, etc. which the optimizer would take advantage of. The SQL program would be the same no matter what. It is not necessary for the programmer to understand the underlying data structure. Perhaps the biggest advantage of the relational approach is its simplicity (at least at the user level). To some extent this is a disadvantage as well because it makes it easy to create poorly designed databases, which can be very frustrating to use.

For the previous set of notes, click on 325 notes-2.
For the next set of notes click on 325 notes-4




Last update: Monday, July 13, 2009 at 2:39:56 PM.