Steve Tillman's Home Page

 
Home

About

Courses

Schedule

Vitae

Sports



Membership

Login

 
 

325 notes-7

Transaction Management

arrow2: Read chapter 10
Chapter 10 is about transaction management and control. This is a critical factor and a fact of life in any real world system, because the contents of virtually any real life database will be in a constant state of flux. A transaction is the primary mechanism by which database content is changed.

A transaction is a real world event which will cause the database contents (not the structure) to be altered. As far as the database is concerned, a transaction must be a logical unit of work, i.e. it must take the database from one consistent state to another. Consider the following classic example (which happens thousands of times daily in real life): $100 is transferred from account A to account B. This is a very simple situation. It requires two update commands to be sent to the database.

  1. Subtract $100 from the balance in A.
  2. Add $100 to the balance in B.
For the database to be consistent, i.e. accurately reflect the total amount of money, either both operations must be done or neither. We cannot have just one of the them done. Thus a and b comprise a logical unit of work.

Let's consider a slightly more extensive example. The real world event (which happens millions of times a day) is that a customer buys a product and charges it to his/her account. The salesman making the sale gets a commission. Even in the simplest possible case we must update at least four tables: The customer table to change the balance due field, the inventory table to reflect the number and value of the items in stock, the accounts receivable table to keep the assets of the organization current, and the employee table to add in the commission for the salesman. Other updates which might or might not happen now (but will certainly happen at some point because of an event such as this one) could involve shipping, billing, reordering for inventory, etc. For the database to accurately reflect the assets of the organization all the updates related to the event must be done, or none of them. A transaction basically consists of all the commands which update all the relevant tables. If only some of them are executed, the database would be in an inconsistent state.

There is a specific example in the text, very similar to the one described above, on page 415. It uses the database described on page 413.
arrow2: Overhead of chap 10 sale database
arrow2: Overhead of commands from page 415
This is the sequence of commands needed to update the database and keep it in a consistent state. Note the last command, COMMIT.

The standard SQL commands which determine the end of a transaction are COMMIT and ROLLBACK. A new transaction is started automatically when a program is initiated (which could be the start of an sqlplus session), or when the previous transaction ends. There is also an optional command called SET TRANSACTION (which we will discuss shortly) which allows you to establish certain parameters about the transaction. ROLLBACK ends the current transaction and returns the database to the state it was in when the transaction started. It rolls back the updates performed within the transaction. COMMIT ends the transaction and makes permanent any updates that were performed within the transaction. There is an implicit commit if the program (which could be an embedded SQL program, an sqlplus session, the underlying code for a user front end, etc.) ends smoothly or if there is a DDL command within the program.
arrow2: ROLLBACK-COMMIT examples (roll-09.lst and commit-09.lst)
If you do not use ROLLBACK and COMMIT commands then your entire log-on session is counted as one big transaction. That would not be problem if you are only retrieving data, but may be one if you are updating. Should the machine crash during a transaction, the transaction is automatically rolled back, i.e. the database is reset to the state it was in before the transaction started. The system uses a transaction log to keep track of all transactions that update the database. The information stored in this log is used by the DBMS for a recovery operation triggered by a ROLLBACK command, an abnormal program termination, or a system crash. On page 419 there is a simplified version of a transaction log.
arrow2: Overhead from page 419
The system will generate a transaction number. The log keeps track of the before and after images of fields that are changed. Basically, in a recovery situation, the system will examine the log to see what transactions were not committed, and then roll them back to the state they were in before the transaction.

If you are making numerous changes to the database, you do not want to have to redo them all in case of a problem. You can avoid this by issuing explicit COMMIT commands whenever you have a logically complete transaction. Implicit COMMIT commands are generated by a smooth exit, or by a DDL command. Any transaction can be killed by the user with an explicit ROLLBACK command. You might want to do this if you noted an earlier mistake, or if the situation changed in the middle of a transaction. For example, suppose you are taking on-line orders. A customer ordered something, you updated the inventory, and the customer changed his/her mind.

If a transaction is particularly long, and only some of the commands need to be redone (e.g. a customer order with numerous possible products), oracle has a feature which allows rollback to a previously defined part of the transaction called a savepoint. The syntax would be

ROLLBACK TO SAVEPOINT sav_1
sav_1 is a user defined name placed in the midst of the transaction.

Transaction management and control is relatively straightforward with a single user database. The real problems arise with concurrent users. Consider the following scenario:
arrow2: hand out concurrent update

Two transactions are called serializable if running them concurrently produces the same results as if they run in some order where one is completed before the other starts. That way the database will be in a consistent state. One of the main goals of transaction management is to have all transactions have the property of serializability.

arrow2: Hand out two session update
We can see from this that if one user updates a table, another user will only see the pre-updated data until the update is made permanent via a COMMIT command. On a second attempt, the second session tried to update the EMPLOYEE table. The command hung until the first session transaction ended (with a rollback command).

Simultaneous update is a real problem, but it can be (and usually is) solved. The most common way to guard against it is to use file locking. The general idea of file locking is the following: You need a file for a potential update situation, and there are other users who might also access the file. You place a lock on the file, i.e. prevent any other user from accessing the file. Then when you are done with the file, you release the lock. This will prevent the simultaneous update problem, but it leads to some potential problems of its own.

  1. It has happened that a user locks a file and then goes to lunch (or some other activity) without releasing the lock. This particular problem can be avoided by having the OS release the lock and roll back any updates if there is no activity for a certain amount of time. That might inconvenience someone, but will not cause the system to grind to a halt.
  2. If a resource requested is locked, the computer repeats the request periodically until the resource becomes available. Normally, this usually happens within a few seconds. Because of this, there is activity from the user≠s log-on session. Consider the following:
    1. User A locks file X in preparation for a potential update. At approximately the same time user B locks file Y.
    2. User A finds that his transaction also updates file Y, so he puts in a lock request for Y. At about the same time B finds his transaction updates X so he puts in a lock request for X.
    3. A has X, waits for Y. B has Y, waits for X. DEADLOCK.

Deadlock is a serious and difficult problem. A full discussion of it is well beyond the scope of this course (take CS 326, Operating Systems), but we will consider it a little. One relatively common way to handle potential deadlock situations is for the OS to maintain a "wait for" table to see which processes are waiting for which resources. Examination of the table looking for "cycles" (e.g. A holds X waiting for Y, B holds Y waiting for Z, C holds Z waiting for X) will usually reveal potential or actual deadlocks. Since a process can hold several resources and be waiting for several, the actual cycles are not that easy to pick up. When the OS discovers a potential deadlock situation, it will rollback one of the processes, freeing the resources held by that process. There is often a priority scheme to determine which process gets rolled back.

The deadlock problem can be greatly alleviated by locking at the record level rather than the file level. It is likely that two processes will contend for the same file at the same time, but unlikely that they would contend for the same record. The difficulty is that the overhead for record locking is very high, and response time could easily become totally unacceptable. A more practical alternative, which is a compromise between the two, is to lock at the page (block) level. A page, by definition, is the amount of data read into memory with an ordinary read operation. It will usually contain several records, perhaps 20 - 200, but nowhere near the number of records in a file.

In addition to the granularity (record, page, file level, etc.) of the locks, there are also different types of locks. The most common are the following:

  1. Exclusive Update: One user, updating.
  2. Retrieval Only: Multiple users, but reading only.
  3. Single user update, multiple readers.
  4. Shared Update.
What you use depends on authorization and system requirements.

Oracle does have a specific LOCK TABLE command.
arrow2: Hand out LOCK TABLE syntax
arrow2: Hand out two session lock.

arrow2: Hand out lock quote
The manual says that if an explicit LOCK command is not issued, oracle automatically does exclusive row level (record level) locking when an INSERT, UPDATE, or DELETE command is issued. (My experience is that the lock is actually table level, but perhaps I am misinterpreting the manual.) A problem with oracle's automatic locking is that the lock is in effect only for the duration of the command which caused the lock. In general a transaction would have numerous such commands, so it is possible to interleave operations from more than one transaction. One way to get around this is to do a SELECT .... FOR UPDATE ... command before doing any of the modification commands. For example the command

SELECT EMP_NAME, SALARY, ISSUE_DATE, ORDER_VALUE
FROM WH_EMPLOYEE E, WH_PUR_ORDER P
WHERE E.EMPNUMB = P.EMPNUMB AND 
WHNUMB = 'wh80'
FOR UPDATE;
will lock the relevant rows of the employee and purchase order tables until the transaction is ended. If you only wish to update the salary field, you could make the last line FOR UPDATE OF SALARY. Then only the rows of the employee table would be locked.
arrow2: hand out two session for update
As this handout shows, when the SELECT....FOR UPDATE is used, we apparently do get row level locking.

Another mechanism a user has for transaction management is the SET TRANSACTION command. Such a command must be the first command within its transaction. One example of such is the command:

SET TRANSACTION READ ONLY;

This causes all subsequent queries within the transaction to see only changes that were committed before the transaction started. Therefore you would see only consistent data (which might, however, be somewhat obsolete). You would be prohibited from doing updates within that transaction.

Another example is the following command:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

Suppose a DML command within transaction T1 attempts to update an item that was updated by another transaction, say T2, and T2 was uncommitted at the start of T1, then the DML command will fail. One advantage to this is that without this SET TRANSACTION statement, an attempt to update an item that has been locked will cause your session to wait until the lock is released, so that in effect the transaction will hang. In many cases it is better to have a command fail and to know why than to have it hang without an obvious reason.

Although locking is certainly a widely used method of transaction control, it is not the only one. One methodology for avoiding concurrent updates which does not use locking is illustrated by the following handout:
arrow2: Hand out DG concurrent update
The primary method here is basically a timestamp method. Timestamp methods are discussed on pages 431 - 433 of the text. Although this is actually for a CODASYL database, the principle is independent of the specific database architecture. It has serializability; namely that the results of concurrent transactions should be the same as though the transactions were executed in serial order.

Real-time Transaction Processing

In some applications transactions are associated with real-time constraints, typically in the form of deadlines, i.e. the transaction must be completed by some specified time, or its value decreases. There are actually two types of real-time transactions. The application system is referred to as a hard real-time system if the value of the transaction becomes 0 if it is not completed on time. Examples might be a radar surveillance system for an airport or a patient monitoring system for a hospital. The system is referred to as a soft real-time system if the value of the transaction diminishes if it is not completed on time, but does not necessarily become zero. A computerized stock buying system might fall into such a category. Another example might be taking customer orders online. If the customer has to wait too long, he/she might hang up. We will not distinguish between the two for this discussion.

In a real-time system transaction scheduling is almost always based on a priority system. Each transaction is assigned a priority. The priority can be assigned in a number of ways. The simplest way is to give the highest priority to the transaction with the closest completion deadline. In many applications it is possible to predict the execution time of a transaction. In that case a priority procedure could be to assign the highest priority to the transaction with the least amount of slack with respect to its completion time. Priority is used for CPU scheduling. These days many systems, even desktop systems, have multiple CPU's but the number of potential transactions usually far exceeds the number of CPU's, so the problems we are about to discuss remain the same, though more complicated in such situations. For simplicity, therefore, we will assume a single CPU. If transaction T1 arrives while T2 is executing, and T1 has a higher priority than T2 does, T2 is suspended and T1 assumes control of the CPU. A problem, known as the priority inversion problem, occurs when a higher priority item must wait for a lower priority item. This would occur if T2 holds a lock on a resource that T1 needs. In that case T1 must wait for T2 to complete.

There are two ways that this delay can be long enough to cause T1 to miss its deadline. The first way is called data resource blocking. Suppose T1 needs an exclusive update lock on item A, an item for which T2 holds a shared read lock. T1 must wait for T2 to complete. But before T2 completes T3 comes along and requests a shared read lock on A. T2 completes, but T3 is still going, so T1 must now wait for T3. This could continue until it is too late for T1 to complete on time.

The second way is called CPU resource blocking. In this case suppose T4 comes along with a higher priority than T2, but lower than T1. T4 has no lock conflicts with T2. Therefore T2 is suspended and T4 executes. T1 cannot preempt T4 because it still cannot get the resource held by T2. T1 must wait for both T2 and T4 to finish before it can start. Since this can happen more than once, it is possible for T1 to be blocked until it is too late to beat its deadline.

There are potential solutions to the priority inversion problem, but they can cause problems themselves. One "solution" is called Priority Inheritance (PI). Under PI, when priority inversion occurs, the low priority transaction holding the lock (T2 in our example) would inherit the priority of the highest priority transaction waiting for the locked resource. In this case T2 would inherit T1's priority. With a higher priority, T2 might run faster and release the lock quicker. Also, when T4 comes along, it would not be able to preempt T2. When T2 completed and released the lock, T1 would have priority over T4 and could begin processing. PI eliminates the CPU resource blocking problem. PI does not affect data resource blocking, however. Also, raising T2's priority may affect other concurrent transactions, slowing down their completion, and possibly causing them to miss their deadlines.

A second possible solution to the priority inversion problem is called Priority Abort (PA). In this case when a high priority transaction conflicts with a low priority item over a lock, if the low priority item has already started, it is aborted (rolled back). With this scheme the priority inversion problem goes away completely. A low priority item will never be able to block a high priority item. This is desirable for real-time transaction scheduling, but it has a negative side effect for the system. There could be a high system abort rate, which wastes system resources. Also consider the following scenario: Suppose that priority is based solely on the deadline, i.e. the transactions with the closest deadline get the highest priority. Suppose T1 takes 5 units to complete and has a deadline of 18. T2 takes 9 units to complete, and has a deadline of 20. Suppose T2 starts executing at time 0, and T1 arrives at time 8. Then under PA, since T1 has a higher priority, T2 is aborted. T1 starts at 8 and finishes at 13. T2 can start at time 13, but cannot finish until time 22, so misses its deadline.

The trade off between PI and PA is the potential long blocking of PI versus the high conflict abort rate of PA. A potential compromise, advocated by some, is called Conditional Priority Inheritance. A threshold value of h is determined. If the difference between T2's total execution time and the remaining time is greater than h, abort T2. If it is not T2 inherits T1's priority. The value of an appropriate h is not clear, and research is still trying to determine it. In the above scenario, if h = 2, T2 would be allowed to go to completion at time 9, and T1 would start at 9 and finish at 14, allowing both transactions to meet their deadline, and not wasting system resources.
arrow2: Hand out real time example

Distributed Database Management Systems

arrow2: Read chapter 12
Everyone is aware of the term "information superhighway" as epitomized by the internet in general and the world wide web in particular. Network technology is advancing rapidly and remote access is now an everyday fact of life. Many organizations are spread over remote geographical sites, and the organizational information needs to be accessible (to some extent) at each site. Much of this is managed through distributed database management systems (DDBMS).

A DDBMS is different from the situation of a centralized database with remote access from different locations. A DDBMS consists of a collection of sites, connected via some kind of a communications network. There are two basic conditions which must be satisfied for a setup to be considered a DDBMS:

  1. Each site has a DBMS in its own right.
  2. The sites agree to work together so that a user at any one site can access data anywhere in the network.

The DDBMS can be regarded as a kind of partnership among the local DBMS's. Each local DBMS would have software, possibly an extension of the local DBMS, to provide the partnership functions. The combination of this new component, together with the existing DBMS's constitutes the DDBMS.

Advantages of a DDBMS:

  1. Data can be located near the greatest demand site.
  2. If users (mostly) work with a subset of the database, by storing it (or a copy of the subset) locally, access is much faster than with a remote site.
  3. With data at several different sites rather than one, the workload can be spread out, possibly improving processing performance.
  4. Ideally, new sites can be added at will, thus making growth easier. (In practice, adding new sites to an existing network of databases is often not that easy.)
  5. Local hardware frequently consists of workstations or desktop systems, which are considerably easier and cheaper to run than mainframe systems.
  6. Workstations and desktop systems generally have a friendlier user interface.
  7. If a local site goes down, the other sites can still work.

Disadvantages of a DDBMS:

  1. Management and control are much more complex than with a centralized system. (We will address these problems in more detail later.)
  2. Security is a major problem. Not only are workstations and desktop systems more vulnerable than mainframe systems, but the network itself is particularly vulnerable to intruders.
  3. Although TCP/IP is a de facto standard protocol for network communication, the fact remains that at the present time there is no official standard communication protocol at the application level.
  4. Training costs tend to be more expensive and more ongoing. They can easily (and usually do) exceed the savings in operating costs, and frequently even offset any savings in hardware.
  5. A lot of remote access can really bog down a network, slowing down processing to a crawl, and even causing system crashes.
  6. To minimize network traffic it is fairly common to duplicate data at multiple locations. Not only does this increase storage costs, but routine updates become more complicated.
  7. Routine activities such as transaction management, query optimization, etc. are much more complex than with a centralized database.

C. J. Date, one of the big names in database theory, gives what is generally regarded as the fundamental principle of DDBMS: To the user, a distributed database should look exactly like a non-distributed system. This principle led to Date's 12 objectives of DDBMS.
arrow2: Overhead of Date's 12 rules

These are available in your text on pages 506 - 507. Date proposed these more than 20 years ago, but most database people still buy into them. The following quote in the book I find interesting. "Although no current DDBMS conforms to all of them, they constitute a useful target."

  1. Local Autonomy: All operations at a given site are controlled at that site. Local data is locally owned and managed. Security, integrity, and storage representation of local data remain under the control of the local site. This objective is not completely achievable. There are some situations, e.g. remote updates, in which a given site, X, must relinquish some control to a remote site, Y. However sites should be autonomous as much as possible.
  2. No reliance on a central site: There are two reasons for this objective.
    1. A central site might be bottleneck if there is very heavy traffic.
    2. If the central site goes down, so does most of the network.
  3. Continuous operation: The network should never be required to shut down in order to perform some function such as adding a new site or updating the DBMS at an existing site.
  4. Location independence: (This is sometimes known as location transparency.) Users should not have to know where data is physically stored. Data should be allowed to migrate from site to site (perhaps for performance reasons) without affecting existing programs (again except, perhaps, for performance).
  5. Fragmentation independence: A relation is fragmented if it can be divided up into pieces or "fragments" for physical storage purposes. This is usually done for performance reasons, e.g. to store data where it is most frequently used.
    arrow2: Hand out fragmentation example.
    The top example shows horizontal fragmentation and the bottom shows vertical. There can be combinations. Any fragment should be derivable from the original with appropriate selection and projection operations. The original table should be able to be recovered from the fragments by JOINs and UNIONs. User requests for data should be the same whether or not the table is fragmented.
  6. Replication independence: Data replication means that a relation (or a fragment) is stored at more than one site. This is done primarily to improve retrieval performance. It will tend to penalize update. (We will look more closely at that shortly.) Again from the user's point of view, access commands should be the same whether or not the data is replicated.
  7. Distributed query processing: Obviously you would need distributed query processing in a distributed database, but in fact this can be a major stumbling block to efficient running of the network. C. J. Date≠s book, An Introduction to Database Systems, eighth edition, which has a 2003 copy write date, has an interesting example. It can be found on pages 661-662. Suppose we have the following tables:
    		SUPPLIER (S#, CITY) stored at A  (10,000 records)
    		PART( P#, COLOR) stored at B (100,000 records)
    		SP (S#, P#, PRICE) stored at A (1,000,000 records)
    
    Using these tables consider the query, "What London suppliers are shipping red parts?" An SQL command to answer the query is
    		SELECT S#
    		FROM SUPPLIER, SP, PART
    		WHERE CITY = ŒLONDON≠ AND COLOR = ŒRED≠ AND
    			SUPPLIER.S# = SP.S# AND
    			SP.P# = PART.P#
    
    Date makes some reasonable assumptions about record length, data transmission rate, access delay, etc. in order to examine various strategies for carrying out the query. His absolute time estimates could possibly be improved with better (newer, more expensive) technology, but the relative times for comparing the different strategies would still be approximately correct.
    	STRATEGY                           COMMUNICATION TIME
    i.	Move PART to A                         6.67 minutes
    ii.	Move SUPPLIER and SP to B	      1.12 hours
    iii.	For each London shipment,	        5.56 hours
    	check if the part is red
    iv.	For each red part, check if London	2 seconds
    	supplier exists
    v.	Move London shipments to B	        6.67 minutes
    vi.	Move red parts to A	               .1 seconds
    
    The DDBMS's query optimizer will make the strategy selection. As you can see, the need for a global query optimizer is crucial for a DDBMS to operate properly.
  8. Distributed transaction management: There are two major considerations -- concurrency and recovery. An agent is a process performed at a given site. A transaction will generally consist of multiple agents. For recovery the system must ensure that all agents of a given transaction either commit together or rollback together. [The usual strategy for this is the 2-phase commit. We will discuss that shortly.] Concurrency is likely to be based on locking, and the main problem with locking is global deadlock. This is much harder to detect and fix than a local deadlock because there is no single operating system in control.
  9. Hardware independence: When Date first proposed his 12 objectives, this was thought to be an insurmountable obstacle. We are close to achieving it now.
  10. Operating system independence: Widespread use of the internet has helped in establishing some de facto standards for this objective and the next one. We still have a ways to go.
  11. Network independence: See objective 11.
  12. DBMS independence: We still have a ways to go in this respect, though we have made considerable progress since Date's proposal was first made.


For the previous set of notes click on 325 notes-6.

For the next set of notes click on 325 notes-8.




Last update: Wednesday, August 12, 2009 at 11:23:46 AM.