REDACT Project on Preventing Database Deadlocks in Software Applications

Mark Grechanik Ph.D., University of Texas at Austin

Summary Many organizations deploy applications that use databases by sending Structured Query Language (SQL) statements to them and obtaining data that result from executions of these statements. Since applications often share the same databases concurrently, database deadlocks routinely occur in these databases resulting in major performance degradation in these applications. To prevent database deadlocks automatically, we created a novel approach and we rigorously evaluated it. For a realistic case of over 1,000 SQL statements, all hold-and-wait cycles are detected in less than 15 seconds. We build a tool that implements our approach and we experimented with three applications. Our tool prevented all existing database deadlocks in these applications and increased their throughputs by approximately up to three orders of magnitude. The entire code, results, and video can be obtained from here. The Problem of Database Deadlocks Many organizations and companies deploy Database-centric applications (DCAs), which use databases by sending transactions to them - atomic units of work that contain SQL statements - and obtaining data that result from execution of these SQL statements. When DCAs use the same database at the same time, concurrency errors are observed frequently, and these errors are known as database deadlocks, which is one of the reasons for major performance degradation in these applications. The responsibility of database engines is to provide layers of abstractions to guarantee Atomicity, Consistency, Isolation, and Durability (ACID) properties; however, these guarantees do not include freedom from database deadlocks. In general, deadlocks occur when two or more threads of execution lock some resources and wait on other resources in a circular chain, i.e., in a hold-and-wait cycle. Database deadlocks occur within database engines and not within DCAs that use these databases. A condition for observing database deadlock is that a database should simultaneously service two or more transactions that come from one or more DCAs, and these transactions contain SQL statements that share the same resources (e.g., tables or rows). In enterprise systems, database deadlocks may appear when a new transaction is issued by a DCA to a database that is already used by some other legacy DCA, thus making the process of software evolution error-prone, expensive, and difficult. Currently, database deadlocks are typically detected within database engines using special algorithms that analyze whether transactions hold resources in cyclic dependencies, and these database engines resolve database deadlocks by forcibly breaking the hold-and-wait cycle. That is, once a deadlock occurs, the database rolls back one of the transactions that is involved in the circular wait. Doing so effectively resolves the database deadlock, but this resolution causes performance degradation, since DCAs should repeat the rolled back transactions to ensure functional correctness. Unfortunately, this solution is only partially effective even though it is widely used as part of the defensive programming practice, where programmers should write special database deadlock exception-handling code that should repeat aborted transactions -- searching for ``database deadlock exception'' on the Web yields close to 2,500 web pages, many of which instruct programmers how to handle database deadlock exceptions for different databases. By the time that a database deadlock is resolved, the damage to the performance of the DCA is done, since rolling back transactions, issuing exceptions inside the DCA, and executing defensive code within these exception handlers to retry these transactions incur a significant performance penalty. Our experiments show that database deadlocks result in up to three orders of magnitude of worsening performance of client/server DCAs when scaling the load up to only 1,000 clients! To make things worse, when transactions are discarded, the results of valuables and long-running computations are lost, as it is especially evident in case of multi-level and long-lived transactions. Our interviews with different Fortune 100 companies confirmed that database deadlocks occur on average every three weeks for large-scale enterprise DCAs, some of which have been around for over 20 years, with an estimated annual cost of DCA support close to $500K. For instance, database deadlocks still occur every ten days on average in a commercial large-scale DCA that handles over 70% of cargo flight reservations in the USA, and this challenge persists and it is getting worse as this DCA evolves. Interestingly, even if the cause of a database deadlock is understood, it is often not possible to fix it, since it would involve drastic redesign by changing the logic of the DCA to avoid certain interleavings of SQL statements among different transactions. In addition, fixing database deadlocks may introduce new concurrency problems, and frequently these fixes reduce the occurrences of database deadlocks instead of eliminating them. Developers need approaches for preventing database deadlocks in order to achieve better performance of software, however, there are currently no tools that prevent database deadlocks. We created a novel approach for pReventing databasE Deadlocks from AppliCation-based Transactions (REDACT) that detects all hold-and-wait cycles among resources in SQL statements and prevents database deadlocks automatically using the information about these cycles. We makes the following contributions. • We create a novel model using our abstraction that hides the complexity of database engines, instead concentrating on abstract operations (i.e., read and write) on resources (e.g., database tables) and on how these operations lock and release these resources. We created a tool that automatically extracts models from transactions. • Based on our model, we designed a novel efficient algorithm to detect all hold-and-wait cycles and we showed its correctness.% and we proved the correctness of this algorithm. • We implemented our algorithm in a simulator with which users can enter SQL statements from DCAs and detect all hold- and-wait cycles that potentially lead to database deadlocks. This detection is done statically, and its results are used to prevent database deadlocks at runtime. We evaluated our algorithm using this simulator with a random SQL generator, and we showed that for an extreme case of 200 transactions, each of which containing 200 SQL statements (i.e., a total of 40,000 SQL statements), it takes a little over 311 hours for our algorithm to detect all hold-and-wait cycles. For a realistic case of a large-scale DCA that contains 50 transactions, each of which includes a dozen of SQL statements our algorithm finds all cycles in less than 15 seconds. • Using this information about hold-and-wait cycles, we designed and built a generator for a supervisory control program that prevents database deadlocks by intercepting SQL statements sent to databases by DCAs, detecting a potential deadlock, and delaying an SQL statement thus effectively breaking the deadlock cycle. • We implemented our approach in a tool and experimented using three client/server DCAs. Our tool prevented all existing database deadlocks in these DCAs and increased their throughputs by approximately up to three orders of magnitude for 1,000 clients. Example of a Database Deadlock Consider an example of database deadlock that is shown in the table below. Transactions T 1 and T 2 are independently sent by DCA(s) to the same database at the same time. When the first DCA executes the statement UPDATE that is shown in Step 1, the database locks rows in the table authors that contains values of the attribute paperid equal to 1. Next, the second DCA executes the statement UPDATE that is shown in Step 2, the database locks rows in the table titles that contains values of the attribute titleid equal to 2. When the statement SELECT is executed as part of the transaction T 1 as shown in Step 3, the database attempts to obtain a read lock on the rows of the table \texttt{titles}, which are exclusively locked by the transaction T 2 of the second DCA. Since these locks cannot be imposed simultaneously on the same resource (i.e., these locks are not compatible), T 1 is put on hold. Finally, the statement SELECT is executed as part of the transaction T 2 as shown in Step 4, the database attempts to obtain a read lock on the rows of the table authors, which are exclusively locked by the transaction T 1 of the first DCA. At this point both T 1 and T 2 are put on hold resulting in a database deadlock. The same reasoning applies if the granularity of locks is coarser, on the table level. A lock graph is shown in the above figure for the these transactions that are depicted in rectangles and resources (i.e., tables) are shown in ovals. Thick arrows designate locks that transactions obtain on resources and dashed arrows designate that transactions are waiting to obtain locks on resources. The lock graph shows the cycle T 1 ->authors->T 2 ->titles->T 1 . This hold-and- wait cycle may not always materialize in a database deadlock; however, when interleaving steps occur as shown in the table, a database deadlock is highly likely. One exception is when these tables contain no data; in this case, locks may be released by the database engine almost instantly or not imposed at all. This situation illustrates how difficult it is to reproduce database deadlocks even in very simple cases. Downloads and Experimental Results To reproduce results of our experiments with REDACT, you need to download the following components. • REDACT system with instructions inside the rar file on how to put the system together - available here. • Subject applications for experimentation are available from Sourceforge here. • The Cycle Detector is available here with the documentation inside the package. You can watch a movie that shows it at work by following this link. • Random SQL statement and database generator is available here. • PNML model generator is available here. • Experimental results are avalable in Excel spreadsheets bundled together from here. People REDACT was created at the Advanced Research In Software Engineering (ARISE) lab at the Department of Computer Science of the University of Illinois at Chicago. Mark Grechanik, Project Lead Email: drmark[at]uic.edu B. M. Mainul Hossain Email: bhossa2[at]uic.edu Ugo Buy Email: buy[at]uic.edu Haisheng Wang worked on this project first as a Ph.D. student under prof.Buy and then as a post-doc with prof.Grechanik at UIC. He is now with Oracle Corp. Email: haisheng.wang[at]oracle.com There were many graduate students from UIC who worked on different components of this project at different times. Akila Subramanian, Naga Keerthi Gummadi, Ashwini Vishwanath, Priyanka Dawani, and Kasthuri Parthi worked on extracting the SQL parser from the Apache Derby database and creating a wrapper that made it easier to use. Thomas Marrinan helped us fix bugs in the extracted SQL parser and he wrote a translator that converted transactions into PNML models. Kumar Bisht wrote the simulator with which users can enter SQL statements from DCAs and detect all hold-and-wait cycles that potentially lead to database deadlocks. He and Sugi Venugeethan worked on visualizing hold-and-wait cycles in the simulator. Arthi Vijayakumar developed a random generator for SQL statements and databases. Dr. Chen Fu from Accenture discussed related work on deadlocks and Petri nets at the very early stage of this project. Tathagata Dasgupta wrote python scripts to generate various kind of inputs for the hold-and-wait cycle detection algorithm. Sumaiya Sayed extracted transactions from the subject DCAs. Prof. Don Batory made a number of useful suggestions during our peripatetic discussion in the Botanical Garden when we attended ICSE in Honolulu, Hawaii. Finally, we want to thank Jega Aravandy from Accenture and a number of anonymous managers from Fortune 100 companies for letting us use some empirical data related to database deadlocks from large-scale applications.