Transaction management in the R* distributed database management system

SUNDAY, FEBRUARY 14, 2021 •

Note: I use the words "transaction" and abbreviation "Xact" interchangeably in this article.

Happy Valentine's Day! This seminal paper extends the two phase commit (2PC) protocol with the optimized Presumed Abort (PA) and Presumed Commit (PC) protocols.

The main goal for 2PC is to guarantee the atomicity (all or nothing execution) of transactions processed by multiple sites. 2PC was designed with several ideals in mind. In addition to guaranteeing transaction atomicity, the outcomes from processing a commit should be "forgotten" (a.k.a. little to no memory spent holding onto old, never-to-be-used-again values) within a short period of time. The coordinator should minimize total overhead in the forms of log writes and message traffic. For instance, if the coordinator crashes in the middle of a transaction, not having to repeat completed operations while still maintaining correctness on each machine might require more logging, and visa versa.

The purpose of this paper is to uphold the pre-existing conditions for atomicity while making optimizations to the performance around different edge, failure, and non-failure cases. In situations where byzantine failures occur and transactions must be aborted, the revised protocols attempt to maximize the ability to perform unilateral aborts across all machines.

Here, we suggest that complicated protocols developed for dealing with rare kinds of failures during commit coordination are not worth the costs that they impose on the processing of distributed transactions during normal times (i.e., when no failures occur).

2PC Review

For context, the authors spend the first half of the paper describing the existing 2PC protocol. To facilitate distributed updates, machines are organized in a setting consisting of a single coordinator and multiple subordinates. You can think of both the coordinator and any subordinate as a process. The coordinator serves as a single point of contact for syncing subordinates' states. The normal 2PC protocol can be visualized as follows:

There is a line of communication for each subordinate where this protocol is being carried out. In the above diagram, the * means that the record is forced to stable storage. Each action is logged as a record with key information including:

Record Type (Prepare, Commit, Abort, Collecting, End, Undo, Redo) Process ID Xact Name Coordinator ID Exclusive Locks Subordinate ID

From the above diagram, the "Prepare" phase refers to the first two lines of communication (Prepare + Vote Yes/No). Upon receiving the PREPARE message from the coordinator, each subordinate will force write a log record documenting its vote or actions.

The coordinator's actions and decisions are summarized by the following: Coordinator Normal 2PC Note that the ABORT/COMMIT messages in the "Commit" phase are only sent to the subordinates that voted YES in the "Prepare" phase.

The subordinate's actions and decisions are summarized by the following: Subordinate Normal 2PC If one subordinate votes NO, the resulting action will always be ABORT. Therefore, a NO voting subordinate can preemptively stop the transaction without having to wait for the voting results from the coordinator.

2PC Recovery

Now, we have a good idea of what basic 2PC looks like. So what kinds of issues can arise from failures for either the coordinator or the subordinate? Let's investigate how the existing DDBMS recovers from failures.

At the time of the failure, a recovery process running on each site (coordinator and subordinates) restores the machine and each Xact by determining each Xact's status based on the write-ahead-logging records generated by the 2PC protocol. If an Xact is in the middle of committing or aborting, the coordinator picks up where it left off, sending the corresponding COMMIT or ABORT message to subordinates until they respond with an ACK. If there's no logs about an Xact, it's aborted, undone, and forgotten about.

Out of the possible phases a subordinate processing an Xact could be in, we've covered committing, aborting, and no information. That leaves the prepared state. Upon revival from failure, the subordinate will ping the coordinator asking for the vote result - in other words, whether the Xact should be committed or aborted. Upon receiving the inquiry, the coordinator looks it up and sends the response accordingly.

This works great if the coordinator actually knowns the Xact state. What if there's no information? The authors discuss:

Since both COMMITS and ABORTS are being acknowledged, the fact that the inquiry is being made means that the inquirer had not received and processed a COMMIT/ABORT before the inquiree “forgot” the transaction. Such a situation comes about when (1) the inquiree sends out PREPARES, (2) it crashes before receiving all the votes and deciding to commit/abort, and (3) on restart, it aborts the transaction and does not inform any of the subordinates.

It's a tricky situation. On restart, the participants technically do not know whether they themselves are a subordinate or a coordinator (hence the "inquirer" and "inquiree" designations). So what's the correct response here?

Given this fact, the correct response to an inquiry in the no information case is an ABORT.

The reason this discussion is important is that 2P as described would not be able to facilitate such a procedure. The authors highlight in the very next section:

2P as described above is inadequate for use in systems where the transaction execution model is such that multilevel (>2) trees of processes are possible... Each process communicates directly with only its immediate neighbors in the tree, that is, parent and children. In fact, a process would not even know about the existence of its non-neighbor processes.

Through this overview of 2PC and the complementary recovery process, the authors aim to point out the inefficient message traffic and log writes of the non-failure cases, along with the inability to recover from the aforementioned failure case.

New Protocols

With the stage set, the authors introduce three new protocols to address the alleged shortcomings.

Solving 2PC's inoperability in contexts with multiple trees of processes is fairly simple - make 2PC multi-leveled, too, with intermediate nodes that...

...act as both coordinators (for their child processes) and subordinates (for their parent process).

In normal 2PC, subordinates only communicate with the coordinator, and not with one another. However, let's say that a certain transaction running on a subordinate is split across even more processes (a.k.a. more subordinates). To allow the processes to talk to one another without using the root coordinator, we simply add an intermediate "non-leaf, non-root" process that acts similarly to the coordinator root itself. Its value comes from the fact that this allows for 2PC to be carried out on a tree of processes, as opposed to a single layer of processes. This fixes the prepared phase bug mentioned above.

Presumed Abort and Presumed Commit are new protocols driven by workload-specific optimizations. Let's start with Presumed Abort.

Remember how, as discussed previously, an Xact is aborted if the recovery process can't find any information about it? As it stands, the coordinator would send an ABORT message, wait for subordinates' ACKs, and then remove the Xact from its memory store.

However, the authors point out, if it always aborts, then there's no reason for the coordinator to hold onto the aborted Xact. A log record's main purpose is for a recovery process to restore an Xact to its former state after failure. However, if the Xact was aborted (and its operations were never executed), there's no point in having a log record to ensure that nothing happened. After failure, if the subordinate inquires the coordinator about such an Xact, the coordinator's logs will turn up empty, and the subordinate will still be told to ABORT, without all the extra ACKs and abort/end log records.

Presumed commit builds on top of presumed commit while flipping around some of PA's guidelines. In a DDBMS with few failures, we can reasonably assume that the majority of Xacts will commit, with rare failures that incur an ABORT. Therefore, why don't we reduce ACKs and logs by requiring ACKs and forcing execution for aborts, but not for commits? This is presumed commit in a nutshell. The modifications to presumed abort as follows:

  • Require ACKs, end records for aborts
  • Force write collecting record before sending PREPARE
  • Force write abort records

Finally, in the assessment, it's shown that for different workloads, PA > 2PC and PC > 2PC, but for PC vs. PA, the performance depends very much on whether the Xacts are more write or read heavy.