Databases: More Atomic Cluster Commits

I'm currently reading the book Scalable Internet Architectures, and I'm enjoying it.

On page 141, he describes two-phase commits:
The basic idea is that the node attempting the transaction will notify its peers that it is about to commit, and they will react by preparing the transaction and notifying the originating node that they are ready to commit. The second phase is noticing any aborts in the process and then possibly following through on the commit (hopefully everywhere)...The two-phase commit is not perfect, but it is considered sufficient for applications such as stock trading and banking...Despite its shortcomings, the only common "correct" multimaster replication technique used today is the 2PC, and, due to its overhead costs, it is not widely used for geographic replications.
I didn't catch the phrase "not perfect" the first time I read that section, so I spent the day wondering how the hell they implemented fully atomic transactions across a cluster of machines. What happens if all the machines agree that they're right about to finish the commit and then one of the machines goes down right before writing the final bit to the transaction log and the others don't? I'll readily admit I'm no database expert, so I figured I must be missing something. I fought with it for a while, and I was very pleased to re-read that section and see the phrase "not perfect".

Well, how could you do a perfect cluster-wide commit with absolutely no race conditions? Let's assume the machines are all near each other since 2PC doesn't work across the country anyway without a ton of lag. I have a silly brute force solution that may not be perfect, but is a bit closer to atomic (i.e. it relies on the hardware in the same way mutexes do):

Start with battery-backed RAID cards. However, design the RAID cards so that they have one input wire and one output wire specifically to address this problem. Now, put all the machines in a circuit using this wire:
| |
On each of the RAID cards is a switch. If all the machines open their switches, current can flow. If any one of them closes their switch, current doesn't flow. It's the hardware equivalent of an AND operation; did I mention I'm not a hardware guy? Anyway, my commit system proceeds pretty much the same as the 2PC. When the machines have agreed that they're ready to do the final commit, the kernel setups up the sector to be written to disk and lets the RAID card take over. Then, they each open their switches. When the current starts flowing, the RAID card recognizes this signal to write the sector containing the data for the transaction log saying that the commit took place. If the RAID card is, for some reason, unable to perform its duties, it shuts down and declares that it's broken. I.e. the machine will die, but there won't be inconsistency.

Well, there may be holes in this scheme. Perhaps it was written about 30 years ago just like all the other interesting CS problems. It does require specialized hardware. All those reservations aside, I think this leads to a more atomic cluster commit.


Anonymous said…
The trick is that there's a master node with a log. If a node goes down before writing that final bit, when it gets back up the master will replay its log and all will be fine.
jjinux said…
What happens if:

The master writes its final log message.

Then the master manages to tell half the nodes that it wrote the final log message.

Then it dies.

The other half of the nodes think that the master did not succeed, so they roll back.

Sorry if this is a silly, basic question. This is all new stuff to me.
Anonymous said…
You should look up the 3 phase commit
bong said…
You might find the discussion about two-phase commit in Java Transaction Processing interesting especially the parts that discuss what can happen when the commit phase fails. Don't let the Java portions throw you off, they cover transaction processing very well.
jjinux said…
> Java Transaction Processing

I wish I had a copy to read about it. Is there any way you can summarize in a paragraph or two?