Skip to main content

Databases: More Atomic Cluster Commits

I'm currently reading the book Scalable Internet Architectures, and I'm enjoying it.

On page 141, he describes two-phase commits:
The basic idea is that the node attempting the transaction will notify its peers that it is about to commit, and they will react by preparing the transaction and notifying the originating node that they are ready to commit. The second phase is noticing any aborts in the process and then possibly following through on the commit (hopefully everywhere)...The two-phase commit is not perfect, but it is considered sufficient for applications such as stock trading and banking...Despite its shortcomings, the only common "correct" multimaster replication technique used today is the 2PC, and, due to its overhead costs, it is not widely used for geographic replications.
I didn't catch the phrase "not perfect" the first time I read that section, so I spent the day wondering how the hell they implemented fully atomic transactions across a cluster of machines. What happens if all the machines agree that they're right about to finish the commit and then one of the machines goes down right before writing the final bit to the transaction log and the others don't? I'll readily admit I'm no database expert, so I figured I must be missing something. I fought with it for a while, and I was very pleased to re-read that section and see the phrase "not perfect".

Well, how could you do a perfect cluster-wide commit with absolutely no race conditions? Let's assume the machines are all near each other since 2PC doesn't work across the country anyway without a ton of lag. I have a silly brute force solution that may not be perfect, but is a bit closer to atomic (i.e. it relies on the hardware in the same way mutexes do):

Start with battery-backed RAID cards. However, design the RAID cards so that they have one input wire and one output wire specifically to address this problem. Now, put all the machines in a circuit using this wire:
_______________
| |
|_N1_N2_N3_N4_|
On each of the RAID cards is a switch. If all the machines open their switches, current can flow. If any one of them closes their switch, current doesn't flow. It's the hardware equivalent of an AND operation; did I mention I'm not a hardware guy? Anyway, my commit system proceeds pretty much the same as the 2PC. When the machines have agreed that they're ready to do the final commit, the kernel setups up the sector to be written to disk and lets the RAID card take over. Then, they each open their switches. When the current starts flowing, the RAID card recognizes this signal to write the sector containing the data for the transaction log saying that the commit took place. If the RAID card is, for some reason, unable to perform its duties, it shuts down and declares that it's broken. I.e. the machine will die, but there won't be inconsistency.

Well, there may be holes in this scheme. Perhaps it was written about 30 years ago just like all the other interesting CS problems. It does require specialized hardware. All those reservations aside, I think this leads to a more atomic cluster commit.

Comments

Anonymous said…
The trick is that there's a master node with a log. If a node goes down before writing that final bit, when it gets back up the master will replay its log and all will be fine.
jjinux said…
What happens if:

The master writes its final log message.

Then the master manages to tell half the nodes that it wrote the final log message.

Then it dies.

The other half of the nodes think that the master did not succeed, so they roll back.

Sorry if this is a silly, basic question. This is all new stuff to me.
Anonymous said…
You should look up the 3 phase commit http://en.wikipedia.org/wiki/Three-phase-commit_protocol
bong said…
You might find the discussion about two-phase commit in Java Transaction Processing interesting especially the parts that discuss what can happen when the commit phase fails. Don't let the Java portions throw you off, they cover transaction processing very well.
jjinux said…
> Java Transaction Processing

I wish I had a copy to read about it. Is there any way you can summarize in a paragraph or two?

Popular posts from this blog

Ubuntu 20.04 on a 2015 15" MacBook Pro

I decided to give Ubuntu 20.04 a try on my 2015 15" MacBook Pro. I didn't actually install it; I just live booted from a USB thumb drive which was enough to try out everything I wanted. In summary, it's not perfect, and issues with my camera would prevent me from switching, but given the right hardware, I think it's a really viable option. The first thing I wanted to try was what would happen if I plugged in a non-HiDPI screen given that my laptop has a HiDPI screen. Without sub-pixel scaling, whatever scale rate I picked for one screen would apply to the other. However, once I turned on sub-pixel scaling, I was able to pick different scale rates for the internal and external displays. That looked ok. I tried plugging in and unplugging multiple times, and it didn't crash. I doubt it'd work with my Thunderbolt display at work, but it worked fine for my HDMI displays at home. I even plugged it into my TV, and it stuck to the 100% scaling I picked for the othe

ERNOS: Erlang Networked Operating System

I've been reading Dreaming in Code lately, and I really like it. If you're not a dreamer, you may safely skip the rest of this post ;) In Chapter 10, "Engineers and Artists", Alan Kay, John Backus, and Jaron Lanier really got me thinking. I've also been thinking a lot about Minix 3 , Erlang , and the original Lisp machine . The ideas are beginning to synthesize into something cohesive--more than just the sum of their parts. Now, I'm sure that many of these ideas have already been envisioned within Tunes.org , LLVM , Microsoft's Singularity project, or in some other place that I haven't managed to discover or fully read, but I'm going to blog them anyway. Rather than wax philosophical, let me just dump out some ideas: Start with Minix 3. It's a new microkernel, and it's meant for real use, unlike the original Minix. "This new OS is extremely small, with the part that runs in kernel mode under 4000 lines of executable code.&quo

Drawing Sierpinski's Triangle in Minecraft Using Python

In his keynote at PyCon, Eben Upton, the Executive Director of the Rasberry Pi Foundation, mentioned that not only has Minecraft been ported to the Rasberry Pi, but you can even control it with Python . Since four of my kids are avid Minecraft fans, I figured this might be a good time to teach them to program using Python. So I started yesterday with the goal of programming something cool for Minecraft and then showing it off at the San Francisco Python Meetup in the evening. The first problem that I faced was that I didn't have a Rasberry Pi. You can't hack Minecraft by just installing the Minecraft client. Speaking of which, I didn't have the Minecraft client installed either ;) My kids always play it on their Nexus 7s. I found an open source Minecraft server called Bukkit that "provides the means to extend the popular Minecraft multiplayer server." Then I found a plugin called RaspberryJuice that implements a subset of the Minecraft Pi modding API for B