Skip to main content

Databases: More Atomic Cluster Commits

I'm currently reading the book Scalable Internet Architectures, and I'm enjoying it.

On page 141, he describes two-phase commits:
The basic idea is that the node attempting the transaction will notify its peers that it is about to commit, and they will react by preparing the transaction and notifying the originating node that they are ready to commit. The second phase is noticing any aborts in the process and then possibly following through on the commit (hopefully everywhere)...The two-phase commit is not perfect, but it is considered sufficient for applications such as stock trading and banking...Despite its shortcomings, the only common "correct" multimaster replication technique used today is the 2PC, and, due to its overhead costs, it is not widely used for geographic replications.
I didn't catch the phrase "not perfect" the first time I read that section, so I spent the day wondering how the hell they implemented fully atomic transactions across a cluster of machines. What happens if all the machines agree that they're right about to finish the commit and then one of the machines goes down right before writing the final bit to the transaction log and the others don't? I'll readily admit I'm no database expert, so I figured I must be missing something. I fought with it for a while, and I was very pleased to re-read that section and see the phrase "not perfect".

Well, how could you do a perfect cluster-wide commit with absolutely no race conditions? Let's assume the machines are all near each other since 2PC doesn't work across the country anyway without a ton of lag. I have a silly brute force solution that may not be perfect, but is a bit closer to atomic (i.e. it relies on the hardware in the same way mutexes do):

Start with battery-backed RAID cards. However, design the RAID cards so that they have one input wire and one output wire specifically to address this problem. Now, put all the machines in a circuit using this wire:
_______________
| |
|_N1_N2_N3_N4_|
On each of the RAID cards is a switch. If all the machines open their switches, current can flow. If any one of them closes their switch, current doesn't flow. It's the hardware equivalent of an AND operation; did I mention I'm not a hardware guy? Anyway, my commit system proceeds pretty much the same as the 2PC. When the machines have agreed that they're ready to do the final commit, the kernel setups up the sector to be written to disk and lets the RAID card take over. Then, they each open their switches. When the current starts flowing, the RAID card recognizes this signal to write the sector containing the data for the transaction log saying that the commit took place. If the RAID card is, for some reason, unable to perform its duties, it shuts down and declares that it's broken. I.e. the machine will die, but there won't be inconsistency.

Well, there may be holes in this scheme. Perhaps it was written about 30 years ago just like all the other interesting CS problems. It does require specialized hardware. All those reservations aside, I think this leads to a more atomic cluster commit.

Comments

Anonymous said…
The trick is that there's a master node with a log. If a node goes down before writing that final bit, when it gets back up the master will replay its log and all will be fine.
jjinux said…
What happens if:

The master writes its final log message.

Then the master manages to tell half the nodes that it wrote the final log message.

Then it dies.

The other half of the nodes think that the master did not succeed, so they roll back.

Sorry if this is a silly, basic question. This is all new stuff to me.
Anonymous said…
You should look up the 3 phase commit http://en.wikipedia.org/wiki/Three-phase-commit_protocol
bong said…
You might find the discussion about two-phase commit in Java Transaction Processing interesting especially the parts that discuss what can happen when the commit phase fails. Don't let the Java portions throw you off, they cover transaction processing very well.
jjinux said…
> Java Transaction Processing

I wish I had a copy to read about it. Is there any way you can summarize in a paragraph or two?

Popular posts from this blog

Drawing Sierpinski's Triangle in Minecraft Using Python

In his keynote at PyCon, Eben Upton, the Executive Director of the Rasberry Pi Foundation, mentioned that not only has Minecraft been ported to the Rasberry Pi, but you can even control it with Python . Since four of my kids are avid Minecraft fans, I figured this might be a good time to teach them to program using Python. So I started yesterday with the goal of programming something cool for Minecraft and then showing it off at the San Francisco Python Meetup in the evening. The first problem that I faced was that I didn't have a Rasberry Pi. You can't hack Minecraft by just installing the Minecraft client. Speaking of which, I didn't have the Minecraft client installed either ;) My kids always play it on their Nexus 7s. I found an open source Minecraft server called Bukkit that "provides the means to extend the popular Minecraft multiplayer server." Then I found a plugin called RaspberryJuice that implements a subset of the Minecraft Pi modding API for B

Ubuntu 20.04 on a 2015 15" MacBook Pro

I decided to give Ubuntu 20.04 a try on my 2015 15" MacBook Pro. I didn't actually install it; I just live booted from a USB thumb drive which was enough to try out everything I wanted. In summary, it's not perfect, and issues with my camera would prevent me from switching, but given the right hardware, I think it's a really viable option. The first thing I wanted to try was what would happen if I plugged in a non-HiDPI screen given that my laptop has a HiDPI screen. Without sub-pixel scaling, whatever scale rate I picked for one screen would apply to the other. However, once I turned on sub-pixel scaling, I was able to pick different scale rates for the internal and external displays. That looked ok. I tried plugging in and unplugging multiple times, and it didn't crash. I doubt it'd work with my Thunderbolt display at work, but it worked fine for my HDMI displays at home. I even plugged it into my TV, and it stuck to the 100% scaling I picked for the othe

Creating Windows 10 Boot Media for a Lenovo Thinkpad T410 Using Only a Mac and a Linux Machine

TL;DR: Giovanni and I struggled trying to get Windows 10 installed on the Lenovo Thinkpad T410. We struggled a lot trying to create the installation media because we only had a Mac and a Linux machine to work with. Everytime we tried to boot the USB thumb drive, it just showed us a blinking cursor. At the end, we finally realized that Windows 10 wasn't supported on this laptop :-/ I've heard that it took Thomas Edison 100 tries to figure out the right material to use as a lightbulb filament. Well, I'm no Thomas Edison, but I thought it might be noteworthy to document our attempts at getting it to boot off a USB thumb drive: Download the ISO. Attempt 1: Use Etcher. Etcher says it doesn't work for Windows. Attempt 2: Use Boot Camp Assistant. It doesn't have that feature anymore. Attempt 3: Use Disk Utility on a Mac. Erase a USB thumb drive: Format: ExFAT Scheme: GUID Partition Map Mount the ISO. Copy everything from