I was wrong — the talk followup.
I criticize myself more harshly than any other. I advise my bosses to be direct and honest with their feedback whenever possible, so it leaves out any chance that I could misinterpret it and fall down a rabbit hole of self loathing. Maybe this is what Xanax is supposed to help with, I’m not entirely sure.
Last week at Hashiconf I gave the most intense talk I’ve given yet, “You. Must. Build. A. Raft!”. It was fun and lighthearted, meant to explain concepts such as Byzantine Fault Tolerance and Paxos in a manner where my dog could understand it. I wanted to give the talk some type of storyline, so I framed it around a discussion I had with my lead awhile back.
I drool over redundancy. I don’t know why, but the more reliable and redundant a system is, the more I want to find out how I could break it. I was discussing with my lead how I wanted to implement a multi-master, multi-region set up for an application we were working on. He advised me that because of the application uses Raft, the latency may cause issues between regions and that I shouldn’t do it. I heeded his advise, but a voice in the back of my head wondered if it was Raft, or the application’s fault. So my talk was born.
For the sake of everyone reading, I’ll only focus on the Raft portion of my talk. Each node in Raft has a randomized timer, when these timers hit their limit they attempt to go into leader election. The way we prevent these timers from timing out is by having the leader send heartbeats to each node, resetting their timer.
Since the timers are randomized, I believed that if we used Raft across multiple regions that any latency issues would be solved by the leader election process.
If latency occurs, and a timeout happens, then a new term occurs and the leader election process happens. I figured this leader election process was supposed to be flexible and happen a lot.
I was wrong. A week later, the Github Outage occurred. Github was down for 24 hours and 11 minutes.
To summarize: increased latency between regions caused a leader to be promoted in a new region leading to inconsistencies in data and a degredation in user experience.
I ended my talk last week saying my lead and I were both correct, but leaning towards myself being more correct due to how Raft should handle leader election. However how something should work, and how something actually works, are two completely different monsters. Fromn what GitHub was saying in their outage report, I was not correct. Having a multi-region multi-master set up could cause downtime whenever a leader is promoted across regions.
Were there things that could have prevented this from happening? Maybe, but with the replication at this scale across regions, that’s a hell of a problem. Having a private network with more controllable and predictable latency could have helped, but you're still replicating large amounts of data and shifting traffic between regions. This could lead to poor user experience (cue customer saying: "Sometimes my site is slow!") and data loss.
I’m sending this out so that before my talk is released on Youtube, and three people watch it in full, and one of those three calls me out on this, I’ll have a retort. For those of you who do watch my talk, if what happened to GitHub is as they described, this could be seen as a Byzantine Failure due to the loss of consensus and inconsistency in reports. Which is pretty frickin’ cool.