Will Portnoy: Lessons Learned from Implementing Paxos

My group at Microsoft uses Azure for many of its projects. We have a shared data store that we run as a service in Azure, and some of the goals for this data store included

replication to scale out horizontally for distributing read query load
fault tolerance for machine failures, network failures, azure upgrades, etc

The distributed state machine approach is one option for fault tolerance across replicas (vector clocks are another, for example).

It's a state machine because all replicas start from the same state (e.g. an empty data store) and perform the same sequence of state transitions (e.g. updates to the data store). All replicas will have (eventually) the same state. The idea is that if all replicas have exactly the same state, it doesn't matter on which replica read queries are performed. Machines can come and go, network messages can be dropped, but read queries to any one replica will have the same result from any of the replicas.

In this article, I'd like to cover some of the lessons learned from implementing Paxos. Some of the points may seem obvious (I think High-Performance Server Architecture is another article where being explicit about the basics is valuable), but I think they're worth stating as I haven't seen them written down before (Paxos Made Live is a notable exception).

I'd recommend the wikipedia link above for an introduction to Paxos, but here's a capsule summary. A common application of Paxos is to build a distributed transaction log of operations to be applied in sequence to every state machine replica. Paxos reaches consensus on each operation across all replicas - this means no replica will disagree with any other replica as to the value selected, even in the presence of network or machine failures at any point in the algorithm.

Because Paxos is usually explained in terms of reaching consensus on a single value, each position in the log is called an instance of Paxos. In reality, people will use a variant of Multi-Paxos to run many instances of Paxos, one for each specific position in the log.

A client of Paxos submits operations to the log, and Paxos goes through two phases to reach consensus across replicas regarding the single indexed position that each operation occupies in the log. A leader is elected for a round in phase 1, and that leader runs a value for a particular instance of Paxos through an election in phase 2. One of the details that allows Paxos to reach consensus is that the leader can be constrained to complete phase 2 with a specific value in a specific instance position in the log.

Paxos tolerates multiple leaders at the cost of possibly never reaching consensus (called the FLP Impossibility Result, this is more of a theoretical problem avoided with persistent leader elections, described below).

The state machines can execute every operation from the start of the log until the first missing instance awaiting consensus to be reached.

Some of the lessons follow.

Computer science papers can be funny. See the original paper on Paxos.
The description of Paxos in Paxos Made Simple is in terms of three agents: the proposer, the acceptor, and the learner. This expression of the Paxos algorithm means it's natural to implement the algorithm using the Actor Model of concurrency.

I chose the Reactive Extensions. F#'s MailboxProcessor and Erlang's processes may have been good options too, but there were reasons the reactive framework felt like the right tool for the task.

Paxos has a reputation of a difficult implementation. I think the difficulty lies less in the algorithm and more in getting the supporting details correct (many examples follow). But simplicity in implementation is important, and composability allowed me to effectively combine simpler primitives to build a more complicated system.

So what are some of the layers I'd like to compose?
1. a communication channel between replicas. This could be UDP, a custom TCP mesh, or through WCF to leverage it's service hosting and activation functionality.
2. a multiplexer for multiple logs onto a single communication channel.
3. a multiplexer for all three Paxos agents to share the same communication channel for a particular log.
4. a method for the leader proposer agent to run an election for a round with all of the acceptor agents voting.
5. a method for the Paxos proposer agent to run elections with the specific messages for phase 1 and 2.
I needed a mechanism for composable filters over a shared IP port, so that I could execute multiple copies of Paxos on the same port. The reactive extensions met that need, but then it was very natural to express the Paxos agents (e.g. proposer, acceptor, learner) in terms of an actor model using Observable.Create, and then the Paxos agents were also composable out of smaller parts as well (e.g. WhenMajorityHasQuorum, Phase1Lead, Phase2Propose).
Servers designed to use more than one cpu core have more than one operating system thread (even if hidden under a concurrent programming abstraction). This can lead to nondeterministic program executions, where the exact output of a program depends on the particular thread schedule selected by the operating system.

You may even choose to introduce nondeterminism: for example, I use randomized exponential backoff as a response to timeouts to allow a set of Paxos replicas to settle on a persistent leader allowing for faster consensus.

And the network itself introduces randomness; for example, with udp packets being dropped.

To properly test your distributed system, you need to be able to introduce these timeouts and network failures in a controlled and repeatable manner - and nondeterminism is a non-starter for debugging and reproducing problems.

To make code deterministic for repeatable tests, it must have the same inputs. And the same inputs means the same network i/o, thread scheduling, random number generation - effectively all interaction with the outside world.

The reactive framework has made these ideas explicit through the virtual scheduler. The test virtual scheduler was very useful for running deterministic unit tests, and through the composable nature of the message handling, it was trivial to add a “noisy observable” to simulate packet loss and duplication to stress test the implementation. Weaving a pseudo-random number generator with a specific seed through the code allowed for randomized exponential backoff behavior to be deterministic as well.
Components like an implementation of Paxos can find reuse in many situations if they're designed for reuse. I've heard that a framework should not be deemed reusable until three separate clients have reused it. Designing for testability (e.g. policy versus mechanism and
dependency injection) can help provide a second client, in addition to your main application, to enhance the reusability of your code.
The Paxos algorithm is described in terms of three agents, passing messages around. But message passing using event-driven programming isn't the most friendly concurrency pattern, mostly due to stack ripping.

So it's important to abstract Paxos for clients. The method I chose is to implement a base class StateMachine that was designed for inheritance - clients maintain the state they wish to replicate in their StateMachine-derived class, and keep all mutations of that state in a single method called only by the Paxos algorithm when consensus has been reached on the next position in the transaction log.
```
public override Task ExecuteAsync(int instance, Proposal command)
```
Many theoretical algorithms from distributed systems use the Asynchronous Model of time, where there is no shared notion of time across different nodes. One technique used to reduce these methods to practice is to introduce timeouts on certain operations. There would be a timeout for the replica trying to become leader, a timeout for the non-leader replicas watching for a leader replica that has gone offline, and a timeout for the phase 2 rounds of voting. There is an important detail in the last timeout though - the better implementations of Paxos allow multiple instances to reach consensus in parallel (without picking a fixed factor of concurrency, as Stoppable Paxos describes). But it simply takes longer, even if purely by network latency, to drive more instances to consensus. So in the end, you either vary your timeout in proportion to the number of requests or you limit the number of concurrent requests so that the timeout is sufficient.
If you're writing a server that spends most of its time waiting for i/o, it's worth considering asynchronous code. Even having a thread wait for a timeout can be a waste of that thread's resources. It's easier to write asynchronous code using a coroutine style of control flow; for example, async and await in the Async CTP.

But if you don't have special support in the debugger for stacks ripped apart across asynchronous i/o, it can be perplexing to break in the debugger and find no user code running and all threads waiting on i/o completion ports. The call stacks are useless for explaining the current state of the program.
Operations that take long periods of wall clock time (even if the cpu is efficiently waiting) need to be logged. The most obvious example is timeouts and exponential backoffs - this is code that does nothing and has no visible effects other than the passage of time.
When I was writing my implementation of Paxos, printf-style logging of the messages exchanges among the agents in my unit tests was my most valuable tool for tracking down violations of consensus. But logging every message was an overload of information, and it made for a tedious experience. It helped tremendously to have perfectly repeatable tests, allowing me to set conditional breakpoints and debug a specific request pass through the system. Having a request id passed end-to-end through the system made it even easier to follow the messages exchanged across different replicas.
Execution time for some section of code can take longer to execute than you would ever expect. Of course, you should run a profiler before optimizing the code (though I'll bet serialization and deserialization is at the top of your profile), but running a profiler in production is not practical. Production isn't usually set up with development tools, and collecting massive amounts of data affects performance too.

We can easily time small sections of code. But reporting the mean and standard deviation of a normal distribution forced onto the samples collected may mislead, mostly because the distribution of execution times may not fit a normal distribution.

I'm going to suggest something even cheaper, that I've used with success: time sections of your code that expect to take only 10s of milliseconds, and log a message if they take greater than one second. I think you would be surprised at the variability of the execution time of a single section of code.

The rationale and justification for this approach is covered in Amazon's paper on Dynamo. You care much more about the 99th percentile response time for a web service than the mean response time. True real time systems even introduce a hard deadline for completion of tasks, not even allowing the variability in execution time that really gets ignored by most profilers.
HTTP has proven to be a very useful transport protocol on top of TCP, and most HTTP client libraries (even through RPC mechanisms like WCF) provide http method invocation timeouts. But that's just the client-side of the operation - the server-side of the operation isn't always aware that the client has disconnected, and so you need to manually cancel the server-side code after some timeout. .NET provides a convenient structure for cooperative cancellation through the CancellationToken.
All server-side operations need to have their own timeout logic - this is not taken care of automatically by the typical server framework. In some situations, you may be querying an underlying sql database, and it will have its own timeout, which will throw an exception and unwind that server call stack and release resources. But when you're implementing something like Paxos, there are times where the server's logical thread of control is waiting for an event that may never occur (for example, a state machine's request to replicate a command in the log). When the client has already given up, you must ensure the server isn't continuing for no reason.
In Paxos Made Simple, Lamport mentions executing phase 1 for "infinitely many instances" (the 0-1-infinity rule comes to mind).

In practice, that means a proposer that wishes to lead values to consensus in phase 2 needs to ask each acceptor the list of previous values they have accepted for all [infinitely many] instances (positions in the transaction log). "Infinitely many" is too large of a network message, which brings me to the lesson: don't allow completely variable resource usage.

You can see examples of throttles in most systems: for example, IIS has configurable http request size limits and WCF has configurable message size quotas. But Paxos brings up some issues in protocol design that I think are worth mentioning.
1. A leader replica may have a very large number of instances they would like to submit to the log. So the leader may try to execute phase 2 with so many instances that timeouts are hit.
  
  To avoid these variable size messages, we simply cap the number of phase 2 submissions that may execute concurrently within a timeout period.
2. If a new replica joins a set of replicas that have a very long shared transaction log and tries to become leader without any knowledge of the transaction log state, if the acceptors intend to promise their allegiance to that leader, they must send all of the previous instances for which they have accepted values.
  
  But if a replica is added to the cluster, it's likely to be missing all of the log data from any snapshots to the latest position in the log, and one of the first things it may choose to do is to try to become leader. The protocol can handle that situation, at the cost of needing to update that replica to the latest position in the log - that can be too much data to send at once.
  
  To avoid these large messages, we can reject leadership requests if the requesting replica is lagging by too many instances, and count on the gossiping protocol to catch that replica up.
3. It's pretty inefficient to have a replica send all of their accepted instances to a prospective leader when it's likely that consensus has been reached for many of those instances.
  
  To avoid these large messages, when a replica tries to become leader, it can efficiently encode the instance positions into a run-length encoded string, and the accepting replica need only send the instance positions of which the prospective leader is unaware. Here, we rely on a property of Paxos: that an instance which reaches consensus will remain the same value for all time across all replicas.
Flow control is a complicated but necessary aspect of any server API design. "Fire and forget" APIs are practically asking for a client sending requests to overwhelm a server, and brings to mind TCP congestion control and senders overwhelming receivers.
1. Randomized exponential backoff is the classic method for flow and congestion control in networks.
2. When there is cost to making a network request (always), batching is another classic method to help minimize the resources needed for the server to respond to client requests.
3. It's pretty common to want to perform an action on a long list of items, and we know it's too costly to spin up a thread per item. So we use thread and task pools to queue up the work and expect if the work is cpu-bound and has no serialized portions (Amdahl's law talks about serial versus parallel phases of work), approximately the same number of threads as cpu cores will be active - and we're maximizing the throughput of the cpu resources of the machine. But what if the work involves i/o? If we were writing a web crawler, do we really want to initiate as many http requests as our cpus will allow all at once? No, we want to limit the concurrency of the work including the i/o operations. So instead of using a low-level concurrency limited scheduler, I've written helper functions that allow me to run an entire operation involving asynchronous i/o, but limit the concurrency of these operations to a specific number.
Paxos implementations tend to amortize the cost of phase 1 (i.e. leader election) across many phase 2 executions (i.e. proposal consensus). We can have the current leader refresh its mandate (through repeating phase 1) at some small interval, and the other replicas can watch for the current leader's refresh with a timeout of a larger interval - if the current leader is "stale," another replica can try to become leader. It's also the case that a replica would want to become leader if they have proposals they would like to drive to consensus. But part of the paxos protocol is that when a replica wants to become leader, they must be made aware of all proposals accepted by other replicas is previous rounds so that the new leader can drive those proposals to consensus - and in a practical sense, that's an unbounded amount of data. So what can be done? Well, when a replica wants to become a leader, it can also state which instances it already knows have been driven to consensus, so that responding replicas need not include those instances. And more importantly, replicas that are too far behind in their knowledge of the consensus of instances in the log need to have their bid for leadership rejected until the gossip protocol can bring them up to date. Instead of amortizing the cost of leader election across many instances, we could bundle the leader election into phase 2 for the previous instance. Unfortunately, that serializes consensus for a sequence of instances.
Paxos does not really fit in the family of distributed system approaches called "eventually consistent". Reads of the underlying state are scheduled through the log, and because they execute on a quorum of replicas, you know that you are getting the actual state of the replicated data at that point in the transaction log. But coming to consensus just for a read involves at minimum an execution of phase 2 and network communications. So there are some knobs you can turn to trade off consistency for performance - for example, you can perform dirty reads at whichever replica your load balancer happens to direct the client's request. The nice thing about paxos is that you can easily monitor the lag in both driving proposals to consensus and the lag in the execution of those proposals once they have reached consensus in the transaction log. And so we find dirty reads to be less of an issue than might be expected.
Mike Acton asks the question that's worth considering in API design. For all of the parameters X of type T to a method, when will you ever have only one T? In many cases, you will have a collection of X that all need to be processed in the same way.

Game developers, optimizing for cpu performance, are thinking about hitting the memory wall of performance where it doesn't matter how many instructions your cpu can retire because you can't interact with memory fast enough.

But at a larger scale, calls to networked services are the wall against which single-machine performance suffers. Imagine the code for a dynamic HTML page that runs many queries against a database, and let's ignore caching layers like memcached for now. Connection pooling can remove the latency of connection setup, but the ideal design is that a single HTML page executes a single network request of a batch of queries against the database.

But your system isn't designed that way - in fact, even though page construction leverages multiple cpus through concurrent code, most of the interfaces are constructed to execute a single operation at a time. Given the constraint of those single-item interfaces, one solution is to introduce a bit of latency to make batching operations possible.

As each request comes in, it's added to a queue. If the queue was empty, a timer is set to send off a batch of requests together, and if the queue is full, the batch is sent immediately. It's a simple workaround for interfaces that aren't designed with batching in mind.
I implemented Stoppable Paxos to manage group membership, hooking it up to the RoleEnvironment.Changed event to submit state machine configuration changes to the replica set. A few details are worth mentioning (other than the large subject of how to handle log truncation and snapshots in the face of configuration changes).
1. First, you need an initial configuration. If a replica starts off with empty paxos storage, we can initialize instance 0 to some configuration "change" that represents the initial state of the cluster - something needs to start the cluster, and the cluster itself certainly can not come to a quorum on membership changes.
2. Second, the notion of replica node identity in the cluster is tied to the stable storage of promises made in phase 1 and 2 by acceptor. So if the paxos log is lost, that replica has left the cluster and needs to rejoin. This is a bit of a pain - one alternative that seems to work well is since stoppable paxos won't let a replica's acceptor vote until it's a member of the cluster, you know it can't also go back on previous promises from phase 1 and 2. So you may be able to skip on changing the node's identity when the paxos log is "lost".
I read about the End to end principle in my networks graduate course, and I continually see applications in my work. For example, I had generate a unique GUID for each paxos proposal so that a replica could determine whether one of its proposals had been driven to consensus by another replica temporarily becoming the leader (since the proposal commands may have been equal). But initially that proposal GUID was assigned within the paxos layers, and it would have been better to assign the guid at the client constructing the proposal to submit to paxos. Why? Because it may be that the client needs to retry the whole initiation of the proposal to a remote paxos cluster, and having that GUID allows for the prevention of duplicate proposal submissions in the face of retries.
The papers on paxos don't really speak about how replicas can be "caught up" after restarting or restoring a snapshot without the preservation of the paxos stable storage, but there are some important details. In a similar way to how a prospective leader can include a compact representation of the instances they know have reached consensus in their phase 1 request message (with the benefit that replicas promising to follow that leader and not accept proposals for smaller rounds need only send a much smaller subset of their previously accepted proposals as constraints for the new leader), we can design a gossip anti-entropy protocol to fill in the gaps of the replica's view of the transaction log. Each replica simply sends a compact form of the instances positions that they know have reached consensus to each other learner replica in the current configuration, and if a replica receives a gossip message including an instance number that they're not aware of, they send a request for that data to that replica. The twist is that these requests can also be unbounded and need to have some reasonable limit of concurrency placed on them.
Paxos is for building a replicated transaction log, usually composed of operations that are applied to a data store of some type. Generally you want to make periodic "snapshot" backups of this data store, because rebuilding the entire store from the entire transaction log is too expensive in terms of time and space. These backups can be used to quickly bring up new replicas as needed for load or reliability. Once the backup has been performed (and usually moved to some longer term durable storage), the paxos log can be truncated. There are tons of choices here: do you store the current configuration in the snapshot? How do you perform a backup that doesn't "stop the world" preventing new writes to your store? How do you deal with corruption of the underlying store? What if the binary format of the store changes?
Paxos is just the start of a replicated fault tolerant system - I hope I've covered some interesting details here, but the logistics of building a full system will lead to many more challenges.

9 comments:

UnknownJuly 5, 2012 at 2:40 PM
Thanks for this post. Very interesting read.
UnknownOctober 2, 2013 at 9:04 AM
If each entry in the paxos log is a full transaction, wouldn't said log file end up unwieldy?
Will PortnoyNovember 1, 2013 at 10:10 PM
You can truncate the paxos log (containing full transactions) when you periodically snapshot your state to some external durable store (which you want to do anyway, instead of transmitting and replaying the entire paxos log from the initial position when starting new replicas).
UnknownJune 16, 2014 at 11:14 AM
Hi, Thanks for your nice write-up. I was wondering if you can please share your
thoughts on the following.

Assume a scenario where the state change was successfully added to the log.
However, when it was time to apply the change to the state machine, there was
persistent failure. Did you run into any situations as above and if so I would be
interested in your thoughts about how you handled them. If there is persistent
failure, to me one option appears to be to exit the cluster and the resync the entire
data from a surviving node.

I was trying to figure out how to use paxos if I want to implement replication of
transactions. Lets say we treat each transaction as a single operation from
paxos stand-point (a logical operation such as an insert, delete). Each paxos
operation when applied to the state machine, could result in multiple operations
on the state machine (insert into a table, update an index, etc.). This transaction
could fail for any number of of reasons. Hence, I was trying to figure out how
would one ensure that the state machine is completely identical even in the midst
of persistent failure. The only option I see is that the "faulty" nodes exit from the cluster
and resyncs the entire state. Thanks for your time.
UnknownJune 16, 2014 at 4:32 PM
Thanks for your reply
AnonymousMarch 8, 2015 at 5:25 PM
Thanks for the writeup - now studying all the referenced things.

P.S. Code bit rendering looks weird in Chrome/Safari
https://www.dropbox.com/s/4gidr3lc94jjz0f/Screenshot%202015-03-09%2002.20.24.png?dl=0
AnonymousMarch 20, 2015 at 11:37 PM
Why didn't you use Windows Fabric (I believe Lync and DocumentDB use it). It also provides high availability and fault tolerance with a Paxos implementation. Or is this a case of Microsoft internal politics? ;)
AnonymousMarch 20, 2015 at 11:39 PM
Also: "The distributed state machine approach is one option for fault tolerance across replicas (vector clocks are another, for example)."

What? What has one got to do with the other? Paxos provides consensus and the other provides partial ordering of events.

Thursday, June 14, 2012

Lessons Learned from Implementing Paxos

9 comments: