Comments on Will Portnoy: Lessons Learned from Implementing Paxos

Also: "The distributed state machine approach...

2015-03-20T23:39:02.137-07:00

Also: "The distributed state machine approach is one option for fault tolerance across replicas (vector clocks are another, for example)."

What? What has one got to do with the other? Paxos provides consensus and the other provides partial ordering of events.

Why didn't you use Windows Fabric (I believe L...

2015-03-20T23:37:50.399-07:00

Why didn't you use Windows Fabric (I believe Lync and DocumentDB use it). It also provides high availability and fault tolerance with a Paxos implementation. Or is this a case of Microsoft internal politics? ;)

Thanks for the writeup - now studying all the refe...

2015-03-08T17:25:18.975-07:00

Thanks for the writeup - now studying all the referenced things.

P.S. Code bit rendering looks weird in Chrome/Safari
https://www.dropbox.com/s/4gidr3lc94jjz0f/Screenshot%202015-03-09%2002.20.24.png?dl=0

Thanks for your reply

2014-06-16T16:32:06.670-07:00

Thanks for your reply

If you're using a managed language, you can ob...

2014-06-16T15:57:05.837-07:00

If you're using a managed language, you can observe failure to execute by an uncaught exception. If your language will take down the process due to executing a command (e.g an access violation), you might consider a watchdog process and a poison message queue, keeping track of the failure to execute. Of course, that comes with costs.

Generally, you pass enough of the transaction through the state machine so that all replicas will pass or fail together - there shouldn't be per-replica execution failures beyond those considered to be replica failures.

Hi, Thanks for your nice write-up. I was wonderi...

2014-06-16T11:14:55.054-07:00

Hi, Thanks for your nice write-up. I was wondering if you can please share your
thoughts on the following.

Assume a scenario where the state change was successfully added to the log.
However, when it was time to apply the change to the state machine, there was
persistent failure. Did you run into any situations as above and if so I would be
interested in your thoughts about how you handled them. If there is persistent
failure, to me one option appears to be to exit the cluster and the resync the entire
data from a surviving node.

I was trying to figure out how to use paxos if I want to implement replication of
transactions. Lets say we treat each transaction as a single operation from
paxos stand-point (a logical operation such as an insert, delete). Each paxos
operation when applied to the state machine, could result in multiple operations
on the state machine (insert into a table, update an index, etc.). This transaction
could fail for any number of of reasons. Hence, I was trying to figure out how
would one ensure that the state machine is completely identical even in the midst
of persistent failure. The only option I see is that the "faulty" nodes exit from the cluster
and resyncs the entire state. Thanks for your time.

You can truncate the paxos log (containing full tr...

2013-11-01T22:10:22.326-07:00

You can truncate the paxos log (containing full transactions) when you periodically snapshot your state to some external durable store (which you want to do anyway, instead of transmitting and replaying the entire paxos log from the initial position when starting new replicas).

If each entry in the paxos log is a full transacti...

2013-10-02T09:04:36.148-07:00

If each entry in the paxos log is a full transaction, wouldn't said log file end up unwieldy?

Thanks for this post. Very interesting read.

2012-07-05T14:40:01.512-07:00

Thanks for this post. Very interesting read.