For this discussion we are interested in a database system which can provide a availability guarantee for data reads and writing such that a reader must be able obtain the information supplied by the last successful write with some probability n and a writer must be able to record its data to the database with some other probability n. To achieve this we must guarantee that the probability of critical component failure is less then n for either reads or writes. This can be done either by acquiring components which have a failure rate less then n or by by combining components in a redundant configuration such that the probability of all redundant components failing is less then n. This article concerns it self with the second approach and the issues that arise from a system where one redundantly stores data on more then one distinct node in a cluster with to goal of achieving a low probability of failure.
With a data replication approach to availability we are bought the the primary concern of this work, detecting and handling error states which cause data to become inconsistent across nodes replicating data. Inconsistencies in replication create ambiguity in the definition of a successful wright and obfuscate the value of the last wright.
There are three different errors which can endanger consistency of writes:
When voting is used to decide on a consistent version of a datum the client must query replants until a majority of the total number of replicants agree on the value of datum. This value will then be considered the value for the query. This method can correct n/2 - 1 inconsistencies where n is the number of replicas of the data. If more errors have occurred then a consistent read is impossible.
Write failures can largely be ignored in this approach, as they will merely cause one out of n reads to fail which is acceptable as long as it is still possible for n/2 - 1 reads to succeed.
This outlines the minimal requirements for the consistency model. This article will be followed by more with some enhancements to this scheme which improve on it by providing methods for reducing the number of states where data can be inconsistent ( useful for high concurrency datums ), providing methods for repairing datums where state has become inconsistent, and describing optimizations the the performance of a database using this method of replication consistency.
With a data replication approach to availability we are bought the the primary concern of this work, detecting and handling error states which cause data to become inconsistent across nodes replicating data. Inconsistencies in replication create ambiguity in the definition of a successful wright and obfuscate the value of the last wright.
There are three different errors which can endanger consistency of writes:
- Partial write: The write fails on one or more nodes.
- Corrupt write : The data becomes corrupt on one or more nodes.
- Out of order write: Two or more writes are applied to the same datum but in a inconsistent order across two or more nodes.
- Corrupt read : The data becomes corrupt on transmission to the reader.
- Write read race: A reader performs a read while a write is occurring.
When voting is used to decide on a consistent version of a datum the client must query replants until a majority of the total number of replicants agree on the value of datum. This value will then be considered the value for the query. This method can correct n/2 - 1 inconsistencies where n is the number of replicas of the data. If more errors have occurred then a consistent read is impossible.
Write failures can largely be ignored in this approach, as they will merely cause one out of n reads to fail which is acceptable as long as it is still possible for n/2 - 1 reads to succeed.
This outlines the minimal requirements for the consistency model. This article will be followed by more with some enhancements to this scheme which improve on it by providing methods for reducing the number of states where data can be inconsistent ( useful for high concurrency datums ), providing methods for repairing datums where state has become inconsistent, and describing optimizations the the performance of a database using this method of replication consistency.

Comments (2)
I assume this method would still involve value+checksum.
If two nodes had identical corruption you'd have a bad failure situation as clients would have corrupt data with no way to detect failure.
Kevin
Posted by Kevin Burton | December 29, 2007 5:27 AM
Posted on December 29, 2007 05:27
In it's base form no checksum is needed. Adding them is a good idea but not needed. I am working on more posts that extend the base implementation to improve it but it will function as is.
On two corrupt nodes the system can correct n/2 -1 failures. If you want to handle two errors you need to set n to at least 5.
Posted by Jonathan Mooe | December 29, 2007 11:13 AM
Posted on December 29, 2007 11:13