« SAFE Act considered harmful | Main | Vectorized Classic Video Games »

Simple Consistency

For this discussion we are interested in a database system which can provide a availability guarantee for data reads and writing such that a reader must be able obtain the information supplied by the last successful write with some probability n and a writer must be able to record its data to the database with some other probability n. To achieve this we must guarantee that the probability of critical component failure is less then n for either reads or writes. This can be done either by acquiring components which have a failure rate less then n or by by combining components in a redundant configuration such that the probability of all redundant components failing is less then  n. This article concerns it self with the second approach and the issues that arise from a system where one redundantly stores data on more then one distinct node in a cluster with to goal of achieving a low probability of failure.

With a data replication approach to availability we are bought the the primary concern of this work, detecting and handling error states which cause data to become inconsistent across nodes replicating data. Inconsistencies in replication create ambiguity in the definition of a successful wright and obfuscate the value of the last wright.

There are three different errors which can endanger consistency of writes:
  1. Partial write: The write fails on one or more nodes.
  2. Corrupt write : The data becomes corrupt on one or more nodes.
  3. Out of order write: Two or more writes are applied to the same datum but in a inconsistent order across two or more nodes.
Similarly there are a related but distinct set of errors that can occur during reads:
  1. Corrupt read : The data becomes corrupt on transmission to the reader.
  2. Write read race: A reader performs a read while a write is occurring.
All these errors are indifferentiable to the reader and all present the error state where the values read are inconsistent. In the case that any of these errors occur if one were to perform two reads of the same datum but from different replicants it is possible to get different results. Similarly, if one were to perform a read for a single datum across all replicants one would get back a inconsistent set of answers. In response we can either attempt to avoid these error states or we can define write and read success so that it is consistent under all these cases. In this article we chose the later approach and introduce idea of voting which will allow us to correct inconsistencies up to a configurable limit.

When voting is used to decide on a consistent version of a datum the client must query replants until a majority of the total number of replicants agree on the value of datum. This value will then be considered the value for the query.  This method can correct n/2 - 1 inconsistencies where n is the number of replicas of the data. If more errors have occurred then a consistent read is impossible.

Write failures can largely be ignored in this approach, as they will merely cause one out of n reads to fail which is acceptable as long as it is still possible for n/2 - 1 reads to succeed.

This outlines the minimal requirements for the consistency model. This article will be followed by more with some enhancements to this scheme which improve on it by providing methods for reducing the number of states where data can be inconsistent ( useful for high concurrency datums ), providing  methods for repairing datums where state has become inconsistent, and describing  optimizations the the performance of a database using this method of replication consistency. 

TrackBack

TrackBack URL for this entry:
http://0x0000.org/cgi-bin/mt/mt-tb.cgi/22

Comments (2)

I assume this method would still involve value+checksum.

If two nodes had identical corruption you'd have a bad failure situation as clients would have corrupt data with no way to detect failure.

Kevin

In it's base form no checksum is needed. Adding them is a good idea but not needed. I am working on more posts that extend the base implementation to improve it but it will function as is.

On two corrupt nodes the system can correct n/2 -1 failures. If you want to handle two errors you need to set n to at least 5.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on December 24, 2007 6:13 PM.

The previous post in this blog was SAFE Act considered harmful.

The next post in this blog is Vectorized Classic Video Games.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type 3.33