The common method of read scaling is to copy read heavy data to more nodes.
This increases the available read throughput of a datum by increasing
the total amount of resources dedicated to serving it. This will always be
necessary when the read load for a single datum surpasses the resources
of a single node (IO, CPU, Network, etc). This method scales well but
has the down side that it is often not efficient with respect to
storage space. This is not so often a issue for data on disk but is a often issue for data cached in memory. The following
example demonstrates the difference:
Let us posit that there two nodes { node1, node2 } and a data set having four datums, { A, B, C, D } with the load distribution <A:40%, B:40%, C:10%, D:10%>. To scale reads one could fully replicate the datums across both nodes, node1:{ A, B, C, D }, node2:{
A, B, C, D }. We would then distribute the query load evenly between the two nodes so they were each handling one half of the total load for the
data. A second option would be to share the load
between the two nodes by partitioning the data across the nodes. One such repartitioning is node1:{ A, C },
node2:{ B, D } resulting in the load distribution of node1:<A:40%, C:10%>, node2:<B:40%, D:10%>. With this partitioning the read load is still evenly distributed across the nodes but only half as much data is stored at each node.
Even though this example appears to assumes that reliable load data is available for each datum it is possible to achieve a even distribution of load by methods such as repartitioning with only node level knowledge. One such method would periodically randomly select a datum from node a with above average load and move it to a node with a below average load. Such a scheme should lead to a balanced cluster over time. ( This scheme will likely be discussed in more detail in a later post. )
The cost of using partitioning as a read scaling method is the added complexity of creating the partitions on the data and maintaining the routing information required to deliver queries to the appropriate node.
Tags: Databases