We have struck a middle ground between strong consistency (such as ACID transactions) with its scalability limitations, and weaker forms of consistency (such as best effort or eventual consistency) with their anomalies. Our approach is timeline consistency: all rep-licas will go through the same timeline of updates, and the order of updates is equivalent to the order in which they were made to the database. This timeline is shown in
Figure 4-1. Thus, the database will converge to the same value at all replicas, and that value will be the latest update made by the application.
T A B L E 4 - 4.The two updates cross during propagation
West coast East coast
Username Status Username Status
Alice Busy Alice Off the phone
T A B L E 4 - 5.Inconsistent replicas
West coast East coast
Username Status Username Status
Alice Off the phone Alice Busy
F I G U R E 4 - 1.Timeline of updates to Alice’s status.
“Busy”
“Off the phone”
Status: none Status: busy Status: off the phone
Record timeline
60 C H A P T E R F O U R
Timeline consistency is implemented by having a master copy where all the updates are made, with the changes later propagated to other copies asynchronously. This master copy serializes the updates and ensures that each update is assigned a sequence number. The order of sequence numbers is the order in which updates should be applied at all replicas, even if there are transient failures or misorderings in the asynchronous propagation of updates. We have chosen to have a master copy per record since many Yahoo! applica-tions rely on a single table in which different records correspond to different users, each with distinct usage patterns. It is possible, of course, to choose other granularities for mas-tership, such as a master per partition (e.g., based on a key) of records.
Even in a single table, different records may have master copies located in different servers. In our example, Alice, who lives on the west coast, has a record that is mastered there, whereas her friend Bob, who lives in Singapore, has his record mastered in the Asian replica. The mas-tership of the record is stored as a metadata field in the record itself, as shown in Table 4-6.
Of course, a master copy seems at odds with our principle that only cheap operations should be done synchronously. If Alice travels to New York and updates her status from there, she must wait for her update operation to be forwarded to the west coast, since her profile record is mastered there; such high-latency cross-continental operations are what we are trying to minimize. Such cross-colo writes do occur occasionally, because of shift-ing usage patterns (e.g., Alice’s travel), but they are rare. We analyzed updates to Yahoo!’s user database and found that 85% of the time, record updates were made to the colo con-taining the master copy. Of course, Alice may move to the east coast or to Europe, and then her writes will no longer be local, as the master copy for her record is still on the west coast. Our system tracks where the updates for a record are originating, and moves mas-tership to reflect such long-standing shifts in access patterns, in order to ensure that most writes continue to be local. (We discuss mastership in more detail in the next section.) When an application reads a record, it typically reads the local replica. Unless that replica is marked as the master copy, it may be stale. The application knows that the record instance is some consistent version from the timeline, but there is no way for the applica-tion to know from the record itself whether it is the most recent version. If the applicaapplica-tion absolutely must have the most recent version, we allow it to request an up-to-date read; this request is forwarded to the master to get the latest copy of the record. An up-to-date read is expensive, but the common case of reading the local (possibly stale) replica is cheap, again in line with our design principles. Luckily, web applications are often tolerant of stale data. If Alice updates her status and her friend Bob does not see the new status right away, it is acceptable, as long as Bob sees the new status shortly thereafter.
T A B L E 4 - 6.Profile table with mastership and version metadata
Username _MASTER _VERSION FullName …
Alice West 32 Alice Smith …
Bob Asia 18 Bob Jones …
Charles East 15 Charles Adams …
…
C L O U D S T O R A G E D E S I G N I N A P N U T S H E L L 61 Another kind of read that the application can perform is a critical read, to make sure that
data only moves forward in time from the user’s perspective. Consider a case where Alice changes her avatar (a picture representing the user). Bob may look at Alice’s profile page (resulting in a read from the database) and see the new avatar. Then, Bob may refresh the page, and due to a network problem, be redirected to a replica that has not yet seen Alice’s avatar update. The result is that Bob will see an older version of the data than the version he just saw. To avoid these anomalies for applications that want to do so, the database returns a version number along with the record for a read call. This version number can be stored in Bob’s session state or in a cookie in his browser. If he refreshes Alice’s profile page, the previously read version number can be sent along with his request, and the database will ensure that a record that is no older than that version is returned. This may require forwarding to the master copy. A read that specifies the version number is called a
“critical read,” and any replica with that version, or a newer version, is an acceptable result. This technique is especially helpful for users that update and then read the data-base. Consider Alice herself: after she updates her avatar, she will become confused if we show her any page with her old avatar. Therefore, when she takes an action that updates the database (like changing her avatar), the application can use the critical read mecha-nism to ensure that we never show her older data.
We also support a test-and-set operation that makes a write conditional upon the read ver-sion being the same as some previously seen verver-sion (whose verver-sion number is passed in as a parameter to the test-and-set request). In terms of conventional database systems, this provides a special case of ACID transactions, limited to a single record, using optimistic concurrency control.
More on mastership
We employ various techniques to ensure that read and write operations go on smoothly and with low latency, even in the presence of workload changes and failures.
For example, as we mentioned earlier, the system implements record-level mastership. If too many writes to the record are originating from a data center other than the current master, the mastership of the record is promptly transferred to that data center, and sub-sequent writes are done locally there. Moreover, transferring mastership is a cheap opera-tion and happens automatically, thereby allowing the system to adapt quickly to workload changes.
We also implement a mechanism that allows reads and writes to continue without inter-ruption, even during storage unit failures. When a storage unit fails, an override is issued (manually or automatically) for that storage unit, signifying that another data center can now accept writes on behalf of the failed storage unit (for records previously mastered at the failed storage unit). We take steps (details omitted here) to ensure that this override is properly sequenced with respect to the updates done at the failed storage unit. This is done to guarantee that timeline consistency is still preserved when the other data center starts accepting updates on behalf of the failed storage unit.
62 C H A P T E R F O U R
In PNUTS, all read and write requests go through a routing layer that directs them to the appropriate copy (possibly the master) of the record. This level of indirection is a key to how we provide uninterrupted system availability. Even when a storage unit has failed and its data is recovered on to another storage unit, or record masters are moved to reflect usage patterns, these changes are transparent to applications, which still continue to con-nect to routers and enjoy uninterrupted system availability, with requests seamlessly routed to the appropriate location.
Supporting ordered data
Our system is architected to support both hash-partitioned and range-partitioned data. We call the hash version of our database YDHT, for Yahoo! Distributed Hash Table, and the ordered version is called YDOT, for Yahoo! Distributed Ordered Table. Most of the system is agnostic to how the data is organized. However, there is one important issue that is sen-sitive to physical data organization. In particular, hash-organized data tends to spread load out among servers very evenly. If data is ordered, portions of the key space that are more popular will cause hotspots. For example, if status updates are ordered by time, the most recent updates will be of most interest to users, and the server with the data partition at the end of the time range will be the most loaded. We cannot allow hotspots to persist without compromising system scale-out.
Logically ordered data is actually stored in partitions of physically contiguous records, but with partitions arranged without regard to order, possibly across physical servers. We can address the hotspot issue by moving partitions dynamically in response to load. If a few hot partitions are on the same server, we can move them to servers that are less loaded.
Moreover, we can also dynamically split partitions, so that the load on a particularly hot single partition can be divided amongst several servers.* This movement and splitting of partitions across storage units is distinct from the mechanism mentioned previously for changing the location of the master copy of a record: in this case, changing the record master affects the latency of updates that originate at a server, but does not in general reduce the cumulative read and write workload on a given partition of records. A particu-lar special case that requires splitting and moving partitions is when we want to update or insert a large number of records. In that case, if we are not careful we can create a sever load imbalance by sending large batches of updates to the same few servers. Thus, it is necessary to understand something about how the updates are distributed in the key space, and if necessary, preemptively split and move partitions to prepare for the upcom-ing onslaught of updates (Silberstein et al. 2008).
We insulate applications from the details of the physical data organization. For single record reads and writes, the use of a routing layer shields applications from the effects of partition movement and splitting. For range scans, we need to provide a further abstraction: imagine
* The observant reader may have noticed that if all updates affect the partition containing the end of the time range, splitting this partition will not solve the problem, and some measure such as sorting by a composite key, e.g., user and time, is required.
C L O U D S T O R A G E D E S I G N I N A P N U T S H E L L 63 that we want to scan all registered users whose age is between 21 and 30. Answering this
query may mean scanning a partition with several thousand records on one server, then a second partition on another server, and so on. Each partition of several thousand records can be scanned quickly, since they are sequentially ordered on disk. We do not want the application to know that we might be moving or splitting partitions behind the scenes. A good way to do this is to extend the iterator concept: when an application is scanning, we return a group of records, and then allow the application to come back when it is ready to ask for the next group. Thus, when the application has completed one batch and has asked for more, we can switch them to a new storage server that has the partition with the next group of records.
Trading off consistency for availability
Timeline consistency handles the common case efficiently and with clean semantics, but it is not perfect. Occasionally, an entire datacenter will go down (e.g., if the power is cut) or become unreachable (e.g., if the network cable is cut), and then any records mastered in that datacenter will become unwriteable. This scenario exposes the known trade-off between consistency, availability, and partition tolerance: only two of those three proper-ties can be guaranteed at all times. Since our database is global, partitions will happen and cannot cause an outage, and thus in reality we only have a choice between consistency and availability. If a datacenter goes offline, possibly with some new updates that have not yet been propagated to other replicas, we can either preserve consistency by disallowing updates until the datacenter comes back, or we can preserve availability by violating time-line consistency and allowing some updates to be applied to a nonmaster record.
Our system gives the application the ability to make this choice on a per-table basis. If the application has chosen availability over consistency for a particular table, and a datacenter goes offline, the system temporarily transfers mastership of any unreachable records in that table. This decision effectively forks the timeline to favor availability. An example is shown in Figure 4-2. After the lost colo is restored, the system automatically reconciles any records that have had conflicting updates, and notifies the application of these con-flicts. The reconciliation ensures that the database converges to the same value every-where, even if the timeline is not preserved. On the other hand, if the application has chosen consistency over availability, mastership is not transferred and the timeline is pre-served, but some writes will fail.
64 C H A P T E R F O U R
For certain operations, this trade-off between consistency and availability can be easier to manage. For example, imagine that an application wants to include polls, where users vote on various questions (like “What is your favorite color?”) and the poll results are stored as counters in our database. Counter operations (like increment) are commutative, and can therefore be applied even to the nonmaster copy without breaking timeline con-sistency. Normally our replication mechanism transfers the new version of the record between replicas, but for commutative operations we would actually have to transmit the operation (e.g., increment). Then, whenever the master received the operation (either during normal operation or after a datacenter failure), it could apply it without worrying about whether it is out of order. The one restriction in this scheme is that we cannot mix commutative and noncommutative operations: setting the value of the counter at any time after the record inserted is forbidden, since we do not know how to properly order an increment and an overwrite of the value.
Another extension to our approach is to allow updates to multiple records. Many web workloads involve updates to a single record at a time, which is why we focused on time-line consistency at a per-record basis. However, it is occasionally desirable to update mul-tiple records. For example, in our social networking application we might have binary friend links: if Alice and Bob are friends, then Alice appears in Bob’s friend list and Bob appears in Alice’s. When Alice and Bob become friends, we thus need to update two records. Because we do not provide ACID transactions, we cannot guarantee this update is atomic. However, we can provide bundled writes: with one call to the database, the applica-tion can request both writes, and the database will ensure that both writes eventually occur. To accomplish this, we log the requested writes, and the system retries the writes until they succeed. This approach preserves per-record timeline consistency, and since the retries can be asynchronous, preserves our performance goals.
In summary, timeline consistency provides a simple semantics for how record updates are propagated, and flexibility in how applications can trade-off read latency for currency.
However, it does not support general ACID transactions—in particular, transactions that read and write multiple records.