One thing you’ll quickly notice is that the semantics to set the replication factor depend on the replication strategy you choose. The replication strategy tells Cassandra exactly how you want replicas to be placed in the cluster.
There are two strategies available:
SimpleStrategy: This strategy is used for single data center deployments. It is fine to use this for testing, development, or simple clusters, but discouraged if you ever intend to expand to multiple data centers (including virtual data centers such as those used to separate analysis workloads).
NetworkTopologyStrategy: This strategy is to be used when you have multiple data centers, or if you think you might have multiple data centers in the future. In other words, you should use this strategy for your production cluster.
SimpleStrategy
As a way of introducing this concept, we’ll start with an example using SimpleStrategy. The following Cassandra Query Language (CQL) block will allow us to create a
keyspace called AddressBook with three replicas:
CREATE KEYSPACE AddressBook WITH REPLICATION = {
‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 3 };
You will recall from the previous chapter’s section on token assignment that data is assigned to a node via a hash algorithm, resulting in each node owning a range of data.
Let’s take another look at the placement of our example data on the cluster. Remember the keys are first names, and we determined the hash using the Murmur3 hash algorithm.
The primary replica for each key is assigned to a node based on its hashed value. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive).
While using SimpleStrategy, Cassandra will locate the first replica on the owner node (the one determined by the hash algorithm), then walk the ring in a clockwise direction to place each additional replica, as follows:
Additional replicas are placed in adjacent nodes when using manually assigned tokens In the preceding diagram, the keys in bold represent the primary replicas (the ones placed on the owner nodes), with subsequent replicas placed in adjacent nodes, moving clockwise from the primary.
Although each node owns a set of keys based on its token range(s), there is no concept of a master replica. In Cassandra, unlike make other database designs, every replica is equal.
This means reads and writes can be made to any node that holds a replica of the requested key.
If you have a small cluster where all nodes reside in a single rack inside one data center,
SimpleStrategy will do the job. This makes it the right choice for local installations, development clusters, and other similar simple environments where expansion is unlikely because there is no need to configure a snitch (which will be covered later in this section).
For production clusters, however, it is highly recommended that you use
NetworkTopologyStrategy instead. This strategy provides a number of important features for more complex installations where availability and performance are paramount.
NetworkTopologyStrategy
When it’s time to deploy your live cluster, NetworkTopologyStrategy offers two additional properties that make it more suitable for this purpose:
Rack awareness: Unlike SimpleStrategy, which places replicas naively, this feature attempts to ensure that replicas are placed in different racks, thus preventing service interruption or data loss due to failures of switches, power, cooling, and other similar events that tend to affect single racks of machines.
Configurable snitches: A snitch helps Cassandra to understand the topology of the cluster. There are a number of snitch options for any type of network configuration.
We’ll cover snitches in detail later in this chapter.
Here’s a basic example of a keyspace using NetworkTopologyStrategy:
CREATE KEYSPACE AddressBook WITH REPLICATION = {
‘class’ : ‘NetworkTopologyStrategy’, ‘dc1’ : 3,
‘dc2’ : 2 };
In this example, we’re telling Cassandra to place three replicas in a data center called dc1 and two replicas in a second data center called dc2. We’ll spend more time discussing data centers in Chapter 4, Data Centers, but for now it is sufficient to point out that the data center names must match those configured in the snitch.
Snitches
As discussed earlier, Cassandra is able to intelligently place replicas across the cluster if you provide it with enough information about your topology. You give this insight to Cassandra through a snitch, which is set using the endpoint_snitch property in
cassandra.yaml. The snitch is also used to help Cassandra route client requests to the closest nodes to reduce network latency.
As of version 2.0, there are eight available snitch options (and you can write your own as well):
SimpleSnitch: This snitch is a companion to the SimpleStrategy replication strategy. It is designed for simple single data center configurations.
RackInferringSnitch: As the name implies, this snitch attempts to infer your network topology. Using this snitch is discouraged because it assumes that your IP addressing scheme reflects your data center and rack configuration. For this to work properly, your addresses must be in the following form:
PropertyFileSnitch: Using this snitch allows the administrator to define which nodes belong in certain racks and data centers. You can configure this using
cassandra-topology.properties. Each node in the cluster must be configured identically. You should generally prefer GossipingPropertyFileSnitch because it handles the addition or removal of nodes without the need to update every node’s
properties file.
GossipingPropertyFileSnitch: Unlike PropertyFileSnitch, where the entire topology must be defined on every node, this snitch allows you to configure each node with its own rack and data center, and then Cassandra gossips this information to the other nodes.
CloudstackSnitch: This snitch sets data centers and racks using Cloudstack’s country, location, and availability zone.
GoogleCloudSnitch: For Google Cloud deployments, this snitch automatically sets the region as the data center and the availability zone as the rack.
For production installations, it is almost always best to choose
GossipingPropertyFileSnitch in physical data center environments and the appropriate cloud snitch in cloud environments.
Since much of the configuration related to snitches pertains to the topology of our data center, we will save our detailed treatment of this topic for Chapter 4, Data Centers, which will cover Cassandra’s multiple data center features in detail.