What is Sharding?
A server’s capacity can be challenged by database systems with large data sets or high throughput applications. For example, high query rates can consume the CPU capacity of the server; likewise working set sizes larger than a system’s RAM stress the IO capacity of disks.
There are two methods for addressing system growth: vertical and horizontal scaling.
Vertical Scaling (scale up) involves increasing the capacity of a single server, such as using a more powerful CPU, adding more RAM, or increasing the amount of storage space. Limitations in available hardware and technology may restrict a single machine from being sufficiently powerful for a given workload. Additionally, cloud-based providers have hard limits for available hardware configurations. As a result, there is a practical maximum for vertical scaling.
Horizontal Scaling (scale out) involves dividing the system dataset and load over multiple servers, and adding additional servers to increase capacity as required. While the overall speed or capacity of a single machine may not be high, each machine handles a subset of the overall workload, which potentially may provide better efficiency depending on the workload. It may be cheaper to add additional servers as needed than to purchase high-end hardware for a single server. However, the trade-off is that management complexity is increased.
MongoDB supports horizontal scaling through sharding.
Sharding is a method to distribute data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.
Sharded Cluster Components
A MongoDB sharded cluster consists of the following components:
- Shard – each shard contains a subset of the sharded data. These are the MongoDB instances that actually store collection data. Shards can be deployed as standalone instances or as a replica-set
- Mongos – the mongos acts as a query router, providing an interface between client applications and the sharded cluster. These servers are individual instances that do not store data locally. Instead, they query individual shards using cached state from the config-servers, as needed.
- Config servers – config servers store metadata and configuration settings for the cluster. Basically, config-servers track state about which servers contain what parts of a sharded collection.
The following image demonstrates the interaction of components within a sharded cluster:
MongoDB stores data as documents in a binary representation called BSON (Binary JSON). Documents that share a similar structure are typically organized as collections. You can think of collections as being analogous to a table in a relational database: documents are similar to rows, and fields are similar to columns.
When a MongoDB collection is sharded, data is distributed based on an identifying property called a shard key. A shard key is a column that exists in every document in the collection, representing a field that can be used to organize those documents, based on range. For example, if you wanted to organize users based on the names of the states in which they live, you could begin by assigning all users with states starting with the letters A-C to shard 1 and all users of states starting with the letters D-F to shard 2.
Sharding Policies
Sharding is automatic and built into the database. This means that developers don’t face the added complexity of building sharding logic into their application code. Operations teams won’t need to deploy additional clustering software to manage process and data distribution.
Multiple sharding policies are available that enable the distribution of data across a cluster according to query patterns or data locality:
- Range sharding – documents are partitioned across shards according to the shard key value. Documents with shard key values close to one another are likely to be co-located on the same shard. This approach is well suited for applications that need to optimize range based queries.
- Hash sharding – documents are distributed according to an MD5 hash of the shard key value. This approach guarantees a uniform distribution of writes across shards, but is less optimal for range based queries.
- Zone sharding – provides the ability for DBAs and operations teams to define specific rules governing data placement in a sharded cluster. Zones accommodate a variety of deployment scenarios – for example, locating data by geographic region, by hardware configuration for tiered storage architectures, or more. Data placement rules can be continuously refined by modifying shard key ranges, and MongoDB will automatically migrate the data to its new zone.
Why Shard?
Sharing provides several advantages. A few of which are listed here:
Reads / Writes
MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations. Both read and write workloads can be scaled horizontally across the cluster by adding more shards.
Storage Capacity
Sharding distributes data across the shards in the cluster, allowing each shard to contain a subset of the total cluster data. As the data set grows, additional shards increase the storage capacity of the cluster.
High Availability
A sharded cluster can continue to perform partial read / write operations even if one or more shards are unavailable. While the subset of data on the unavailable shards cannot be accessed during the downtime, reads or writes directed at the available shards can still succeed.
In production environments, individual shards should be deployed as replica sets, providing increased redundancy and availability.
[…] replica set is a fully self-healing shard (see previous post for introduction to sharding) helping prevent database downtime. Replica failover is fully automated, eliminating any kind of […]
LikeLike