Let’s say an application is popular and have heavy traffic from across the world. This leads to a situation where your database gets loaded with thousands of new records every day. Now the questions arises about managing a data of users and for the users. Database Sharding is the best way to proceed with in this scenario.
Sharding is partitioning of data. Sharding is a method to split a huge dataset into multiple databases through specific technique of partitioning. In general, sharding can be done in two ways Vertical & Horizontal.
Vertical sharding : A technique of storing a data by into multiple sets by diving properties. Let’s say we have user data with around 25 different properties into single table/dataset. We can divide these properties to multiple tables/sets like separate tables/set for personal information, login credentials, health details, biographic details etc.
Horizontal sharding divides the data based on rows into multiple tables or location. i.e., We have 4 million users’ data. We have applied sharding and divided data into 4 different sets of 1 million users based on let’s say regions.
Shard or Partition Key is a portion of primary key which help us to plan how the data can be distributed. Plan on how to work with read-writes over the data.
Each database type has a different way to handle sharding process but it can be divided into two types in general scenarios. A logical shard is a technique to store the data with same partition keys shared between multiple data. A physical shard is something with multiple logical shards. To know more about what MciroSoft says about this please Click Here
It takes data as input and apply hash function over it to generate hash output and based on the hash it stores that record to appropriate shard. As per my knowledge generally modulus operator is more popular in taking as subset of data as input and generate number of shards.
As we would be having partition key along with us. Reads will be performed over single database. In absence of partition key , then every database needs to be be searched to get the records. Generally, this kind of sharding is suitable for key-value databases as sharding function uniformly distributes data according the logic specified in function.
In dynamic sharding, there will be a locator service which interacts for each read-write entries to locate the right database. As per different partition keys available, locator service will handle the queries for read and write efficiently. Same as algorithmic sharding, here all queries will have the partition key available. Queries without that have to search the records in all databases.
It is not as uniformed as algorithmic one, has multiple challenges in implementation. Here locator service is point where all queries comes first, so if that fails, entire structure fails. In case of routing failures it is hard to identify those records in database and is a time consuming process.
Here, another concept has also become popular called entity groups which means storing all kind of data for single user with same partition key. This will allow us to read the things very easily and efficiently where as cross partitioned queries might take place but frequency of the same would be lower so it would give better results.
As partitioned databases with partition key will be specified with each query. Query is dedicated to single database only. Results into a quick and optimized response.
The core of Data Shrading is distributed architecture. If any of the partition failed, only specific partitioned data will be impacted, rest will work perfectly. This can be resolved through replica or database backups of that partition only.
Appropriate strategy to handle sharding process should be selected. Changing strategy mid way will be too complex.
Complex Problem with Simple Solution is not what implement sharding is all about. As database and application architecture interaction will become more complicated.
The level of Complexity in Database Sharding is high. Reverting back to non-sharding architecture after implementation will be for sure not easy.
Microsoft also has given few pointers to be taken care, the same can be checked over here
Database Sharding definitely helps when the large amount of data are to be managed. Needs concrete efforts in devising a sharding strategy and implementation. It is much complex to manage data with sharding. I recommend to go for Database sharding process only when required and Opt for horizontal sharding.
If you’re planning for the Effectively managing your data, get in touch with Ansi ByteCode LLP team.
Author : Naishadh R. Patel