StorReduce does deduplication by breaking incoming data into blocks, and only storing blocks that it hasn’t seen before. The way we tell what blocks are unique is to compare the hash of the new block with the hashes of all existing stored blocks, so it’s important that we don’t have a hash collision. A hash collision could cause us to throw away data that was actually unique. It goes without saying that we never want this to happen for any workload.
The answer is to store enough bits of the hash to ensure that there is a vanishingly small probability of a collision, for any conceivable workload.
Too much hash isn’t necessarily good either
The size of the stored hash for each block is a significant contributor to the size of the StorReduce index data, so we don’t want to store more bits of hash than we need.
We use the SHA-2 hash algorithm which is cryptographically secure and therefore gives an excellent distribution of hashes. We use SHA-224 which provides us with 224 bits (28 bytes) of hash, and we then throw away some of that to reduce the size of our index data.
We hash each block of the original data before deduplication and compression and we aim for a block size of 8KB (kilobytes) in normal cases.
After performing calculations for various hash sizes we’ve settled on using 18 bytes of hash - see the calculations below.
The definitive article about how to calculate the probability of hash collisions is the Birthday Attack article on Wikipedia. Also, here’s another very readable article on Hash Collision Probabilities with a great explanation of the maths.
We’ve done some calculations using a Python script using the decimal library to provide high-precision arithmetic, and also independently checked the calculations using the approximation formula from the articles mentioned above. Here are the results:
|Amount of raw data||50 petabytes (5*10^16 bytes)||50 petabytes (5*10^16 bytes)|
|Average block size||8 kilobytes (expected size)||100 bytes (unrealistically small)|
|Number of blocks / hashes (k)||6,100,000,000,000
|Size of hash used||18 bytes (144 bits)||18 bytes (144 bits)|
|Number of unique hash values (N)||22,300,... 39 more zeroes
(2.23*10^43 values, or 2^144)
|22,300,... 39 more zeroes
(2.23*10^43 values, or 2^144)
|Probability of collision anywhere in the data
(using approximation formula: k^2 / 2 N)
|less than one in 1,000,000,000,000,000,000
|about one in 200,000,000,000,000
So for normal workloads with 50PB of data (scenario 1), the chance of a collision is very, very, very close to zero. Even with a million 50PB customers the chances are still only one in a trillion that any of them would ever experience an error due to a hash collision.
For this reason we decided that 18 byte hashes are a suitable size. In reality we could have gone smaller and still been safe, but we wanted to be conservative.
Workloads with lots of small files
If an object (file) being deduplicated is less than 8KB then the data block we are hashing will consist of the entire file, and will be less than 8KB. This means more blocks for the same volume of data, so we should consider the impact of having an average block size of smaller than 8KB. Although a workload consisting of tens of petabytes of very small files seems hard to imagine, technically it would be possible, so we’ve made sure we’re still OK in that circumstance.
In Scenario 2 in the table, even if we reduce the average block size down to 100 bytes and leave the data at 50PB (that’s a lot of 100 byte files!), the probability of a hash collision for 18-byte hashes is still only one in 200 trillion for any given customer. Just for comparison, you are likely to die in a shark attack long before that happens (odds 1 in 300 million!).
Effect of deduplication ratio
The better the deduplication ratio, the less blocks we need to store and the less hashes we use. For these calculations we’re being hyper-conservative and assuming no deduplication (maximum number of hashes). This isn’t realistic and in reality the number of hashes will be smaller since customers wouldn’t be using StorReduce if they weren’t getting any deduplication benefit.
Effect of a hash collision
In the almost zero likelihood event of a hash collision, when StorReduce reassembled the file later, the server would notice that the overall checksum for the original file didn’t match. This would force the StorReduce server to give an error rather than returning the original file. However this ensures that the wrong data is never returned because the mistake will be noticed.
StorReduce uses 18-byte hashes to determine the uniqueness of the 8 kilobyte blocks it stores, providing enough uniqueness and safety for any conceivable workload.