One of StorReduce’s claims to fame is to be able to operate at sustained throughput rates of up to 1,100 megabytes per second for both reads and writes, using a single StorReduce server. We occasionally get some raised eyebrows when we tell people this; it’s an impressive number and it does break new ground. It’s also significant in a business sense. It means that we cut a large on-premise to cloud migration (petabytes) from years to weeks, that we enable clouds to have much larger volumes of data to be migrated to them than ever before, and that StorReduce’s inline deduplication is fast enough to be used by big data companies wanting to use Hadoop or search in real-time.
But what does it really take to achieve this sort of throughput?
We recently listed StorReduce on Amazon’s AWS marketplace. As part of that process I performed some throughput testing on the AMI virtual machine image that’s available on the marketplace. I was able to fairly closely replicate our previous throughput numbers, achieving 1,100 MB/s for writes, but along the way I found some things that can make a big difference to the speed and throughput.
How StorReduce works
StorReduce sits in front of an existing cloud storage mechanism like Amazon S3, looking for data that it’s seen before. Only new ‘unique’ blocks are stored to the underlying cloud storage.
To make this work, StorReduce needs to compare incoming data to all previously uploaded data in real-time, and this needs to work even when petabytes of data have been uploaded. StorReduce keeps an index of each block of data it has already seen, and this index needs fast random access.
As a side note, it’s easy to forget just how much data a petabyte is - it’s a million gigabytes. Comparing incoming data to petabytes of existing data in real time isn’t easy!
Deduplication Ratio is the Key
The deduplication ratio measures how well your data deduplicates by comparing the data size before and after deduplication. Higher deduplication ratios make a huge difference - not only do you require less storage, but StorReduce processes data faster because duplicate data blocks don’t need to be written to cloud storage.
The deduplication ratio is directly related to the proportion of data that the server has seen before, and so it’s important to get the best possible ratio.
Top 8 Tips for Great Throughput
Tip #1: Don’t feed in compressed or encrypted data
Compression and/or encryption make the data look random and so there aren’t any duplicate blocks! Be sure to decrypt and decompress data prior to uploading it through StorReduce. (StorReduce uses transport layer encryption and policy-based access control to ensure your data is safe).
Tip #2: Use representative test data
Picking data that is representative of your complete data set is vital. In particular, don’t pick a small amount of sample data that has no duplicates.
Backups are a good example of data that has a lot of duplicate information. For full daily backups, after the initial upload we can expect an average daily delta (change rate) of 1% or or less. Backup services will commonly see only 0.3% delta each day. Over time the deduplication ratio approaches this delta rate, so 97% or more deduplication on full daily backups is realistic. Ensure your test data includes some sequential backups rather than a few randomly selected tapes or drives.
For Big Data, if you regularly make copies of data for developers or testing, ensure your test data includes some of those copies.
Tip #3: Use Multiple Clients in parallel
One of the key factors in getting good throughput for the larger EC2 instances is to have enough concurrent clients to keep the server fed with upload data.
For clients that upload multiple file parts in parallel you may need far less instances, but my testing revealed that a single StorReduce server could cope with a large number of clients simultaneously, and how running more clients increased the total throughput.
Tip #4: Use a large part size for multipart uploads
If your client uses multipart uploads then configure the client to use a relatively large part size, e.g. 100MB rather than the 5MB minimum. This reduces the overhead of making TCP connections, which take time to set up and to ramp up to full speed. Larger part sizes made a big difference to both speed and total throughput in testing.
Using single-part uploads is another option; these are inherently faster than multipart uploads since they don’t require the ‘setup’ and ‘complete’ operations of a multipart upload. However for files larger than 100Mb to 200Mb single-part uploads may not be so reliable.
For my testing I used multi-part uploads for everything, but with a 100MB part size. (Note that the default of 5MB makes sense for WANs, but 100MB or more is good for LANs.)
Tip #5: Make sure your network can keep up
For my testing I used 2 separate EC2 instances - one for the clients and one for the StorReduce Server. I placed both instances in the same availability zone and subnet so the network communication will be fast.
Also note that the larger EC2 instances have much better network bandwidth than the smaller instances - I did my testing on a c3.8xlarge server with a 10Gb/s network connection.
Tip #6: Keep the index on SSD Storage
SSDs are much, much faster than hard disks so the index data should be kept on SSD. AWS makes this easy by providing SSD storage for some types of EC2 instance (the c3 and i2 series are particularly good for StorReduce).
The StorReduce AMI on AWS Marketplace automatically ensures the index data is stored on SSD.
Tip #7: Use RAID-0 on your index storage
If you have more than one SSD, set them up as a single volume using RAID-0. This means all the SSDs are used at once, increasing the total reads and writes per second. For example an AWS i2.8xlarge instance has 8 SSDs and can sustain 8 times the IO operations per second if configured using RAID-0.
The StorReduce AMI on AWS Marketplace automatically configures all available SSDs as a single RAID-0 volume.
(Note that we don’t need higher RAID levels to protect the index data since the entire index can be rebuilt automatically from cloud storage).
Tip #8: Avoid HTTPs on the back end
If you are running StorReduce on EC2 and storing your data on S3, let StorReduce talk HTTP rather than HTTPs when storing its data. The connections from EC2 to S3 are internal to Amazon and therefore HTTP is an option, removing the setup and CPU cost of HTTPs. (This can be done in the StorReduce Settings page by using an explicit S3 endpoint such as “http://s3.amazonaws.com")