As the amount of data that organisations have continues to grow, the time it takes to back up all the data to tape has become increasingly inconvenient. Likewise, data retrieval from tape is a time-consuming process and can be unreliable.
As hard disk capacities increase and the cost per GB of storage continues to fall, organisations are looking to hard disk based storage for their operational backups and relegating tape to archival purposes or dropping it altogether. Writing data to disk is not only significantly faster than transferring to a tape, the transfer speeds for fibre channel disks are at least three times faster than tape making it feasible to back up from disk to disk. In fact, when you factor in the initial investment in the tape drive, the tapes themselves, the lifespan of each type of media and the administrative overhead, you might find that disk to disk (D2D) backup is more cost effective than tape.
With the rapid adoption of virtualisation, the amount of data stored by organisations has and will continue to grow more rapidly than ever. With each virtualised server being reduced to a set of files stored centrally, additional techniques and technologies are required to manage and control the data growth. One such technology currently finding favor within the industry is de-duplication.
Data de-duplication (often called "intelligent compression" or "single-instance storage") is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media. Redundant data is replaced with a pointer to the unique data copy.
For example, a typical VMware cluster might contain 40 virtual machines all running the same version of windows 2008. Now each of those C: drive instances, containing largely the same data, is approximately 40-60 GB in size. If the virtual infrastructure is backed up or archived, all 40 instances are saved, requiring 1.2TB of storage space. With data de-duplication, only one instance of each OS file is actually stored with each subsequent instance is just referenced back to the one saved copy. In this example, a 1.2TB storage demand could be reduced to only 50-70GB.
Data de-duplication offers other benefits. Lower storage space requirements will save money on disk expenditures. The more efficient use of disk space also facilitates longer disk retention periods, improved recovery time objectives (RTO) and reduces the need for tape backups. Data de-duplication also reduces the amount data that must be sent across a WAN for remote backups, replication, and disaster recovery thereby reducing traffic on the network.
Data de-duplication can generally operate at the file, block, and even the bit level. File de-duplication eliminates duplicate files but this is not a very efficient means of de-duplication. Block and bit de-duplication looks within a file and saves unique iterations of each block or bit. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks or bytes are saved, the changes don't constitute an entirely new file. This process makes block and bit de-duplication far more efficient. However, block and bit de-duplication takes more processing power and uses a much larger index to track the individual pieces.
This is vendor dependent. Vendors like EMC and its recent purchase of Data Domain has made it very easy by creating a fast, application independent storage system (attachable as a file server over Ethernet or a VTL over Fibre Channel). No client software or other configuration is required. As a result, de-duplication can be almost invisible to backup and recovery and other near-line storage processes. It works easily with various data movers and workloads, including non-backup data like e-mail archives, reference data and engineering revision libraries. More flexibility means more consolidation is possible using less physical infrastructure.
In any storage system, the disk drives are the slowest component. In order to get greater performance it is a common practice to stripe data across a large number of drives so they work in parallel to handle I/O. If the system uses this method to reach performance requirements, you need to ensure the right balance between performance and capacity’s. This is important since the essence of data de-duplication is to reduce the number of disk drives.
This is not a question for in-line de-duplication systems but a requirement for the post-process. Post-process methods require additional capacity to temporarily store duplicate backup data. How much disk capacity is required will depend on the size of the backup data sets, how many backup jobs you run on a daily basis and how long the de-duplication technology "holds on" to the capacity before releasing it. Post-process solutions that wait for the backup process to complete before beginning to de-duplicate will require larger disk caches than those that start the de-duplication process during the backup process.
