Operational Recovery Replication vs. CDP (Continuous Data Protection)

What are the Big Differences?

Well… There are a lot of differences! Both operational recovery replication and CDP serve a similar purpose, but they achieve their goals very differently. Let’s first start off with a high-level primer of each.

I want to focus on replication vs. CDP as it pertains to a virtual environment. There is a long history of these technologies being used within the physical world going back decades. Technologies like SRDF, which was used on the EMC Symmetrix storage platform, have been used and are still being used by many Fortune 500 organizations and companies around the world. That is not what I am talking about. I am focusing on ways to replicate data from one hypervisor to another, whether that be the same or different hypervisors.

First, Let’s Explore Replication

What does “replication” really mean? Well, simply put, replication is a point-in-time copy of data. This means you have a copy of the source at some specific point in time that is sent to a target. When it arrives, it is an image of what that VM looked like at the time in which the replication was initiated.

First let’s define the three types of replication as they pertain to this article: storage replication, agent-based replication, and hypervisor-level replication using native hypervisor snapshots.

Storage Replication

With storage replication you have a LUN, which is mapped to a host as a volume, which then gets presented up to the host/cluster and gets formatted in whatever file format your hypervisor uses. I will use VMFS in this example since it has a datastore on it which has VMs residing within that datastore. Typically, a policy is set up within the array to take “snapshots” on some specific schedule. Let’s say you perform snapshots once every 4 hours and you keep 16, (first in, first out or otherwise known as “FIFO”). You will have 2 days of rolling protection. These snapshots are sent from the source to destination storage arrays, so you must use the same technology. There are some very rare exceptions to that rule, but it’s correct for the most part. Let’s not get nitpicky!

The snapshot happens within the stronger subsystem and, unless there is specific integration with the hypervisor from the storage vendor they host, the VMs never know this process is happening. Only the blocks that have changed from the last “snapshot” are sent. They are deduped or other WAN optimization technologies may be layered upon in some cases. However, for the most part, this is a very efficient way to send data from source to target.

A Few Things to Consider About Storage Replication

What I am talking about here is “asynchronous replication,” and not “synchronous.” Synchronous is really used for active configuration where you don’t want any data loss at all. I can discuss this in another post. An async replication works at the storage level. This means these changes are now updated and available for recovery at the target system by “mounting” the replicated point-in-time copy. This process of mounting may be a manual process depending on the storage vendor, or it may be an “orchestrated” process that leverages software like VMware Site Recovery Manager.

Another consideration is whether or not you want the VM to be in a consistent state. If so, then you may need to quiesce the VM. Quiescing can be done with VMware Tools or Microsoft VSS. When quiescing occurs, things like memory are flushed or database logs may be committed. The point is that the VM and OS are being communicated with. It is important to note that this process can only be done every so often. Microsoft does not recommend using VSS more than once per hour on SQL, for instance, because of the pause in IO. It is the only way to have a fully-committed image for recovery purposes at the same time.

The last point I will make relates to the transactional database. SAN replication will capture data and give you a crash-consistent copy at the target system. That crash-consistent copy will come online just fine nine times out of ten. But if you didn’t commit logs and the database was shut down in a “dirty” state, then there is always some inherent risk. The risk is the same for all of the technologies I am discussing if you don’t have application-aware integration enabled.

Downsides to Application-Aware Replication

The problem with application-aware replication is the overhead on the application, so it’s a balancing act. There is an answer, but that answer lies further up in the application stack within the application vendor’s solution. Both Microsoft SQL and Oracle provide clustering technologies for their database platforms to solve this problem. The issues, at least in the SMB space, typically are that they are cost-prohibitive from a licensing and a configuration standpoint. This is why you should look at all of the alternatives. Again, these are just things to consider when weighing your options.

Agent-Based Replication

The next option is making replicas via some sort of agent-based solution. Most major backup vendors, from Veeam to Commvault, all have agents that will replicate the data from source to target on a predefined schedule just like in the example above for storage snapshots. Although I am focusing on VMs in this blog, this is the one option that will provide very similar capabilities for physical servers, and even physical to virtual replication options.

The agent-based method uses a software agent to take the point-in-time copies to capture the changes from the last point-in-time copy. It then writes these changes to another storage location. Agent-based can also interact with technologies like Microsoft VSS to make sure your copy is consistent. In many cases, they will not have a lot of overhead on the VM itself; therefore, agent-based solutions are good for systems like databases. Microsoft SQL server availability groups are also examples of software-based replication technologies. These can provide the necessary offset copies for business continuity and have little impact in the actual database itself. This again relates to what I said about crash-consistent vs. application-aware. It all comes down to what your databases and applications can handle as far as a pause in IO.

A Few Things to Consider about Agent-Based Solutions

Agent-based solutions can replicate data from one storage vendor to another as well as one hypervisor to another. This is because they are not tied to the storage or the hypervisor, so they have a lot of flexibility. So what’s the downside?

First, there is software running on every OS, which means there is some overhead. If you add encryption, then that process also uses CPU on the source system. They also have to be managed and patched individually. If you have a lot of VMs, then this can become a time-consuming process. I think it creates unnecessary overhead in a virtual world from a management standpoint. On the other hand, a large benefit is that if one machine has a problem, then you are only affected by one system instead of affecting a larger swath of systems when you perform hypervisor-based replication.

Again, these are just points to consider. Agent-based replication is still a great option in smaller environments. Recovery of these systems can depend on the software vendor being integrated into the hypervisor. It may also just be a manual process of turning off and then on. Again, this is usually less of an issue for smaller environments.

Hypervisor-Based Replication

The last and final point-in-time-based replication method I will discuss is hypervisor-based replication. This is one of the most popular ways to backup as well as replicate data today in a virtual world for many good reasons. When using the APIs that the hypervisor vendor provides, the vendor writes software using an API and takes a snapshot using the native snapshot process within the hypervisor.

In the case of VMware, VAAI integration makes that call, creates a VMware snapshot, and labels it as one that was created by the software vendor. It then scans the image file (VMDK) to see what changed since the last snapshot using CBT (change block tracking). These changes are then sent to the target system image file (VMDK) and are updated based upon your policy. There is a schedule for when the snapshots are taken and how many you want to keep just like with the storage replication option. This gives you the ability to roll back to a number of different images at the disaster recovery site.

I also want to point out that when using a certified storage vendor with the replication software of choice, then you most likely can use “storage integrated” snapshots. Instead of using a VMware snapshot to read and update the VMDK file, you can use the storage vendors snapshot. Let’s say you have Nimble Storage, Veeam Replication, and Veeam Enterprise Edition. You can leverage the storage integrated snapshot functionality. This will take a lot of the overhead away from the host and use the datapath from the data mover to capture the changes. I will get into why this may be important later in this blog.

Failover Plans

You can also bring VMs to be orchestrated within a software vendor’s solution. As an example, in the case of Veeam, Veeam has “failover plans.” Failover plans allow you to build out a full plan of exactly what order the virtual machines must come up in, as well as the networking requirements and connectivity options, etc. This “orchestration” is what makes this a very attractive option.

Again, like agent-based, one big advantage over storage replication is the ability to use heterogeneous storage and hypervisors on the source and target. This is why Storcom can provide these capabilities for so many of our customers. We don’t care what our customers use on their production systems for storage. It also allows us to use a cloud gateway for our clients to connect to Storcom instead of having VPNs in place. Storage replication almost always requires a VPN or a direct circuit.

So What are the Challenges with Hypervisor-Based Replication?

It really comes down to a few factors. The first issue is that you still have to communicate with the VM even when you use storage integrated snapshots. VMware tools then allows the process to run and flushes memory and releases the VMDK file. This happens all very quickly but it can cause IO connection on the VM. As I mentioned before, this can be a problem for applications like OLTP databases. This is why we are forced to look at other options sometimes like CDP.

Issue #2: The Recovery Point is Not as Low as CDP

Since the recovery point is not as low as CDB, the best you are ever really going to see is 1-hour and that’s pushing it. Having a VM snapped that often may cause issues even on a file server. Again, most organizations are fine with RPOs of 4 or 8 hours on file, web, and app servers. Therefore, using a technology that provides this level of protection might work for your organization. The database is usually where you want that lower RPO and why we sometimes look to some other options.

Issue #3: The BC/DR Testing Process May have Limitations

I don’t want to make a blanket statement but some solutions like Veeam & VMware won’t allow you to perform updates during a DR test to the VMDK files. That means your disaster recovery protection is effectively not updating during the time of your test. There are some workarounds so you can still test and get updates, but they are not the easiest processes to implement.

All of these options do have limitations when it comes to providing business continuity and disaster recovery capabilities. This is why CDP exists and is a good alternative or addition. So let’s now explore CDP.

What is Continuous Data Protection (CDP)?

CDP is pretty much what it says it is: continuous data protection. Unlike point-in-time “snapshots” which happen at a pre-defined schedule, CDP sends changes as they happen. It lives in the IO path itself.A hypervisor-based CDP solution uses a splitter driver. APIs from the hypervisor vendor take the IO that is written to disk and writes that to storage typically in second location. IO can then be used to update the VMDK or VHD file. This is NOT “synchronous replication.” That is different.

Synchronous replication means that IO written to target A is sent to target B and then an acknowledgement is sent back to source. I will discuss that in my next blog when I talk about metro/stretch clustering. The storage location is typically called a “journal” or “IO catalog.” It provides the ability to go back in time to a specific, or at least a timestamped set of IO, to recover back to the VMDK file at the target location.

Let me explain the full process: VM1 gets “seeded.” Seeding means that you take a copy of the VMDK file and copy it over to the target site. You now have a copy of that VM. The journal now starts catching up with the changes from the original point when the “seed” occurred. When it fills up, then it writes the changes to the VMDK file. It’s a first-in-first-out process so that the latest changes are always applied. You then have the ability to go back in time to some defined period. However, it’s not usually very long, maybe 24-72 hours.

Advantages of CDP

Next to synchronous replication, CDP is the closest thing to a like-for-like copy of data from one site to another. As with hypervisor-based replication, most solutions like Zerto and Veeam provide an orchestration component. This component allows you to pre-plan the recovery of critical application groups. Additionally, you can set SLAs at the application level. You can specify which systems take priority if there is bandwidth contention. Maybe you want the web, app, and DB for your ERP systems to be number one. Solutions like Zerto will ensure that this group has the best chance of being as close to the specified SLA as possible. This is done by creating SLA at the application group level.

CDP does not touch the VM from a hypervisor standpoint which is another big advantage. CDP does not use hypervisor or storage “snapshots.” There is no overhead or “stunning” that may occur with some of the other methods mentioned above. It also provides zero contact protection capabilities which is perfect for high transaction databases.

So What are the Disadvantages of CDP?

Well, until Veeam’s new version 11 release, you had either support two products or choose an all or nothing scenario with CDP or replication. Solutions like Zerto are great, but they only do CDP. Veeam only did replication prior to v11. Now, Veeam can do both! We tested Veeam Backup and Replication v11 and its new CDP functionality. Click here to learn more about the key new features and functionalities. However, I will say that it does not have the same track record as Zerto since it is so new.

Storcom will continue watching the CDP technology grow. CDP should become more viable as a few more features become available like Cloud Connect for CDP. As of right now, you can only use a VPN with Veeam CDP unlike Veeam replication which supports their cloud connect gateway. Zerto offers a Cloud Connect option where you don’t need a point-to-point VPN between your site and your MSP.

Conclusion

Hopefully you have a better idea of the differences between all of these technologies and what your options might be for replication after reading this blog. There is still a lot more to cover. I did not get very far into the differences as they pertain to data security, but that is another blog topic…

Until next time,

Dave Kluger

Storcom CTO

TAGS:

blog Blog Post CDP replication Veeam Zerto

WRITTEN ON:

4/29/21

Operational Recovery Replication vs. CDP (Continuous Data Protection)