SAN versus Local Storage and High Availability

I have two problems with buying a SAN. An entry-level SAN has a single point of failure – even if you have redundant power and dual controllers, the backplane can fail. If all your applications are running off that SAN, you’re screwed. But having two SANs for full redundancy is twice the price. I’m looking at £20k for an P4300 versus £10k for the entry level HP SAN (P2000). That’s a lot for a small business like ours. My supplier tells me I’d be crazy to go for a single SAN.

But are there any benefits of virtualisation without shared storage? Well, a lot of people seem to go for two physical hosts with local storage only and replicate them using Veeam replication. This does provide automatic failover, but seems a lot better than waiting for a dead server to be repaired and then possibly having to rebuild it from scratch which could take a couple of days. If one host dies, you spend a few minutes getting everything back and running on the other host. And its cheap. So I’m attracted to this. But my supplier, once again, tells me I’d be crazy.

I followed a debate on a forum that went:

“Yes, shared storage is good, but going with cheap shared storage is risky. You are putting all eggs in one basket, and if such storage dies, HA will not work. In fact, nothing will work. So unless you can afford good fault-tolerant shared storage (read “expensive”), 2 ESX hosts with local storage and replication between them is significantly more fault tolerant solution. Secondly, HA is much worse than replication, because replication provides transactionally-consistent images, while HA does cold restart of crash-consistent image, so such recovery rarely does any good for transactional applications and databases. With Microsoft Exchange for example, out of 2 times I had this situation in production, MDB got corrupted in both cases, so I still had to manually restore from earlier backup.”

“Veeam Backup & Replication product provides functionality of replicating virtual machines between ESX hosts with local storages.”

“HA only works with shared storage and there is no replication involved. HA wil power up guests on the second host when the first host dies. It is the equivalent of pulling power to a server and restarting it. Any decent SAN will have dual controllers and dual PS. So the only single point of failure is a complete failure of the backplane. That’s a risk that most will take. Let’s face it, if that small risk is too great, you are not currently investing enough in DR.”

“HA is pretty simplistic one, it requires shared storage, and does not guarantee successful recovery due to performing simple crash-consistent restart. Replication is much more advanced than that, does not require shared storage, and provides guaranteed application recovery. As for automatic failover with replication, actually there is such capability. With Veeam Essential suite, you also get Veeam Monitor product, that has built-in alerts for VM heartbit, and ability to automatically trigger response action based on alert. This response action can be simple script that automatically starts up the corresponding replica VM on the standby host.”

“Scenario 1. Production ESX host fails, HA does its job restarting VM with Exchange server on another ESX host in the cluster. VM restarts fine, but Exchange is failing due to MDB corruption caused by improper shutdown. I had personally suffered twice from exact same situation. No fun.

Scenario 2. Shared storage goes down. You have to perform full VM restore to local ESX storage now.

Both of these scenarios will results in:

1. A few hours of down time while you are restoring Exchange VM from backup.

2. Up to 24 hours data loss because you have to roll back to your nightly Exchange backup.

Now, compare this with replication between local storage (very popular scenario with our customers). Whether your production ESX host or shared storage fail,

1. Down-time will be less than a few minutes (time it takes for replica VM to start up on standby ESX host).

2. Maximum loss of data will be less than your chosen replication period.

I am not saying HA does not have its uses, most applications will be fine after crash-consistent image restart. However, for some applications this often causes data corruption. On the other hand, applications like Exchange or databases are also most oftenly used apps (and always mission-critical too).”

The guy above pushing replication over high availability is a Director at Veeam, so not exactly unbiased. Then again, in the sales meetings I’ve had, no-one has mentioned that a crash is likely to cause database corruption of your Exchange database (and presumably our ERP SQL-Server database). It may not be likely, but I’d be really annoyed if I spent over fifty grand on a High Availability solution only for it to fail because of database corruption. And replication is probably cheaper.

As usual, it comes down to who to believe. Everyone is selling a product to me.

Leave a Reply

Your email address will not be published. Required fields are marked *