Hello friends, long time lurker, first time poster.
I was wondering if anyone has ever tested failing a new array peer persistence pair by trying to simulate an array failure on the source side. We have new arrays (9450s) and a short window for this kind of disruptive testing. In the past we have tested by unplugging fiber and network connectivity to the source array. We are using VMware as the cluster technology and before the failover scenario we start an iometer and play a video on a guest on that cluster to simulate load during the failover.
RC Type: RCFC
Cluster Technology: ESX
Peer Persistence with QW configured at 3rd site and auto-failover enabled on the RC group.
Source and Target LUNs exported to the cluster hosts.
Just looking for others thoughts, maybe there is a better way of doing so.
Thanks for reading!
Testing Peer Persistence Failover
-
- Posts: 1
- Joined: Fri Mar 08, 2019 3:28 pm
- Location: United States
Re: Testing Peer Persistence Failover
Sounds like a good plan.
Just be aware that all ports must fail at the same time (within a very few seconds) for the 3PAR to handle this as a failover. If you spend like 5 seconds on disabling each port, the 3PAR is too smart and understands that this is "human made" and will not trigger the failover.
Also keep in mind.
The SCSI standard allows for 60 seconds of "latency" before it times out. Peer Persistence in a supported configuration should always complete the automatic failover within half of that (30 seconds). My experience is that it completes in way less when I've tested but I've never been close to the limits. Any application and operating system following the SCSI standard will have no problems with that (which is a polite way of saying that my experience is that not all application do).
Just be aware that all ports must fail at the same time (within a very few seconds) for the 3PAR to handle this as a failover. If you spend like 5 seconds on disabling each port, the 3PAR is too smart and understands that this is "human made" and will not trigger the failover.
Also keep in mind.
The SCSI standard allows for 60 seconds of "latency" before it times out. Peer Persistence in a supported configuration should always complete the automatic failover within half of that (30 seconds). My experience is that it completes in way less when I've tested but I've never been close to the limits. Any application and operating system following the SCSI standard will have no problems with that (which is a polite way of saying that my experience is that not all application do).
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.
Re: Testing Peer Persistence Failover
Yes I've done this when we installed a new 8400 pair and a 9450 pair.
We got a test scenario from HPE via our re-seller/installer which seems to work and can be done from the office (rather then pulling wires in a data center).
As mentioned it's a timing thing, this method needs hitting return on a command on the source array CLI and on the Witness VM CLI as close together as possible (I had one open on PC and other on laptop so could hit the keys at the same time).
Basically you use controlport on the source array to down all the IP/FC Rcopy ports and use an iptables command on the Witness to block traffic from the source array.
We've only done this when in early test phases with a selection of VMware/Windows/Solaris test hosts connected. If you wanted to be really brutal I guess you could down all the host ports on the source array at the same time (the command takes a list of ports so you can do them all if wanted).
I too would expect some of our 'well written enterprise' apps to trip up on SCSI timeouts.
In a live environment we've only done a manual switchover of PP groups when preparing for a DC power down and that went smoothly.
We got a test scenario from HPE via our re-seller/installer which seems to work and can be done from the office (rather then pulling wires in a data center).
As mentioned it's a timing thing, this method needs hitting return on a command on the source array CLI and on the Witness VM CLI as close together as possible (I had one open on PC and other on laptop so could hit the keys at the same time).
Basically you use controlport on the source array to down all the IP/FC Rcopy ports and use an iptables command on the Witness to block traffic from the source array.
We've only done this when in early test phases with a selection of VMware/Windows/Solaris test hosts connected. If you wanted to be really brutal I guess you could down all the host ports on the source array at the same time (the command takes a list of ports so you can do them all if wanted).
I too would expect some of our 'well written enterprise' apps to trip up on SCSI timeouts.
In a live environment we've only done a manual switchover of PP groups when preparing for a DC power down and that went smoothly.