After a bit of healthy debate in the office about the merits and
implications of using adaptive copy mode (ACp) with EMC's SRDF, I wanted to clarify my own
thoughts on how it operates, and the benefits of using it.
Firstly, I guess, for the uninitiated, it's a question of what is it exactly is it? - Well, EMC has their SRDF (Symmetrix Remote Data Facility) data replication protocol that allows data to be replicated over distance from one array to another.
Normally, whereever possible, when running replication at the hardware layer like this (where the hardware has no concept of application transactional consistency), you need the replication to perform synchronously with the I/Os. This essentially means that when a write operation is performed on the source array, it is flushed from disk and committed to the destination (i.e. remote) array before the operation is ack'd to the server as being completed.
Clearly there is a time overhead with this kind of arrangement, dictated by two factors - the (relatively constant, dependant on array loading) "rippling" effect of the write operation having to pass through two arrays before getting a final "commit", and also the transmission and acknowledgement time (as dictated by the speed of light and the distance between the arrays, which can be many kilometres apart).
From this (the aggregate of two times the transmission latency and the write latency for the remote array), it can be easily seen that the distance has an immediate bottleneck (not necessisarily in bandwidth, but certainly in transaction time - roundtrip times of 20-30ms is not unusual).
As the long-haul links between the arrays become congested, performance will rapidly degrade, and because of the synchronous nature your application will run slowly (I've seen write operations in excess of 300-400ms reported by the OS in bad cases)
In contrast, an async transfer mode will commit the write operation to local storage while transmitting the write to the remote array in the backgroup. This gives performance comparable to that of a non-SRDF setup, at the expense of the risk of missing I/O transactions in the event of a link or primary site failure. Because of the asynchronous nature of the SRDF transaction, even when the long-haul links between the two arrays are congested, your application will perform well and the SRDF updates will conclude at the next available opportunity.
And so, back to ACp - a compromise between the two.
The trouble with Synchronous is that you don't necessarily want your application slowing down at every busy spot during the day, perhaps you want to make more efficient usage of your bandwidth by playing the odds. You might, for example feel that running your long haul link at an extremely high utilisation is more cost effective than an upgrade in bandwidth. ACp will allow you to do this, but with the caveat that during peak load, you will be slightly out of sync.
ACp introduces the concept of the skew value.. essentially a threshold, counted in number of write operations. Below the skew value, the device pairing operates in async mode, and above the threshold it switches to synchonous mode (the skew value normally defaults to 65536 operations).
So, for example, playing the odds.. If, by running your long haul link at a higher utilisation meant that you couldn't keep your local app running quickly enough due to the link contention and latency, ACp may help. You might not be fully consistent one hundred percent of the day, but 98% might be good enough, especially when coupled with the local recovery abilities of journalled filesystems and modern DBMS.

0 Trackbacks