I am still looking at some of the announcements, especially on the storage side of things and the EVORail that seems pretty compelling for what has been lately called “hyper converged infrastructures”.
I’ll drive a blog on to the topic once I complete all reviews. It will surely allow us to exchange a little on the new technologies enhancements in the Datacenter Operating System a.k.a virtualization.
For the moment, I’ll complete the series of 3 blogs started a few weeks ago entitled “what to get ready in your vsphere cluster design – Part 1” and “What to get ready in your vsphere cluster design – Part 2”
Permanent Device Loss Prevention, HA HeartBeat, DPM & WoL and vMotion bandwidth.
As we build stronger a more reliable vSphere clusters, many aspects need to be deeply understood. And when I say deeply understood I don’t mean only knowing what it does and how it does it, but know how it will sustain the objectives set, because this is really why acquisitions are made in the first place, right? Right!
Permanent Device Loss Prevention
vSphere 5 introduced Permanent Device Loss (PDL) which improved how loss of individual storage devices was handled by All Paths Down (APD) on the storage side by providing a much more granular understanding of the condition, and provide a reaction to the condition that best fits the actual condition experienced. Wow that was a mouth full… LOL
When a storage device becomes permanently unavailable to your hosts a PDL condition is triggered through SCSI command: “Target Unavailable” and it could be for either device that is unintentionally removed, or its unique ID changes, or when the device experiences an unrecoverable hardware error)
When deploying a cluster environment, make sure to enable permanent device loss detection on all hosts in the cluster to ensure a VM is killed during a PDL condition. You may also want to enable HA advanced setting “das.maskCleanShutdownEnabled” to make sure a killed VM will be restarted on another host after a PDL condition occurred. Think about transactional VMs for example….
HA HeartBeat … in Metro-Clusters
One of the fundamentals of cluster is Heartbeat as it keeps the state of the nodes updated. This HeartBeat is enabled, for a new host joining an existing cluster, by uploading into the host an agent and allow all agents on all hosts to communicate with eachother every 1 second. 15-seconds missed heartbeat and the host is considered down.
Locally it’s easy to manage. For dispersed architectures a little more challenging (from a vmware standpoint of course i.e vPLEX…). Latencies are playing against some of the best practices, and therefore often the data stores heartbeat is leveraged to alleviate the challenging networking conditions.
For metro-cluster (geographically dispersed clusters) I have always made sure that the number of vSphere HA heartbeat datastores is set to minimum four as some of the most influent bloggers have over and over evangelized. Manually select site local datastores, two for each site, to maintain heartbeating even when sites are isolated.
DPM & WoL (Wake-on-Lan)
We all know DPM (Distributed Power Management) http://www.vmware.com/files/pdf/Distributed-Power-Management-vSphere.pdf , yet few are using it primarily because today’s majority of datacenters (and I am not talking about Cloud providers here) are not yet concerned by the latest legislation around power consumption (but it will change) in Canada.
A few conditions are required though. First Each host’s vMotion, networking link must be working correctly (obvious no?). The vMotion network should also be a single IP subnet, not multiple subnets separated by routers (layer2 guys, Layer2.)…
Maybe obvious but too often overlooked, the vMotion NIC on each host must support WOL. Very important as well, the switch port that each WOL-supporting vMotion NIC is plugged into should be set to auto negotiate the link speed, and not set to a fixed speed (for example, 1000 Mb/s). Many NICs support WOL only if they can switch to 100 Mb/s or less when the host is powered off… Keep it in mind.
Overall, if using DPM and WoL remember that hosts are contacted on their **vMotion interfaces** so the NICs associated with vMotion must support WoL and must be part of the same layer 2 domain.
I have that question often from many many many many many people. What should I consider in my vMotion bandwidth. We have a tendency to go to what’s bigger is better. When you have that option, it surely won’t hurt, but the requirement behind 10Gbps are far more than the speed; the wire, the NICs involved and the L2 switches are to be considered which usually drives costs up when in fact the requirements are not there.
Yes, bandwidth is highly important and we should pay a close attention to it. By providing enough bandwidth, the cluster can reach a balanced state more quickly, resulting in better resource allocation (performance) for the VMs therefore providing a better VM density returning a high ROI on hardware investments.
However, before jumping on the big boy, consider link aggregate and the overall cluster design. My take on that topic is that if you forecast adequately the cluster at the hardware level and ensure that all hosts are well architected, and every VM has the *appropriate amount of resources it requires to function* vMotion bandwidth will smoothly fit in 1Gbps network speed.
So it is important, but it should not be used to alleviate some architecture disfunction; vMotion is cool, but moving transactional VMs around too often is not suggested. Having planned the cluster appropriately should be targeted and IF REQUIRED, vMotion leveraged for maintenance requirement, manual cluster balance et DRS requirements but should not be considered as an operational tool that VM should depend on. It is a “feature” not a “fonction”.
Don’t forget, a VM will only best performed once well configured, on a well equipped host and when it’s the least “moved around”.
This blog concludes the series of 3 blogs about “What to get ready in your vSphere Cluster Design”. Little details here and there will make the difference overall in the budget you plan for your next fiscal year. I believe in providing the right information to the decision makers to help them maximize the return on investments made originally at the acquisition point of the virtualization technology.
I keep stressing the fact that vCOPS should ALWAYS be considered, and should be implemented quickly. If you’re looking for insights in your environments it’s a must and rather than planning on new capex, focus on a higher conversation and think like an investor.
If you paid $X dollars in the past for something you wanted, chances are you would want to squeeze every bit of it before considering upgrading or changing. I often see too many unbalanced clusters because some of the fundamentals are poorly implemented or wrongly leveraged.
Rely on someone you trust to help you design and brainstorm your environment. The more the better since the winner at the end would be your organization, I strongly believe that it is a fundamental exercise that we should ALL do when comes the time to evaluate where should we next be investing.
I hope this helped a little, and I am looking forward to all feedbacks.
Happy weekend friends!