In PART 1 https://florenttastet.wordpress.com/2014/08/10/what-to-get-ready-in-your-vsphere-cluster-design-part-1/ of this series fo blogs we’ve reviewed some vSphere best practices and I want to thank everyone that shared feedback.
All feedback are greatly appreciated and welcome. They are a source of inspiration and allows a wider view of the agility of vmware.
Last week’s blog, revealed how much DRS is crucial fonction. It’s definitely a “must have” for many organizations that are in need of a complete organic, self adjustable, agile and somewhere “self healing” option to be deeply considered. Through some of his underlying features, this fonction enables automated resource consumption and VM placements that best meets a cluster needs and surpasses resource assignments.
I like to call it “having an eye inside”; being able to move virtual workload to the best of its needs no matter when, how and why without disrupting daily operations.
Today we’ll see in this PART 2, SRM over vMSC, AutoDeploy, Restart Priority and Management Network under a 1000 words (or so LOL)
SRM over vMSC
First vMSC and SRM are not the same and were not designed in the same way. They are both concerned by disasters but in a different way: disaster avoidance (vMSC) versus disaster recovery (SRM).
VSphere Metro Storage Cluster (vMSC) allows vMotion workloads between two locations to proactively prevent downtimes. When you’re aware of a service interruption in a primary datacenter vMSC will allow you to move your virtual workload from your primary site to the secondary site. You need to be aware of the distance limitations of such activity. vMSC will only permit this vMotion under the following context:
- Some form of supported synchronous active/active storage architecture
- Stretched Layer 2 connectivity
- 622Mbps bandwidth (minimum) between sites
- Less than 5 ms latency between sites (10 ms with vSphere 5 Enterprise Plus/Metro vMotion)
- A single vCenter Server instance
Some of the Pros of vMSC are the possibility of non-disruptive workload migration (disaster avoidance), no need to deal with issues changing IP addresses, potential for running active/active data centers and more easily balancing workloads between them, typically a near-zero RPO with RTO of minutes and it only requires a single vCenter Server instance (for cost conscious decisions).
SRM focuses on automating the recovery process of workloads that unexpectedly fail. Once you inject SRM in your infrastructure, a copy occurs between a storage source and destination and a piece of software contains the restart order of the VM. Note that I said “restart”. It means that at a certain moment there will be loss of service, the time for SRM to restart the VMs affected (and protected by SRM). The requirements for a SRM architectures are:
- Some form of supported storage replication (synchronous or asynchronous)
- Layer 3 connectivity
- No minimum inter-site bandwidth requirements (driven by SLA/RPO/RTO)
- No maximum latency between sites (driven by SLA/RPO/RTO)
- At least two vCenter Server instances
SRM has advantages that should not be ignored. It can define startup orders (with prerequisites), there is no need for stretched Layer 2 connectivity (but supported) and it has the ability to simulate workload mobility without affecting production; finally SRM supports multiple vCenter Server instances (including in Linked Mode)
Choosing carefully will lead to success as always; consider what it can’t do instead of what he can do alleviates surprises but first and foremost, ensure you understand the business objectives of such requirement as they fundamentally do not address the same needs. Look for your RTO and RPO http://en.wikipedia.org/wiki/Recovery_point_objective (or ask the questions) and the answers will provide strong guidances.
In short, if very high availability and/or non-disruptive VM migration between datacenters is required use vMSC, otherwise leverage SRM. Involve your Storage infrastructure in the discussions. vMSC is heavily relying on storage arrays manufacturers, so you’ll need to consider their capacities and limitations.
Such requirements is not needed in SRM (i.e storage capabilities).
Know Norton Ghost? Then welcome to the 21st century of automated image deployment. Typically used in large environments to cut the overhead, vSphere Auto Deploy can provision multiple physical hosts with containing ESXi.
Centrally store a standardized image for your host and apply it to newly added servers. Leveraging Host Profiles will ensure the image deploy complies with all fonctions and features configured on the rest of the hosts in the cluster and provide a standardized cluster
When a physical host set up for Auto Deploy is turned on, Auto Deploy uses a PXE boot infrastructure along with vSphere host profiles to provision and customize that host.
Now equipped with an iterated GUI it stores the information for the ESXi hosts to be provisioned in different locations. Information about the location of image profiles and host profiles is initially specified in the rules that map machines to image profiles and host profiles. When a host boots for the first time, vCenter creates a corresponding host object and stores the information in the database. https://labs.vmware.com/flings/autodeploygui
Unless you are passionate about manual installation, AutoDeploy cuts the time it take to manually setup your hosts, configure then can apply a policy ensuring the host will comply with the rest of the cluster. When you’re not doing a provisioning task daily errors can occurs; autoDeploy stores the configurations and applies them on demand.
Remember to always update the “referenced” image if you change something on the hosts in the cluster; it will minimizes the troubleshooting and consider using the hardware Asset Tag to group the ESXi Hosts to limit the number of rule set patterns and ease administration.
Ensuring the backend from first then the front end is an art. When backends are databases, they need to be first back online before the front end can communicate adequately with them.
Part of HA architecture, the restart priority of a VM or service ensures that the targeted services or VM are coming online in an orchestrated manner.
Configuring restart priority of a VM is not a guarantee that VMs will actually be restarted in this order. You’ll need to ensure proper operational procedures are in place for restarting services or VMs in the appropriate order in the event of a failure.
In case of host failure, virtual machines are restarted sequentially on new hosts, with the highest priority virtual machines first and continuing to those with lower priority until all virtual machines are restarted or no more cluster resources are available.
The values for this setting are: Disabled, Low, Medium (the default), and High.
If Disabled is selected, VMware HA is disabled for the virtual machine. The Disabled setting does not affect virtual machine monitoring, which means that if a virtual machine fails on a host that is functioning properly, that virtual machine is reset on that same host.
From the book:
The restart priority settings for virtual machines vary depending on user needs. VMware recommends assigning higher restart priority to the virtual machines that provide the most important services.
■ High. Database servers that will provide data for applications.
■ Medium. Application servers that consume data in the database and provide results on web pages.
■ Low. Web servers that receive user requests, pass queries to application servers, and return results to users.
Beware that If the number of hosts failures exceeds what admission control permits, the virtual machines with lower priority might not be restarted until more resources become available.
To start, for the “Management Network” portgroup it is a best practice to combine different physical NICs connected to different physical switches to simply increase resiliency. It seems simple and obvious and too often do I see this not being proper.
If the Management Network is lost any kind of traffic, outside the VM traffic and vMotion will be affected. Traffic between an ESXi host and any external management software is transmitted through an Ethernet network adapter on the host; that Ethernet adapter becomes your main concern and should be addressed with much caution.
Examples of external management software include the vSphere Client, vCenter Server, and SNMP client. Surely we don’t want to lose the functionality of the vSphere client, especially in stressful moments such as a disaster situation;
Yes of course you can still jump in command line and interact with all hosts, but remember that in environments with more than 5 hosts this becomes a struggle and most of the policies are managed by the vCenter anyway, so what good does command line does in emergencies where RTO are tiny?
During the autoconfiguration phase, the ESXi host chooses vmnic0 for management traffic. You can override the default choice by manually choosing the network adapter that carries management traffic for the host. In some cases, you might want to use a Gigabit Ethernet network adapter for your management traffic. Another way to help ensure availability is to select multiple network adapters. Using multiple network adapters enables load balancing and failover capabilities
My advises on this point is not to under consider this Management Network. Carefully plan for the worst and ensure continuity of service of yourself and the cluster in general.
From the field, SRM, vMSC, AutoDeploy, Restart Priority and Management Network are critical points to consider at the same level than DRS or host affinity and Resource Pools but they seat at the lower layer of the infrastructure and are supporting the entire architecture.
In general, simplifying your operations will require a form of image automation for the hosts. Add it and let AutoDeploy push the image; make sure you support PXE boot on your network of course.Once deployed the image needs to be managed and for such management to be safe a reliable I strongly suggest to ensure the network management will not diminish the work you’ve put on the cluster architecture. Double up the connectivities… just in case.
Next blog will cover Permanent Device Loss Prevention, HA HeartBeat, DPM & WoL and vMotion bandwidth.
Happy weekend folks!