Alternative ways to provide High Availability for VMware Platform Services Controller (PSC)?

Over the past few months I have been looking at whether there is any value for me to design a highly available Platform Services Controller (PSC) setup for our new vSphere 6 platform. We all know that one of the functions of a PSC is to provide the authentication to VMware's product suites. So for example without a working PSC I would not be able to access my vCenter until it is fixed or restored. This would mean I would lose the management of the environment but it doesn't mean I would lose VMs that are already running.

To setup PSC to be in a high availability setup for your vCenter you would need to use one of the supported 3rd party load balancers that VMware recommends.  Below were the pros and cons for us at the time of design;

Note: The only policy currently supported at the time of writing is Active/Passive mode for load balancing. You will not be able to do active/active configuration.

  • Provides an "always on" PSC service
  • Provides automatic failover in event of failure or during upgrades of PSC so that the management of vCenter is not affected
  • Could be costly if you do not have a load balancer infrastructure in place already
  • Our virtual infrastructure team would need a good understanding of how load balancers work so that they could troubleshoot in event of a problem
  • Could complex the solution when troubleshooting the issue as the management of load balancers are not within the team. We would rely on other teams availability to help us
  • VMware would provide limited support with the configuration of your load balance should problems arises
  • The pair of PSC in high availability setup would need to be of the same type i.e. you cannot have one appliance and one windows based PSC in a pair

  • Only LTM (Local Traffic Manager) is supported and not cross site GTM (Global Traffic Manager) so if you had two sites you would need to have another pair of load balancer


Could we do a Standby PSC?

With the release of vSphere 6 Update 1 there is a command line which allows us to be able to repoint a instance of vCenter to a working PSC (within a site and cross-site) which gives us an option to have a standby manual PSC instance. As we know that the PSC works using a multi-master replication (like Active Directory) then the data/configuration is technically always replicated as soon as there are any changes between them.

So now in event of a failure we could manually repoint our vCenter to our standby one as shown below:

There are two KB articles which describes how to repoint your vCenter within the same site KB2113917 and KB2131191 to repoint across sites.

So as you can see there is an official way to repoint your vCenter to another Platform Services Controller (PSC). If you have monitoring software in place then you could possibly monitor the PSC VM and if it is down then run a script to repoint your vCenter to a healthy PSC.

Below are two possible designs for my environment which we will test in our lab to make sure it works:

Option 1: At each site I have two PSC, one as Active and the other one as standby
  • The local vCenter is connected to one of the local Platform Services Controller. If the connected one fails then I have a standby one I can fail over to
  • If my links between the two sites fail then I can still run my vCenters independently and have local resiliency for my Platform Services Controller
  • If my local Platform Services Controller both fails then I can repoint to my remote site one and still get my vCenter back up and running. Although I wouldn't recommend it as if the links are slow or high latency then it could cause you issues
  • Downside is I would need to manage and maintain four PSCs and ensure the standby one is working (Maybe Switch over once a month?)
  • We would need to apply Anti-Affinity rules to keep the two PSCs at each site to be on separate hosts

Option 2: At each site just one PSC and both active for their respective site
  • If my links between the two sites fail then I can still run my vCenters independently
  • If my local Platform Services Controller fails then I can repoint to my remote site one and still get my vCenter back up and running. Although I wouldn't recommend it as if the links are slow or high latency then it could cause you issues
  • Simple design (less PSCs) and the PSC is always in use so I don't need to do switch overs to test

We decided to use appliances for Platform Services Controller instead of Windows installs because for supportability, security and upgrades we can see it being easier as it is all under one vendor, VMware. So when we have issues we don't need to involve other teams such as Windows and Security.

Hopefully this gives you some ideas on the possible solutions/workaround you can do for your very important Platform Services Controller. We are now going to test our possible designs in our lab and verify that it is workable. Will update this once we have finished our testing and provide you will the scenarios that we tested against. Once again we are looking this from a point of view where we are only using vCenter with PSC. If you have other VMware products using PSC as well such as vROPS, SRM then you will need to do further investigations to make sure all the other products will support this method.

