With Microsoft Windows Fabric playing a key part and the algorithms that are used in Lync Server 2013 Front end pool infrastructure, the following underlying process can help understand the user experience and considerations for Lync Server 2013 maintenance or when applying patches.
First of all, there are 2 types of quorums in play here, the Pool Quorum and the Replica Set Quorum and never to be confused with eachother.
Pool quorum
- For a pool that is running, you need at least 50% of the servers to be online to achieve quorum state.
- For a pool that is starting from cold boot, you’ll need 85% of the FE servers to be online to achieve quorum and for FE services to start.
Replica set quorum
What is Replica Set ?
To put it plainly, Replica Set is the set of replica’s of user data – with Lync Server 2013, there are up to 3 copies of user data which together achieves Replica Set quorum. As we discussed in my earlier post, each front end server will have or be associated with a Routing Group ID and as and when users are added, users will be assigned to a primary, secondary and backup secondary (tertiary) routing groups. This can be found using
Get-CsUserPoolInfo sip:>
PS C:\> Get-CsUserPoolInfo sip: N1001001@contoso.com
Identity : N1001001@contoso.com
PrimaryPoolFQDN : Lync-Pool01.contoso.com
BackupPoolFQDN : Lync-Pool02.contoso.com
UserServicesPoolFQDN : Lync-Pool01.contoso.com
PrimaryPoolMachinesInPreferredOrder : {Site1-Server01.contoso.com, Site1-Server02.contoso.com, Site1-Server03.contoso.com}
BackupPoolMachinesInPreferredOrder : {Site2-Server01.contoso.com, Site2-Server02.contoso.com}
PrimaryPoolPrimaryRegistrar : Site1-Server01.contoso.com
PrimaryPoolBackupRegistrars : {Site1-Server02.contoso.com, Site1-Server03.contoso.com}
PrimaryPoolPrimaryUserService : Site1-Server01.contoso.com
PrimaryPoolBackupUserServices : {Site1-Server02.contoso.com, Site1-Server03.contoso.com}
BackupPoolPrimaryRegistrar : Site2-Server01.contoso.com
BackupPoolBackupRegistrars : {Site2-Server02.contoso.com}
BackupPoolPrimaryUserService : Site2-Server01.contoso.com
BackupPoolReplicaUserServices : {Site2-Server02.contoso.com}
So the normal process is that if Server01 decides to go down or is taken offline, Server02 will take over as primary depending on how the Server01 one went down in the first place (unexpected or graceful shut down), the load on the server and amount of resources it will take. Points to ponder are
- For e.g. if the number of users in the FE server in question was low, a copy may be generated quickly.
- Initial wait before another copy is invoked is 15-30 min.
- If the server was taken down gracefully, the algorithm will wait even longer (may be to satisfy the ‘coffee’ situation where what if you shut down the server and have gone for a long coffee break?
)
Now consider the situation where the Primary and Secondary replica servers (Server01 and 02) are both offline. In this situation, we have lost 2 copies of the corresponding user’s data and as a result Replica Set is now in quorum loss. What would happen in such a situation ? Though the user still have an online Secondary Backup (Tertiary) copy, since the Replica Set quorum is lost (2 of 3 copies lost), we will have a partial outage situation where all users with Server01 and 02 as 2 of the 3 servers in their Replica Set, will not be able to log in.
So, taking all the above into account, consider a Front End Pool with 10 servers. If we lose 2 servers, the pool quorum is still up and running (as 8 of 10 servers online), but Replica Set is in quorum loss for the users who has those 2 servers as 2 of the 3 servers in their Replica Set.
How can we avoid such a situation ?
That’s where Upgrade Domains come into play. Upgrade domains are just a fancy name for logical grouping of servers that can be taken offline at the same time for activities such as maintenance and patching. This is particularly important whilst performing patching, to make sure that no more than 1 upgrade domain is taken offline. For e.g. if we take servers belonging to 2 upgrade domains offline, the probability of having more than 1 replica of a Replica Set being offline is extremely high and would result in partial outage as mentioned above.
To find how your member servers are classified into upgrade domains – run the following cmdlet
Get-CsPoolUpgradeReadinessState | Select-Object -ExpandProperty UpgradeDomains
It will show the results as below.
PS C:\> Get-CsPoolUpgradeReadinessState | Select-Object -ExpandProperty UpgradeDomains
TotalFrontends : 1
TotalActiveFrontends : 1
TotalVoters : 1
TotalActiveVoters : 1
Name : UpgradeDomain1
Frontends : {Site1-Server01.Contoso.com}
IsReadyForUpgrade : True
TotalFrontends : 1
TotalActiveFrontends : 1
TotalVoters : 1
TotalActiveVoters : 1
Name : UpgradeDomain2
Frontends : {Site1-Server02.Contoso.com}
IsReadyForUpgrade : True
TotalFrontends : 1
TotalActiveFrontends : 1
TotalVoters : 1
TotalActiveVoters : 1
Name : UpgradeDomain4
Frontends : {Site1-Server03.Contoso.com}
IsReadyForUpgrade : True
In the above case, as you can see, you do not have more than 1 server within the same Upgrade Domain and hence you can only take down one server at a time, patch it, bring it back up, make sure the services have started before taking another one offline.
I got the below screenshot from one of the TechEd videos to give you an idea of the number of upgrade domains in relation to the size of the Front End Pool.
Couple of things to keep in mind
- Upgrade domains are assigned at the time the pool is defined in the topology builder and is all controlled by the windows fabric algorithm.
- Each SBS and SBA will get their own Routing Group but all users assigned to that SBS will be in the same routing group which means all the SBS users will be serviced by one FE server.
Hope this helps…
The post Replica Sets and user experience (considerations) for Lync Server 2013 maintenance or when applying patches appeared first on UnifiedMe.