Enhance Sysplex Distributor to recover more quickly from dead distributing LPAR

Enhancement to Sysplex Distributor more quickly to recognise the distributing LPAR is dead and for a backup distributing LPAR to take over the function.
TCPIP sysplex autonomics allows a TCPIP stack to monitor itself and leave the sysplex when it is unhealthy. This works fine for many conditions including normal shutdown of the z/OS system or TCPIP stack. However, if the LPAR is completely dead, TCPIP cannot and does not do this.
Instead, depending on z/OS settings such as INTERVAL and ISOLATETIME, it can take 2-3 minutes for a backup sysplex distributor to take over. This causes loss of service, e.g HTTP 404 errors, during this period. We are attempting to build a highly available environment where this 2-3 minute loss of service does not occur.
We have simulated this condition by deactivating the sysplex distributor LPAR through the hardware console.

Idea priority

Medium

Post comment

Guest

Reply
| Nov 19, 2015

Due to processing by IBM, this request was reassigned to have the following updated attributes:
Brand - Servers and Systems Software
Product family - z Systems Software
Product - z/OS Communications Server

For recording keeping, the previous attributes were:
Brand - WebSphere
Product family - Enterprise Networking
Product - z/OS Communications Server

0 reply Hide replies

Guest

Reply
| Oct 12, 2012

This RFE is being closed because an alternative solution is available. The new SFM BCPii feature that was introduced in z/OS V1R11 can address this requirement by reducing the amount of time it takes for SFM to detect a dead LPAR, from minutes to less than 10 seconds. Once enabled this feature should address the Sysplex Distributor failure scenario described in this RFE. With its BCPii support enabled SFM can promptly detect an LPAR that is down and drive the XCF exits for the TCP/IP XCF group on the remaining LPARs triggering the necessary recovery operations (such as moving DVIPAs to designated backup LPARs, etc. For more information on this feature refer to the MVS Setting Up a Sysplex documentation:
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/iea2f1c2/COVER?SHELF=all13be9&DT=20120814144655

SFM technology is key in detecting failed LPARs in a sysplex in a standard way, partitioning the system out of the sysplex and notifying all sysplex services exploiting components of this action. Providing this type of functionality in every sysplex exploiting function, such as Sysplex Distributor, would require significant duplication of effort and may lead to inconsistent and potentially incompatible implementations.

0 reply Hide replies

Guest

Reply
| Aug 3, 2012

These are our settings for detecting a dead LPAR in the sysplex:

SYS1.PARMLIB(EXSPAT00)
SPINRCVY ABEND,TERM,ACR
SPINTIME=20

SYSS.PARMLIB(COUPLEBB)
INTERVAL(85)

SFM Policy
ISOLATETIME(0) SSUMLIMIT(2)

We settled on them as most suitable for GDPS with PPRC (and XRC).

0 reply Hide replies

Guest

Reply
| Jun 8, 2012

I wasn't aware of the SFM enhancements. I'll need to work with my z/OS colleagues to understand if they can help us. I know we've reduced the detection time from around 3 minutes to around 2 minutes. However, there was some reluctance to reduce it further in case of false positives removing a healthy system from the sysplex. Maybe we can revise SFM further with the enhancements. I'm away for 2 weeks now so it could be some time before I can send a further update.

0 reply Hide replies

Guest

Reply
| Jun 8, 2012

I wasn't aware of the SFM enhancements. I'll need to work with my z/OS colleagues to understand if they can help us. I know we've reduced the detection time from around 3 minutes to around 2 minutes. However, there was some reluctance to reduce it further in case of false positives removing a healthy system from the sysplex. Maybe we can revise SFM further with the enhancements. I'm away for 2 weeks now so it could be some time before I can send a further update.

0 reply Hide replies

Guest

Reply
| May 2, 2012

Thanks for taking the time to submit this requirement. As mentioned in the requirement, the focus of the Sysplex Autonomics support is indeed on self-health checks to determine if the local system is encountering health issues that prevent it from being a productive member of the TCP/IP sysplex group. For catastrophic errors to a given system, TCP/IP and other z/OS sysplex exploiters rely on the Sysplex Failure Management (SFM) component of the system to partition the failing system out of the sysplex. Once that occurs, all components exploiting XCF services will get notified that the system is no longer part of the sysplex and initiate any appropriate recovery actions. How quickly SFM can partition the system out of the sysplex does depend on the SFM policy that is in effect. There have been several enhancements in SFM in recent releases to significantly reduce the amount it takes for a system to be removed from the sysplex by SFM. When these enhancements are exploited you should be able to get time interval down to 5-10 seconds vs the 2-3 minutes mentioned above. These SFM enhancements are described in the following presentation from the recent SHARE (Atlanta Winter 2012) - "10850: Sysplex Failure Management (SFM): History and Proven Practice Setting", here's a link to the presentation material:

https://share.confex.com/share/118/webprogram/Session10850.html

Question: Have you explored the latest SFM enhancements? And if these SFM enhancements can dramatically reduce the outage time mentioned above does that satisfy this requirement? If not can you provide some additional rationale on why it does not? Thanks in advance for your time and feedback.

0 reply Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Enhance Sysplex Distributor to recover more quickly from dead distributing LPAR

Please enter your email address

RELATED IDEAS

Enhance Sysplex Distributor to recover more quickly from dead distributing LPAR