This is the public portal for all IBM Z Hardware and Operating System related offerings. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).
We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:
Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,
Post an idea.
Get feedback from the IBM team and other customers to refine your idea.
Follow the idea through the IBM Ideas process.
Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.
IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.
ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.
Due to processing by IBM, this request was reassigned to have the following updated attributes:
Brand - Servers and Systems Software
Product family - z Systems Software
Product - z/OS Communications Server
For recording keeping, the previous attributes were:
Brand - WebSphere
Product family - Enterprise Networking
Product - z/OS Communications Server
This RFE is being closed because an alternative solution is available. The new SFM BCPii feature that was introduced in z/OS V1R11 can address this requirement by reducing the amount of time it takes for SFM to detect a dead LPAR, from minutes to less than 10 seconds. Once enabled this feature should address the Sysplex Distributor failure scenario described in this RFE. With its BCPii support enabled SFM can promptly detect an LPAR that is down and drive the XCF exits for the TCP/IP XCF group on the remaining LPARs triggering the necessary recovery operations (such as moving DVIPAs to designated backup LPARs, etc. For more information on this feature refer to the MVS Setting Up a Sysplex documentation:
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/iea2f1c2/COVER?SHELF=all13be9&DT=20120814144655
SFM technology is key in detecting failed LPARs in a sysplex in a standard way, partitioning the system out of the sysplex and notifying all sysplex services exploiting components of this action. Providing this type of functionality in every sysplex exploiting function, such as Sysplex Distributor, would require significant duplication of effort and may lead to inconsistent and potentially incompatible implementations.
These are our settings for detecting a dead LPAR in the sysplex:
SYS1.PARMLIB(EXSPAT00)
SPINRCVY ABEND,TERM,ACR
SPINTIME=20
SYSS.PARMLIB(COUPLEBB)
INTERVAL(85)
SFM Policy
ISOLATETIME(0) SSUMLIMIT(2)
We settled on them as most suitable for GDPS with PPRC (and XRC).
I wasn't aware of the SFM enhancements. I'll need to work with my z/OS colleagues to understand if they can help us. I know we've reduced the detection time from around 3 minutes to around 2 minutes. However, there was some reluctance to reduce it further in case of false positives removing a healthy system from the sysplex. Maybe we can revise SFM further with the enhancements. I'm away for 2 weeks now so it could be some time before I can send a further update.
I wasn't aware of the SFM enhancements. I'll need to work with my z/OS colleagues to understand if they can help us. I know we've reduced the detection time from around 3 minutes to around 2 minutes. However, there was some reluctance to reduce it further in case of false positives removing a healthy system from the sysplex. Maybe we can revise SFM further with the enhancements. I'm away for 2 weeks now so it could be some time before I can send a further update.
Thanks for taking the time to submit this requirement. As mentioned in the requirement, the focus of the Sysplex Autonomics support is indeed on self-health checks to determine if the local system is encountering health issues that prevent it from being a productive member of the TCP/IP sysplex group. For catastrophic errors to a given system, TCP/IP and other z/OS sysplex exploiters rely on the Sysplex Failure Management (SFM) component of the system to partition the failing system out of the sysplex. Once that occurs, all components exploiting XCF services will get notified that the system is no longer part of the sysplex and initiate any appropriate recovery actions. How quickly SFM can partition the system out of the sysplex does depend on the SFM policy that is in effect. There have been several enhancements in SFM in recent releases to significantly reduce the amount it takes for a system to be removed from the sysplex by SFM. When these enhancements are exploited you should be able to get time interval down to 5-10 seconds vs the 2-3 minutes mentioned above. These SFM enhancements are described in the following presentation from the recent SHARE (Atlanta Winter 2012) - "10850: Sysplex Failure Management (SFM): History and Proven Practice Setting", here's a link to the presentation material:
https://share.confex.com/share/118/webprogram/Session10850.html
Question: Have you explored the latest SFM enhancements? And if these SFM enhancements can dramatically reduce the outage time mentioned above does that satisfy this requirement? If not can you provide some additional rationale on why it does not? Thanks in advance for your time and feedback.