Skip to Main Content
IBM Z Hardware and Operating Systems Ideas Portal


This is the public portal for all IBM Z Hardware and Operating System related offerings. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).


Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:

Search existing ideas

Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,

Post your ideas
  1. Post an idea.

  2. Get feedback from the IBM team and other customers to refine your idea.

  3. Follow the idea through the IBM Ideas process.


Specific links you will want to bookmark for future use

Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.

IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.

ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.

Status Not under consideration
Workspace z/OS
Created by Guest
Created on Apr 20, 2012

Enhance Sysplex Distributor to recover more quickly from dead distributing LPAR

Enhancement to Sysplex Distributor more quickly to recognise the distributing LPAR is dead and for a backup distributing LPAR to take over the function.
TCPIP sysplex autonomics allows a TCPIP stack to monitor itself and leave the sysplex when it is unhealthy. This works fine for many conditions including normal shutdown of the z/OS system or TCPIP stack. However, if the LPAR is completely dead, TCPIP cannot and does not do this.
Instead, depending on z/OS settings such as INTERVAL and ISOLATETIME, it can take 2-3 minutes for a backup sysplex distributor to take over. This causes loss of service, e.g HTTP 404 errors, during this period. We are attempting to build a highly available environment where this 2-3 minute loss of service does not occur.
We have simulated this condition by deactivating the sysplex distributor LPAR through the hardware console.

Idea priority Medium
  • Guest
    Reply
    |
    Nov 19, 2015

    Due to processing by IBM, this request was reassigned to have the following updated attributes:
    Brand - Servers and Systems Software
    Product family - z Systems Software
    Product - z/OS Communications Server

    For recording keeping, the previous attributes were:
    Brand - WebSphere
    Product family - Enterprise Networking
    Product - z/OS Communications Server

  • Guest
    Reply
    |
    Oct 12, 2012

    This RFE is being closed because an alternative solution is available. The new SFM BCPii feature that was introduced in z/OS V1R11 can address this requirement by reducing the amount of time it takes for SFM to detect a dead LPAR, from minutes to less than 10 seconds.  Once enabled this feature should address the Sysplex Distributor failure scenario described in this RFE.  With its BCPii support enabled SFM can promptly detect an LPAR that is down and drive the XCF exits for the TCP/IP XCF group on the remaining LPARs triggering the necessary recovery operations (such as moving DVIPAs to designated backup LPARs, etc.   For more information on this feature refer to the MVS Setting Up a Sysplex documentation:
    http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/iea2f1c2/COVER?SHELF=all13be9&DT=20120814144655

    SFM technology is key in detecting failed LPARs in a sysplex in a standard way, partitioning the system out of the sysplex and notifying all sysplex services exploiting components of this action. Providing this type of functionality in every sysplex exploiting function, such as Sysplex Distributor, would require significant duplication of effort and may lead to inconsistent and potentially incompatible implementations.

  • Guest
    Reply
    |
    Aug 3, 2012

    These are our settings for detecting a dead LPAR in the sysplex:

    SYS1.PARMLIB(EXSPAT00)
    SPINRCVY ABEND,TERM,ACR
    SPINTIME=20

    SYSS.PARMLIB(COUPLEBB)
    INTERVAL(85)

    SFM Policy
    ISOLATETIME(0) SSUMLIMIT(2)

    We settled on them as most suitable for GDPS with PPRC (and XRC).

  • Guest
    Reply
    |
    Jun 8, 2012

    I wasn't aware of the SFM enhancements. I'll need to work with my z/OS colleagues to understand if they can help us. I know we've reduced the detection time from around 3 minutes to around 2 minutes. However, there was some reluctance to reduce it further in case of false positives removing a healthy system from the sysplex. Maybe we can revise SFM further with the enhancements. I'm away for 2 weeks now so it could be some time before I can send a further update.

  • Guest
    Reply
    |
    Jun 8, 2012

    I wasn't aware of the SFM enhancements. I'll need to work with my z/OS colleagues to understand if they can help us. I know we've reduced the detection time from around 3 minutes to around 2 minutes. However, there was some reluctance to reduce it further in case of false positives removing a healthy system from the sysplex. Maybe we can revise SFM further with the enhancements. I'm away for 2 weeks now so it could be some time before I can send a further update.

  • Guest
    Reply
    |
    May 2, 2012

    Thanks for taking the time to submit this requirement.  As mentioned in the requirement, the focus of the Sysplex Autonomics support is indeed on self-health checks to determine if the local system is encountering health issues that prevent it from being a productive member of the TCP/IP sysplex group.  For catastrophic errors to a given system, TCP/IP and other z/OS sysplex exploiters rely on the Sysplex Failure Management (SFM) component of the system to partition the failing system out of the sysplex.  Once that occurs, all components exploiting XCF services will get notified that the system is no longer part of the sysplex and initiate any appropriate recovery actions. How quickly SFM can partition the system out of the sysplex does depend on the SFM policy that is in effect.  There have been several enhancements in SFM in recent releases to significantly reduce the amount it takes for a system to be removed from the sysplex by SFM.  When these enhancements are exploited you should be able to get time interval down to 5-10 seconds vs the 2-3 minutes mentioned above.  These SFM enhancements are described in the following presentation from the recent SHARE (Atlanta Winter 2012) - "10850: Sysplex Failure Management (SFM): History and Proven Practice Setting", here's a link to the presentation material:

    https://share.confex.com/share/118/webprogram/Session10850.html

    Question: Have you explored the latest SFM enhancements? And if these SFM enhancements can dramatically reduce the outage time mentioned above does that satisfy this requirement?  If not can you provide some additional rationale on why it does not?  Thanks in advance for your time and feedback.