2014년 2월 24일 월요일

[TechNote] Troubleshooting: Workload Management Problems

Troubleshooting: Workload Management Problems

Technote (troubleshooting)


Problem(Abstract)

Troubleshooting for problems with the Workload Management component of IBM WebSphere Application Server. This should help address common issues with this component before calling IBM support and save you time.

Resolving the problem

Information about the Workload Management Service
The Workload Management (WLM) component in IBM WebSphere Application Server provides routing services for incoming application requests so that these can be distributed to application server resources like Enterprise Java Beans (EJBs), Servlets and other server side application resources capable of processing such requests. WLM also provides failover capabilities when the servers hosting the applications are members of a cluster.
More details about Workload Management are available from the WebSphere Application Server V8.0 Information Center under the following topics:

What symptoms are you experiencing?
  1. Application requests fail to get serviced and runtime exceptions are seen in the SystemOut logs

    • org.omg.CORBA.TRANSIENT: SIGNAL_RETRY: This is a transient exception and suggests a routing attempt by the workload management routing service connecting to a target server.

      This is the exception thrown to the client in the case where a request is sent out to a target server and no reply ever comes back. The exception informs the application that another target might be available to satisfy the request, but the request could not be failed over transparently by WLM because the completion status was not determined to be “no”. In this case the client application needs to determine if they wish to resend the request.

    • org.omg.CORBA.NO_IMPLEMENT: This exception is thrown when none the servers participating in the workload management of the Enterprise JavaBeans (EJB) are available and the routing service cannot locate a suitable target for the request.

      The exception is created, for example, if the cluster is stopped or if the application does not have a path to any of the cluster members. There are many kinds of NO_IMPLEMENT which can be distinguished by the associated message or minor code with the NO_IMPLEMENT exception.

    • NO_IMPLEMENT: Retry Limit Reached & NO_IMPLEMENT: Forward Limit Reached: Each of these exceptions is thrown in the case where WLM is attempting to route a request to a server and receives an exception which is considered retryable, or the request is being forwarded.

      In order to avoid an infinite selection loop, these exceptions will be thrown if errors are received or forwarding is done on the same server ten consecutive times.

    • NO_IMPLEMENT: No Cluster Data: This exception, often seen in the Node Agent, is thrown when WLM is asked to make a selection over a cluster, but no data has been found or gathered for that particular cluster.

      This error is often seen for a short time when first requests for a cluster are made after startup of a cell, if this is seen they can usually be resolved by the setting of custom WLM properties to true. For information on the custom properties Exceptions of this type which remain persistent should be sent to IBM support.

    • NO_IMPLEMENT: No Available Target: This is a more general exception meaning that we may have some cluster data (perhaps not all), but with whatever data we currently have available to us, WLM cannot find a valid target for the request from that data.

      It is possible that members have been marked unusable or just that we do not have the current data necessary to route the request to the intended resource.

    • NoAvailableTargetException: This exception is internal to IBM only, you may see it printed out in traces with the WLM trace spec enabled, but this exception is caught and handled internal to the WLM code.


      This exception is often expected, especially in failover and startup scenarios and if a real problem exists, it would manifest itself as one of the NO_IMPLEMENT exceptions above.

  2. Enterprise JavaBeans requests are not distributed to all servers

    1. Make sure the target servers are started. Use the Administrative console to try starting them, or if a target server that is failing to service requests is already started, try restarting it.

    2. Try accessing the enterprise bean directly on the problem server. Perhaps the issue is not related to workload management. If this fails, review topic Enterprise bean cannot be accessed from a servlet, a JSP file, a stand-alone program, or other client in the WebSphere Application Server V8.0 Information Center.

    3. Check your configuration. Review the Troubleshooting the Workload Management component topics in the WebSphere Application Server V8.0 Information Center.

    4. Make sure the server is "in view" in reference to HAManager and the core group the server belongs to. If the server is not in view then it may be islanded from the rest of the cell and not seen as being available by the other servers, and subsequently the client.

    5. Make sure static routing is not enabled, this can cause the same islanding problem as above.

    6. Make sure HAManager is enabled, disabling it also causes WLM to function improperly.

  3. Enterprise JavaBeans requests are not distributed evenly

    • Possible reasons for this behavior are:
      • Improper configuration
      • Environment issues such as the availability of servers or applications.
      • A large numbers of requests that involve transactional affinity, or
      • A small number of clients

    • Things to consider:
      • WLM sprays a variety of different requests (referred to as WLMable requests).

        If you are only tracking the spraying of a particular application request and it is unbalanced, that does not mean that WLM is spraying improperly. The classic example is a 2 cluster member cluster with the same server weights and an client application which loops on 2 WLMable requests, operationA and operationB. If a tracking system only looks at how operationA requests are being distributed, they will all be sent to one server. This is not a bug or problem, as all operationB requests are sent to the other server as well. This “pattern problem” is often seen in small test environments with only a few servers and is rarely seen in production systems with more cluster members.
      • The Workload Management service uses a weighted proportional scheme to distribute Enterprise JavaBeans requests.

        The WLM selection logic had certain feedback mechanisms that can change our routing behavior on the fly. We react to various scenarios and even server load when making routing decisions, so it is entirely possible that WLM can function perfectly and the requests will not be balanced exactly to the configured server weights. For example, if a customer has a cluster with 2 machines, one is a powerful 8-way with lots of RAM, and the other is a single processor desktop machine, even if the server weights are set to 2 for both servers, you could potentially see 80% or more of the request go to the 8-way machine, simply because the desktop machine cannot keep up as fast as the 8-way.

  4. A failing server still receives Enterprise JavaBeans requests (failover fails)


    Some possible causes are:

    • The client might have been in a running transaction with an Enterprise JavaBeans on the server that went down.

      This might be working as designed by letting this particular exception flow back to the client. Since the transaction might have completed, failing over this request to another server could result in this request being serviced twice. This function is referred to as “Quiesce mode”. Quiesce mode is entered when a server is asked to shut down. While in Quiesce mode, the server will reject all incoming requests which it determines are new work, but still allow in-flight requests to complete. This is primarily designed to allow transaction work to finish as above to prevent unnecessary TRANSACTION ROLLBACK exceptions, but requests other than transaction can be allowed into a server. By default, Quiesce mode will last for a maximum of 3 minutes (this is configurable), although it is possible for a server to exit quiesce earlier if all registered components agree that it is okay to do so based on their own criteria. If a request gets rejected by a server in quiesce mode, WLM will be called with a org.omg.CORBA.COMM_FAILURE with a completion status of “no”, and this request will be automatically retried by WLM.

    • If the requests sent to the servers come back to the client with any other exceptions consistently, it might be that no servers are available.

      In this case, follow the resolution steps as outlined in Troubleshooting the Workload Management component in the WebSphere Application Server V8.0 Information Center.

  5. Stopped or hung servers do not share the workload after being restored


    This error occurs when the servers that were unavailable are not recognized by the Workload Management component after they are restored.

    There is an unusable interval determined by the property com.ibm.websphere.wlm.unusable.interval during which the workload manager waits to send to a server that has been marked unusable. By default this is 5 minutes. You can confirm that this is the problem by ensuring that the servers is up and running and then waiting for the unusable interval to elapse before checking to determine whether failover occurs. If the server still does not participate in workload, reference bullet #2 above for additional reasons why this could occur.


  6. What to do Next?

    If the above scenarios and steps taken did not help resolve your problem, you may access the information available from the WebSphere Application Server Support site and browse for current and known problems and their resolution.

    If no useful hints are found in the support site, then see the MustGather for WLM problems and collect the information requested before opening a PMR.

댓글 없음:

댓글 쓰기