Recent Blog Entries

Wednesday, 12 August 2009

Load Balancing Strategies for SOA Infrastructures

When designing SOA infrastructures or Enterprise Architectures, most architects miss one component which may impact the overall system performance dramatically – the hardware load balancer.

Typically a hardware load balancer like F5 BigIP exists as a shared component for many applications and services to distribute load between multiple servers or applications and to ensure failover in case of hardware or software failure.

However, the typical setup of a load balancer is rarely suited for typical SOA services as the following example shows:

A sample real life SOA service consists of a large order system which does the backend processing of customer orders. It was implemented using Oracle SOA Suite 10g and is based mainly on BPEL processes which contain the business logic and orchestrate  the backend systems like Product Catalog, Fulfillment and Billing. All the requests (orders) come from a CRM portal.

This SOA implementation is running on an 8-node Linux-based BPEL cluster running Oracle Fusion Middleware 10g. In front of these 8-nodes, a F5 BigIP is distributing the load.

Initially the load balancer setup was the following:

- static round-robin to all 8 nodes
- session affinity (persistence) for a duration of one hour

The corresponding rule for the BigIP configuration looks like

persist source_addr_1h

Let’s look at the impact of these settings:

Session Affinity (persistence) means that all requests coming from the same originating IP address will be routed by the load balancer to the same cluster node. This rule is mostly suited for typical web applications where it is desired to keep one HTTP session on the same cluster node.

However this setup has a severe negative impact for backend SOA services as we can see shortly:

The SOA order service consists of one invocation endpoint and some partnerlinks for calling the backend systems. Most of this communication is asynchronous, that means the backend systems are called via one-way requests and use callbacks to return messages back to the order system.

What now happens is that the first request to the order system comes from the CRM system, resulting in a routing to one node (say “Node A”) of the cluster. If these requests come over the load balancer, then all subsequent request coming from the CRM system will also be routed to the same cluster node, because the originating IP is equal!  This results in a very poor overall load distribution. In fact only one of the 8 nodes will get most of the requests. When this node is at 100% resource utilization, the system will dramatically slow down even though the other 7 nodes are almost idle.

But it is even getting worse.

First of all, the same effect is in place for every callback from a backend system. So, if a callback (return message) is received from the Billing System, also the same node gets these answer repeatedly within one hour (though this need not be Node A, of course).

Secondly, the static round-robin algorithm does not take in effect, which state each cluster node has. So, for example if one cluster node is heavily under load, because it processes some complex orders, and this results in 100% cpu load, then the load balancer will not recognize this but route lots of other requests to this node causing overload and saturation.

In summary, a small misconfiguration of the load balancer will lead to a system which does not use the hardware of 8 nodes effectively, which will not be able to handle lots of requests and which will not scale well.

So what are the recommendations?

1. There should be no session affinity (persistence) at all for a runtime SOA system. There may be some exceptions, for example at deployment time. When deploying SOA services with multiple artifacts (for example multiple BPEL processes, WSDLs and XSDs), this should happen in general on the same cluster node first to prevent lots of replication and inconsistencies in the cluster. But this can be configured, for example by using a dedicated deployment server or setting up a virtual host as a deployment target.

2. BigIP offers more sophisticated load balancing algorithms than dumb static round-robin. For example you can use the dynamic ratio load balancing (described in Chapter "Configuring servers for SNMP Dynamic Ratio load balancing" of the BigIP Reference Guide). This algorithms uses metrics which are calculated dynamically by SNMP agents running on each node. These SNMP agents are typically included in the LINUX distribution and just need to be started on each node. The load balancer then regularly queries these agents about the values, dynamically calculates the metrics and routes the requests accordingly. This means that the distribution of requests will be proportionally to the metrics of each node for each time frame.

An example might look like

monitor type real_server {
interval 5
timeout 16
dest *.12345
method "GET"
cmd "GetServerStats"
metrics "ServerBandwidth:1.5,CPUPercentUsage, MemoryUsage,
TotalClientCount"
agent "Mozilla/4.0 (compatible: MSIE 5.0; Windows NT)
}

The configuration of the Dynamic Ratio Load Balancing needs the following settings on BigIP side (example):

Interval:  1 minute
Timeout:  10 seconds
Community Strings:  "public"
SNMP Version: V2

The overall effect is that only nodes which are not under load will get requests, and that the overall distribution of load will be much more effective.  All cluster nodes will be utilized and the overall system will scale well up to the limit of the all available nodes. Then, of course, when all 8 cluster nodes are at 100%, this algorithm cannot help anymore of course….

References:

- BigIP Reference Guide

- Oracle SOA Suite 11g Enterprise Deployment Guide

- Oracle SOA Suite 10g Best Practises

- Oracle Fusion Middleware 11g High Availability Guide

No comments:

Post a Comment