Skip to main content

Performance Testing a New CRM

This article is a summary of the approach and lessons learned during the performance test a web-services implementation of a new CRM system.
Touch point applications connect to an integration layer using various data formats and protocols and the integration layer converts the request to a normalized web-service call to the Customer Master Repository.


This integration had very strict requirements for performance: the average response time for read operations (as perceived by the touch points) must be below 120ms at a load of 10 transactions per second. There were similar requirements, not as strict, for update operations.

Instrumentation

First step is to the instrumentation of the web-service and integration layer for the performance measuring. All the components log the time it required to process the request and how much of the time was spent in the component and in the downstream (from left to right) components. A typical log file in the Integration Layer contains the following entries:

06:38:22.158 |CMDR.getCustomer|start[ID, 51002799]|0|0|catalina-exec-178|1409830702158||
06:38:22.159 |CMDR.getCustomer|sendCmdrRequest|1|1|catalina-exec-178|1409830702158||
06:38:22.177 |CMDR.getCustomer|receivedCmdrResult[14]|19|18|catalina-exec-178|1409830702158||
06:38:22.177 |CMDR.getCustomerCarismaView|stop[ID, 51002799, 0, 0]|19|0|catalina-exec-178|1409830702158||

These log entries are for a transaction that started at 06:38:22, for a GetCustomer operation for the Customer with the Id 51002799. 1ms after start, CIL sent a request to CMDR; CMDR responded after 18ms, and CIL back the response immediately. The total time spent in CIL is 19ms.
Each “transaction” is identified in the log by the time when it started and the thread that is executing it. There could be interleaving transactions because multiple threads are active at the same time. Inside a transaction there is a “start”, a “stop” and can be several intermediate events. These intermediate entries allow to breaking down the time spent in the operation and help in troubleshooting. For example, we can identify if the issue is due to the preparation of the request, or the access to the database, or parsing the response.
We have created tools that parse these log files (there can be multiple across multiple servers) and create a chart with the response times (using JFreeChart). Below is an example of such a graph showing outliers in the response time:



Another point of instrumentation is in the web-service to return the time spent in the service as part of the response header:

<IdentifyCustomerResponse xmlns="urn:example.com:crm:v1"
  Version="1.6.1"
  Timestamp="2014-09-04T07:20:29-05:00"
  SystemType="CMDR"
  SystemId="rclnpmdr03"
  Duration="14"
  TransactionId=”[sGQEndFaTH2AVBneqS6Hjw]">
    <Customer CustomerId="515732187" CustomerType="IND">
        <Profile OrgCd="ORG1" LastModifiedBy="CRM">
            <Person FirstName="NATASHA" LastName="SMITH" BirthDate="1990-08-27"/>
            …
        </Profile>

    </Customer>
</IdentifyCustomerResponse>

If the client has a log entry for this response like this:

06:38:22.177 |CMDR.getCustomer|receivedCmdrResult[14]|19|18|catalina-exec-178|1409830702158||

we can determine that 4ms is spent in the network, in the TCP/IP stack of the OS and inside the application container serializing and deserializing messages.
Sometimes there is a discrepancy between the sever and client reported times which indicate network issues. Below is a sample spreadsheet that shows that for the transactions that took more than 3s in the client, the server time was minimal and we can infer that the problem is in the network.



Another interesting fact from this spreadsheet is that this issue happens periodically, every hour (2AM, 3AM, …). We can infer there is an activity on the network that happens at the hour, can be backup, which is impacting the performance of the service.

Generating Load Programatically


Once the code is instrumented for performance analysis we can use the data to troubleshoot performance issues in the case of an incident. We can also use it during performance testing. This section covers methods for generating the load in the environment for the purpose of testing and troubleshooting performance testing.
One way to generate the load is using Java performance test libraries. For example, we used successfully databene.org’s contiperf library to convert JUnit tests into performance tests:

        @Test
@PerfTest(threads = 5, invocations = 1000)
public void test3_GetAccountSummary() {
logger.debug("Working in thread: " + Thread.currentThread().getName());
GetAccountSummaryRequest rq = createRequest(GetAccountSummaryRequest.class);

int memberNum = generateNextMemberNum();
rq.setMemberNum(m);

GetAccountSummaryResponse rs = service.getAccountSummary(rq);

if(rs.getFailure() != null)
{
logger.error("Received error: " + rs.getFailure().getError().get(0).getMessage());
}
}

In this example we are using the library to run running the test 100 times in 5 threads. There are many other options to configure the performance test: warm-up period, checking the results, wait for a random time before launching the next test.

We also used successfully JMeter to create load on a web-service. The advantage of JMeter is that it allows simulating a realistic load made of a combination of requests. The following screenshot  right shows a JMeter configuration that will run 8 types of transactions execute with a ratio of 0.8%, 1.2%, 58.78%, 0.94%, 0.19%, 26.48%, 10.77% and 0.89%.



Each of the transaction types uses as parameters data from a CSV files – the entries are shown as CSV Data Set Config.
The output of the test runs is saved in CVS files and can be analyzed after the test. For example, the following graph shows that there were several cases when the Client Time is much larger than the Server Time during the execution of a JMeter test:

Another useful feature of JMeter is that it can execute tests connecting directly to the database which can isolate the performance issue even further.

Troubleshooting Performance Issues

Here are steps used in finding root cause for performance issues and then implementing corrective actions.

Breakdown response time breakdown by analyzing the logs

 By looking at the client logs and the response times we will try to identify if the time spent is in one of the following layers: touch point client, network, integration layer, network, web-service or database. We found that most often the issue is in the database, sometimes is in the network.
Correlate performance numbers with other variables in the environment
Most performance issues can be correlated with other things happening in the environments:
  • Database issues are correlated with other queries
  • Network issues tend to be correlated with congestions
  • Web-services issues are correlated with CPU utilization, memory utilizations or garbage collections
For this step, it is important to have fine-grained monitoring to be enabled in the environment. We worked with the systems administrators and the operational support teams to gather these metrics and make them available during the troubleshooting process.

Operationalize performance monitoring

We have implemented processes to operationalize performance analysis. Performance data is collected at the transaction level for the last 30 days allowing going back in the case of an incident and identify the issues. We also ensured that enough data is available to perform root-cause analysis (CPU and memory logs, database logs).
Another method is to calculate aggregate metrics on the most important performance indicators: moving average of response time for the most common operations, number of requests/sec. These are collected as JMX attributes which can both be displayed in a console and can be collected by typical monitoring tools.

Conclusions

Three activities have helped us with the performance testing and troubleshooting:
  • instrumentation: individual application events with timestamp and duration and aggregate metrics in JMX
  • automated test generation with Java test functions and JMeter
  • data analytics and visualization using Excel and JFreeChart






Popular posts from this blog

On Defining Messages

“Defining Message Formats” is the title of a message posted on the Service Oriented Architecture mailing list [1]  which attracted a lot of attention.  The post summarizes  the dilemma faced by solution architects when they have to define a service interface: 1. Base the internal message format on the external message standard (MISMO for this industry). This message set is bloated, but is mature, supports most areas of the business, and is extensible for attributes specific to this company and its processes. 2. Create an XML-based message set based on the company's enterprise data model, which the company has invested a great amount of money into, and includes a very high percentage of all attributes needed in the business. In a nutshell, we've generated an XML Schema from ER Studio, and will tinker with that construct types that define the payloads for messages. 3. Use MISMO mainly for its entity definitions, but simplify the structure to improve usability. We benefit from the

On Data Consistency Pattern

 Data consistency along with Multi-Step Process and Composite Application are the most common integration scenarios. At a very high level, multiple components manipulate and store the same entity. When one component modifies the entity the others must be notified of this change and ensure a consistent state of the entity across the whole system.  In real applications, to address the complexities around keeping the entity consistent in a distributed system, the business and system architects impose constraints that simplify the patterns of interaction and lead to a robust solution. The rest of these article contain typical approaches to Data Consistency. Depending on the type of entity there could be design variations but they all share the approach. To make this document more concrete, I will use the Customer entity as an example. One solution is to centralize all the Customer changes in one component, the “Customer Master,” and all the other component keep read-only copies of the enti