Performance Testing a New CRM

Performance testing is challenging, frustrating, often underestimated typically getting attention only after an incident. How did we the performance test and what did we learn during the development and implementation of web-services for a new CRM system?

Touch point applications connect to an integration layer using various data formats and protocols and the integration layer converts the request to a normalized web-service call to the Customer Master Repository.

This integration had very strict requirements for performance: the average response time for read operations (as perceived by the touch points) must be below 120ms at a load of 10 transactions per second. There were similar requirements, not as strict, for update operations.

Instrumentation

First step is to the instrumentation of the web-service and integration layer for the performance measuring. All the components log the time it required to process the request and how much of the time was spent in the component and in the downstream (from left to right) components. A typical log file in the Integration Layer contains the following entries:

06:38:22.158 |CMDR.getCustomer|start[ID, 51002799]|0|0|catalina-exec-178|1409830702158||

06:38:22.159 |CMDR.getCustomer|sendCmdrRequest|1|1|catalina-exec-178|1409830702158||

06:38:22.177 |CMDR.getCustomer|receivedCmdrResult[14]|19|18|catalina-exec-178|1409830702158||

06:38:22.177 |CMDR.getCustomerView|stop[ID, 51002799, 0, 0]|19|0|catalina-exec-178|1409830702158||

These log entries are for a transaction that started at 06:38:22, for a GetCustomer operation for the Customer with the Id 51002799. 1ms after start, CIL sent a request to CMDR; CMDR responded after 18ms, and CIL back the response immediately. The total time spent in CIL is 19ms.

Each “transaction” is identified in the log by the time when it started and the thread that is executing it. There could be interleaving transactions because multiple threads are active at the same time. Inside a transaction there is a “start”, a “stop” and can be several intermediate events. These intermediate entries allow to breaking down the time spent in the operation and help in troubleshooting. For example, we can identify if the issue is due to the preparation of the request, or the access to the database, or parsing the response.

We have created tools that parse these log files (there can be multiple across multiple servers) and create a chart with the response times (using JFreeChart). Below is an example of such a graph showing outliers in the response time:

Another point of instrumentation is in the web-service to return the time spent in the service as part of the response header:

<IdentifyCustomerResponse xmlns="urn:example.com:crm:v1"

Version="1.6.1"

Timestamp="2014-09-04T07:20:29-05:00"

SystemType="CMDR"

SystemId="rclnpmdr03"

Duration="14"

TransactionId=”[sGQEndFaTH2AVBneqS6Hjw]">

…

</Profile>

</Customer>

</IdentifyCustomerResponse>

If the client has a log entry for this response like this:

06:38:22.177 |CMDR.getCustomer|receivedCmdrResult[14]|19|18|catalina-exec-178|1409830702158||

we can determine that 4ms is spent in the network, in the TCP/IP stack of the OS and inside the application container serializing and deserializing messages.

Sometimes there is a discrepancy between the sever and client reported times which indicate network issues. Below is a sample spreadsheet that shows that for the transactions that took more than 3s in the client, the server time was minimal and we can infer that the problem is in the network.

Another interesting fact from this spreadsheet is that this issue happens periodically, every hour (2AM, 3AM, …). We can infer there is an activity on the network that happens at the hour, can be backup, which is impacting the performance of the service.

Generating Load Programatically

Once the code is instrumented for performance analysis we can use the data to troubleshoot performance issues in the case of an incident. We can also use it during performance testing. This section covers methods for generating the load in the environment for the purpose of testing and troubleshooting performance testing.

One way to generate the load is using Java performance test libraries. For example, we used successfully databene.org’s contiperf library to convert JUnit tests into performance tests:

@Test

@PerfTest(threads = 5, invocations = 1000)

public void test3_GetAccountSummary() {

logger.debug("Working in thread: " + Thread.currentThread().getName());

GetAccountSummaryRequest rq = createRequest(GetAccountSummaryRequest.class);

int memberNum = generateNextMemberNum();

rq.setMemberNum(m);

GetAccountSummaryResponse rs = service.getAccountSummary(rq);

if(rs.getFailure() != null)

{

logger.error("Received error: " + rs.getFailure().getError().get(0).getMessage());

}

In this example we are using the library to run running the test 100 times in 5 threads. There are many other options to configure the performance test: warm-up period, checking the results, wait for a random time before launching the next test.

We also used successfully JMeter to create load on a web-service. The advantage of JMeter is that it allows simulating a realistic load made of a combination of requests. The following screenshot right shows a JMeter configuration that will run 8 types of transactions execute with a ratio of 0.8%, 1.2%, 58.78%, 0.94%, 0.19%, 26.48%, 10.77% and 0.89%.

Each of the transaction types uses as parameters data from a CSV files – the entries are shown as CSV Data Set Config.

The output of the test runs is saved in CVS files and can be analyzed after the test. For example, the following graph shows that there were several cases when the Client Time is much larger than the Server Time during the execution of a JMeter test:

Another useful feature of JMeter is that it can execute tests connecting directly to the database which can isolate the performance issue even further.

Troubleshooting Performance Issues

Here are steps used in finding root cause for performance issues and then implementing corrective actions.

Breakdown response time breakdown by analyzing the logs

By looking at the client logs and the response times we will try to identify if the time spent is in one of the following layers: touch point client, network, integration layer, network, web-service or database. We found that most often the issue is in the database, sometimes is in the network.

Correlate performance numbers with other variables in the environment

Most performance issues can be correlated with other things happening in the environments:

Database issues are correlated with other queries
Network issues tend to be correlated with congestions
Web-services issues are correlated with CPU utilization, memory utilizations or garbage collections

For this step, it is important to have fine-grained monitoring to be enabled in the environment. We worked with the systems administrators and the operational support teams to gather these metrics and make them available during the troubleshooting process.

Operationalize performance monitoring

We have implemented processes to operationalize performance analysis. Performance data is collected at the transaction level for the last 30 days allowing going back in the case of an incident and identify the issues. We also ensured that enough data is available to perform root-cause analysis (CPU and memory logs, database logs).

Another method is to calculate aggregate metrics on the most important performance indicators: moving average of response time for the most common operations, number of requests/sec. These are collected as JMX attributes which can both be displayed in a console and can be collected by typical monitoring tools.

Conclusions

Three activities have helped us with the performance testing and troubleshooting:

instrumentation: individual application events with timestamp and duration and aggregate metrics in JMX
automated test generation with Java test functions and JMeter
data analytics and visualization using Excel and JFreeChart

Flowid

Search This Blog