Skip to main content

Performance Testing a New CRM

Performance testing  is challenging, frustrating, often underestimated typically getting attention only after an incident. How did we the performance test and what did we learn during the development and implementation of  web-services for a new CRM system?
Touch point applications connect to an integration layer using various data formats and protocols and the integration layer converts the request to a normalized web-service call to the Customer Master Repository.




This integration had very strict requirements for performance: the average response time for read operations (as perceived by the touch points) must be below 120ms at a load of 10 transactions per second. There were similar requirements, not as strict, for update operations.

Instrumentation

First step is to the instrumentation of the web-service and integration layer for the performance measuring. All the components log the time it required to process the request and how much of the time was spent in the component and in the downstream (from left to right) components. A typical log file in the Integration Layer contains the following entries:

06:38:22.158 |CMDR.getCustomer|start[ID, 51002799]|0|0|catalina-exec-178|1409830702158||
06:38:22.159 |CMDR.getCustomer|sendCmdrRequest|1|1|catalina-exec-178|1409830702158||
06:38:22.177 |CMDR.getCustomer|receivedCmdrResult[14]|19|18|catalina-exec-178|1409830702158||
06:38:22.177 |CMDR.getCustomerView|stop[ID, 51002799, 0, 0]|19|0|catalina-exec-178|1409830702158||

These log entries are for a transaction that started at 06:38:22, for a GetCustomer operation for the Customer with the Id 51002799. 1ms after start, CIL sent a request to CMDR; CMDR responded after 18ms, and CIL back the response immediately. The total time spent in CIL is 19ms.
Each “transaction” is identified in the log by the time when it started and the thread that is executing it. There could be interleaving transactions because multiple threads are active at the same time. Inside a transaction there is a “start”, a “stop” and can be several intermediate events. These intermediate entries allow to breaking down the time spent in the operation and help in troubleshooting. For example, we can identify if the issue is due to the preparation of the request, or the access to the database, or parsing the response.
We have created tools that parse these log files (there can be multiple across multiple servers) and create a chart with the response times (using JFreeChart). Below is an example of such a graph showing outliers in the response time:



Another point of instrumentation is in the web-service to return the time spent in the service as part of the response header:

<IdentifyCustomerResponse xmlns="urn:example.com:crm:v1"
  Version="1.6.1"
  Timestamp="2014-09-04T07:20:29-05:00"
  SystemType="CMDR"
  SystemId="rclnpmdr03"
  Duration="14"
  TransactionId=”[sGQEndFaTH2AVBneqS6Hjw]">
    <Customer CustomerId="515732187" CustomerType="IND">
        <Profile OrgCd="ORG1" LastModifiedBy="CRM">
            <Person FirstName="NATASHA" LastName="SMITH" BirthDate="1990-08-27"/>
            …
        </Profile>

    </Customer>
</IdentifyCustomerResponse>

If the client has a log entry for this response like this:

06:38:22.177 |CMDR.getCustomer|receivedCmdrResult[14]|19|18|catalina-exec-178|1409830702158||

we can determine that 4ms is spent in the network, in the TCP/IP stack of the OS and inside the application container serializing and deserializing messages.
Sometimes there is a discrepancy between the sever and client reported times which indicate network issues. Below is a sample spreadsheet that shows that for the transactions that took more than 3s in the client, the server time was minimal and we can infer that the problem is in the network.



Another interesting fact from this spreadsheet is that this issue happens periodically, every hour (2AM, 3AM, …). We can infer there is an activity on the network that happens at the hour, can be backup, which is impacting the performance of the service.

Generating Load Programatically


Once the code is instrumented for performance analysis we can use the data to troubleshoot performance issues in the case of an incident. We can also use it during performance testing. This section covers methods for generating the load in the environment for the purpose of testing and troubleshooting performance testing.
One way to generate the load is using Java performance test libraries. For example, we used successfully databene.org’s contiperf library to convert JUnit tests into performance tests:

        @Test
@PerfTest(threads = 5, invocations = 1000)
public void test3_GetAccountSummary() {
logger.debug("Working in thread: " + Thread.currentThread().getName());
GetAccountSummaryRequest rq = createRequest(GetAccountSummaryRequest.class);

int memberNum = generateNextMemberNum();
rq.setMemberNum(m);

GetAccountSummaryResponse rs = service.getAccountSummary(rq);

if(rs.getFailure() != null)
{
logger.error("Received error: " + rs.getFailure().getError().get(0).getMessage());
}
}

In this example we are using the library to run running the test 100 times in 5 threads. There are many other options to configure the performance test: warm-up period, checking the results, wait for a random time before launching the next test.

We also used successfully JMeter to create load on a web-service. The advantage of JMeter is that it allows simulating a realistic load made of a combination of requests. The following screenshot  right shows a JMeter configuration that will run 8 types of transactions execute with a ratio of 0.8%, 1.2%, 58.78%, 0.94%, 0.19%, 26.48%, 10.77% and 0.89%.



Each of the transaction types uses as parameters data from a CSV files – the entries are shown as CSV Data Set Config.
The output of the test runs is saved in CVS files and can be analyzed after the test. For example, the following graph shows that there were several cases when the Client Time is much larger than the Server Time during the execution of a JMeter test:

Another useful feature of JMeter is that it can execute tests connecting directly to the database which can isolate the performance issue even further.

Troubleshooting Performance Issues

Here are steps used in finding root cause for performance issues and then implementing corrective actions.

Breakdown response time breakdown by analyzing the logs

 By looking at the client logs and the response times we will try to identify if the time spent is in one of the following layers: touch point client, network, integration layer, network, web-service or database. We found that most often the issue is in the database, sometimes is in the network.
Correlate performance numbers with other variables in the environment
Most performance issues can be correlated with other things happening in the environments:
  • Database issues are correlated with other queries
  • Network issues tend to be correlated with congestions
  • Web-services issues are correlated with CPU utilization, memory utilizations or garbage collections
For this step, it is important to have fine-grained monitoring to be enabled in the environment. We worked with the systems administrators and the operational support teams to gather these metrics and make them available during the troubleshooting process.

Operationalize performance monitoring

We have implemented processes to operationalize performance analysis. Performance data is collected at the transaction level for the last 30 days allowing going back in the case of an incident and identify the issues. We also ensured that enough data is available to perform root-cause analysis (CPU and memory logs, database logs).
Another method is to calculate aggregate metrics on the most important performance indicators: moving average of response time for the most common operations, number of requests/sec. These are collected as JMX attributes which can both be displayed in a console and can be collected by typical monitoring tools.

Conclusions

Three activities have helped us with the performance testing and troubleshooting:
  • instrumentation: individual application events with timestamp and duration and aggregate metrics in JMX
  • automated test generation with Java test functions and JMeter
  • data analytics and visualization using Excel and JFreeChart






Popular posts from this blog

View - A Functor for Web App Design

This blog is about practical applications of Category Theory to the development of Java + Spring applications. I am looking at a design approach to simplify the development of web applications. Traditionally, this kind of back-office application is based on the Web 1.0 technology stack, using Spring Boot and Thymeleaf. My approach is to keep using Spring Boot but replace the generation of HTML with J2HTML and higher-order views. From a Category Theory point of view, we can look at web applications as mappings from the Category of Business Entities and the Category of UI Widgets. If we go one step further, both business entities and UI widgets are mapped to Java classes. Thus, we can view a web application (or a part of it) as an endofunctor in the Category of Java Classes. We define the View-functor as follows: domain(V) - Java classes representing business entities - e.g., Invoice, User - and, codomain(V) - Java functions that render the business entity as a DomContent object (DomCont

Reading J2HTML

J2HTML (j2html) is a Java library used to generate HTML I have been using it to create Web 1.0 applications in Java. Web 1.0 is server-side rendering pages with minimal Javascript. As I got deeper into using the library I started to read the actual source code of this library with an eye on following Java best practices and to my pleasant surprise, this code follows many of them. I am going to show here some examples of using interesting Java features, beyond the basics. 1. Functional Interface People are aware that Java supports some form of Functional Programming, and here is an example of using it: @FunctionalInterface public interface Indenter {     String indent( int level , String text ); } public static Indenter indenter = ( level , text ) -> String. join ( "" , Collections. nCopies ( level , FOUR_SPACES )) + text ; Things that I noticed: String has a method called join. I used before StringUtils.join, but now that is in the standard library I don’t need to us