Skip to main content

Performance Testing a New CRM

Performance testing  is challenging, frustrating, often underestimated typically getting attention only after an incident. How did we the performance test and what did we learn during the development and implementation of  web-services for a new CRM system?
Touch point applications connect to an integration layer using various data formats and protocols and the integration layer converts the request to a normalized web-service call to the Customer Master Repository.




This integration had very strict requirements for performance: the average response time for read operations (as perceived by the touch points) must be below 120ms at a load of 10 transactions per second. There were similar requirements, not as strict, for update operations.

Instrumentation

First step is to the instrumentation of the web-service and integration layer for the performance measuring. All the components log the time it required to process the request and how much of the time was spent in the component and in the downstream (from left to right) components. A typical log file in the Integration Layer contains the following entries:

06:38:22.158 |CMDR.getCustomer|start[ID, 51002799]|0|0|catalina-exec-178|1409830702158||
06:38:22.159 |CMDR.getCustomer|sendCmdrRequest|1|1|catalina-exec-178|1409830702158||
06:38:22.177 |CMDR.getCustomer|receivedCmdrResult[14]|19|18|catalina-exec-178|1409830702158||
06:38:22.177 |CMDR.getCustomerView|stop[ID, 51002799, 0, 0]|19|0|catalina-exec-178|1409830702158||

These log entries are for a transaction that started at 06:38:22, for a GetCustomer operation for the Customer with the Id 51002799. 1ms after start, CIL sent a request to CMDR; CMDR responded after 18ms, and CIL back the response immediately. The total time spent in CIL is 19ms.
Each “transaction” is identified in the log by the time when it started and the thread that is executing it. There could be interleaving transactions because multiple threads are active at the same time. Inside a transaction there is a “start”, a “stop” and can be several intermediate events. These intermediate entries allow to breaking down the time spent in the operation and help in troubleshooting. For example, we can identify if the issue is due to the preparation of the request, or the access to the database, or parsing the response.
We have created tools that parse these log files (there can be multiple across multiple servers) and create a chart with the response times (using JFreeChart). Below is an example of such a graph showing outliers in the response time:



Another point of instrumentation is in the web-service to return the time spent in the service as part of the response header:

<IdentifyCustomerResponse xmlns="urn:example.com:crm:v1"
  Version="1.6.1"
  Timestamp="2014-09-04T07:20:29-05:00"
  SystemType="CMDR"
  SystemId="rclnpmdr03"
  Duration="14"
  TransactionId=”[sGQEndFaTH2AVBneqS6Hjw]">
    <Customer CustomerId="515732187" CustomerType="IND">
        <Profile OrgCd="ORG1" LastModifiedBy="CRM">
            <Person FirstName="NATASHA" LastName="SMITH" BirthDate="1990-08-27"/>
            …
        </Profile>

    </Customer>
</IdentifyCustomerResponse>

If the client has a log entry for this response like this:

06:38:22.177 |CMDR.getCustomer|receivedCmdrResult[14]|19|18|catalina-exec-178|1409830702158||

we can determine that 4ms is spent in the network, in the TCP/IP stack of the OS and inside the application container serializing and deserializing messages.
Sometimes there is a discrepancy between the sever and client reported times which indicate network issues. Below is a sample spreadsheet that shows that for the transactions that took more than 3s in the client, the server time was minimal and we can infer that the problem is in the network.



Another interesting fact from this spreadsheet is that this issue happens periodically, every hour (2AM, 3AM, …). We can infer there is an activity on the network that happens at the hour, can be backup, which is impacting the performance of the service.

Generating Load Programatically


Once the code is instrumented for performance analysis we can use the data to troubleshoot performance issues in the case of an incident. We can also use it during performance testing. This section covers methods for generating the load in the environment for the purpose of testing and troubleshooting performance testing.
One way to generate the load is using Java performance test libraries. For example, we used successfully databene.org’s contiperf library to convert JUnit tests into performance tests:

        @Test
@PerfTest(threads = 5, invocations = 1000)
public void test3_GetAccountSummary() {
logger.debug("Working in thread: " + Thread.currentThread().getName());
GetAccountSummaryRequest rq = createRequest(GetAccountSummaryRequest.class);

int memberNum = generateNextMemberNum();
rq.setMemberNum(m);

GetAccountSummaryResponse rs = service.getAccountSummary(rq);

if(rs.getFailure() != null)
{
logger.error("Received error: " + rs.getFailure().getError().get(0).getMessage());
}
}

In this example we are using the library to run running the test 100 times in 5 threads. There are many other options to configure the performance test: warm-up period, checking the results, wait for a random time before launching the next test.

We also used successfully JMeter to create load on a web-service. The advantage of JMeter is that it allows simulating a realistic load made of a combination of requests. The following screenshot  right shows a JMeter configuration that will run 8 types of transactions execute with a ratio of 0.8%, 1.2%, 58.78%, 0.94%, 0.19%, 26.48%, 10.77% and 0.89%.



Each of the transaction types uses as parameters data from a CSV files – the entries are shown as CSV Data Set Config.
The output of the test runs is saved in CVS files and can be analyzed after the test. For example, the following graph shows that there were several cases when the Client Time is much larger than the Server Time during the execution of a JMeter test:

Another useful feature of JMeter is that it can execute tests connecting directly to the database which can isolate the performance issue even further.

Troubleshooting Performance Issues

Here are steps used in finding root cause for performance issues and then implementing corrective actions.

Breakdown response time breakdown by analyzing the logs

 By looking at the client logs and the response times we will try to identify if the time spent is in one of the following layers: touch point client, network, integration layer, network, web-service or database. We found that most often the issue is in the database, sometimes is in the network.
Correlate performance numbers with other variables in the environment
Most performance issues can be correlated with other things happening in the environments:
  • Database issues are correlated with other queries
  • Network issues tend to be correlated with congestions
  • Web-services issues are correlated with CPU utilization, memory utilizations or garbage collections
For this step, it is important to have fine-grained monitoring to be enabled in the environment. We worked with the systems administrators and the operational support teams to gather these metrics and make them available during the troubleshooting process.

Operationalize performance monitoring

We have implemented processes to operationalize performance analysis. Performance data is collected at the transaction level for the last 30 days allowing going back in the case of an incident and identify the issues. We also ensured that enough data is available to perform root-cause analysis (CPU and memory logs, database logs).
Another method is to calculate aggregate metrics on the most important performance indicators: moving average of response time for the most common operations, number of requests/sec. These are collected as JMX attributes which can both be displayed in a console and can be collected by typical monitoring tools.

Conclusions

Three activities have helped us with the performance testing and troubleshooting:
  • instrumentation: individual application events with timestamp and duration and aggregate metrics in JMX
  • automated test generation with Java test functions and JMeter
  • data analytics and visualization using Excel and JFreeChart






Popular posts from this blog

What Can Category Theory Do for Me?

Category Theory is one of the hot topics in computer science. There are many blogs, youtube videos and books about it. It is an elusive subject, with the potential to be the ultimate unifier of everything (math, quantum physics, social justice), to allow us to write the best possible programs and many other lofty goals. I decided to explore the field and see how it can help me. Below is a summary of 12 weeks of reading and watching presentations about Category Theory. It took me some time to select the study material. In the end I decided to use David Spivak’s book (Category Theory for Sciences, David Spivak, 2014) and the youtube recording of a 2017 workshop. These provided both a rigorous approach and a lighter version in the youtube videos. In parallel we explored other sources for a more practical perspective. The CTfS book is a systematic introduction to Category Theory, with definitions and proofs. I liked that, it is a solid foundation. The examples are mostly from math (Set,

On Defining Messages

“Defining Message Formats” is the title of a message posted on the Service Oriented Architecture mailing list [1]  which attracted a lot of attention.  The post summarizes  the dilemma faced by solution architects when they have to define a service interface: 1. Base the internal message format on the external message standard (MISMO for this industry). This message set is bloated, but is mature, supports most areas of the business, and is extensible for attributes specific to this company and its processes. 2. Create an XML-based message set based on the company's enterprise data model, which the company has invested a great amount of money into, and includes a very high percentage of all attributes needed in the business. In a nutshell, we've generated an XML Schema from ER Studio, and will tinker with that construct types that define the payloads for messages. 3. Use MISMO mainly for its entity definitions, but simplify the structure to improve usability. We benefit from the