Hinjewadi, Pune Hinjewadi, Pune

Transcription

Hinjewadi, Pune
Hinjewadi, Pune
Event Hosts
Event Supporter
Event Sponsors
Table of Contents
Foreword: CMG India's 1st Annual Conference
……………..
i
Architecture & Design for Performance
Optimal Design Principles for Better Performance of Next
Generation Systems, Maheshgopinath Mariappan et al
……………..
1
Architecture & Design for Performance for a Large European
Bank, R Harikumar, Nityan Gulati
……………..
7
Designing for Performance Management in Mission Critical
Software Systems, Raghu Ramakrishnan et al
……………..
19
Low Latency Multicore Systems
Incremental Risk Calculation: A Case Study of Performance
Optimization on Multi Core, Amit Kalele et al
……………..
31
……………..
41
……………..
58
Measuring Wait and Service Times in Java Using Byte Code
Instrumentation, Amol Khanapurkar, Chetan Phalak
……………..
69
Cloud Performance Testing Key Considerations,
Abhijeet Padwal
……………..
78
……………..
90
Performance Benchmarking of Open Source Messaging
Products, Yogesh Bhate et al
Advances in Performance Testing and Profiling
Automatically Determining Load Test Duration Using
Confidence Intervals, Rajesh Mansharamani et al
Reliability
Building Reliability in to IT Systems,
K. Velivela
Foreword: CMG India's 1st Annual Conference
Rajesh Mansharamani
President, CMG India
When we founded CMG India in Sep 2013, I expected this community of IT system
performance engineers and capacity planners to grow to 200 members over time. An
year and a quarter since then I am happy to see my initial estimates go wrong. Not only
do we have more than 1500 CMG India members today, we also have more than 200
attending our 1st Annual Conference this December!
CMG Inc is very popular worldwide thanks to its annual conference, which attracts the
best from the industry to present papers in performance engineering and capacity
planning. Having this precedent in front of us, we wanted to set the bar high for CMG
India's 1st Annual Conference. Given that majority of the IT system professionals in
India have never submitted a paper for a conference publication, we were delighted to
see 29 high quality submissions in response to our call for papers. The conference
technical programme committee, drawn from the best across industry and academia,
accepted 10 of these submissions for publication and presentation. We hope to see these
numbers grow over time, thus giving opportunities for more and more professionals
across India to step forth and present their contributions.
Fortunately, the paper submissions were in diverse areas spanning architecture and
design for performance, advances in performance testing and profiling, reliability, and
cutting edge work in low latency systems. When complemented with our keynote
addresses in big data, capacity management, database query processing, and real life
stock exchange evolution, we truly have a great technical programme lined up for our
audience. Thanks to all our keynote speakers (Adam Grummitt, N. Murali, Anand
Deshpande, and Jayant Haritsa) for their readiness to speak at this inaugural event.
Given that majority of our audience is in billable client projects, we decided to restrict to
the conference to a Friday and Saturday, and hence have tutorials and vendor talks in
parallel. Tutorials too went through a call for contributions process and we were
delighted to see fierce competition in this area as well. Finally, we could shortlist only
four tutorials and we added another two invited tutorials from academia and industry
stalwarts. At the same time we lined up one session on startups and five vendor talks
from our hosts and sponsors.
Our 1st conference would not have been possible without the eagerness shown by
Persistent Systems and Infosys, Pune, to host the sessions in their campuses in
Hinjewadi, which is today the heart of the IT sector in Pune. We would also not be able
to make our conference affordable to one and one, without contributions from our
sponsors: Tata Consultancy Services, Dynatrace, VMware, Intel and HP. Given that CMG
India exists as a community and not a company, we were extremely glad when
Computer Society of India stepped in as the event supporter to handle all financial
transactions on our behalf. CMG India is extremely thankful to the hosts, sponsors, and
supporter, not just because of their deeds but also because of the terrific attitude they
have demonstrated in making this conference a success.
i
None of the CMG India board members has hosted or organized a conference of this
nature before. While 16 regional events were organized by CMG India since its inception,
there wasn't any need of an organizing committee for these events, given that each
event lasted just two to three hours and was free to participants. As the annual
conference dates started approaching we realized the enormity of the task at hand in
managing a relatively mega event of this nature. For that reason I am extremely grateful
to Abhay Pendse, head of this conference's organising committee, and all the volunteers
who have worked with him in the planning and implementation.
Given that all of the organising committee members are working professionals with little
spare time in their work, it was heartening to see all of them spend late evening hours in
ensuring that the conference planning and implementation is as meticulous as possible.
It's been a joy working with such people and I would like to thank them again and again
for stepping forward and carrying forth their responsibilities till the very end. I am
equally impressed with the technical programme committee (TPC) wherein nearly all of
the 25 members reviewed papers and tutorials well ahead of their deadlines. All TPC
members are expert professionals very busy in their own work. Hats off to such
commitment to the field of performance engineering and capacity management.
We have hit a full house in our 1st Annual Conference, and we look forward to tasting
the same success in the years to come. I sincerely hope this community in India
continues to grow and shows the same spirit of contribution as we move forward in time.
Technical Programme Committee
Rajesh Mansharamani, Freelancer (Chair)
Amol Khanapurkar, TCS
Babu Mahadevan, Cisco
Bala Prasad, TCS
Benny Mathew, TCS
Balakrishnan Gopal, Freelancer
Kishor Gujarathi, TCS
Manoj Nambiar, TCS
Mayank Mishra, IIT-B
Milind Hanchinmani, Intel
Mustafa Batterywala, Impetus
Prabu D, NetApp
Prajakta Bhatt, Infosys
Prashant Ramakumar, TCS
Rashmi Singh, Persistent
Rekha Singhal, TCS
Sandeep Joshi, Freelancer
Santosh Kangane, Persistent
Subhasri Duttagupta, TCS
Sundarraj Kaushik, TCS
Trilok Khairnar, Persistent
Umesh Bellur, IIT-B
Varsha Apte, IIT-B
Vijay Jain, Accenture
Vinod Kumar, Microsoft
Organising Committee
Abhay Pendse, Persistent (Chair)
Dundappa Kamate, Persistent
Prajakta Bhatt, Infosys
Rachna Bafna, Persistent
Rajesh Mansharamani, Freelancer
Sandeep Joshi, Freelancer
Satonsh Kangane, Persistent
Shekhar Sahasrabuddhe, CSI
ii
Optimal Design Principles for better Performance of Next
generation Systems
Maheshgopinath Mariappan, Balachandar Gurusamy, Indranil
Dharap,
Energy, Communications and Services,
Infosys Limited,
India.
{Maheshgopinath_M,Balachandar_gurusamy,Indranil_Dharap}@infosys.com
Abstract
Design plays a vital role in the software engineering methodology. Proper
design ensures that the software will serve its intended functionality.
Design of a system should cover both functional and Nonfunctional
requirements. Designing the nonfunctional requirements are very difficult
in the early stages of SDLC due to less clarity of actual requirements and
primary focus is given to Functional requirements. Design related errors
are really difficult to address and it might cost millions to fix it at a later
stage. This paper describes the various real life performance issues and
design aspects to be taken care for better Performance.
1. INTRODUCTION
There is a tremendous growth in the field of social
networking and internet based applications over the
last few years. Across the globe there is an
exponential growth in the number of people using
these applications. Companies are deploying lot of
strategies to increase their applications availability,
reliability and make it less error prone. Any drop in
these parameters will have a significant impact on their
revenue and user base. But developing a 100 %
reliable and error free application is not possible. Some
type of application errors are easy to fix and recover
where as some of them are not.
Design related issues are very critical and they
have a huge impact on the functionality of the
application. It takes lot of time to redesign and rebuild
the application. So enough attention has to be given in
the early stages to ensure that all the aspects of
application is covered during design. Designing for the
next generation system is even more complex as it
introduces new complexities like most of the softwares
used are open source, lot of stake holders involved
and dynamically changing requirements.
Each section of this paper from Section 3 – Section 16
explains about the different design aspects which
should be considered for better quality design.
2. RELATED WORK
Our paper covers the efficient logging ideas to
achieve better application performance. There are
different best practices and algorithms that are
available for logging in the market. Some of the latest
logging algorithms are explained by Kshemkalyani,
A.D (1). Our design suggestions are generic in nature
and applicable to any of the commonly used
programming languages. List of the common
languages used for software development is provided
by IEEE (2).Different patterns of IO operations are
explained by Microsoft (4). Significance of caching size
is explained in this paper. Different caching techniques
are explained by L.Han and team (5).
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
1
3. Logging:
Traditionally logging was considered as a way to
store all the info related to the application request and
response. This info was used by the operations and
dev teams during debugging of the application. Over a
period of time there is a major shift in this trend. Now a
day’s business mainly relies on this data to generate
business metrics and reports. They also use the log
data to identify the Usage pattern and Customer churn.
Advancement of research in the areas of Cloud
computing and big data as well as availability of
efficient tools like Splunk made this analysis possible.
So there is a push and urge from different stake
holders of the products like sales, business and care to
log as much info as possible related to the user
request. The following are the common problems
faced because of poor logging design
1. Slow response time
2. Application performance degradation
Case study:
A real time web application was hanging after
running in the production environment for 5 hrs. After
analysis we were able to identify the root cause of the
issue. That application was logging the entire request,
response with headers and all the Meta into log files.
After filtering out the unnecessary logging, the system
was able to run without any issues. Extra measures
needs to be taken incase if system is asynchronous in
nature (Node.js) and single threaded. We should not
lock the master process in logging. If the main thread
is get locked than all the requests into the system file
up until the main thread is released and available.
Aspects that needs to be considered during the
design are
1. Proper logging level
2. Log only the critical details of the session like,
session id, user id, type of operation performed
etc., instead of logging all the details
3. Storing logs to a file system or local system
instead of trying to write across the network
4. Set proper rollback size and policy for logging
5. Enable auto archival for logging
6. If possible make logging as async process
Relationship
between
improvement
of
throughput and logging level is explained in this
picture
Figure1: Logsize vs Throughput,Response Time
4. Programming language:
In the software industry lot of programming
language options available to develop a system.
Selecting a proper language for application
development makes the major difference in the system
performance. Some developers tend to write all the
applications using the same language. But
consideration should be provided in selecting the
proper programming language
Case study:
An e-commerce application was designed using
J2EE framework. This application was a web
application and it is stateless. It was not able to serve
more user requests using a single server. The same
application was redesigned using Node.js and now the
system is able to handle more requests using the
same application server. The development time also
reduced a lot.
But the same Node.js which was employed for the
back end application had issues as follows. During trial
run everything seemed to be normal, but the
application was not scaling up easily in the production
region without the node clustering, Node.js is a single
threaded framework. One more major issue was
observed related to error handling. If there is any error
in the input, it triggered failure of the single thread and
crashed the application completely. So the following
has to be considered.
1. If the application is going to be multithreaded
then frameworks like AKKA, PLAY and J2EE
should be used
2. If the application needs quick response time
use languages like Scala than JAVA
3. If the application is a web application with no
state
information
stored,
then
select
programming languages like Node.js
2
Response time
program language impact
150
100
50
0
Node
I/O itensive application
J2EE
CPU intensive application
Case study:
An application was designed to store various
trouble tickets and their status. Development team
decided to develop this entire application with new
softwares like NoSQL databases. But during the POC
it was observed that RDBMS are better than NoSQL
for relational, transaction based system.
The best design practices for the selection of
Database are
1. Use Relational DB’s such as SQL, if there
is relation among the data getting stored
and system needs frequent reads
2. NoSQL DB’s are useful for storing
voluminous data without relation and with
more writes and less reads
6. Replication strategies:
Figure2:Responsetime vs Programming launguage
3.3 Reducing IO calls:
IO related calls like DB and File operations are
generally costly. These should be limited as much as
possible.
Case study:
A mobile back end application’s goal was to collect
the user preference, debug logs from mobile clients
and store it in the DB. During the initial testing
everything seemed to be fine. But during the
production launch the response time was increased
drastically. We did an analysis and observed that the
client was calling multiple times DB to store and
retrieve the user preference and logs. So caching
solution was added in between client and DB layer to
provide most frequently requested data
Applications are deployed across different data
centers and clusters. In this case sometimes data
replication is required across the servers to serve the
users without any functional glitch. There are two
strategies followed for data replication
1. Synchronous data replication
2. Asynchronous data replication.
Synchronous data replication should be used only
when the data needs to be updated for each
transaction. It generally comes with the tradeoff –
providing slow response time to the end users.
Asynchronous replication provides good response
time but not suitable for frequent updates.
1. Employ a caching mechanism
2. Group the DB write calls
3. Using the Async driver if available
5. Selection of DB
Selection of appropriate DB is also a critical factor.
Now a days there is a trend among technical
community to go for No SQL DB for each and every
application. But that is not the right approach. The
following are common problems if the DB selection is
not correct.
1. Query becomes very complex if the data stored
is not in proper format.
2. Query takes more time to execute and compute
results than the expected interval
3. Overall slow response to the end user
Figure 3: Replication type vs Response time
7. Avoid lot of hops:
It is not a surprise that the number of hops directly
affects the performance of the system, especially the
ones related to the legacy systems.
3
Case study:
When a provisioning engine was deployed in
Production it took more than an hour to provision a
single user account. Lot of analysis was done to
identify the root cause of the problem. The system was
sourcing from more than 50 legacy services to fetch
the info and creating the user entry during provisioning.
Each system took some milliseconds to process the
request. During further analyzes it was observed that
all these legacy systems were built on top of the
original source of truth and each one of them adding
some extra small functions which is not required for the
provisioning engine.
After a series of discussions with all our stake
holders we retired the unwanted systems and rewrote
the original source of truth in such a way that it
provides all the required data directly. The response
time then improved a lot from an hour to less than 10
sec. So always remember to avoid the unnecessary
hops in larger systems.
Figure 4 : Number of hops vs Response time
8. Caching:
Caching is mainly used to cache the user data
temporarily for a certain period of time so that it can
reduce the I/O intensive calls like DB read and write.
The following thumb rules need to be followed for
caching.
1. Store only the static data in Cache
2. Never store dynamic data in the cache
3. Store only less volume of data
4. Put proper Cache eviction policies
5. If possible use in-memory cache compared to
secondary cache
The following are the common issues faced across
the applications if the caching is not proper
1. Lot of DB calls
2. Out of memory error due to increase in the
cache growth
Figure 5 : Data type vs Cache effectiveness
9. Retry mechanism:
Generally retry mechanism is employed in
database calls and third party calls to cover the failures
due to not able to reach the target application. This
also helps to enhance the overall user experience.
Case study:
A real time communication project was deployed
with a no SQL DB in the back end. During the week
end the NoSQL DB went down due to system issues
and all the clients were retrying indefinitely, eventually
bringing down the entire infrastructure. After thorough
analysis it was identified that the number of retries was
not defined in the client side and it was continuously
retrying to reach the DB. Number of retries was
configured in the application and it helps the
application to work without issues. So the optimal
number of retries should be decided during the initial
stages of system design. According to the CAP
theorem (6) we cannot achieve consistency,
availability and partition of tolerance at the same time.
There is always tradeoff between these parameters
10. Garbage collection:
Case study:
One of the queuing application had an issue of
dropping customer requests. There were no issues
reported in the system logs. Everything seems to be
normal. During the root-cause analysis it was identified
that the Garbage collection was not configured
properly. The garbage collection was attempted at very
frequent intervals which resulted in system hanging
during that period and loss of transactions.
So design should capture the required garbage
collection parameters.
4
Dropped connection vs Full GC's
40
20
0
1
2
3
4
Full GC cycles(fixed time interval)
Dropped connections
Figure 6 : Full GC vs Dropped connections
11. Session Management:
Case study:
An ecommerce site was not able to scale beyond
certain range of customer transactions. After analysis it
was identified that there were lot of customer sessions
maintained in system memory. Each session had lot of
data within it. So the whole process was consuming lot
of memory and system was not able to scale very
much. So care should be provided for the proper
session management
1. Remove the session from memory after certain
time limit
2. Remove the session immediately if client is
disconnected or not able to reach
3. Limit the number of items getting stored as part
of the session
4. Move the data from session to some sort of
temporary cache or journal so that the memory
is free to be allocated most of the times
multithreaded applications then it will degrade the
performance of the applications. Instead of that we can
use ConcurrentHashMap to achieve better results.
Similarly String Builder is preferred over String Buffer
as it is not synchronized. If you want to retrieve the
items using a key then the ideal choice will be using a
Map for the implementation. To retrieve the items in
the input order, any List implementation can be used.
When you want to store only unique set of data items
then any Set implementation can be used. Queues can
be used to implement any worker thread model of
application. Not selecting the appropriate data
structure will also result in inefficient use of heap
memory.
By
default
each
data
structure
implementation has its own default storage capacity.
The default capacity size of Hash table, Hash Map,
String buffer is 16 whereas the Array List default is 10.
Most of these data structures expand their capacity by
power of 2 when the initial storage is exhausted. So
selecting a Hash Map instead of Array List will
consume more heap memory. So utmost care must be
provided about the selection of the proper data
structure during the design phase.
14. Client
computing
Side
Computing
vs
Server
Side
12. Async Transactions
During the design determine the transactions that
can be performed in async manner. Performing
transactions in async manner helps to reuse the
expensive resources effectively. The following items
should be done in async way.
1. Data base read or write
2. IO intensive operations
3. Calls to third party systems
4. Enable NIO connections in the application
servers
There is always a disagreement between back end
program developers and web designers about Client
side or Server side computation model is better. Most
of the Client side computations are generally related to
user interface. The client side implementation is done
using java script, AJAX and flash which utilizes the
client system resources to complete the requests.
Server side computing is implemented using PHP,
ASP.net JSP, python and ruby on rails. There are
advantages and disadvantages on each of these
approaches. Using server side computing one can
provide quick business rule computation, efficient
caching implementation and better overall data
security whereas client side computing helps to
develop more interactive web applications which use
less network bandwidth and achieves quicker initial
load time. So server side computations is preferred for
validating the user choices, to develop structured web
applications and persist user data. Client side
computations are useful for dynamic loading of
content, animations and data storage in local
temporary storage.
13. Choice of Data Structures
15. Parallelism
Data structures are internally used by Computers to
store and retrieve the data in an organized manner.
The commonly used data types are List, Map, Set and
Queue. Each type of data structures comes with their
inherent special features like Hash table which is
synchronized by design. So if we use Hash table for
Most of the modern day computers by default
contain multiple cores. During design phase we need
to identify what are all the tasks that can be executed
in parallel to completely utilize this hardware feature.
Parallelism is a concept of executing tasks in parallel
to achieve high throughput, performance and easy
5
scalability. To achieve parallelism in the application we
need to identify the set of tasks that can be executed
independently without waiting for others. Large
work/transaction should be broken into small work
units/tasks. Dependency between these work units
and communication overhead between these units
should be identified during the design phase. Then
work units can be assigned to the central command
unit for execution. Finally the results should be
combined and sent to the user. One good example of
this design is MapReduce programming technique.
According to design rule hierarchies paper (7)
Software modules located within the same layer of the
hierarchy, suggest independent, hence parallelizable,
tasks. Dependencies between layers or within a
module suggest the need for coordination during
concurrent work. So use as much parallelism as
possible during the design to achieve better
performance.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Suitable replication strategy
Retire/merge unwanted legacy systems
Implement proper caching mechanism
Proper retry interval at client side
Proper garbage collection configuration
Keep less data in session memory
Priority to Async transaction
Proper choice of data structure.
Client Side computing vs Server
computing
14. Parallelism
15. Choice of Design patterns
Each one of the above mentioned design principle
will surely help in achieving better performance and
good customer experience.
18. REFERENCES:
1.
16. Choice of Design patterns
Design patterns provides a solution approach to the
commonly recurring problems in a particular context.
Concepts of design pattern started with the initial set of
patterns described by the gang of four in their design
patterns book. Currently around 200 design patterns
available to resolve different software problems. So the
cumbersome task is identifying the suitable design
pattern for the application. One of the good approach
is to design pattern Intent ontology proposed by
Kampffmeyer H., Zschaler S in their paper (7).
They have also developed a tool to identify the suitable
design pattern for the problem. Once a particular
pattern is identified it has to be checked to ensure that
it doesn’t belong to Anti patterns.
2.
3.
4.
5.
17. CONCLUSION:
As the next gen systems becomes more complex
and pose a challenge on its own, it is imperative that
while designing these systems, we follow the points
discussed in this paper. Based on our experience over
the years, a system proves to be efficient and costeffective only when more weightage is given and
adequate time is spent in designing the system. Brief
design topics are
1. Proper logging configuration
2. Appropriate selection of software language
3. Reduce number IO operations
4. Appropriate selection of databases
Side
6.
7.
8.
A. Kshemkalyani "A Symmetric O(n log n)
Message Distributed Algorithm for LargeScale Systems", Proc. IEEE Int\'l Cluster
Computing Conf., 2009
http://spectrum.ieee.org/at-work/techcareers/the-top-10-programming-languages
N. Ali, P. Carns, K. Iskra, D. Kimpe, S. Lang,
R. Latham, R. Ross, L. Ward, and P.
Sadayappan.
Scalable
I/O
forwarding
framework for high-performance computing
systems. In IEEE International Conference on
Cluster Computing (Cluster 2009), New
Orleans, LA, September 2009.
http://msdn.microsoft.com/enus/library/windows/desktop/aa365683(v=vs.85
).aspx
L. Han, M. Punceva, B. Nath, S.
Muthukrishnan, and L. Iftode, "SocialCDN:
Caching techniques for distributed social
networks, " in Proceedings of the 12th IEEE
International Conference on Peer-to-Peer
Computing (P2P), 2012.
http://en.wikipedia.org/wiki/CAP_theorem.
Sunny Wong, Yuanfang Cai, Giuseppe
Valetto, Georgi Simeonov, and Kanwarpreet
Sethi”Design Rule Hierarchies and Parallelism
in Software Development Task”
Kampffmeyer H., Zschaler S., “Finding the
Pattern You Need: The Design Pattern Intent
Ontology”, in MoDELS, Springer, 2007,
volume
4735,
pages
211-225
6
ARCHITECTURE AND DESIGN FOR PERFORMANCE OF A LARGE
EUROPEAN BANK PAYMENT SYSTEM
Nityan Gulati
Principal Consultant
Tata Consultancy Services Ltd
[email protected]
Gurgaon
R. Hari Kumar
Senior Consultant
Tata Consultancy Services
[email protected]
Bangalore
A Large software system is typically characterized by a large volume of
transactions to be processed, considerable infrastructure and high
number of concurrent users. Additionally it usually involves integration
with a large number of up-stream and down-stream interfacing systems
with varying processing requirements and constraints. The above
parameters on its own may not pose a challenge when they are static in
nature, but it gets tricky if the inputs keep changing and continuously
evolving. In such conditions, how do we keep the system performance
and resilience under control? This paper tries to explain the key design
aspects that will need to be considered across various architectural
layers to ensure a smooth post production performance.
1.
INTRODUCTION
In a typical implementation, often due attention is not paid to the system performance during the initial stages of
design and development. Performance testing happens at a later stage; sometimes just a few weeks before the
application goes live As a result, only a very limited performance tuning options are available at this stage. We
can do a bit of SQL tuning and some tweaking in the system configuration. Due to a lack of systematic and timely
approach to address the performance issues, mostly these steps result in little gains.
While a large system involves several design aspects, we shall discuss some key application design areas and
guidelines that need to be borne in mind during the design stage for robust and performing implementations.
Specifically, the paper illustrates key aspects to be considered in tuning web page response time, aspects in
tuning straight through processing (STP) throughput, factors for improving the batch throughput and a number of
other parameters to be considered for tuning.
The document is based on the experiences from tuning the architecture, design and code of several product
based implementations of financial applications.
The examples and statistics quoted are derived from the actual experience from managing the design and
architecture of payment platform for a large European Bank.
The system has gone live successfully and is in production for around two years now.
The paper is organized as follows:
Section 2 provides the context of the payment system, SLA requirements and a brief overview on the architecture
of the system, Section 3 provides the key design considerations and parameters that are discussed in the paper,
Section 5 to 7 discusses in detail the tuning activity that was done on the selected parameters, Section 8
7
summarizes the key performance benefits realized after the various parameters were tuned and finally section 9
enumerates the key lessons learnt from the project followed by references
2.
SYSTEM CONTEXT AND ARCHITECTURE
The following picture depicts a high level view of the architecture of a corporate banking application. The
application supports payment transactions in terms of deposits, transfers, collections, mandates and a host of
other typical banking transactions.
Figure 1 Application architecture of the payment processing system
The following are the key, metrics on the volume of the transactions that the system is expected to process
Two million transactions per day to be processed with a peak load of 300k transactions per hour, with around
2000 branch users. The End of day (EOD) / End of month (EOM) process has to complete within 2 hours. The
system has to generate around one million statements post EOD over Million accounts. The system has around
twenty upstream interfaces sending down the payments as files and messages. Over all about 30+ downstream
systems which include regulatory, reporting, security and business interfaces.
The following picture provides the technical architecture of the banking application
8
Figure 2 Technical architecture of the payment processing system
This is a typical n-tier architecture that has the following layers
Web Tier: It is browser based and uses https as the protocol with extensive use of AJAX to render the content.
Presentation Layer: This provides the presentation logic, mainly appearance and localization of data.
Controller Layer: This is realized using the standard Model View Controller (MVC) design pattern.
Application Layer: This consists of Business Services and Shared services. A business service covers the
functional area of a specific application, whereas a shared service provides infrastructure functionality such as
exception handling, logging, audit trail, authorization, and access control.
Database Tier: This layer encompasses the data access objects that encapsulate data manipulation logic.
Batch Processing: The batch framework supports a multi-threaded scalable framework. It provides restart and
recovery features. It also allows capability to schedule the jobs using an internal scheduler or any 3rd party
schedulers such as Control-M.
Integrator: This layer provides the integration capabilities to provide interfaces with external systems and
gateways. Integrator amalgamates simplicity with a rich layer of protocol adapters including the popular SWIFT
adapters, which is a powerful transformation and a rule based routing engine to provide the standard features of
an Enterprise Application Integration layer.
9
3.
DESIGN CONSIDERATIONS
This section covers the key challenges that we faced in the various architectural layers of the system and the
thought process adopted for resolution.
The following are the challenges that are discussed in this paper
3.1
Tuning web page parameters
Performance of search screens considering the variety of search options and parameters available to the user
and huge transaction volumes that are added to the system on a daily basis. Search capability is fundamental to
the daily usage of the business users to carry out their daily tasks. The SLA mandated by the customer is to have
a screen response of 2 seconds or less
3.2
Tuning STP throughput parameter (Straight through processing)
Design for maintaining the STP throughput considering the following parameters
Number (Count) of payment files received from the upstream systems at any given point in time and nonuniformity in the size of the files received. The files could be single transaction file or bulk set of transaction files.
The SLA is to process a load of 300,000 transactions received as messages, files (single and bulk) of varying
sizes under 60 minutes.
3.3
Tuning batch parameters
The performance of batch programs largely depends on effective management of database contentions and
optimal commit sizes. Considering the fact that we could receive more than 40% of the transactions from a single
account, it could result in hot spots. The batch process can be quite intensive pumping up a large number of
transactions. We had to tune the system where the commit rates were in excess of 10,000 per second.
The SLA is to complete the End of day batch profile (EOD) and the End of the month batch profile less than 2
hours. The peak volume of a business day was about two million transactions. Additionally, the following
parameter tunings are considered

Database parameters tuning.

Oracle specific considerations.
4.
TUNING WEB PAGE PARAMETERS
Keeping response time SLA for search page as less than 2 seconds posed a challenge, considering the variety of
search options / combinations available to the user and huge amount of transactions that are added to the system
on a daily basis. The flexibility provided to the user allowed for a large number of combinations of search
parameters and the generic SQL having null value function and “OR” conditions lead to full table scan.
Web based searches is one of the key operations that the business users exercise frequently. Hence it is
imperative that the operation is designed as efficiently as possible. The following techniques are recommended
for efficient search operations.
Identifying the popular search criteria – Considering a huge permutation of search conditions that can be
possible, it is challenging to have indexes available on every possible combination. Moreover, there can be fields
such as names and places where indexes may not be useful since we might have several thousand records that
qualify for a given name or a place. Hence it is essential that we understand the most frequently used search
parameters through detailed discussions with the business or operational users of the current systems. We
included dedicated queries for these popular combinations serving the day to day requirements of most of the
operational requirements. . For the remaining combinations, we included a generic query with a short date range
as default. E.g. in a payment system, Order Date will have a default date range of 30 days. This is the period most
of the searches would fit into.
If it is a legacy system which is being upgraded, logging using software probes can be used to tap the parameter
values passed for scientific analysis. A query on Oracle internal view DBA_HIST_SQLBIND can also be used to
capture the parameter (host variables for the SQL involved) usage by the end users.
10
Gracefully stopping the long running SQLs - A long running online query in the database is not under the
control of the Application Server. It can block the Application Server thread for considerable amount of time thus
impacting concurrency. In Oracle RDBMS such a query, can be interrupted gracefully if we create a dedicated
Oracle user id for usage in the Application Server data source. This user id is assigned an Oracle Profile which
has a limit set for CPU_PER_CALL (units are 1/100 of CPU seconds). Whenever an SQL running under that user
id exceeds the specified, an Oracle Error ORA 02393 is returned and the query is terminated. Application code
will trap it and a meaningful message is sent to the end user
Tuning the blank Search - This is a special case of open ended Search where the end user does not pass any
parameters at all. This means all data is part of the result set and as the data is invariably sorted, it means sorting
millions of rows and then presenting the first few rows. The only way to address this is to guide the Query Plan
(at times via HINTS) so that an index with column order matching the sort order is picked up to avoid the costly
sort operation. Also, we decided to suppress this Open Search feature wherever it is not required.
Additionally we added a data filter which will restrict the size of the result set to a smaller set. This makes sense
as a blank search would otherwise bring up multiple pages of data that is not really useful to the user.
4.1
Quick Tuning Steps that lead to further benefits
Most of the times we do not have frequent change in the static content. In such cases we can use the option of
long term caching in the browser. However browser still generates 304 requests [PERF001] for verification of the
validity in static content causing response delays. This problem is pronounced in high latency locations. The
following entry in httpd configuration file of web server prevents the 304 requests.
Table 1: Configuration required in HTTPD to prevent 304 calls
However when a software upgrade contains a modified static content, the browser would still continue to use
the old content. This was managed by appending the context parameter of the application with a release
number whereby the browser will pull the content once gain during the first access. To avoid the URL
changing from a user perspective, the base URL was made to forward the user request to the upgraded URL
This is by far the best approach as a quick win to deploy the application in high latency WAN environment.
Significant performance gains in response times were realized due to reduction in the number of network calls
that the page rendering process performed for retrieving the static content. A gain of 15-20% was realized on
high latency networks where the latencies were between 200 to 300msBrowser Upgrade
11
It has to be noted that IE 8 / IE9 give superior performance as compared to IE 7 on account of efficient
rendering APIs in these versions. The performance of the web pages improved 10 to 15% without requiring
any code changes.
5.
STP THROUGPUT PARAMETER TUNING
Design for maintaining the STP throughput considers the following parameters:
Number of files received at any point of time and no uniformity in the size of the files received by the system.
Random combinations of small, medium and large sized files are received from the upstream interfaces. A file
considered small varies from ~1 to 100 KB based on the record length of the transaction. A medium file varies
from 100 to 1000KB. A file is considered large when it is in excess of 1 MB. We were expected to receive files of
size up to 60MB.
The following is the design adopted for achieving scalability and load balancing on the number of files and their
sizes
Figure 3 Design view of the file processing component for scalability
The files are received by the File interface through a push or pull mechanism based on the upstream source.
Based on the number of files received, the file processer threads are spawned dynamically scaling up the
processing capability.
The file content is parsed and the contents are grouped, based on the type of transaction. E.g. Single payments
and bulk payments.
Each of the batches is handled by batch adapters which divides the load amongst a pool of threads that can be
varied based on the load
12
6.
TUNING BATCH PARAMETERS
Batch jobs are a standard way of processing the transactions at the end of the day for doing interest calculations,
account management and other risk management jobs. The volume of the transactions at the end of the day can
be quite high. In our case, the peak volume of transactions expected was about two million transactions.
The application software should be able to scale up and fully exploit the available resources viz. CPU, Memory
etc. in a manner such that there is minimal contention amongst parallel paths of execution.
Multithreading support is provided by Java. The Batch frameworks should exploit the same. However; the various
hotspots which cause concurrency issues degrade the gains from multi-threading.
The guidelines below were used to mitigate the contentions
Sequence Caching – Batch Processing needs a large number transaction IDs to be generated. Using oracle
generated sequences enable them to be easily generated and assigned to the transactions. Caching them
upfront reduces the overhead of ID generation and leads to improvement of performance of the batch processing.
Oracle Sequences typically used in primary keys should be cached as much as possible (default is 20) if there is
no business constraint. The NOORDER clause should be preferred. This technique improved the performance of
the SQL performance of our application thereby improving the batch throughout considerably.
Allocation of Transactions to Threads - Allocation of transactions to the threads avoids situation, where
transactions processed over multiple threads in parallel, can enter into contention (Oracle row lock contention
wait). An example is the case - where transactions of the same account are getting processed in parallel across
Threads and doing ‘Select for Update’ on the Account row. The contention can be reduced if the model is
changed and the threads pick up the transactions based on the modulus of the Account Id. Modulus is a
mathematical function that finds the remainder of division of one number by another. This technique helped us to
route the transaction evenly across the threads thereby reducing the hot spots
Usage of Right Data Structures - Usage of better container collection classes promotes better concurrency. We
adopted concurrent hash map which reduced the cases where the transactions were entering into dead lock
situations.
Parallelism in batch profile - This can bring in much needed time reduction in the overall EOD cycle. We
engaged the functional SMEs to redesign the EOD profile carefully so that non conflicting Batch programs can be
run in parallel.
Deadlock Prevention- Deadlocks need to be avoided at by planning and proper design. For this, the updates
should be at the end, just before the commit as far as possible. The Tables should be updated in a consistent
order in all programs running in parallel e.g. Table A, followed by Table B followed by Table C. Details of
deadlocks, the SQLs and rows involved can be seen in the Alert Log and Trace files generated by Oracle. The
above technique helped us to reduce the deadlocks when files were received with several transactions on a small
set of accounts.
7.
TUNING DATABASE PARAMETERS
Managing the REDO logging - Oracle redo logs can become a huge bottle neck when several threads are
writing large amount of redo data in parallel. The redo volumes were reduced by avoiding repeated updates in the
same commit cycle and by placing the redo logs on faster storage and separating the log members / groups on
different disks [PERF003]. More the threads means more redo generation load.
By using prepared statements having bind variables - This promotes statement caching, reduces memory
footprint in the shared pool and reduces the parsing overhead.
13
Bulking of Inserts/Updates – Inserts/Updates were clubbed using the add Batch and execute Batch JDBC
methods. This is highly useful in an IO bound application. It saves on network round trips and especially useful
where latency between the application and database servers is high.
Disabling Auto Commit mode – As a default, the JDBC connection commits at every Insert/Update/Delete
statement. This can not only lead to too many commits resulting in the infamous ‘log file sync wait’ [PERF002] but
can also lead to integrity problems as it breaks the atomic unit of work principle.
Ensuring closure of Prepared statement and Connection – This will conserve the JDBC resources and
prevent database connection leakage. This is done in the ‘finally’ block in Java so that even in exceptions we
close the resources before exiting.
Connection Pool libraries - This saves on JDBC connection count and promotes better management and
control. Industry standard Application Servers use this as a standard practice. The same may now be used in the
batch frameworks. In this context right configuration of pool is very important, especially the minimum/maximum
pool sizes else the application threads will wait for the connections. “BoneCP” is a third party connection pool
management library which was used in our application. It does a good job of managing the pool effectively.
Controlling the JDBC Fetch Size – The result set rows are retrieved from the database server in lots as per the
JDBC fetch size. Based on this size the JDBC driver allocates memory structures in the Heap to accommodate
the arriving data. Fetch size determines the number of chunks of row-sets that oracle return for a given query.
The default value is 10. This will mean, oracle will return in 10 row-sets if the result set size is 100 rows. Too less
a value will mean more network traffic between client and the database. Too high a value will impact the heap
memory available on the client. The fetch size selected should be optimal to balance the network traffic and the
available heap available on the client side. A large fetch size can lead to out of memory errors. Cases where the
code tries to fetch data with same size of the rows to be shown on screens can result in out of memory issues.
We selected a fetch size of 100, based on the maximum number of records that a user would normally want to
view on a given search criteria
Database Clustering - Implementing Real application cluster can help scale the database layer largely. It is one
of the most important layers at which an issue is normally cascaded across all the layers. RAC is a special case
and unless application is designed for RAC, it is going to degrade 30% or more on account of RAC related waits
related to transfer of Oracle data blocks over the high speed interconnect, across the SGAs of the different RAC
nodes. While index hash partitioning, sequence caching etc. can give some relief, real improvement comes from
right application work load partitioning in sync with database Table partitioning.
For example - in a two Entity scenario, JVMs processing Entity 1 connect to RAC node A as a preferred instance
(using Oracle Services) and JVMs processing Entity 2 connect to RAC node B as a preferred instance. The
required Tables are partitioned on Entity Id and the Partitions are mapped onto separate Table spaces to isolate
the blocking scenarios. All this helps mitigate the RAC related damages and promotes improved performance,
scalability and availability. It implies that at design stage enough thought has been given for RAC enablement.
Implementation of RAC in the program was postponed as there was no infrastructure support available in the
client environment
Oracle specific considerations [PERF003]
Physical Design: The physical design of database storage has an important impact on overall performance of
the database
The following are the factors that were considered for optimizing the database design:
The physical arrangement of table spaces on storage volumes (hard disks) along with the mapping of database
objects (tables, indexes, etc.) on to the table spaces.
The number of redo log groups, their member count, and log file sizes and placement. (E.g. log groups and their
members to be placed on separate and very fast disks.)
14
The storage options of database objects within a table space (e.g. In Oracle, options to be set in the storage
clause e.g. PCT free, extents should be exercised).
Definition of indexes - which tables columns should be indexed, and the type of index (e.g. B Tree, bit mapped
index (recommended for status type columns etc.).
Other performance tuning options, depending on RDBMS features (e.g. partitioned tables, materialized views
should be exploited)
Table Partitioning - If volumes are going to be high, table partitioning has to be thought in advance rather than
later when we have performance issues. For databases above 1 TB, partitioning must be an essential
consideration. It reduces the hot spots such as index last block, where we have sequentially increasing keys for
the index. We considered both Table and Index Partitioning to get the best gains.
Partitioning helped us greatly in large scale removal of data during archival and purging process. For example
removal of Jan 2012 data from the database Table can be done very fast by dropping the monthly Partition as
compared to conventional delete of millions of rows. Additionally, it helped us to improve the maintenance of the
database e.g. Partitions were backed up independently; Partitions were marked ‘read only’ to save backup time.
8.
KEY BENEFITS REALIZED
8.1
Web performance
The design criteria adopted for web page performance helped us to manage the SLA expectation of the users.
While the logged in users were expected to be around 2000 users, the concurrent usage was 200 users.
Figure 4 Average Response Time of the top 10 frequent use cases
X-axis >> Average response range, Y axis >> Actual graph value
15
The search page performance was manageable even when a portion of the users did fire blank searches. Queries
were adjusted to include a shortened data range to manage the data volume.
The introduction of configuration in the web server to stop 304 calls originating from browser boosted the
performance of the application in regions when the network latency was upwards of 200ms.
The graphical picture above shows the response times of key pages that the user community uses very
frequently.
8.2
STP performance
The design adopted for processing the files of varying sizes increased the throughput significantly. The dynamic
spawning of file processors and splitting the huge files into manageable batches improved the scalability of the
system dramatically.
The following matrix provides the split up of various files and messages totaling to 300,000 transactions per hour
that the system was able to process in 60 minutes
Table 2: STP volumes to be supported on each of the key interfaces
8.3
Batch performance
This was one of the key components in the payment processing system. Parallel processing of various batch
components that are functionally independent helped us to reduce the time window of the batches significantly.
The following picture depicts the various groups of batches grouped together and executed parallel.
16
Figure 4 Average Response Time on the top 10 use cases
Oracle database‘s Redo log volume design (right sizing) helped a lot improving the IO of the system and thereby
processing performance. Appropriate commit size configuration coupled with a very efficient third party connection
pool library helped us to increase the batch throughput significantly.
With these improvements, the batch component was able to complete the EOD profile on a transaction volume of
two million transactions which was the peak volume test that was mandated by the customer.
9.
KEY LESSONS LEARNT FROM THE PROJECT
Following are the key lessons learnt from this project
Performance is a key component to be planned and worked upon from the requirement stage till the performance
testing stage. It cannot be done as an exercise when the issues are found. Performance of large systems hinges
upon the ability to minimize IO by maximizing efficiency. Exploit caching features thoroughly as they contribute
immensely to the scalability of the system.
Quick results on web page performance can be realized using static content caching and by reducing 304 calls
especially on a WAN environment. It has to be noted that IE 8 / IE9 give better results as compared to IE 7 on
account of efficient rendering APIs in these versions. The performance of the pages is expected to increase by 10
to 15% without requiring any code changes.
Keeping the number of SQLs fired to the minimum per unit of output achieved is one of the key design objectives
to be borne in mind. Once this is done, rest of the tuning can be achieved easily with appropriate indexing.
Keeping the primary database size to the minimum through an archival policy is very essential to contain the ever
growing transaction tables. Without a good archival policy, performance of the system is bound to deteriorate with
the passage of time. Implementation of RAC requires a careful thought on the application design without which we
could experience significant performance degradation.
17
REFERENCES
[PERF001] Best practices for speeding up your web site. Yahoo developer network.
[PERF002] Oracle Tuning: The Definitive Reference: By Donald Burleson.
[PERF003] Oracle® Database Performance Tuning Guide.
18
Designing for Performance Management of Mission Critical Software
Systems in Production
Raghu Ramakrishnan
Arvinder Kaur
Gopal Sharma
TCS
USICT, GGSIPU
TCS
A61-A, Sector 63
Sector 16C
A61-A, Sector 63
Noida, Uttar Pradesh, India
Dwarka, Delhi, India
Noida, Uttar Pradesh, India
201301
110078
201301
91-9810607820
91-9810434395
91-9958444833
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Traditionally, the performance management of software systems in production has been a
reactive exercise, often carried out after a performance bottleneck has surfaced or a severe
disruption of service has occurred. In many such scenarios, the reason behind such behavior
is never correctly identified primarily due to the absence of accurate information. The absence
of historical records related to system performance also limits development of models and
baselines for proactive identification of trends which indicate degradation. This paper seeks to
change the way the performance management is carried out. It identifies five best practices
that have been classified as requirements to be included as part of software system design
and construction to improve the overall quality of performance management of these systems
in production. These practices were successfully implemented in a mission critical software
system at design time resulting in effective and efficient performance management of this
system from the time it was operationalized.
1 Introduction
The business and technology landscape of today is characterized by the increasing presence of mission critical
web applications. These web applications have progressed from being simple static content providing applications
to application allowing all kinds of business transactions. The responsiveness of websites under concurrent load
of a large number of users is an important performance indicator for the end users and the underlying business.
The growing focus on high performance and resilience has necessitated including comprehensive performance
management as an integral part of the software systems. However this is an area which has received limited
focus and relies on a fix-it-later approach from a project execution perspective. The key to the successful
performance management of critical systems in production is timely availability and accuracy of data which may
then be analyzed for proactive identification of performance incidents. The inclusion of performance management
requirements is essential in the design and construction phase of a software system.
Our experience in handling of performance related incidents in critical web application over the last few years has
shown that the focus on inclusion of such techniques starts when performance incidents get reported after the
start of operations of the software system in production. This may be too late since there may be little room left for
any significant design change at that point of time or may require a major rework in the application. This approach
19
is risk prone and expensive as making changes in implementation phase of software development lifecycle is
difficult, incurs rework effort and a 100 times increase in cost [YOGE2009].
A number of studies have shown that responsive websites impact productivity, profits and brand image of an
organization in a positive manner. In addition, slow websites result in loss of brand value due to negative publicity
and decreased productivity. A survey done by Aberdeen Group on 160 organizations reported an impact of up to
16% in customer satisfaction and 7% in conversions i.e. loss of a potential customer due to a one second delay in
response time. The survey also reported that the best in class organizations improved their average application
response time by 273 percent [SIMI2008].
Satoshi Iwata et al. demonstrates the usage of statistical process control charts based on median statistic for
detecting performance anomalies related to processing time in RuBiS which is a web based prototype of an
auction site [SATO2010]. This performance anomaly detection technique requires the timely availability of
measured value using appropriate instrumentation techniques (e.g. response time from the web application).
This paper tries to bring about a paradigm shift from the prevalent reactive and silo-based approach in the domain
of performance monitoring of mission critical software systems to an analytics based engineering approach by
including certain proven requirements as part of the design and design process. The objective is to be able to
know about a performance issue before a complaint is received from the end users. The silo-based approach
analyzes several dimensions such as web applications, web servers, application servers, database servers,
storage, servers and network components in isolation. The reactive approach involves including logs in a
makeshift manner only when a performance incident occurs. This in turn results in the required information not
being available at the right time for effective detection, root cause analysis and resolution of the problem.
This paper suggests including practices like instrumentation, system performance archive, controlled monitoring,
simulation and an integrated operational console as part of the design and design process of software systems.
These requirements are not aimed at improving or optimizing the performance of the software system but are for
enhancing the effectiveness and efficiency of performance monitoring in production.
These requirements were successfully included as part of the design and design process in a mission critical egovernment domain web application built using J2EE technology. This has helped the production support team to
recognize early warning signs which may lead to a possible performance incident and take corrective actions
quickly.
The rest of this paper is organized as follows. Section 2 describes the application in which the proposed best
practices were implemented. Section 3 describes the best practices in detail. Section 4 presents the results and
findings of our work. Section 5 provides the summary, conclusion, limitations of our work and suggestions for
future work.
2 Background
These requirements were successfully included as part of the design and design process in a mission critical egovernment domain web application built using J2EE technology servicing more than 40000 customers every day.
The web application in this e-governance program is used by both external and department users for carrying out
business transactions. The external users access this application on the Internet and the department users
access it on the Intranet. The technical architecture has five tiers for the external users and three tiers for the
department users. The presentation components are JSPs and the business logic components are POJOs
developed using the Spring framework.
External Users: The information flow from external users passes through five tiers. The tiers 1, 2A and 3 host the
web application server. The tiers 2B and 4 host the database server. Tier 1 provides the presentation services.
Tier 2A and 2B carry out the request routing from the tier 1 to tier 3 after carrying out the necessary authentication
and authorization checks. The business logic is deployed in tier 3. Tier 4 holds all transactional and master data of
the application.
Department Users: The information flow from department users passes through three tiers. The tier 5 and 6 host
the web application server. Tier 5 provides the presentation tier. The business logic id deployed in Tier 6. Tier 4
holds all transactional and master data of the application.
20
Figure 2 shows the logical architecture of the e-government domain web application.
Figure 2: Logical architecture of the e-governance web application
3 Building Performance Management in Software Systems
The existing approach of performance management of software systems in production involves a reactive and silo
based approach. The silo based approach involves measurements at the IT infrastructure component level i.e.
server, storage, web servers, application servers, database servers, network components and application server
garbage collection health. The reactive approach involves adding log entries whenever a performance incident
occurs. This section describes in detail five mandatory requirements that a software system needs to incorporate
as part of the design and design process to ensure effective, proactive and holistic performance management
after the system is in production. These requirements were successfully implemented as part of the design and
design of a mission critical e-government domain web application with excellent results. This web application
provides a number of critical business transactions to end users and requires meeting stringent performance and
available service level agreement requirements.
3.1 Instrumentation
The instrumentation principle of Software Performance Engineering states the need to instrument software
systems when they are designed and constructed [CONN2002]. Instrumentation is the inclusion of log entries in
components for generating data to be used for analysis. These log entries do not change the application behavior
or state. Correct and sufficient instrumentation help in quick isolation of performance hotspots and determination
of the components contributing to these hotspots. The logs from various tiers and sources form an important input
to performance management of software systems in production. These logs can be from application or the
infrastructure tier. The logs from the application tier include web server and custom logs of the web application.
The logs from the infrastructure tier include processor utilization, disk utilization, application server garbage
collector etc. The technique of implementing instrumentation includes use of filters, interceptors and base classes.
Figure 3 shows the usage of a base class for implementing instrumentation.
The software system requirements in the area of performance and scalability traditionally do not mention an
instrumentation requirement. The experience of the authors in managing large scale software systems showed
instrumentation being introduced as a reactive practice towards the end of the software development lifecycle for
identification of performance incidents reported from the end users or performance tests. This reactive approach
results in rework and schedule slippage due to the code changes need for instrumentation and regression testing
21
required following these changes. This paper recommends inclusion of this practice as a key requirement in the
software requirements specification rather than being limited to being a best practice.
PRACTICE: Include sufficient instrumentation in all tiers for quick isolation of performance
problems and identification of the component(s) contributing to performance problems.
public class TestBaseAction extends ActionSupport implements PrincipalAware, ServletRequestAware,
SessionAware {
public final String execute() throws Exception {
String res;
Date begin = new Date( );
res = execute2( );
Date end = new Date( );
logger.info(….end.getTime( ) – begin.getTime( ) ….);
}
}
Figure 3: A base class is a place to add instrumentation log entries.
Figure 4 shows entries from a standard web server log. Each entry includes a timestamp, the request information,
execution time, response size and status. The software requirements specification can explicitly state that the web
server log need to be enabled for recording specific attributes. These entries can be aggregated for a time interval
(e.g. two minutes) to arrive at statistics like count and mean response time or used for steady state analysis of the
software system. In a stable system, the rate of arrival of requests is equal rate at which the requests leave the
system.
XXX.00.777.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/news/ticker.jsp HTTP/1.1" 200 1078 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:13 +0530] "POST /OnlineApp/secure/AddressAction HTTP/1.1" 200
13495 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/images/bt_red.gif HTTP/1.1" 200 157
0
XXX.00.777.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/images/bullet_gray.gif HTTP/1.1" 200
45 0
XXX.00.777.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/status/trackingHTTP/1.1" 200 10055 0
ZZZ.77.99.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/css/doctextsizer.css HTTP/1.1" 200 73 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/news/ticker.jsp HTTP/1.1" 200 1078 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:14 +0530] "GET /OnlineApp/CaptchaRxs?x=1483d7f9s71990
HTTP/1.1" 200 4508 0
ZZZ.77.99.1XX - - [14/Aug/2014:22:06:13 +0530] "GET /OnlineApp/secure/ServiceNeeded HTTP/1.1" 200
7346 0
ZZZ.77.99.1XX - - [14/Aug/2014:22:06:14 +0530] "POST /OnlineApp/user/loginValidate HTTP/1.1" 302 - 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:14 +0530] "GET /OnlineApp/user/uservalidationHTTP/1.1" 200
8717 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:14 +0530] "GET
/OnlineApp/user/loginAction?request_locale=en&[email protected] HTTP/1.1" 302 - 0
YYY.99.010.1XX - - [14/Aug/2014:22:06:14 +0530] "POST/OnlineApp/secure/logmeAction HTTP/1.1"200
7657 0
Figure 4: Using Web Server Logs as an Instrumentation Tool
Figure 5 shows entries from a custom web application log. Each entry includes an entry and exit timestamp, web
container thread identifier, the request information, execution time, status and correlation identifier. The software
requirements specification can explicitly state that the application web server log need to be enabled for recording
specific attributes. The custom logs provide application specific information which may not correctly reflect in the
web server log (e.g. a web server log may report an http status 200 but the business transaction may have
22
encountered a logical error). These entries can be aggregated for a time interval (e.g. two minutes) to arrive at
statistics like count and mean response time or used for steady state analysis of the software system.
2014-08-14 21:01:13,836 | WebContainer : 760 | -|DCBANKONL|class .secure.action.uploadform|-|Mon Aug 14 21:01:00
GMT+05:30 2014|13176|-|-|20140818210100
000001ABBBA26d1ds668|
2014-08-14 21:01:41,507 | WebContainer : 755 | -|[email protected]|class online.secure.action.viewFormAction|-|Mon
Aug 14 21:01:34 GMT+05:30 2014|7030|-|-|2
01408182101340s0ad0af00164515616|
2014-08-14 21:01:52,798 | WebContainer : 730 | -|[email protected]|class
online.secure.action.payment.PaymentVerificationAction|-|Mon Aug 14 21:01:46 GMT+05
:30 2014|5805|-|-|20140818210146000001AAAA65590404|
2014-08-14 21:02:34,466 | WebContainer : 699 | -|CCC0990|class online.secure.action.CreditCardPaymentAction|-|Mon
Aug 14 21:02:26 GMT+05:30 2014|7733|-|-|20140818
210226000002AAAA23518695|
2014-08-14 21:02:34,498 | WebContainer : 655 | -|[email protected]|class
online.secure.action.ApplicationSubmitAction|-|Mon Aug 14 21:02:19 GMT+
05:30 2014|15050|-|-|2014081820aa000d02AAaA5a118s7|
Figure 5: Using Custom Logs as an Instrumentation Tool
3.2 System Performance Archive
This practice involves keeping a record of the history of the software system by storing the values of various
metrics related to the performance of the system. The metric provide a strong mechanism for reviewing past
performance and identifying emerging trends. Brewster Kalhe founded the Internet archive for keep a record of
the history of Internet [BREW1996]. The Http archive is a similar permanent record of information related to web
page performance like page size, total requests per page etc. [HTPA2011].
The system performance archive must capture a minimum of three important attributes namely the measured
value, the metric and the applicable domain for each measurement (e.g. the metric is the response time,
applicable domain is the application home page and the measured value is 4.2 seconds). The measured metric
can be explicit or implicit. An example of an explicit metric can be derived from the requirement that “the software
system shall be designed to process 99% of the online home page requests to complete within 5 seconds”. This
archive is used as input to in-house and third party analytical tools to carry out statistical analysis (e.g. mean,
median, standard deviation, percentile) and modeling (e.g. capacity planning) . This paper recommends inclusion
of this practice as a key requirement in the software requirements specification. Critical software systems need to
provision the required infrastructure for creating this archive in terms of compute and storage, in-house or third
party analytical tools for this critical functionality at the time of design, capacity planning and construction. This
compute and storage provisioning can be easily done on in a cloud environment.
PRACTICE: Design and construct a system performance archive for critical software systems to
keep a record of performance related information.
3.3 Controlled Monitoring
This practice involves executing synthetic read only business transactions using a real browser, connection speed
and latency. These synthetic business transactions can be executed from one or more regions. There are a
number of incidents in which an end user reports experiencing slowness but the server health appears normal.
The practice of controlled monitoring helps quickly determine if the incident is specific to the user reporting the
problem. The software system requirements in the area of performance and scalability traditionally do not mention
executing synthetic read only business transactions using real browser, at real connection speed and latency.
This requirement can be implemented using frameworks like private instance of WebPageTest
(https://sites.google.com/a/webpagetest.org/docs/private-instances). Critical software systems need to provision
the required infrastructure for carrying out this monitoring and upload the results in the System Performance
Archive.
23
PRACTICE: Use synthetic read only business transactions using a real browser, connection speed
and latency to measure performance.
3.4 Simulation Environment
This practice is based on the belief that most events happening in a system shall be reproducible under similar
conditions. The causal analysis of certain incidents may remain inconclusive during initial analysis. Recreation of
the symptoms leading to that incident, under similar conditions may lead to the deeper insight and help in finding
the actual root cause. Since such simulation may not be feasible in the actual production environment in majority
of the cases, a similar Simulation Test environment needs to be used. The prevalent practice in the Industry
appears to be to treat performance test as single or one time activity prior to implementation resulting in
provisioning of a simulation environment only for limited duration. As a result, reproduction of a complex problem
that occurred in production becomes extremely difficult.
PRACTICE: Provide a simulation environment to reproduce the performance incident in production
like conditions to ensure completeness and correctness of the causal analysis of that incident.
3.5 Integrated Operations Console
In order to manage performance of a production system effectively, it is essential that production support teams
have the ability to visualize anomalies and resolve exceptions without delay. The prevalent silo based approach
involves measurements at the IT infrastructure component level i.e. server, storage, web servers, application
servers, database servers, network components and application server garbage collection health. The concept of
an Integrated Operations Console can be very effective in such scenarios. This console not only monitors the
system performance, but also records exception conditions and provides ability to take actions to resolve these
conditions. The typical actions may range from killing a process or query to restarting of a service. This console
shall also need to provide a component level checklist which can be executed automatically prior to start of
operations every day. Table 1 shows an extract of a database checklist.
This console empowers the teams to take quicker action once an exception is observed. Besides, allowing
actions, the console may automatically gather the relevant data such as heap dumps, database snapshot to aid
further investigation.
Host accessible?
Instance available?
Able to connect to the database?
….
….
….
….
Table 1: Using Custom Logs as an Instrumentation Tool
PRACTICE: Provide an Integrated Operations Console for monitoring the system performance
parameters and mechanism to resolve anomalies for which resolution processes are known.
24
4 Results & Findings
The above five practices were successfully implemented as part of the design and design of a mission critical egovernment domain web application.
4.1 Implementation 1
The instrumentation was implemented as an integral part of the design and design activity of the e-governance
application. Figure 6 shows the instrumentation implemented in that application. The tiers 2, 2A, 3, 5 and 6
implemented instrumentation in the form of web server logs and custom logs. The tiers 2B and 4 implemented
instrumentation in the form of database snapshots. The relevant logs from all tiers all collected at a shared
location. The information from these logs are processed and used as input to the System Performance Archive
and Integration Operations Console.
Figure 6: Instrumentation implemented in web server, application server and database
The first example shows how simple instrumentation helped finding the cause of a performance incident in where
end users experienced high response time. Figure 7 and Figure 8 show request and response time graphs of tier
1 and 3 respectively calculated using the web server logs and depicted on the Integrated Operations Console.
The request is the count of all the requests serviced in a given time interval and the response time is the mean
execution time of all the requests serviced in that time interval. The visual inspection of Figure 7 clearly shows a
high mean response time in tier 1. The spike in the mean response time is not visible in tier 3 for the same
duration, but there is drop in the number of request serviced by this tier. This helps us to conclude that the origin
of the performance incident may be tier 1 or 2A.
25
Figure 7: The request and response time graph of tier 1 from the Integrated Operations Console. The request is the count of all
requests serviced in a given time interval and the response time is the mean execution time of all requests serviced in that time
interval.
Figure 8: The request and response time graph of tier 3 from the Integrated Operations Console. The request is the count of all
requests serviced in a given time interval and the response time is the mean execution time of all requests serviced in that time
interval.
The second example shows how the same instrumentation helps in determining whether the system is in a steady
state. Figure 9 and Figure 10 shows that the system is in a stable state or equilibrium as the number of arrivals is
equal to the number of exits. The arrivals and exits graphs are depicted on the Integrated Operations Console.
Figure 9: The count of arrivals in a given time interval from the Integrated Operations Console
26
Figure 10: The count of exists in a given time interval from the Integrated Operations Console
The provisioning of a simulation environment even after implementation helped resolve a serious long stop the
world garbage collection pauses in the application server on tier 1 and 3. The simulation and confirmation of
resolution of the issue required multiple cycles of test execution. The identification of the reason for these pauses
as a class unload problem required adding debug logs in a Java runtime class. Executing multiple test cycles to
reproduce the problem is not feasible in the development or production environment. Figure 11 shows the
garbage collector log show more than 91000 classes getting unloaded.
Figure 11: The garbage collection log from the Simulation Environment
27
The System Performance Archives provides an insight into the historical as well as emerging performance related
trends for the software system. These trends are crucial to assess the capability of the software system to render
services while meeting the required performance objectives. This information is also used in capacity planning
exercises. Figure 12 shows the implementation of System Performance Archives in the e-government domain
application.
Figure 12: Implementation of the System Performance Archive
Certain trends may be cyclic and may appear only after a particular period. Other trends may be more permanent
in nature and tend to grow or decline. The implementation used descriptive statistics like count, mean, median,
minimum, maximum, percentile and standard deviation. Figure 13 shows the mean response time trend of a
business transaction (BT1) for a period of a month. The range of this response time is between 800 to 900
milliseconds. As can be clearly seen, the response time remained constant for the complete month. Figure 14
shows the mean response time trend of a second business transaction (BT2) for the same month. It can be seen
th
th
that there is a change in the trend from 7 (1919 milliseconds to 2004 milliseconds) and on 29 (2176
milliseconds to 2813 milliseconds). The changes in the trends were further investigated to find out the reason for
th
the same. The change in the trend on 7 was due to additional business logic being added to the business
th
th
transaction as part of the deployment on 6 . The increase in the mean response time on 29 was due to a
network issue resulting in transactions getting executed slowly.
Figure 13: Daily mean response time trend of a business transaction (BT1) for a month
28
Figure 14: Daily mean response time trend of a business transaction (BT2) for a month
5 Threats to Validity
The practices described in this paper are based on the authors experience in working and managing mission
critical web application. These practices may need to be augmented with additional practices which the software
performance engineering community may share.
6 Conclusions
In this paper we have introduced a few mandatory requirements required to be included as part of the design and
construction of a mission critical software system. This will bring about a paradigm shift in the prevalent practice of
production support teams being equipped adequately to quickly detect a performance incident, gather enough
information during an incident or reproduce the incident for a more accurate closure. These practices when
included as part of design provided significant benefits in production support of a mission critical e-government
domain system in areas of timely detection of a performance incident allowing corrective action, visualizing
emerging trends and providing a more correct closure to incidents.
7 Future Work
The future work in this area includes creating baselines and statistical models for metrics like response time and
throughput. There is also a need to devise proactive anomaly detection models using techniques like steady state
analysis.
29
References
[YOGE2009] K. K. Aggarwal and Yogesh Singh, “Software Engineering”, New Age International Publishers, p470,
2009.
[SIMI2008] B. Simic, "The Performance of Web Applications: Customers Are Won or Lost in One Second",
Technical Report - Aberdeen Group, Accessed on 31 Jan 2014 at http://www.aberdeen.com/aberdeenlibrary/5136/RA-performance-web-application.aspx
[SATO2010] S. Iwata and K. Kono, Narrowing Down Possible Causes of Performance Anomaly in Web
Applications, European Dependable Computing Conference, p185-190, 2010.
[CONN2002] Connie U. Smith, Performance Solutions – A practical guide to creating responsive, scalable
software, Addison Wesley, p243, 2002.
[BREW1996] Brewster Kahle, Internet Archives, Accessed on 17 Aug 2014 at
http://en.wikipedia.org/wiki/Brewster_Kahle
[HTPA2011] http archive, Accessed on 31 Jan 2014 at http://httparchive.org/
[WPGT] WebPageTest, Accessed on 31 Oct 2014 at https://sites.google.com/a/webpagetest.org/docs/privateinstances
30
Incremental Risk Charge Calculation: A case study of
performance optimization on many/multi core
platforms
Amit Kalele*, Manoj Nambiar# and Mahesh Barve*#
Center of Excellence for Optimization and Parallelization
Tata Consultancy Services Limited, Pune India
*
[email protected], #[email protected], *#[email protected]
Incremental Risk Charge calculation is a crucial part of credit risk estimation. This data intensive
calculation requires huge compute resources. A large grid of workstations was deployed at a large
European bank to carry out these computations. In this paper we show that with availability of many
core coprocessors like GPU and MIC and parallel computing paradigms, speed up of order of
magnitude can be achieved for the same workload with just a single server. This proof of concept
demonstrates that with the help of performance analysis and tuning, coprocessors can be made to
deliver high performance with low energy consumption, making them a “must-have” for financial
institutions.
1. Introduction
Incremental Risk Charge (IRC) is a regulatory charge for default and migration risk for trading
book position. Inclusion of IRC is made mandatory under the new Basel III reforms in banking
regulations for minimum trading book capital. The calculation of IRC is a compute intensive
task, especially the methods involving Monte-Carlo simulations. A large European bank
approached us to analyze and resolve performance bottlenecks in IRC calculations. The timing
reported on a grid of 50 workstations at their datacenter was approximately 45 min. Risk
estimation and Monte Carlo techniques is well studied topic and details can be found in [1], [2],
[3] and [4]. In this paper we focus on the performance optimization of the IRC calculations on
modern day many/multi core platforms.
The modern day CPUs and GPUs (Graphic Processing Units) are extremely powerful machines.
Equipped with many compute cores, they are capable of performing multiple tasks in parallel.
Exploiting their parallel processing capabilities along with several other optimization techniques,
could result in many fold improvement in performance. In this paper we present our approach for
performance optimization of IRC calculations. We show that multifold gains, in terms of
reduction in compute time, hardware footprint and energy required, can be achieved. We report
that 13.5x and 5.2x speed up is achieved on Nvidia’s K40 GPUs and Intel KNC coprocessor
respectively.
In this paper we present performance optimization of IRC calculations on Nvidia’s K40 [11] and
Intel’s Xeon Phi or KNC coprocessors. We also present benchmarks on Intel’s latest available
platforms namely Sandy Bridge and Ivy Bridge processors. The paper is organized as follows. In
the next section (2), we briefly describe incremental risk charge in relations with credit risk. In
section (3) & (4), we introduce a method for the IRC calculation along with the experimental
setup and procedure. The performance optimization of IRC calculation on Nvidia’s K40 and
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty
free right to publish this paper in CMG India Annual Conference Proceedings.
31
Intel’s KNC coprocessors are presented in section (6) & (7). We present our final experimental
results and achievements in section (8).
2. Credit Risk & Incremental Risk Charge
Basel-III, a comprehensive set of reforms in banking prudential regulation, provides clear
guidelines on strengthening the capital requirements through:


Re-definition of capital ratios and the capital tiers.
Inclusion of additional parameters into the Credit and Market Risk framework like IRC,
CVA (Credit Valuation Adjustment) etc.
 Stress testing, wrong way risk and liquidity risk
The regulatory reforms and the ongoing change in the derivatives market landscape and the
changing behavior of the clients are moving risk function from a traditional back office to a real
time function. This redefinition of capital adequacy and requirements for efficient internal risk
management has increased the amount of model calculation. This is required within the Credit
and Market risk world and thus there is need for large scale computing. Incremental Risk Charge
is one such problem we focused in this paper.
IRC calculation is crucial for any financial institutions in estimating credit risk. The IRC
calculation involves various attributes like Loss Given Default (LGD), credit rating, ultimate
issuer, product type etc. Standard algorithms as well as proprietary algorithms are used to
calculate IRC and methods involving the Monte Carlo simulations are extremely compute
intensive. In the next section we present one such algorithm and discuss computational
bottlenecks.
3. Fast Fourier Transforms in IRC
IRC is a regulatory charge for default and migration risk for trading book position. One of the
approaches based on Monte Carlo simulations for IRC calculation is described in below figure:
The data involved in default loss
distribution is huge. In our case, the FFT
computation for this data was offloaded to
a grid of 50 workstations.
A typical IRC calculation for a single
scenario involves computations of FFT for
160,000 arrays. Each array consisting of
Figure 15 IRC Calculation Flow
32768 random numbers arising out of random credit movement paths. This translates to
approximately 37GB of data to be processed for FFT computation. In all we have to process 133
such scenarios, which makes it a huge data and compute intensive problem. To summarize the
overall complexity of the problem:

1 scenario of IRC calculation: 37GB of data
32
 Total scenarios: 133
 Total data to be processed: (133 * 37) = 4.9 TB
To simulate the above computations, we carried out following procedure:

For each IRC scenario
o Create 160,000 arrays, each of 32768 elements
o Each arrays is filled with random numbers between (0 ~ 1)
o Transfer the data in batches from host (server) to co-processors (Nvidia’s K40
GPU and Intel Xeon Phi or MIC) over PCIe bus
o Compute FFT and copy back results
4. Experiment Environment
Following hardware setup and software libraries were used to carry out the above defined
procedure. Performance analysis was done using Nvidia’s “nvvp” visual profiler tool.
The GPU benchmarks reported in this paper were carried out on the following system. This was
enabled by Boston-Supermicro HPC labs, UK.
 Host: Intel Xeon E5 2670V2, 2 socket (10 core x 2), 64GB of RAM
 GPU: K40 x 4 (in x16 slot), 12GB RAM
 Freely available cuFFT library from Nvidia is used for FFT calculations [7], [8], [12].
Access to the Intel Xeon system was enabled by Intel India. All the experiments were carried out
on following setup.
 Intel X5647 (Westmere), 4 core, 2.93 GHz, 24GB RAM
 Host: Intel Xeon E5 2670, 2 socket (8 core x2), 2.7 GHz processor, 64 GB of RAM
 Coprocessor: KNC 1.238 GHz speed, 16GB of RAM, 61 cores
 Intel MKL library is used for FFT calculations on host as well as coprocessor
In the following sections we discuss performance tuning of IRC calculations on various
platforms.
5. IRC Calculations on Intel Westmere
We implemented the procedure explained in the previous section on Intel Westmere platform
with Intel’s MKL math library.
The MKL is a collection of several libraries spanning linear algebra to FFT computations. The
MKL provides various APIs for creating plans and performing different FFTs. Following APIs
were used to perform 1D FFT in our exercise.


DftiCreateDescriptor();DftiSetValue();
DftiCommitDescriptor();
DftiComputeForward(); DftiFreeDescriptor();
33
Since 4 cores were available for computation, a multi-process application was developed using
Message Passing Interface (MPI) [5], [6]. The overall computations were equally divided among
all the 4 cores. A code snippet of the main compute loop is given below in Figure 2.
int num_arrays, nprocs, myrank;
int mystart, myend, range;
range = num_arrays/nprocs;
mystart = myrank * range;
myend = mystart + range;
for (i = mystart; i < myend;
i++)
{
load_data(buffer);
DftiCreateDescriptor();
DftiSetValue();
DftiComputeForward(buffer);
DftiFreeDescriptor();
}
(Figure-2)
Since FFT computation for each array is
independent no communication was
required among ranks over MPI.
It took 194 minutes to complete 133 IRC
scenarios. It would require ~40 Westmere
servers to complete these calculations in
under 5 min mark. This adds too much
cost in terms of hardware and power
requirement. We hope to achieve a better
performance with coprocessors and
reduce hardware and power requirements.
We consider Nvidia’s K40 and Intel’s
KNC coprocessor in the following
sections.
6.
IRC on Nvidia K40 GPU
The K40 GPU is Nvidia’s latest Tesla
series coprocessor. It has 2880 compute light weight GPU cores and rated at around 1TF of peak
performance. Such platforms are extremely suitable for data parallel workloads. The cuFFT
library was used for FFT computations. In this section, we describe the performance
optimization in a step-by-step manner starting with a baseline implementation. Each step
includes the measures taken in earlier steps.
o
Baseline Implementation
Using the above mentioned procedure, a baseline implementation of FFT calculations was
carried out. This involved creation of the appropriate arrays, calling the cuFFT functions for
creating a 1D plan cufftPlan1d and cufftExecR2C for computing transform and finally
copying the data back to host using cudaMemcpy.
It took ~67min to compute 133 scenario. We observed that the majority of the time (~61 min) is
spent in data transfer between the host and the device. This data transfer happens over PCIe bus.
Profiling the application using nvvp revealed that data transfer over PCIe bus was happening at
only 2 – 3 GBps. The figure (3) below is a snapshot
of the nvvp output.
The data is always transferred between pinned
memory on host and device memory. Since normal
allocation (using malloc()) is always a page-able
memory, there is an extra step, which happens
Figure 3. Data throughput with pageable memory
34
internally, of allocating pinned memory and copying data between pinned memory and page-able
memory.
o
Performance Optimization
The major performance issue observed was data transfer speed. We carried out couple of
optimizations to resolve this issue. We discuss them below.


Usage of Pinned Memory: The data for 1 IRC scenario is approximately 37GBs. The data
transfer rate achieved was poor since page-able memory was used. CUDA provides
separate APIs to allocate pinned memory
(cudaHostAlloc
and
cudaMallocHost). With pinned memory
usage, we achieved a through put of 5 – 6
GBps. A speed up of ~2.5x was achieved. Figure 4 Data transfers with pinned memory
The data transfer time of 133 IRC scenarios
reduced to around 25 min and overall time was ~31 min.
Multi Stream Computation: In the current scheme of things, the data transfers and
computations were happening sequentially in a single stream as shown in figure (5). By
enabling multi stream computation, we could achieve two way overlap.
o Computations with data transfer: GPUs have different engines for computations
(i.e. launching kernels) and data transfer (i.e. cudaMemcpy). The computations
Figure 5 Computation in 1 stream
were arranged in such a way that computations for one set and data transfer for
next set happened simultaneously see figure (6).
o Data transfer overlap: GPUs are capable of transferring data from host to device
(H2D) and from device to host (D2H) simultaneously. With 4 streams, we could
35
Figure 6 Computation in 4 stream with overlaps
Execution time in min
achieve complete overlap between H2D and D2H transfer. With overlaps, further
speedup of approximately 2.67x was achieved. The time for 133 IRC scenarios
were reduced to ~11 min.
A single server can host multiple coprocessor cards. So within a box we could still enhance the
performance by using multiple GPUs. This however had a limitation of data transfer bandwidth.
Our experimental setup had 2 GPUs in x16 PCIe slot. The above optimized implementation was
extended to use two GPUs. The final execution time obtained was 5.6 min. The below plot
highlights step-by-step performance improvement.
80
67
60
31
40
11
20
5.6
0
IRC Problem
Baseline
Pinned Memory
Pinned memory + Multi Streams
2 GPUs
Figure 7 Step-by-step performance improvement on K40
A marginal dip in the performance is observed which is attributed to the sharing of bandwidth for
data transfer between host and multiple devices. The overall scale up achieved was close to 2x
with 2 devices.
7. IRC on Intel KNC
Like NVidia’s K40 GPU, we also carried out the above exercise on Intel KNC coprocessor. The
KNC was Intel’s first coprocessor with 61 cores and it also supports 512 bit registers for vector
processing. These two feature together provides tremendous computing possibilities similar to
NVidia GPUs. Intel also offers a highly optimized math library (MKL). The MKL is a collection
of several libraries spanning linear algebra to FFT computations. However unlike cuFFT, MKL
is not freely distributed. The MKL provides various APIs for creating plans and performing
different FFTs. Following APIs were used to perform 1D FFT in our exercise.

DftiCreateDescriptor();DftiSetValue();
 DftiComputeForward(); DftiFreeDescriptor();
Unlike GPUs, which only works in offload mode, the KNC coprocessor could be used for
computation in native mode, symmetric mode and offload mode. In an offload mode, the main
application runs on the host. Only compute intensive sections of the application are offloaded to
the coprocessors. In native mode, full application run on the coprocessor and in symmetric mode
both host and coprocessor run the part of application. In this exercise all the reading mentioned
on KNC were taken in native mode. Only the final reading of the optimized code was taken in
symmetric mode.
36
The biggest advantage of using KNC coprocessor, in the native mode, is no code level changes
were required. The implementation done for Westmere platform was only recompiled for KNC
platform. The overall computations were equally divided among all the 60 cores.
Each rank or core computed FFT for arrays in its range. This baseline code took 120 min for 133
IRC scenarios. Though the compute time was reduced as compared to Westmere platform (from
194 min to 120 min), the advantage is not as expected. We discuss the changes made to enhance
the performance.
o
Performance Optimizations
Since we were operating in native mode, no data transfer between host and coprocessor was
involved. The MKL library used for FFT computation is highly efficient one. To identify
performance issues, we referred to Intel’s guide to best practices [9], [10] on KNC and MKL.
We exploited some of these techniques, which resulted in improved performance. We present
these below:

Thread binding: Multi-core coprocessors can achieve the best performance when threads
do not to migrate from core to core while execution. This can be achieved by setting an
affinity mask to bind the threads to the coprocessor cores. We observed around 5 – 7 %
improvement by setting the proper affinity. The affinity can be set by KMP_AFFINITY
environment variable with the command:
export KMP_AFFINITY=scatter,granularity=fine
 Memory Alignment for input/output data: To improve performance with data access,
Intel recommends that the memory address for input and output data is aligned to 64 byte.
This can be done by using MKL function mkl_malloc() to allocate input output
memory buffer. This provided further boost of 7 – 9 % in the performance.
 Re-using DFTI structures: Intel recommends reuse of the MKL descriptor functions if
FFT configuration remains constant. This reduces the overhead to initialize various DFTI
structure.
The
MKL
functions
DftiCreateDescriptor
and
DftiCommitDescriptor allocates the necessary internal memory, and perform the
initialization to facilitate the FFT computation. It may also involve the computation on
exploring different factorizations of the input length and searching the highly efficient
computation method. For the problem under consideration array sizes, type of data, type
of FFT remains unchanged for the full application. Hence these descriptors can be
initialized only once and then reused for all the data. Initializing these descriptors only
once outside the main compute loop gave the desired ~3.6x performance gain.
With all the above changes in place, we observed significant improvement in the performance of
IRC calculations. Timing for 133 IRC scenario was reduced to approximately 32 min from 120
min.
Similar to GPUs, a single server can host multiple KNC coprocessor. Since such a setup was not
available, we expect that it would take around 16 – 17 min for IRC calculations on 2 KNCs.
37
8. Final Results
In the earlier sections we discussed performance optimization of IRC calculations on Nvidia’s
K40 and Intel’s KNC coprocessor. Both the platforms are capable compute resources having
their own pros and cons. In this section we summarize overall achievements and other benefits
enabled by this optimization exercise.
o
Execution time with Hybrid computing
Several fold performance improvement was achieved on both coprocessors. All the workload
was taken by coprocessors. However the host machine could be utilized to share the partial
workload. In case of Intel KNC, it required only recompilation of the code to facilitate this.
However for K40, we had to rework the code to accommodate these changes. This was achieved
by combining MPI and CUDA C.

35 – 40 % further speed up was achieved on both KNC and K40 by enabling the
workload sharing.
 Out of 160,000 arrays per IRC scenarios, 60,000 were processed on each of K40s and
40,000 on host. In case of KNC the split up was 30,000 on each of KNC and 100,000
arrays on the host.
The figure (8) summarizes the best results achieved with hybrid computing on coprocessors
along with other Intel platforms.
Execution time in min
120
97
100
80
60
45
40
20
8.95
8.15
3.38
8.645
0
IRC Problem
Original: 50 work Stations
2 Westmere
2 Sandy Bridge
2 Ivy Bridge
2 K40 Hybrid
2 KNC Hybrid
Figure 8 IRC Performance comparison across all platforms
Clearly K40 GPUs performs better than KNC coprocessor. However KNC offers ease of
programming. Any x86 applications will only require a recompilation to work on KNC. On the
other hand porting application to K40 requires lot of programming efforts in terms of CUDA C.
o
Energy Consumption
The energy required to carry out the computations directly affects the cost of the computation. In
our experiment, the K40 performed best. Assuming this as benchmark, we rationalize the
hardware requirement of other platforms to achieve the same performance and intern calculate
38
the energy required to carry out the computation. The energy consumptions are computed
considering the rated wattage for each Intel and Nvidia platform.
The figure (9) show that drastic reduction in compute time and cost for computation is achieved
by optimizing IRC calculations on both the platforms. But the gains are not only limited to these
factors. This exercise also enabled huge reduction in hardware footprint, data center floor space
and ease to maintain compact system.
7
5.79
Energy required in KWh
6
4.74
5
4
3
2
1
0.053
0.044
0.124
0.032
0
Energy required in KWh
Original: 50 Workstations
Westmere
Sandy Bridge
Ivy Bridge
KNCs with Hybrid
K40s with Hybrid
Figure 9 Energy requirement for best performance for all platforms
9. Conclusion
This paper highlights the importance of optimizing an application for a given platform. The
baseline results suggests that simply using the new hardware with libraries would almost always
results in a suboptimal performance. The modern day many core GPUs or Coprocessors have
tremendous computing capabilities. However new and legacy applications could achieve huge
gains only by doing optimization by detailed analysis and measurement with proper profiling
tools. In this paper we highlighted the above fact with an example of IRC calculations. Though
the chosen application was from financial risk computation, compute intensive applications from
various domains can benefit by performance optimization with many core parallel computing.
The highlights of the work are as follows:


With the optimizations, we achieved approximately 13.5x and 5.2x speedup on K40 and
KNC respectively for IRC calculations and ~150x reduction in energy consumption.
Hybrid computing utilizes both, the host and the coprocessor and intern gives best
performance.
39

These high performance setups (coprocessor HW + optimized applications) would allow
banks or financial institution to simulate many more risk scenarios in real time and enable
better investment decisions.
We conclude this paper on a note that, with proper performance optimizations, the many/multi
core parallel computing with coprocessors enable multi-dimensional gains in terms of reduction
in compute time, cost of computations and hardware foot print.
Acknowledgement
The authors would like to thank Jack Watts from Boston Limited (www.boston.co.uk) and
Vineet Tyagi from Supermicro (www.supermicro.com) for enabling access to their HPC labs for
K40 benchmarks. We are also thankful to Mukesh Gangadhar from Intel India for enabling
access to Intel KNC coprocessor.
References
[1] T. Wood, “Applications of GPUs in Computational Finance”, M.Sc Thesis, Faculty of
Science, Universiteit van Amsterdam, 2010.
[2] P. Jorion, “The new benchmark for managing financial risk”, 3rd ed., New York, McGrawHill, 2007.
[3] J. C. Hull, “Risk Management and Financial Institutions Prentice Hall”, Upper Saddle River,
NJ, 2006.
[4] P. Glasserman, “Monte Carlo Methods in Financial Engineering”, Appl. of Math. 53,
Springer, 2003.
[5] Web Tutotial on Message Passing Interface, www.computing.llnl.gov/tutorials/mpi
[6] Peter Pacheco, Parallel Programming with MPI, www.cs.usfca.edu/peter/ppmpi/
[7] Kenneth Moreland, Edward Angel,The FFT on GPU,
http://www.sandia.gov/~kmorel/documents/fftgpu/fftgpu.pdf
[8] Xiang Cui, Yifeng Chen, Hong Mei, “Improving Performance of Matrix Multiplication and
FFT on GPU” 15th International Conference on Parallel and Distributed Systems,2009.
[9] Intel guide for Xeon Phi: https://software.intel.com/sites/default/files/article/335818/intelxeon-phi-coprocessor-quick-start-developers-guide.pdf
[10] Tuning FFT on Xeon Phi: https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dftfunctions-performance-on-intel-xeon-phi-coprocessors
[11] Nvidia K40 GPU: http://www.nvidia.com/object/tesla-servers.html
[12] Nvidia cuFFT library: https://developer.nvidia.com/cuFFT
40
Performance Benchmarking for Open Source Message Productions
Yogesh Bhate
Performance Engg. Group
Persistent Systems
[email protected]
Abstract-
Abhay Pendse
Persistent Systems
[email protected]
Deepti Nagarkar
Persistent Systems
[email protected]
This paper shares the experiences and findings that were collected during a 6-month
performance benchmarking activity carried out on multiple open source messaging products. The primary
aim of the activity was to identify the best performant messaging product from around 5 shortlisted
messaging products available in the open source community. There were specific requirements provided
against which the benchmarking activity was carried out. This paper covers the objective, the plan and
the execution methodology followed for this. The paper also shares the detailed numbers that were
captured during the tests.
1. Introduction
A large scale telescope system is being built by a consortium of 4-5 countries. The telescope system
consists of the actual manufacturing, installation and the operation of a 30 Meter telescope and its related
software sub systems. All software subsystems that control the telescope or use the output provided by
the telescope need to communicate with each other through a backbone set of services providing multiple
common functionalities like logging, security, monitoring and messaging. Messaging or Event Service as
it is called is one the primary service which is part of the backbone infrastructure in the telescope software
system. Each software subcomponent talks to one another use a set of events and those events need to
be propogated to the correct target in real-time.
The event service backbone had stringent performance requirements which are listed in subsequent
sections. The event service was planned to be a thin API layer over a well-known open source messaging
product. This allowed the software planners to keep an option open of changing the middleware in the
lifecycle of the event service. It was required that the software lifecycle would be a minimum of 30 years
from the date of commissioning of the telescope systems. Benchmarking open source messaging
platforms for use in the Event Service development was the primary goal of this project
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty
free right to publish this paper in CMG India Annual Conference Proceedings.
41
2. Benchmarking Details
2.1. Functional Requirements for Benchmarking
The customer provided some very specific requirements which were to be considered during the
benchmarking activity. Below is a summary of those requirements:
 No Persistence: The messages or event sent via the event service are not expected to
be persisted nor are they expected to be durable.
 Unreliable Delivery: Message delivery may not be reliable. This means that it is okay if
the messaging system has some message loss.
 No Batching: No batching should be used to send messages or event. As soon as an
event gets generated it has to be sent on the wire to the listeners/subscribers.
 Distributed Setup: The products should work in a distributed fashion i.e. the publisher,
subscriber, and broker should all be on different.

Java API: Java API should be designed and developed for the benchmarking tests.
2.2. Benchmarking Plan
To ensure that all stakeholders understand the exact process and expectations of the project a
benchmarking plan was created before the work was started. The purpose of the benchmarking
plan was to explain the process of benchmarking in detail. Some important areas that the
benchmarking plan covered were:
 The environment that was planned to be used for testing
 The methodology that was to be used
 The software tools, libraries that would be used.
 The workload models that would be simulated
This benchmarking plan was circulated and reviewed by everyone from the customer technical
team and this was used as the basis for all the activities of this benchmarking project. During
multiple rounds of reviews the benchmarking plan went under numerous changes to ensure that
we only look at stuff that was needed by the customer.
This paper would not go into the details of the benchmarking plan but below (ref. Table 1) is the
summary of the workload models which were mutually agreed and considered important
42
Table 1
2.3. Environment Setup
The benchmarking was carried out on physical high end servers. The configuration of the servers
and other details were part of the benchmarking plan:
Hardware
 Three physical servers
 Each server with 2 Intel Xeon processor chips. Each chip with 6 cores.
 32GB of RAM on each server.
 1G and 10G connectivity between these servers connected via a NetGear switch.
 Each server with one 1G NIC and one 10G NIC.
Software
 64 Bit Java 1.6
 64 Bit Cent OS
 MySQL for storing counters
The following two topologies (Ref. Figure 1) were used for the test
43
Figure 1
2.4. Benchmarking Suite
A custom benchmarking suite was used for this particular benchmarking activity which
allowed us to execute multiple iterations of tests with different workload configurations, to
capture counters and generate appropriate charts for the tests. The following diagram (Ref.
Figure 2) gives a quick design view of the benchmarking suite
Figure 2
Other tools
Apart from the custom bechmarking suite some open source utilities were also used. Below is the
summary of such utilities





Standard Linux utilities - pidstat, vmstat, top etc to capture cpu , memory, disk activities
rd
nicstat – a 3 party utility to monitor network usage on the NIC card.
jstat – A standard JDK utility to capture java heap usage.
JFreeChart – JFreeChart is used to plot graphs from the data collected by the test. This is used
as part of the reporting module in the benchmark suite.
MySQL – MySQL will be used to store the captured metrics and the reporting component will
generate reports based on data stored in MySQL db.
44

Ant – for building the source code.
2.5. Tests
Since there were multiple tests that needed to be done it was required that we categorize the
tests into high level types to clearly understand the purpose of each test. The following
categories were hence defined and every tests marked under one of these categories:
Test Category
Throughput Tests
Latency Tests
Message Loss Tests
Reliability Tests
Description
Tests executed under this category captured the throughput of the
messaging platform. Tests under these categories were executed in
different combinations to observe how the throughput changes
Tests executed under this category will capture the latency of the
messaging platform. These tests will determine how latency gets
affected by different parameters and load. The tests will also determine
the variance in latency (jitter)
Tests executed under this category will capture the message loss if
any for a messaging platform. During execution these tests could be
combined with the throughput tests
Tests executed under this category will discover if the messaging
platform degrades if it’s up and running for a longer duration. Such
tests will make the messaging platform send and receive messages for
a longer duration of time(e.g. overnight) and identify if there is any
adverse impact on latency, throughput or overall functioning of the
platform
Table 2
2.6. Products to be benchmarked
This project was preceded by an earlier phase where almost all available open source
messaging products were subjected to multiple levels of filter criteria. This phase called the
Trade Study phase selected 5 messaging products which were considered suitable for the
requirements of the customer. In this phase these 5 products were benchmarked. The
products are
Table 3
45
2.7. Reporting
It was decided that the following important quantitative parameters would be reported after
the benchmarking tests. For all 5 products each of these parameters would be compared and
the product which has the best values for the majority of the parameters would be chosen.
 Publisher Throughput – Max Number of messages sent per second
 Subscriber Throughput – Max Number of messages received per second
 Latency – The time taken for the message from point A to point B.
 Jitter – Variation in Latency
 Message Loss – Loss of messages
Important Note: All the tests were to be done on 1G and 10G network. It was decided that for
comparison purposes the numbers observed for a 10G network would be used since the
production network bandwidth was planned to be 10G.
46
3. Observations
3.1. Aggregate Publisher Throughput
This parameter gives the maximum number of messages that can be published by the
publishers per second both in an isolated fashion and as an aggregate group.
These throughput numbers are captured on 10G networks when only a single subscriber
listens on a topic. In majority of the cases the system has been scaled to use multiple
publishers, multiple topics.
Figure 3
3.2. Isolated Publisher Throughput
The picture below shows the throughput possible when a single publisher publishes
messages as fast as possible as a function of message size without system failure. HornetQ
was able to publish 111,566 600 byte messages.
It is expected that the throughput in msgs/sec will decrease with message size. In a perfect
system, the decrease would be linear. As shown, this is mostly true but begins to fail for
larger message sizes.
47
Figure 4
During the throughput tests we have observed that HornetQ showed the best possible
throughput and was able to utilize the whole bandwidth of the network. All the other products
hit a plateau on the publisher processing side and could not use the network to the full extent.
3.3. Subscriber Throughput
This provides a view of the number of messages the subscribers were able to consume per
second as a group.
48
Figure 5
The above charts show the aggregate subscriber throughput. In this case one subscriber
listens on one topic and we increase the number of subscribers and topic. This potentially
shows the scalability of the platform from a consumer angle. HornetQ subscriber throughput
is more than two times of the other products. Comparing the publisher and subscriber
throughput graphs we should have seen almost the same number of messages consumed by
the subscribers. But due to latency and other factors the subscribers always lag by some
amount. However as we can see the lag is minimal in the case of HornetQ.
3.4. Impact of multiple subscribers on throughput
Some tests were carried out to judge the impact of multiple subscribers listening on the same
topic. In the customer defined scenarios they did not expect their system to have anything
more than 5-10 subscribers listening on an individual topic. Hence these tests were carried
out for a limited number of subscribers
49
Figure 6
The throughput drops whenever more subscribers join in to listen on a topic. The primary
reason for this drop has to do with the acknowledgements that the platform has to
manage for every message loop. In this case too, HornetQ shows the best possible
results for multiple subscriber scenarios.
50
3.5. Publisher Throughput v/s Subscriber Throughput Ratio
Our observations of the publisher and subscriber throughput for both 1G and 10G show how
well the platform allows the subscriber to “keep up” with the publisher. The chart below
shows the ratio of this comparison.
Figure 7
The best products will show a flat curve and the closer the ratio is to 1, the better the
product. Again, HornetQ is clearly the best product, but surprisingly, Redis is the second
best product with 80% of its messages arriving within the measurement period. The
worse product is Redhat MRG with only 40%-60% arriving within the measurement
window.
51
3.6. Scalability Range
Significant numbers of tests were designed to find out the upper limit of the platform. This
gave good insights on the way the platform was designed and developed. However the
customer was interested in finding out one more non-traditional parameter which was termed
as Scalability Range. This test was designed so that each publisher will publish messages at
a predefined rate (throttled Publishers) of 1000Hz i.e 1000 messages/sec and with such a
configuration we had to determine what is the max number of publishers the platform can
support. In the customer production scenario, the telescope instruments had a upper transmit
range of 1KHz however the number of instruments were not fixed. So this test was deemed
important.
Figure 8
HornetQ and RedHat MRG showed the best scalability in this test and we were able to
stretch the system to almost 350 publishers each publishing at 1000Hz without a
message loss (all messages were received by the subscriber)
3.7. Latency and Jitter
The time a message takes to reach from publisher to subscriber is an important
measurement of the product performance. During latency tests clock synchronization
problem was encountered and our attempts to use NTP or PTP daemons did not yeild the
result we expected. Hence we used the approach of calculating the Round Trip Time (RTT)
and halving that to come up with one way latency.
This method, however, does introduce a bit of uncertainty to the measurements. But this was
considered the best approach at that time since the customer was more interested in
ensuring that the latency numbers hower around the microseconds range and not the millisecond range.
We reported Latency as average numbers, percentile numbers and standard deviation
numbers. However it was considered best to compare the latency in Percentile terms than
the average values since averages can get skewed with outliers.
52
Figure 9
The average latency numbers show that OpenSplice DDS and Redis have the lowest
average latency.
HornetQ and Red Hat MRG have the highest average latency.
The percentile charts show a different view of the numbers.
We have used percentile charts in the previous reports and we believe that gives a very
good perspective of how the latency varies across messages. Jitter can be accurately
visualized by looking at
the percentile charts or values. Average Latencies can be misinterpreted because even
some high values
can skew the whole data set. Percentile shows how much percent of the total messages
fall below a particular latency.
53
So percentile shows the distribution of latencies in the whole message group. In this
report we have picked up the 50, 80, 90 percentile values for all products. HornetQ has
the lowest latency in the 50, 80 and 90 percentile range and this is the primary reason its
subscriber throughput is almost equal to the publisher throughput. Other products have
high latencies in the 50, 80 and 90 percentile range even if their average latencies are
lower than HornetQ and hence they have low subscriber throughput.
In a perfect world we would have both the average and percentile to be minimum and
that really then can be classified as the best product from a latency standpoint. However
in the real world we rarely see such cases. The platform has done some kind of give and
take with respect to throughput, processing speed and latency.
HornetQ does provide a good percentile numbers and comparing with its other
parameters it still remains the top product in the benchmarking tests.
3.8. Resource Utilization
During all the benchmarking tests we constantly captured how the platform utilized the
system resources. We thought it important to report the system utilization at peak throughput
to give a glimpse of how well the system resources are utilized.
Figure 10
Redis is a single threaded server and hence it never utilizes more than one CPU from the
system. HornetQ and RedHat MRG heavily use the CPU’s on the server and that is how
they are able to scale to a very high throughput number.
54
Figure 11
4. Conclusion
Close to around 10-12 man months were spent in this benchmarking activity and this paper attempts
to provide a glimpse of what it looked like after all the work was done and data was compared. The
customer was provided with very detailed charts and reports after every product was benchmarked
and that helped us to tune the overall process by getting early feedback from the customer. We have
done thousands of iterations to ensure that we had utilized the features as per the documentation and
then only we have finally captured the data. Not all the efforts and work done can be documented in
this short paper.




HornetQ came out to be the best performing product based on the customer requirements.
Redis showed real promise on performance standpoint.
RTI and OpenSplice tests were extremely discouraging. Though we acknowledged to the
customer that RTI and OpenSplice is a functionality rich product with thousands of tuning
possibilities which couldn’t be attempted in the time frame provided to us. But we used the
most commonly documented settings and used those for testing.
RedHat MRG was the next best product after HornetQ in throughput terms.
While recommending HornetQ to be used for the event service implementation we also provided a
verbose tabular comparison (see below) of each product to the customer.
55
Table 4
56
References
[RTI DDS] http://www.rti.com/products/dds/
[HornetQ] hornetq.jboss.org/
[Redhat MRG] https://www.redhat.com/promo/mrg
[Open Splice DDS] www.prismtech.com/opensplice
[Redis] www.redis.io
57
AUTOMATICALLY DETERMINING LOAD TEST DURATION USING
CONFIDENCE INTERVALS
Rajesh Mansharamani
Freelance Consultant
Subhasri Duttagupta
Anuja Nehete
Innovation Labs Perf. Engg.
Persistent Systems
[email protected] [email protected] [email protected]
Load testing has become the de facto standard to evaluate
performance of applications in the IT industry, thanks to the growing
popularity of automated load testing tools. These tools report
performance metrics such as average response time and throughput,
which are sensitive to the test duration specified by the tester. Too
short a duration can lead to inaccurate estimates of performance and
too long a duration leads to reduced number of cycles of load testing.
Currently, no scientific methodology is followed by load testers to
specify run duration. In this paper, we present a simple methodology,
using confidence intervals, such that a load test can automatically
determine when to converge. We demonstrate the methodology using
five lab applications and three real world applications.
1. Introduction
Performance testing (PT) has grown in popularity in the IT industry thanks to a number of commercial and free
load testing tools available in the market. These tools let the load tester script application transactions to
create virtual users, which mimic the behaviour of real users. At load test execution time, the tester can
specify the number of virtual users, the think time (time spent at terminal), and the test duration.
Test duration is specified in these tools either as an absolute time interval or in terms of number of user
iterations that need to be tested. In the absence of statistical knowledge, the common practice in the IT
industry is to specify an ad hoc duration which may range from a few seconds to a few hours. The ad hoc
duration is usually arrived at in consultation with one's test manager or blindly adopted from 'best practices'
followed by the PT team.
Regardless of the duration specified, at the end of the load test the tester gets numeric estimates of
performance metrics such as average response time and throughput. The numeric value is accepted as true
because it has come from a well-known tool. Unfortunately, the sensitivity of test duration on test output is not
considered in a regular load test. By regular load test we mean a test that is used to determine the application
response time under a given load, and not the stability of the application under load for a long duration (such
as to test for memory leaks).
If the test duration is too small the estimate of performance may be erroneous. If the test duration is too long it
will lead to fewer cycles for load testing. This paper proposes a simple methodology based on confidence
intervals for automatically determining load test duration while a load test is in progress.
Confidence intervals are widely used to determine convergence of discrete event simulations [PAWL1990]
[ROBI2007] [EICK2007]. Using confidence intervals one can specify with what probability (say 99%) the
estimate of average response time lies in an interval around the true average. The wider the interval less is
the confidence in the estimate. As the run duration increases one expects the interval to become tighter and
converge to a specified limit (for example, 10% of the true mean). We have not come across any study that
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free
right to publish this paper in CMG India Annual Conference Proceedings
58
specifies how to use confidence intervals to determine load test duration. There is a mention that confidence
intervals should be used in load tests in [MANS2010] but no methodology has been given there.
The rest of this paper is organised as follows. Section 2 provides the state of the art in specifying load test
durations. Section 3 provides an introduction to confidence intervals for the reader who is not well versed with
this topic. Section 4 provides a simple methodology to determine run duration and its application to laboratory
(lab) and real world applications. We show in Section 5 that this methodology also works for page level
response time averages as opposed to overall average response time for an application. Section 6 extends
this methodology to deal with outliers in response time data. Finally, Section 7 provides a summary of the
work and ideas for future work.
2. State of the Art in Determining Load Test Duration
We have seen four types of methodologies to determine load test duration in the IT industry, some of which
are given in [MANS2010] and some discussion of the warm up phase (see Section 2.2) is provided in
[WARMUP]. These are not formal methodologies in the published literature, but over the years majority of IT
performance testing teams have adopted them. We now elaborate on each methodology.
2.1 Ad hoc
The most popular methodology is to simply use test duration without questioning why. More often than not the
current load testing team simply uses what was adopted by the previous load testing team and that becomes
a standard within an IT organisation. We have commonly seen test durations ranging from 30 seconds to 20
minutes.
2.2 Ad hoc duration for steady state
(ms)
In this methodology the transient state data is manually discarded. The initial part of any load test will have the
system (under test) in a transient state, due to several reasons such as ramp up of user load, and warm up of
application and database server caches. Figure 1 shows the average response time and throughput as a
function of time, of a lab application that was load tested with 300 virtual users. As can be seen in the figure if
the test duration happens to be in the transient state, then the estimates of average response time and
throughput will be highly inaccurate compared to the converged estimates seen at the later part of the graphs.
(seconds)
(seconds)
Figure 16 Average Response Time & Throughput vs. Test Duration
Experienced load testers usually run a pilot test for a long duration and then visually examine the output to
determine the duration of transient state. They then discard transient state data and use only the steady state
data to compute performance metrics. The duration of time used in the steady state is ad hoc, and often
ranges from 5 minutes to 20 minutes. While this methodology results in more accurate results than the one in
Section 2.1, it is not clear how long to run a test in steady state. Moreover, the transient state duration will
vary with change in application and in workload, requiring a pilot to be run for every change.
2.3 Ad hoc Transient Duration, Ad hoc Steady State Duration
As discussed above, it is laborious to run a pilot and visually determine start of steady state for every type of
load test. As a result, some performance testing teams adopt an ad hoc approach to transient and steady
state durations. We have seen instances wherein the PT team simply assumes that the first 20 minutes of run
duration should be discarded and the next 20 minutes data should be retained for analysis.
59
2.4 Long Duration
The last approach that we have seen in several organisations is to keep the regular load test duration in hours
(to obtain an accurate estimate for performance and not to test for availability, which is a separate type of
test). This way the effects of transient state will not have any major contribution to overall results, since it is
assumed that transient state lasts for a few minutes. We have seen several instances of 2 to 3 hour test
duration in multiple organisations. While there is no doubt on the accuracy of the output, this approach
severely limits the number of performance test cycles.
3. Quick Introduction to Confidence Intervals
We have added this section to give a quick introduction to confidence intervals to the non-statistical load
tester. To understand confidence intervals it helps to first understand the Central Limit Theorem.
The Central Limit Theorem [WALP2002] in statistics states that:
Given N independent random variables X1, …, XN each with mean  and standard deviation , then
the average of these variables X = (X1 + X2 + … + XN)/N approaches a normal distribution with mean
 and standard deviation /sqrt(N).
Successive response time samples may not necessarily be independent and hence it is common to see the
method of batch means widely employed in discrete event simulation [FISH2001]. Instead of using successive
response time samples, we use batches of samples and take the average value per batch as the random
variable of interest.
Thus, if we consider response time batch averages in steady state then we can assume that the average
response time (across batch samples) will converge to a Normal distribution. For a Normal distribution with
mean  and standard deviation  it is well known that 99% of the values lie in the interval (  2.576)
[WALP2002]. Therefore, if we have n batch average samples in steady state during a load test then we can
say with 99% confidence that our estimate of average response time of n samples is within 2.576*/sqrt(n) of
the true mean, where  is the standard deviation of response time. As number of samples, n, increases the
interval gets tighter and we can specify a convergence criteria, as will be shown in Section 4.
An important point to note is that we do not know the true mean  and standard deviation  of response times
to start with, and hence we need to use the estimated mean and standard deviation computed from n samples
of response time. To account for this correction, statisticians assume a student t-distribution [WALP2002].
This will clearly be a function of the number of samples n (more specifically degrees of freedom n-1) and the
level of confidence required (say 99% or 95%). Tables are widely available for this purpose, such as the one
provided in [TDIST]. For a large number of samples (say n=200), the confidence intervals estimated out of a
student t-distribution converge to that of a normal distribution [WALP2002].
4. Proposed Methodology for Automatically Determining Load Test Duration
4.1 Proposed Algorithm
We propose a simple methodology where we analyse response time samples in steady state until we are
confident that the average response time converges. Upon convergence we stop the load test and output all
the metrics required from the test. While there is no technical definition of when exactly steady state starts, we
know that initially throughput will vary a lot and then gradually converge (see Figure 1). Let X k denote the
throughput at k minutes since start of the test (equal to total number of samples divided by k minutes). We
assume that steady state has started after k minutes if Xk is within 10% of Xk-1, where k > 1.
Once we are in steady state we start collecting samples until we reach our desired level of confidence. We
1
propose using a 99% confidence interval that is within 15% of the estimated average response time . In other
words if after n batch samples in steady state the estimated average response time is A n, and the estimated
standard deviation across batch samples is Sn then we assume that the average response time estimate
converges if the following relationship holds true:
1
There is nothing sacrosanct about 15%; it is just that we empirically found the convergence to be reasonably
good with this interval size.
60
An + t99,n-1 Sn /sqrt(n)  1.15 An
where t99,n-1 is the critical value of the t-distribution for
=0.01 (two tailed) and n-1 degrees of freedom. For example, for n=50, t99,n-1 = 2.68
Suppose we have an application wherein the average response time takes a very large amount of time to
converge, then we need to specify a maximum duration of test to account for this case. We also need to
specify a minimum duration in steady state to account for (minor) perturbations due to daemon processes
running in the background, activities such as garbage collection, or known events that may occur at fixed
intervals (such as period specific operations/queries).
Taking the above in to account we propose the following Algorithm 1 for automatically determining load test
duration while a load test is in execution.
Table 1: Algorithm 1 to Determine Load Test Duration
1. Start test for Maximum Duration
2. From the first sample onwards, compute performance metrics of interest as well as
throughput (number of jobs completed/total time).
Let Xk denote throughput after k minutes of the run, where k > 1.
If (Xk  1.1 Xk-1) and (Xk  0.9 Xk-1) then
Steady state is reached. Reset computation of all performance metrics.
Else if Maximum Duration of test is reached output all performance metrics computed
3. From steady state, restart all computations of performance metrics. Assume a batch
size of 100 and compute average response time per batch as one sample. Compute
the running average and standard deviation across batches as follows:
Set n = 0, Rbsum = 0, and Rbsumsq = 0 at start of steady state
For completion of every 100 samples (batch size) after steady state do
Let Rb = average response time of batch
n=n+1
Rbsum = Rbsum + Rb
Rbsumsq = Rbsumsq + Rb*Rb
AvgRb = Rbsum / n
StdRb = sqrt(Rbsumsq/n - AvgRb * AvgRb)
If (t99,n-1 *StdRb/sqrt(n)  0.15 AvgRb) and (MinDuration is over in steady state)
then stop test and output performance metrics
Else If Max Duration is reached then output performance metrics
Endif
End for
We have assumed a batch size of 100. This was chosen empirically after asserting that the autocorrelation of
batch means [AUTOC] was less than 0.1 for the first few values of lag. Typically correlation drops with
increase in lag.
Note that we compute running variance by taking the difference between average of the squared response
time and square of the average response time. This is very efficient for a running computation, as opposed to
the traditional method which is O(n).
We need to validate whether the use of 99% confidence intervals that are within 15% of estimated average
response time, is indeed practical for convergence of load tests or not. And if load tests do converge then we
need to assess what is the error percentage versus a true mean, assuming true mean is the value we get if
we let the test run for ‘long enough’.
We also need to specify a value for MinDuration of test after steady state. Technically one might want to
specify both a minimum number of samples as well as a minimum duration, whichever is higher. In reality, it is
easier for the average load tester to simply specify duration in minutes, given that most of the load tests
produce throughputs which are in tens of pages per second or higher thus yielding sufficient samples.
The next section 4.2 validates Algorithm 1 on a set of five lab applications and then section 4.3 does the same
on three real life applications.
61
4.2 Validation against Lab Applications
Five lab applications were used for validating Algorithm 1. All five were web applications, which were load
tested using an open source tool with 300 concurrent users with 2 second think time. All tests were run for a
total duration of 22 minutes. We asked the team running the tests to send us response time logs in the format
<elapsed time of run, page identifier, response time> where the log contains one entry for each application
web page that has completed.
The five lap applications were:
a. Dell DVD Store (DellDVD) [DELLDVD] which is an open source e-commerce benchmark application. 7
pages were tested in our load test.
b. JPetStore [JPET] which is an open source e-commerce J2EE benchmark. 11 pages were tested in our
load test.
c. RUBiS [RUBIS] which is an auction site benchmark. 7 pages were tested in our load test.
d. eQuiz which is a proprietary online quizzing application. 40 pages were tested in our load test.
e. NextGenTelco (NxGT) which is a proprietary reporting application. 13 pages were tested in our load test.
We present in Table 2 the application of Algorithm 1 to determine convergence in the load tests for these five
lab applications, all of which had a maximum test duration 22 minutes. We graphically verified that in all cases
that the throughput and average response times had converged well before 22 minutes. In all cases we used
a minimum duration of 5 minutes after steady state. We can see from Table 1 that all the applications reached
steady state within 2 to 3 minutes, and after 5 minutes of steady state the 99% confidence intervals are well
within 15% of the mean. When we compare the estimated average response time versus the true mean
(assumed to be the average response time in steady state at the end of 22 minutes) we see a very small
deviation between the two, in most cases less than 1% and in one case just 3.4%.
If we did not specify a minimum duration of 5 minutes after steady state, and just waited for the first instant
where the 99% confidence interval size was within 15% of the estimated average response time, we observed
that 'convergence' happened in a matter of a few seconds for three of the applications and within 1 to 2
minutes for two others, as shown in Table 3. As seen from Table 3 the deviation from the true mean can go up
to 20%, which may be acceptable only during the initial stages of load testing.
Table 2: Application of Algorithm 1 to Lab Applications (Min Duration=5 min)
Application
DellDVD
JPetStore
RUBiS
eQuiz
NxGT
Time to
Steady
State
3 min
2 min
2 min
2 min
2 min
Time to
Converge
after Steady
State
5 min
5 min
5 min
5 min
5 min
99%
Confidence
Interval
size:
Percent of
Estimated
Mean
8.1%
1.9%
5.5%
2.9%
0.9%
Average
Response
Time at
Convergence
Average
Response
Time at End
of Max
Duration
Percent
Deviation in
Avg
Response
Time
23.86 ms
33.80 ms
16.75 ms
62.48 ms
31.59 ms
23.79 ms
34.07 ms
16.20 ms
63.02 ms
31.52 ms
0.3%
0.8%
3.4%
0.9%
0.2%
Table 3: Application of Algorithm 1 to Lab Applications (Min Duration=0 min)
Application
DellDVD
JPetStore
RUBiS
eQuiz
NxGT
Time to
Steady
State
3 min
2 min
2 min
2 min
2 min
Time to
Converge
after Steady
State
52 sec
3 sec
14 sec
2 sec
2 sec
99%
Confidence
Interval
size:
Percent of
Estimated
Mean
14.9%
14.6%
14.6%
11.6%
5.4%
Average
Response
Time at
Convergence
Average
Response
Time at End
of Max
Duration
Percent
Deviation in
Avg
Response
Time
24.54 ms
33.67 ms
18.49 ms
76.19 ms
31.40 ms
23.79 ms
34.07 ms
16.20 ms
63.02 ms
31.52 ms
3.1%
1.1%
14.1%
20.9%
0.3%
62
4.3 Validating Algorithm 1 against Real World Applications
The following three real world IT applications were chosen for validation of Algorithm 1:
i.
MORT: A mortgage and loan application implemented using web services and a web portal. 26 pages
of MORT were load tested with an open source tool, for a total of 20 minutes with 80 concurrent
users. MORT has a mix of pages some of which complete in a few milliseconds and some which take
up to 30 seconds.
VMS: A vendor management system that deals with invoice and purchase order processing. 11 pages
were load tested using a commercial tool, for a total duration of 20 minutes, with 25 concurrent users
and 5 second think times.
HelpDesk: A service manager application for the help desk management lifecycle. 31 pages were
load tested with an open source tool, for a total of 15 minutes with 150 concurrent users, and think
times between 0 to 15 seconds.
ii.
iii.
We see in Table 4 that for all three real world applications Algorithm 1 converged to the average response
time quite fast with less than 5% deviation from the true mean. (In fact for VMS and HelpDesk if we remove
the requirement of 5 minutes steady state duration the convergence occurs in 1.5 minutes with less than 6%
deviation.)
Table 4: Application of Algorithm 1 to Real World Apps (Min Duration=5 min)
Application
MORT
VMS
HelpDesk
Time to
Steady
State
Time to
Converge
after
Steady
State
2 min
2 min
3 min
5 min
11.6min
5 min
99%
Confidence
Interval size:
Percent of
Estimated
Mean
11.1%
14.9%
5.1%
Average
Response
Time at
Convergence
(ms)
908.64 ms
579.69 ms
121.36 ms
Average
Response
Time at End
of Max
Duration
(ms)
867.28 ms
579.36 ms
125.26 ms
Percent
Deviation in
Average
Response
Time
4.8%
0.1%
3.2%
4.4 Distribution of Average Response Time
Pr [response time < r]
Pr [response time < r]
We were curious to see if the distribution of average response time converged to a normal distribution. We
used batches of samples to compute average response times in the logs provided and then took their
cumulative distribution function (CDF) [WALP2002], and compared with that for the Normal distribution with
the same overall mean and standard deviation as the response time log. We can see from Figure 2 that the
distribution was indeed close to the Normal distribution for MORT and HelpDesk. In the case of VMS the error
was a bit more since there were fewer samples in the log file.
r milliseconds
r milliseconds
Figure 2: Distribution of Average Response Time
5. Test Duration for Page Level Response Time Convergence
Section 4 showed how Algorithm 1 works towards convergence of overall average response time, across all
pages of an application. We are now interested in knowing what happens if we want individual page level
response times to converge. Note that we have fewer samples per page compared to total number of
63
samples. The result of applying Algorithm 1 to the 7 pages of DellDVD is shown in Table 5. We see that
Algorithm 1 correctly predicts convergence of the test and the deviation is within 5% from the true mean per
page. We found the same pattern for the other four lab applications.
Table 5: Algorithm 1 Applied to Pages of DellDVD
Page
Number of
DellDVD
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Time to
Steady
State
Time to
Converge
after Steady
State
2 min
2 min
2 min
3 min
3 min
3 min
3 min
11.2 min
5.0 min
10.5 min
5.0 min
5.0 min
5.0 min
5.0 min
99%
Confidence
Interval size:
Percent of
Estimated
Mean
14.9%
11.8%
15.0%
3.7%
12.9%
11.6%
3.9%
Average
Response
Time at
Convergence
Average
Response
Time at End
of Max
Duration
Percent
Deviation in
Average
Response
Time
4.98 ms
13.94 ms
4.47 ms
49.99 ms
12.28 ms
12.48 ms
71.32 ms
4.88 ms
13.48 ms
4.70 ms
49.73 ms
11.74 ms
11.95 ms
72.24 ms
2.1%
3.4%
4.9%
0.5%
4.6%
4.4%
1.3%
In the case of the real world application MORT there were 26 pages in all, but the frequency of page access
was too small in 21 of the pages and there were not enough samples for confidence intervals to converge. For
5 of the pages that had enough samples, we present the results of Algorithm 1 in Table 6. Likewise for
HelpDesk there were 10 pages with enough samples and all converged between 6 to 9 minutes of total run
time with errors less than 5% of the true mean, as shown in Table 7.
Table 6: Algorithm 1 Applied to Pages of Real World Application MORT
Page
Number of
MORT
Page 1
Page 2
Page 3
Page 4
Page 6
Time
Steady
State
to
2 min
2 min
3 min
3 min
2 min
Time
to
Converge
after Steady
State
8.9 min
5.0 min
8.9 min
10.1 min
9.8 min
99%
Confidence
Interval size:
Percent
of
Estimated
Mean
14.6%
7.3%
14.9%
14.4%
13.7%
Average
Response
Time
at
Convergence
Average
Response
Time at End
of
Max
Duration
Percent
Deviation in
Average
Response
Time
32.74 sec
47.51 ms
33.40 sec
34.44 sec
35.59 sec
32.67 sec
45.51 ms
33.46 sec
34.31 sec
35.65 sec
0.2%
4.4%
0.2%
0.1%
0.1%
Table 7: Algorithm 1 Applied to Pages of Real World Application Helpdesk
Page
Number of
Helpdesk
Page 14
Page 15
Page 16
Page 17
Page 22
Time
Steady
State
2 min
2 min
2 min
2 min
2 min
to
Time
to
Converge
after Steady
State
6.3 min
6.3 min
7.4 min
5.2 min
5.9 min
99%
Confidence
Interval size:
Percent
of
Estimated
Mean
4.1%
6.7%
13.4%
14.4%
2.6%
Average
Response
Time at
Convergence
Average
Response
Time at End
of Max
Duration
Percent
Deviation in
Average
Response
Time
24.75 ms
14.28 ms
11.24 ms
363.24 ms
38.89 ms
24.29 ms
14.08 ms
11.02 ms
366.92 ms
38.87 ms
1.9%
1.4%
2.0%
1.0%
0.1%
Both in the case of MORT and in the case of HelpDesk there were 21 pages that did not converge for lack of
samples and if we had to wait for all pages to converge then we would have reached the max duration without
convergence. This calls for a modification to our Algorithm. We should allow pages to be tagged and check for
convergence of only tagged pages. We assume that the load test team would have knowledge of the
application workload and criticality to decide which pages need to be tagged for accurate estimation of
performance metrics.
64
When we applied Algorithm 1 to per page of VMS, the number of samples was too few since our batch size
was 100. (In fact we had just 5 batches per page.) So we reduced the batch size to 10, for the purpose of
analysis. (This is not recommended in general, but our purpose is to draw the attention to handling outliers
through this example.) 8 of the pages converged with a deviation of less than 8% from the true mean, but 3
pages did not converge at all, even though there were enough samples. For these three pages Page 0, Page
2, and Page 10, Table 8 shows the 99% confidence interval size at the end of the run. Confidence intervals
did not converge in these three pages due to the presence of outliers since outliers can drastically increase
variance.
Table 8: Algorithm 1 Applied to Pages of VMS for batch size=10
Page
Number of
VMS
Page 0
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Time to
Steady
State
2 min
2 min
2 min
2 min
2 min
2 min
2 min
2 min
2 min
2 min
4 min
Time to
Converge
after Steady
State
NA
5.9 min
NA
5.0 min
5.3 min
5.2 min
5.6 min
5.0 min
5.0 min
12.6 min
NA
99%
Confidence
Interval size:
Percent of
Estimated
Mean
29.2%
15.0%
18.6%
14.4%
14.9%
11.7%
13.9%
9.2%
10.8%
15.0%
101.9%
Average
Response
Time at
Convergence
Average
Response
Time at End
of Max
Duration
Percent
Deviation in
Average
Response
Time
2465.31ms
2603.48ms
5.3%
402.31ms
369.14ms
364.29ms
379.98ms
736.03ms
372.81ms
456.22ms
418.48ms
391.47ms
383.78ms
381.86ms
793.71ms
346.28ms
438.85ms
3.9%
5.7%
5.1%
0.4%
7.2%
7.7%
3.9%
6. Handling of Outliers in Real World Applications
A closer look at the scatter plot of response times for three 'non convergent' pages of VMS revealed the
presence of outliers, as shown in Figure 4.
So our next question was how to remove outliers. The easiest way is to maintain a running histogram of
response time samples. But if our methodology is to be incorporated in to any load test tool then it has to be
very efficient. Therefore we adopted the heuristic that if any response time sample is more than 2 times the
current average response time it goes in to an outlier bucket (assuming at least 10 samples before this rule
can kick in). We do not discard it because if the number of such samples increases drastically they need to be
reclassified as ‘inliers’. Note that while the figure shows actual response time samples, our algorithm applies
to samples of batch means, which is why a factor of 2 is appropriate.
65
Response time (ms)
Test Duration in seconds
Test Duration in seconds
Figure 3: Outliers in Pages 0, 10 and 2 respectively, in VMS Response Times
We adapted Algorithm 1 to compute the running sum of response times and squared response times for both
regular samples and outlier samples. If the number of outliers exceeds 10% of the total samples we included
them back in to the regular samples by simply adding to the sums of response times and squared response
times and adding the sample counts, at the time of determining convergence. This is very efficient with O(1)
complexity. The only challenge is that if the outliers happen to occur very early in the run time after steady
state they are likely to be included and never discarded. For now we have not improvised upon this algorithm
but we plan to do so in the near future.
When we applied this modified algorithm to the two VMS pages with outliers, average response time for Page
0 converged within 14.9 minutes after steady state with just 1.4% deviation from the final result, and that for
Page 2 converged within 17.1 minutes after steady state with 0.2% error. But Page 10 did not converge
despite implementing the algorithm for outliers. If we manually remove the outliers shown in Figure 3 and replot the data, we get the revised scatter plot in Figure 4.
We now see a new set of outliers. But there are so many of them that we can no longer call them outliers and
the algorithm rightly classified them as inliers. Because of the high variance in the response times, the
confidence intervals for this page did not converge. After removing outliers for this page the 99% confidence
interval at the end of the run had spread of 24% around the mean. Had this been a tagged page it would have
required much longer test duration for convergence, as opposed to the 20 minutes used by the load testing
team.
66
Response time (ms)
Test duration is seconds
Figure 4 : Scatter Plot for 'Outlier' Page in VMS
7. Summary of Algorithm, Applicability, and Future Work
We have presented an algorithm in this paper to automatically determine test duration during load testing. The
th
algorithm has two parts. First, it checks if steady is reached in the k minute of test execution by determining if
th
st
throughput at the k minute is within 10% of the throughput at the (k-1) minute, for k > 1. Second, it checks if
99% confidence interval of average response time of batch means is within 15% of the estimated average
(once runtime exceeds a specified minimum duration after steady state) or the maximum duration is reached.
We have shown that this algorithm works accurately with total average response times having less than 5%
error from the true mean, for five lab applications and three real world applications. In the case of page level
response times, we have proposed an enhancement to take care of outliers. Note that in the case of overall
average response times (across all pages) we do not recommend outlier removal. This is because there may
be infrequent pages that have response times much higher than other frequently accessed pages and these
readings should not be misconstrued as outliers. We have also shown the need to tag pages when applying
this algorithm at the page level so that we check for convergence only for pages that matter.
To speed up load tests, we can get rid of the minimum duration condition for the first few rounds of load tests
where we need quick results and where higher percentage of error is tolerable. As we have seen from all
applications tested the convergence after steady state is often a matter of seconds with errors less than 20%
in all applications tested. For the first rounds of load tests we also need not worry about page level
convergence and we can plan on just overall convergence.
While the algorithm has been presented around average response times, the question arises whether it can
be applied for percentiles of response time, which are commonly reported in load tests. Note that we used
average response times because of the applicability of the central limit theorem. We cannot do the same with
percentiles of response times. In general, we should use the proposed algorithm for determining when to stop
a test, and during the run time maintain statistics for all performance metrics of interest. Whenever the test
stops we can output the estimates of the performance metrics of interest. Note that outliers should not be
removed when computing percentiles.
One of the items for future work is the fine tuning of outlier handling when outliers occur at the start of steady
state. We also need to assess the applicability of this algorithm when there is variable number of users in load
tests.
Acknowledgements
We would like to thank the anonymous referees whose suggestions have drastically improved the quality of
this paper. We would like to thank Rajendra Pandya and Yogesh Athavale for providing performance test logs
of VMS and Helpdesk applications, respectively. We would also like to thank Rupinder Virk for running
performance tests of the lab applications.
67
References
[AUTOC] http://easycalculation.com/statistics/autocorrelation.php
[DELLDVD] Dell DVD Store http://linux.dell.com/dvdstore/
[EICK2007] M. Eickhoff, D. McNickle, K. Pawlikowski, ”Detecting the duration of initial transient in steady
state simulation of arbitrary performance measures”, ValueTools(2007).
[FISH2001] G. Fishman, Discrete Event Simulation: Modelling, Programming, and Analysis, Springer
(2001).
[JPET] iBatis jPetStore http://sourceforge.net/projects/ibatisjpetstore/
[PAWL1990] K. Pawlikowski,”Steady-state simulation of queuing processes: a survey of problems and
solutions.”, ACM Computing Surveys, 22:123–170(1990).
[ROBI2007] S. Robinson,”A statistical process control approach to selecting a warm-up period for a
discrete-event simulation.”, European Journal of Operational Research, 176(1):332–346(2007).
[RUBIS] Rice University Bidding System http://rubis.ow2.org/
[MANS2010] R. Mansharamani, A. Khanapurkar, B. Mathew, R. Subramanyan, "Performance Testing:
Far from Steady State", IEEE COMPSAC, 341-346(2010).
[TDIST] http://easycalculation.com/statistics/t-distribution-critical-value-table.php
th
[WALP2002] R. Walpole. Probability & Statistics for Engineers & Scientists. 7 Edition, Pearson (2002).
[WARMUP] http://rwwescott.wordpress.com/2014/07/29/when-does-the-warmup-end/
68
Measuring Wait and Service Times in Java using Bytecode Instrumentation
Amol Khanapurkar, Chetan Phalak
Tata Consultancy Services, Mumbai.
{amol.khanapurkar, chetan1.phalak}@tcs.com
Performance measurement is key to many performance engineering
activities. Today's programs are invariably concurrent programs that try to
optimize usage of resources such as multi-core and power. Concurrent
programs are typically implemented using some sort of Queuing
mechanism. Two key metrics in queuing architecture are Wait Time and
Service Time. Preciseness of these two metrics determines how
accurately and reliably the IT systems can be modeled. Queues are amply
studied and rich literature is available, however there is paucity of tools
available that provide a breakup of Wait Time and Service Time
components of Response Time. In this paper, we demonstrate a
technique that can be used for measuring the actual time spent in
servicing the Synchronized block as well as time spent in waiting to enter
the Synchronized block. A critical-section is implemented in Java using
Synchronized blocks.
1.
INTRODUCTION
Java programming language is one of the most widely adopted programming languages in the world today. It is
present in all kinds of applications viz. Large Enterprises, Small and Medium businesses as well as in Mobile
apps. One of the features that has made Java so popular is the built-in support for multi-threading to write
concurrent and parallel programs. Vast majority of today’s enterprise applications written in Java are concurrent
programs. By concurrent programs, we mean programs that exchange information through primitives provided by
the native programming language. In Java, such a primitive is provided by the keyword ‘synchronized’. The Java
infrastructure for providing concurrency support revolves around this keyword.
Concurrent programs in java are written using JDK API that support multi-threading. When two or more threads in
Java try to enter a critical-section, Java enforces queuing so that only one thread can get a lock on criticalsection. Upon completing its work the thread relinquishes the lock and leaves the critical-section. Remaining
threads competing for the lock wait to acquire the lock. The Java runtime through its synchronization primitives
manages the assignment of lock to the next eligible thread. Java offers support for queuing policies to be fair or
random. Fair assignment is performance intensive and is rarely used in real life applications. Random assignment
of the lock has no performance overheads and is hence preferred in most applications.
Queuing is a vastly studied topic. Queuing theory [ QUAN 1984 ] provides the base for analytical modeling. Hence
it is highly desirable to be able to apply queuing theory fundamentals on the actual code. Amongst other things,
Queuing theory requires Service Time and Arrival rate as input parameters to be able to predict Wait Times and
Response Times for jobs to complete. In practice though, it is easy to measure response time, but exact service
time and hence wait times remain a little elusive to be measured. There simply aren’t enough tools available
which provide queue depth or breakup of response time into service and wait time components.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
69
In this paper, we try to address that void. We present techniques that improve measurements and can form inputs
for performance modeling. The problem statement that we address in this paper is to provide a solution to find
breakup of response time into service and wait time components without the support for it in the Java API itself.
More specifically, we provide technique to capture service and wait times for concurrent threads that access a
shared resource.
We express the problem statement in form of code. Consider the following code.
public class Test {
static Object _lockA1;
static int sharedVal, NUMTHREADS = 10, SLEEPDURATION = 500;
WorkerThread wt[] = new WorkerThread[NUMTHREADS];
int max = SLEEPDURATION + SLEEPDURATION/2, min = SLEEPDURATION / 2;
public Test(){
_lockA1 = new Object();
}
public void doTest() throws InterruptedException {
for(int i = 0; i < NUMTHREADS; i++){
wt[i] = new WorkerThread();
wt[i].start();
}
for(int j = 0; j < NUMTHREADS; j++){
try {
wt[j].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
class WorkerThread extends Thread {
public void run(){
for(int iter = 0; iter < 1; iter++){
inc();
}
}
public void inc(){
synchronized(_lockA1){
try{
sharedVal++;
sleep(new java.util.Random().nextInt(max - min + 1) + min);
}catch(InterruptedException e){
}
}
}
}
}
Fig. 1. Sample Concurrent Java Program
70
The function inc() is a critical-section and controls access to variable sharedVal. Different threads access the
function concurrently and try to modify the value of sharedVal. Since sharedVal is incremented in the criticalsection, access to the critical-section is made serial. While one thread is executing the critical-section, other
threads have to wait for their turn to enter the critical-section. This kind of code is present in millions of lines of
code that is written in Java, across industry verticals. Few examples where we find such code is Updating account
balance, booking a ticket etc.
The longer the thread has to wait on the critical-section, the longer its response time will be. The lower bound is
established by Service Time. The rest of the paper focuses on how to find the breakup of response time into
Service and Wait times.
2.
JAVA CONCURRENCY INFRASTRUCTURE
Java provides the following infrastructure for writing multi-threaded, concurrent programs.
1) Synchronized Block
2) Synchronized Objects
3) Synchronized Methods
Java Synchronized Blocks are the most prevalent and preferred form of implementing concurrency control. Since
the critical-section is localized, it becomes easy to debug multi-threaded programs using a Synchronized Block.
Method 2) implements concurrency control by making the Object thread-safe. So if, two threads try to access the
same object simultaneously, one thread will get access to the object while the other one will block. Method, 3)
generates an implicit monitor. How a synchronized method is treated is mostly compiler dependent. It uses a Java
constant called ACC_Synchronized.
In this paper, we are going to focus only on Synchronized Blocks. Obtaining wait and service times for
Synchronized Objects / Methods require a different state machine than the one required by Synchronized Block.
Hence obtaining wait and service times using method 2) and 3) are out-of-scope.
Before we get into specifics of the state machine and Java Bytecode, we present alternate method of obtaining
the same information. This alternate method is via logging all accesses before the entry, during and after the exit
from the critical-section. The information will need to of the form <tid, time, locationID> where



tid - Thread Identifier
time - Timestamp
locationID - a combination of class, method and exact location (say line number or variable on which
synchronization happened).
public void inc(){
long t1 = System.currentTimeMillis();
synchronized(_lockA1){
try{
sharedVal++;
}catch(InterruptedException e){
}
}
}
Fig. 2. Alternate Method: Logging
71
For each such access, there will need to be 4 tuples that need to get captured as shown below
T1 – Time thread arrived at synchronized block.
T2 – Time thread entered into synchronized block.
T3 – Time, thread is about to exit synchronized block.
T4 - Time thread exited synchronized block.
In this case,
 (T4 - T1) is the Response Time,
 (T3 - T2) is the Service Time and
 (T2 - T1) is the Wait Time.
This method has the following disadvantages
1) Logging has its own overheads
2) After the logs have been written to, these logs will need to be crunched programmatically to get the
desired information
3) Even for simple programs, the crunching program can get complex because it has to tag the appropriate
timestamps to appropriate threads
4) For complex programs involving nested synchronized blocks (e.g. code that implements 2-phase
commits), the crunching program can quickly become more complex and mat require significant
development and testing time
5) This method will fail in cases where source code is not available
To overcome, these disadvantages we chose to implement bytecode instrumentation to capture the information
we require.
3.
JAVA BYTECODE
Wikipedia [ BYTECODE ] defines Java bytecode as a list of the instructions that make up the Java bytecode, an
abstract machine language that is ultimately executed by the Java virtual machine. The Java bytecode is
generated by language compilers targeting the Java Platform, most notably the Java programming language.
Synchronized Blocks are supported in the language using the bytecode instructions monitorenter and monitorexit.
Monitorenter grabs the lock on the synchronized() section and monitorexit releases the same.
4.
CENTRAL IDEA AND IMPLEMENTATION OF STATE MACHINE
4.1 Central Idea
Our objective in building a state machine is to get the following details about a Synchronized Block.
 Location of the block i.e. which class and which method is synchronized()
 Variable name on which this block is synchronized()
 Breakup of synchronized() block response time into wait and service time components
Ingredients for implementing a critical-section using synchronized blocks are:
1) The synchronized() construct and
2) The variable on which synchronization is happening, either static or non-static
monitorenter and monitorexit opcodes provide events related to entering and exiting the critical-section. To get
access to the variable name we need to track the opcodes getstatic and getfield for tracking static and non-static
variables, respectively. Ideally tracking these 4 opcodes should suffice. This is the central idea behind the state
machine. However we decided to add tracking of another opcode viz. astore to make the state machine more
robust. We took this path for two reasons viz.
1) Typically, the javac compiler generates a bunch of opcodes between the get* and monitorenter opcode.
72
So the exact sequence is not known.
2) However, based on our empirical study we found that the instruction astore always precedes the
monitorenter opcode.
Consider the output of javap [ JAVAP ] utility for inc() function to get a better understanding of the reasons stated
above.
public void inc();
Code:
0: getstatic #4 // Field Test._lockA1:Ljava/lang/Object;
3: dup
4: astore_1
5: monitorenter
6: getstatic #5
// Field Test.sharedVal:I
9: iconst_1
10: iadd
11: putstatic #5
// Field Test.sharedVal:I
14: getstatic #6
// Field Test.SLEEPDURATION:I
17: getstatic #6
// Field Test.SLEEPDURATION:I
20: iconst_2
21: idiv
22: iadd
23: istore_2
24: getstatic #6 // Field Test.SLEEPDURATION:I
27: iconst_2
28: idiv
29: istore_3
30: new
#7
// class java/util/Random
33: dup
34: invokespecial #8
// Method java/util/Random."<init>":()V
37: iload_2
38: iload_3
39: isub
40: iconst_1
41: iadd
42: invokevirtual #9
// Method java/util/Random.nextInt:(I)I
45: iload_3
46: iadd
47: i2l
48: invokestatic #10
// Method sleep:(J)V
51: goto
59
54: astore_2
55: aload_2
56: invokevirtual #12
// Method java/lang/InterruptedException.printStackTrace:()V
59: aload_1
60: monitorexit
61: goto
71
64: astore
4
66: aload_1
67: monitorexit
68: aload
4
70: athrow
71: return
Exception table:
from to target type
6 51 54 Class java/lang/InterruptedException
6 61 64 any
64 68 64 any
Fig. 3. Javap Output
73
Notice the presence of dup and astore instructions (ignore everything starting from '_' character) between the
getstatic and monitorenter opcodes. For various test programs written in various different ways and compiled with
and without -O options, we found that the set of instructions were not the same. Had these instructions been the
same always, we would be in a position to provide guarantee regarding the sequence of events leading to
entering the critical-section. However, we found that astore opcode always precedes the monitorenter event.
Hence, we made it a part of our pipeline of instructions to keep track of, to detect the presence of thread which is
about to enter a critical-section.
The monitorexit opcode is fairly straight forward. Upon encountering monitorexit, we just have to flush our data
structures that keep track of the pipeline.
In our study of Java literature, we haven't come across literature that gives strong guarantees regarding sequence
of bytecode generation. Hence our implementation is empirical, based on our understanding of how the Java
concurrency implementation infrastructure works. Since the implementation is based on empirical data, we carried
out exhaustive testing which we will describe later in the paper. We didn't find any test case for which our state
machine breaks.
4.2 Implementation
We used the ASM [ ASM ] Bytecode Manipulation library to do the instrumentation. ASM is based on Visitor
pattern. The ASM library generates events which are captured and processed by our own Java code. For ASM
library to generate events we needed to register hooks for events of interest that were useful for us in
implementing the state machine. The registering of hooks can be statically done at compile time or done at
runtime i.e. at class load time using the Instrumentation API available since JDK 1.5. Since we anticipated this
utility to be small (< 5K LOC), we preferred the static approach in which instrumentation is done manually.
Conversion to runtime instrumentation is trivial and is just a matter of using the right APIs provided by Java.
The ASM API provides the following hooks for events (of interest to us) to be generated
 visitInsn() :- For monitorenter and monitorexit
 visitVarInsn() :- For astore
 visitFieldInsn() :- For getstatic and getfield
 visitMethodInsn() :- For class and method names
During instrumentation the ASM implementation parses Java bytecode of classes. After parsing those classes, it
generates events for which there is a hook registered. Once an event is generated, it is the responsibility of the
calling code to consume the event. For performance reasons we use the streaming API of ASM. With streaming
API an event is lost if it is not consumed.
When an event is generated, the control returns to our calling code which needs to consume the event. The
calling code takes appropriate actions based on the type {visitInsn, visitVarInsn, visitFieldInsn and
visitMethodInsn} of the event generated.
These events are encapsulated in two Visitors that need to work in lock-steps to be able to distinguish between a
thread has arrive and when the thread has entered the monitor. These Visitors are named as
 ResponseTimeMethodVisitor and
 ServiceTimeMethodVisitor
The algorithm that the ResponseTimeMethodVisitor implements is as follows
a) Maintain a list of opcodes in the order in which they are called.
b) Obviously, we expect the getfield / getstatic as the first element in our list of opcodes. Maintain that at the
head of our list. Keep track of the latest getfield / getstatic opcodes, overwriting previous occurrences, if
any.
c) Ignore all other events, if any are registered e.g. dup, until an astore is received. Add astore as second
element of our list.
d) Ignore all other events, if any are registered until an astore or getfield / getstatic is received.
e) If astore is received, overwrite it at second position in the list.
74
f) If getfield / getstatic is received, empty the list and add getfield / getstatic to the head of the list.
g) Continue the same until first two elements are getfield / getstatic and astore respectively and the third
element is monitorenter.
h) Once the list comprises of
1. getfield / getstatic
2. astore
3. monitorenter
in that order, it sets the flag for that thread to true.
i) Once it sets the flag to true, it does update the book-keeping data structures. In one of the data
structures, it sets the arrival time (T1) of current thread against the synchronized block implemented on
variable pointed to by getfield / getstatic
j) Upon encountering a monitorexit event, it updates the book-keeping data structures by entering the time
(T4) at which thread exited the synchronized block.
The ServiceTimeMethodVisitor simply piggybacks on the work that ResponseTimeMethodVisitor. It only does
things mentioned below
a) Looks for the status of the flag that ResponseTimeMethodVisitor has set to true for current thread. Once it
finds the flag to be set, it updates the book-keeping data structure by setting the syncblock enter time (T2)
against the current thread. Updating the data-structure for a thread whose arrival time is not set before by
ResponseTimeMethodVisitor is an illegal state.
b) After updating the data structure, it again resets the flag to false, so that nested synchronized blocks can
be processed.
c) Since book-keeping data structures are common to both Visitors, it simply ignores the monitorexit event
and treats time set by ResponseTimeMethodVisitor as the timestamp (T3) at which servicing is
completed.
Thus in our state machine implementation
 T3 and T4 are same
 (T3 – T2) gives us the Service Time
 (T4 – T1) gives us the Response Time
The above description is for the simplest case. For nested synchronized blocks, the state machine gets a little
more complex. Technically, however the same algorithm is followed since the methods in our Java code which
maintains the state machine are reentrant. Only, the book-keeping and hence the results printing that get a little
trickier to handle.
4.2 Output of State-machine
For code snippet depicted in Fig. 1, assume appropriate main() is called. Our state machine then outputs the
result in the following format
ThreadName ArrivalTime EnterTime
ExitTime
serviceTime
Thread-0
1408309576179
1408309576179
1408309576196
17
Thread-1
1408309576205
1408309576205
1408309576221
Thread-2
1408309576221
1408309576221
Thread-3
1408309576238
Thread-6
waitTime
LockName
LockLocation
0
Test._lockA1
Test$WorkerThread.inc
16
0
Test._lockA1
1408309576238
17
0
Test._lockA1
1408309576239
1408309576264
25
1
Test._lockA1
1408309576243
1408309576334
1408309576352
18
91
Test._lockA1
Thread-4
1408309576243
1408309576352
1408309576369
17
109
Test._lockA1
Thread-8
1408309576244
1408309576319
1408309576334
15
75
Test._lockA1
Thread-5
1408309576245
1408309576302
1408309576319
17
57
Test._lockA1
Thread-9
1408309576245
1408309576264
1408309576281
17
19
Test._lockA1
Thread-7
1408309576245
1408309576281
1408309576301
20
36
Test._lockA1
Fig. 4 Output from State-machine
75
5.
TESTING STATE MACHINE IMPLEMENTATION
The testing was divided into two types, viz.
1)
Theory-based programs which demonstrate computer science principles or concepts




Producer-Consumer problem [ PRODCONS ]
Dining Philosopher problem [ DINIPHIL ]
Cigarette Smokers problem [ CIGARETTE ]
M/M/1 Queues [ MM1 ]
In computing, the producer–consumer problem, dining philosopher problem and cigarette smokers problem
are classic example of a multi- process synchronization problem. We implement these problems using JAVA.
Each of these problem’s solutions, we design by creating common, fixed-size buffer used as a queue, which
was shared among all participants. By writing synchronized block, we gave access of shared queue to every
participant. We test proposed state-machine implementation on these problems. We perceived 100%
accurate results with respect to the service time, lock time for every thread running.
2)
Custom programs which are comparable to code that gets written in IT industry today


Database connection pooling code
Update account balance for money transfer transaction (2-phase commit)
These programs check the qualitative and quantitative correctness of the code that passes through the state
machine. Other than M/M/1 Queues all others verify the functional correctness of the state machine, while M/M/1
(which is actually a set of programs) validates the quantitative correctness.
6.
APPLICATION OF STATE MACHINE TECHNIQUE
Once response time is accurately broken down into wait and service time components, performance modeling
becomes accurate. Other researchers have built utilities on top of our API to predict performance in the presence
of software and hardware resource bottlenecks [ SUBH 2014 ].
1.
This technique once tool-ified can help in detecting performance bottlenecks. During development phases of
Software Development Life Cycle (SDLC) this tool can provide immense value in troubleshooting performance
issues. We haven’t quantified the performance overhead of our technique yet, but we believe it to be very low.
Our basis for this assumption is our own past work in developing a java profiler, named Jensor [ AMOL 2011 ]
using bytecode instrumentation techniques.
7.
CONCLUSION
It is possible to derive Wait and Service Time components of Response Times of concurrent programs written in
Java using bytecode instrumentation techniques. Our state-machine based approach is capable of capturing
these metrics which can be used for other performance engineering activities like performance modeling, capacity
planning and during performance testing.
76
REFERENCES
[ AMOL 2011 ] Amol Khanapurkar, Suresh Malan “Performance Engineering of a Java Profiler”, NCISE, Feb –
2011.[ ASM ] http://asm.ow2.org/
[ BYTECODE ]http://en.wikipedia.org/wiki/Java_bytecode
[ CIGARETTE ] http://en.wikipedia.org/wiki/Cigarette_smokers_problem
[ DINIPHIL ] http://en.wikipedia.org/wiki/Dining_philosophers_problem
[ JAVAP ] http://docs.oracle.com/javase/7/docs/technotes/tools/windows/javap.html
[ MM1 ] http://en.wikipedia.org/wiki/M/M/1_queue
[ PRODCONS ] http://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem
[ QUAN 1984 ] Lazowska et al: Quantitative System : computer system analysis using queuing network models, A
popular book, 1984
[ SUBH 2014 ] Subhasri Duttagupta, Rupinder Virk and Manoj Nambiar, "Predicting Performance in the Presence
of Software and Hardware Resource Bottlenecks", SPECTS, 2014.
77
CLOUD PERFORMANCE TESTING - KEY CONSIDERATIONS (COMPLETE
ANALYSIS USING RETAIL APPLICATION TEST DATA)
Abhijeet Padwal
Performance engineering group
Persistent Systems, Pune
email: [email protected]
Due to its lower cost and greater flexibility, cloud has become the most preferred option for the deployments for
any size of the applications and products in today’s world. Through its Platform as a Service (PaaS) and
Infrastructure as a Service (IaaS) services, cloud has attracted and benefitted the testing services of the
applications especially the load and performance testing. Though cloud provides superior flexibility, scalability at
lower cost over the traditional on-premises deployments, it has got its own limitations and challenges. If those
limitations are not evaluated carefully they can severely impact overall projects and their budgets if not evaluated
carefully. It is recommended to take holistic view while deciding about using cloud for any purpose by taking
detailed look at pros and cons of cloud.
This paper illustrate cloud in brief and a detail case study a load testing of a Retail application in cloud and how
cloud’s pros and cons worked in favor and against during the course of load testing and what actions needed to
be taken to overcome those.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this
paper in CMG India Annual Conference Proceedings.
78
1. Introduction
In recent years there have been revolutionary technology innovations which have changed the world where we
live and the way we interact and do our business. These innovations have resulted in to a technology
transformation which is happening at a rapid speed. Technology transformation is vital and has resulted in to a
better and faster, serving to the business and the end users. One of the most talked about and which has reached
the reality and established a new type of service delivery arena is the Cloud Computing! The services offered by
the cloud are helping business to move in to an arena of reduced cost, highly available, faster, reliable and high
margin services and products and that’s why businesses are aggressively adapting cloud based services.
Increasingly, businesses are moving their traditional on-premises deployments of their applications or products to
the scalable cloud environment which gives an advantage of the low cost, high availability at low maintenance.
Along with the production deployments, cloud has been also benefitted in the testing of the applications especially
for load and performance testing through its Platform as a Service (PaaS) and Infrastructure as a Service (IaaS)
services. Cloud has found to be useful for hosting the load testing environments due to its ability to arrange high
end servers, applications and number of load injectors with a higher flexibility and lower costs. However like any
other service Cloud does have its own limitation and challenges over conventional on-premises deployments. For
example Cloud doesn’t provide accessibility to the low level hardware configuration parameters which are
important during the activities such as tuning. And in this case tuning or optimization activities cannot be
performed effectively on the cloud. Depending on the cases and type of use of cloud services, those limitations
can be categorized. If one want to use cloud for load and performance testing and at its best then he must take a
holistic view by considering the pros and cons of cloud environment and define an effective strategy to use it.
2. Cloud Computing
Gartner definition for the cloud computingA style of computing in which scalable and elastic IT-enabled capabilities are delivered as a service using internet
technologies. [Gartner 2014]
This definition itself describes the cloud computing in very simple words. A computing which is,
o Scalable and elastic – One can do dynamic provisioning of resources (on-demand)
o Accessibility over the internet – Accessible to the end users over the internet on wide range of
devices, PC, Laptops, mobile etc.
o Service-Oriented – A service which is a value add to the end user for whom it is a black box.
2.1 Types of Cloud Services
Based on these characteristics cloud services are classified in 3 main categories

Infrastructure as a Service (IaaS)
This is the most basic cloud-service model, where physical or virtual machines – and other resources are
offered by the provider and cloud users install operating-system images and their application software on the
cloud infrastructure.

Platform as a Service (PaaS)
A computing platform, typically including operating system, programming language execution environment,
database, and web server. Application developers and testers can develop, run and test their software
solutions on a cloud platform without the cost and complexity of buying and managing the underlying
hardware and software layers.

Software as a Service (SaaS)
79
In the SaaS model, cloud providers install and operate application software in the cloud and cloud users
access the software from cloud clients. Cloud users do not manage the cloud infrastructure and platform
where the application runs. This eliminates the need to install and run the application on the cloud user's own
computers, which simplifies maintenance and support.
2.2 Cloud Service Providers
Amazon, Google, Microsoft Azure, Openstack and many other vendors provide different kind of service offerings
in cloud arena.
2.3 Market Current Status and Outlook
Due to the inherent characteristics of cloud which are beneficial for business and the attractive pricing models
offered by the service providers Cloud based services have enormous demand. A Recent survey by the wellknown agencies shows that demand for cloud based services is getting stronger all the time.
Grtner - Global spending on public cloud services is expected to grow 18.6% in 2012 to $110.3B,
achieving a CAGR of 17.7% from 2011 through 2016. The total market is expected to grow from $76.9B in
2010 to $210B in 2016. The following is an analysis of the public cloud services market size and annual growth
rates. [Cloud Market2013]
Picture 1 – Annual growth for cloud market
3.
Case Study
3.1 About customer
Customer is a leading software company delivering Retail Solutions to market leaders across the globe. These
solutions include POS, CRM, SCM and ERP.
3.2 About Application
Application is an enterprise class retail solution to manage the front end and backend operations within a retail
store and controlling the stores from the head office through a single application.
80
Figure1 – Application architecture
App server (AS) is the core application located at Head office and responsible for managing all the stores and
real-time processing and analyzing the data generated by the stores. AS is also responsible for transferring the
software updates to the stores through its ‘Update’ functionality.
Operations is the core application at every store which is responsible for store management and maintaining the
store level master and transactional data and exchanging it between billing counters and AS server. Operations
takes care of the store operations starting from maintaining stock inventory, pricing, promotions, store level
reports, online data transfer to AS server through ‘Replication client’ component and receiving the patches from
EAS server and transferring those to counters.
Billing counter takes care of item information and billing of those. All the billing data generated by the counter is
stored in Store DB which is finally replicated to AS server using the ‘Replication client’ component at Operations.
All the applications were developed in ASP.Net and the database was the SQL server.
3.3 Performance testing requirement
This retail application has been deployed at various customers and working fine. However till recently maximum
number of stores at any of the customers were 200. Recently the customer got a requirement where this retail
solution would be deployed across 3000 stores. The customer had never done deployment at such a high scale
and thus unaware of the whether the application would sustain 3000 stores, if not then what needs to be tune and
what kind of hardware would be required. As a first step the customer decided to put the application under load of
3000 stores for various business workflows and see how it behaves. For load testing activity the customer came
up with the 5 real life business scenarios which have been used more frequently and does the high amount of
transactions.
The customer had identified below 5 scenarios across AS, Operations and Billing counter as below,
 Scenario 1 –Replication
Replication of billing data from store to AS for 3000 stores.
 Scenario 2 – Billing counter
Multi user (minimum 25 parallel counters) performing billing transactions which include the Bill, Sales Return,
Bill Cancellation, Lost Sales (in order of execution priority) with max line item not above 200 and minimum of
20 line items with Cash and Credit card as payment.
 Scenario 3 – AS
81


Access the reports to be checked while data from Store (minimum 20+ stores) is being updated to AS.
Scenario 4 – Operations
Access stock management functions with 1000 + line items namely with 5/10 users
Scenario 5 –Updates
Download of patch for more than 100 stores simultaneously. Various patch sizes to be tested namely 50MB,
80MB, 100MB
4. Approach
Scenario 1 i.e. Replication was on the high priority as it was most frequent operation between stores and central
server and handles huge amount of data generated by the stores. Here after this white paper would illustrate the
approach taken for load testing this scenario.
4.1 Scenario
Replication of data from Store to Server for 3000 stores. Each store would have 100 billing counters and each
counter generating bill with 200 line items.
4.2 Scenario Architecture
Figure2 – Replication scenario architecture
This replication scenario has 3 sub activities,
1. Collation of billing data from all the counters and generate the xml message files.
2. Transfer the xml message files from store to server (replication client -> replication server).
3. Extract the xml files and store the extracted billing data on the head office database.
It was decided to take pragmatic approach for simulation of the entire scenario. First simulate above mentioned
each step in isolation and then go for the end to end mix execution. First candidate was the transfer of the xml
files from replication client located on 3000 stores to the replication server on the head office. Rational behind
selecting this particular step of the scenario on priority was, step 1 was the ‘within a store’ process which would
82
have max to max 100 counters each store so the max load for this step at any given point would be not more than
100. Step 2 is the event when actual load of 3000 stores would come in to picture so it was decided to start with
that particular step.
4.3 Test Harness setup
To simulate this scenario a test harness was created which had 5 parts,
1. xml messages folders on injector machine
2. Vb based replication client (.exe) on injector machine
3. IIS and sql server based replication server
4. xml message folder on the head-office server and
5. Perfmon setup for monitoring the resource consumption on the AS as well as load injectors.
Folder structure on the store and head office was as below,
Picture2 – Message folder structure on replication client and server
XML messages which have to be transferred are placed in the ‘OutBox’ folder on replication client on store side
and messages which have been received are placed in the ‘Inbox’ folder on replication server at head-office.
Each store has 100 xmls messages of 2 MB size each in the outbox folder with the billing data of the 100 line
items each.
Replication client was a VB based .exe file which was executed through command line\.bat file by passing
arguments as server IP and XML message folder name at client\store end.
Command:
start prjReplicationUpload20092013-1.exe C:\
\ReplicationUpload\ReplicationUpload:10.0.0.35:S000701:100:S000701:20130812-235959(1)
st
prjReplicationUpload20092013-1.exe: application file name for 1 store
10.0.0.35: server IP
S000701: store folder at server end
20130812-235959(1): XML message folder at client end
It was not feasible to setup and manage 3000 actual store machines to inject the load so it was obvious to
simulate multiple stores from single load injector box. This was achieved by using windows batch utility. Multiple
copies of EXE files were created by different names to represent number of store considered for data replication.
Picture 3 – Multiple copies of replication utility
A batch file was created to execute all exes one after another in a sequence.
83
The next question was how to calculate the time taken for the entire messages file upload operation when
multiple copies of replication clients are fired which are uploading xml messages to replication server
simultaneously. Best way to calculate end to end data transfer time was to start with a first replication exe
triggered to the last xml message file uploaded to the replication server.
5. Test Setup
For server configuration it was decided to go ahead with the same configuration which has been used for the
existing customers and based on the results of these test, perform server sizing and capacity planning activity.
AS Configuration
Operating System
Windows Server 2012 DataCenter
Web-Server
IIS 8
Number of Cores
4
RAM
28 GB
Network Card Bandwidth (Mbps)
10 Gbps
Table1 – AS Server configuration
Database Server Configuration
Operating System
Web-Server
Number of Cores
RAM
Network Card Bandwidth (Mbps)
IIS 8
4
7 GB
10 Gbps
Table2 – DB Server configuration
This hardware configuration was not available in house and needed to be either procured or rented out for this
activity. Considering the short span of test execution phase it was decided to rent out this hardware from local
market.
5.1 Load Injectors
Finding out the size and required number of load injectors was tricky. As mentioned above it was not feasible to
setup and manage 3000 actual store machines to inject the load and thus it was necessary to initiate the load of
multiple stores from single load injector box. With this approach it was must to make sure that load injector itself
should not be overloaded and number of injectors should be optimum so that the load injector management
efforts are less and feasible.
To come up with required number of injector, sample tests were conducted by simulation of the multiple copies of
replication client from single injector using the windows batch file. Number of replication client was gradually ramp
up till the point injector CPU reaches to 70%. Single injector with Intel P4 processor with 2 GB RAM supported
100 instances of replication client that means to initiate the load of 3000 stores 30 load injectors are required.
These many machines were not available for load testing in local environment so an option was evaluated to
reducing the number of injectors by increasing the hardware capacity. However this option was not commercially
and logistically viable to arrange those high end machine machines. Considering this it was decided to go ahead
with machine configuration which was used for sample test as it was the normal configuration so the availability
and costing would be affordable.
84
5.2 Rented Vs Cloud base load injectors
Here 2 options were at disposal, either go for renting of the load injectors as well as servers in local market or see
if the test could be performed in virtual cloud environment. Costing was taken for rented option from local market
and for cloud based virtual environment multiple vendors were evaluated such as Amazon cloud and Microsoft
Azure.
Total efforts of 15 days were originally planned for the execution of this particular scenario. For local renting
minimum duration for rent was 1 month with cost of
Client - $50 per month per machine
App Server - $ 150 Per month per server
Database - $50 per day per server
In case of cloud, flexible on-demand costing option was available. For on-demand costing calculation a detailed
usage pattern was defined for the load injectors and server for those 15 days.
Machine
Number of
Instances
Number of
days required
Usage
Activity
Environment setup
and sample runs
Execution of 3000
Load Injectors
30
5
12hrs per day
stores
Sample and actual
Application Server
1
15
12 hrs per day
runs
Sample and actual
Database Server
1
15
12 hrs per day
runs
Table3 – Usage pattern for machines during design and execution of scenario1
Setup machines
2
15
12hrs per day
Based on the above usage pattern cost of Amazon and Microsoft Azure setup were calculated and further
compared with local renting option as below,
Virtual Machines /
Instance
Microsoft Azure ($)
Amazon ($)
Local Renting
($)
Load injectors
30
648
858
1500
Setup machines
2
86.4
547
100
AS Server
1
183.6
270
150
DB Server
1
442.8
98
50
Total
1360
2055
Table4 – Cost comparison between Azure, Amazon and local renting
1800*
*Cost includes only hardware. OS on client and servers and SQL server licenses are separately charged.
In clouds Microsoft Azure was a cheaper than Amazon which also had added benefit of 5 GB of free data
upload and download from cloud which was just 1 GB in case of Amazon.
Microsoft Azure also stood as a winner in cost comparison with local renting option. Apart from hardware
cost local renting had another added cost of licenses of OS and SQL server.
85
5.3 Microsoft Azure Load Test Environment
Figure3 – Load test setup at Azure and local environment
An isolated environment was setup in Azure cloud having replication server on AS, database server, 30 load
injectors and 2 setup machines. Considering the high volume of transaction traffic 10GB LAN was setup for the
load testing environment in Azure. This environment was accessed through the controlling client’s setup in local
environment over the RDP connection.
To control and manage the 30 load injectors in Azure environment, 6 controlling local clients needed to be setup
in local environment. From each controlling client 5 load injectors were accessed to setup and execute the test
and capturing the result data.
6. Test Execution and Results Analysis
6.1 Initial Test Results
After setting up the test environment, test execution was started with less number of stores. Based on the results
of each test run, number of stores load was gradually ramped up. First test was conducted for the 100 stores
which was successful. Then number of stores gradually increased during each test from 100, 200, 500 and 700,
800. Till 700 stores, all xmls files from stores were getting transferred to replication server however during the 800
stores test, number of stores started getting failed. Few more tests with 1000 and 1600 stores were also
conducted for the analysis for the failures. Please refer below table for the summary of the results.
Stores #
100
200
500
700 : Round
1
700 : Round
2
800 : Round
1
800 : Round
2
Successful
Stores #
100
200
500
Failed
Stores #
0
0
0
Start Time
(HH:MM)
6:41:00
11:27:00
13:28:00
End Time
(HH:MM)
6:43:00
11:30:00
13:35:00
Total Time
(mm:ss)
0:02:00
0:03:00
0:07:00
700
0
8:37:00
8:46:00
0:09:00
Pass
700
0
14:03:00
14:15:00
0:12:00
Pass
700
100
6:38:00
6:48:00
0:10:00
Fail
702
98
9:46:28
10:02:00
0:15:32
Faill
Status
Pass
Pass
Pass
86
1000 :
Round 1
1000 :
Round2
1600
954
46
12:57:00
13:12:00
0:15:00
Fail
906
94
12:28:00
12:45:00
0:17:00
Fail
1300
300
7:49:00
8:05:00
Table5 – Test results summary of scenario1 on Azure
0:16:00
Fail
It was observed that after 700 stores replication scenario behaviour was inconsistent. To ascertain the reason for
this failure, resource consumption data on the replication server was further analysed. For this detailed analysis
parameters for each hardware resource were identified, %CPU utilization, Available Memory, % Disc queue
length, % processor queue length and network bandwidth.
Table 6 –Resource Utilization Analysis
This analysis highlighted that when all the stores start replication activity, server disk becomes saturated and thus
the processor and queue length for these resources builds up beyond the threshold values which results in to
inconsistent behaviour and failures.
Based on this analysis it was decided to upgrade both of the hardware resources if possible or atleast the disk
speed which was the main culprit for the failures. Current configuration of these 2 resources was, number of cores
– 4 and disk speed – 10k RPM. To do a stepwise scaling of these resources it was decided to upgrade both of
these resources to CPU – 6 cores and disk speed – 15k RPM.
These new hardware requirements were checked with Microsoft Azure if more numbers of cores and higher
speed disk could be made available. It was found that number of cores could be upgraded to 8 however not the
higher speed disk. Reason behind that was all the instances in the disk array were having the same speed and it
was impossible for the Microsoft to arrange higher speed disk for our testing. This was the show stopper for
further testing and an important revelation of the limitation of the cloud environment. Performance tuning
activity, which requires lot of configuration changes at the underlying hardware layer, cannot not be
performed efficiently in the cloud environment where the resources are being shared and cannot be
changed.
6.2 Moving to Physical Server in local environment
After this revelation of limitations at Microsoft Azure, it was decided to evaluate other options and even though
those are costly compared to Azure. Those were Amazon cloud and rented physical servers in local environment.
Amazon cloud had the option of higher disks as well as more number of CPU’s. However based on the
87
experience of Microsoft Azure it was decided to rule out cloud option as even if Amazon is providing higher speed
disk, they might have further limitations on other resources and their tuning.
Considering the entire situation and limitations of cloud environment it was decided to rent out the server from the
local market with below configuration.
SERVER
Operating System
Operating System Type
64-bit
Processor
Intel®Xeon®CPU E5-2630 [email protected]
Web-Server
IIS 8
Number of Cores
6
RAM
28GB
Table7 – AS server configuration for local environment
Fortunately this hardware configuration was available with the local vendor however the next challenge was to
arrange 30 load injectors which were not available in the load test environment and could not be made available
by the vendor in the short time. Looking at the limited time in hand it was decided to use machines from other
teams during the out of office hours to carry out the further tests.
By overcoming all the challenges tests were carried out in the local environment and as anticipated 3000 stores
replication worked without any hitches!
Store #
100
200
500
700
1000
1500
2000
2500
3000
Successful
Total Time
Failed Stores #
Stores #
(in minutes)
100
0
1
200
0
2
500
0
3
700
0
4
1000
0
8
1500
0
15
2000
0
16
2500
0
21
3000
0
29
Table8 – Test results summary for scenario1 on local environment
Status
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
7. Challenges/ issues faced in Cloud during execution
Apart from the limitation of the configuration changes in the cloud environment, there were few other challenges
were faced during the course of the execution. Most of those were due to the fact that there were large number of
load injectors to be managed and the mode of accessibility i.e. RDP was slow over the internet. Few of those are
mentioned below,
7.1 Switching between injectors to initiate the test
Due to the nature and design of the replication client, there was no central utility or application available to initiate
the load from all 30 injectors automatically the way most load testing tools does. Test initiation had to be done
manually by login in to individual boxes. To accurately simulate the real-time behaviour of replication scenario it
was required to keep the high concurrency during the execution and to achieve this load should have been
initiated from all the 30 injectors at the same time or atleast with a very short delay. To facilitate this, switching
between injectors through controlling machine had to be very fast. This switching between load injectors over the
88
RDP connectivity was tedious, it would have been very easy if there would have been 30 controller machines
each managing single injector but it was not the case in this particular scenario.
7.2 Test data setup
Test setup included various task and one of the difficult one was creating folder setup for the 100 stores per
injector. During each execution cycle it was required to use unique bill number for the billing data so the message
folders were required to be updated before each test cycle with unique bill number in each store folder. Setting up
folder structures on 30 client machines to simulate 3000 stores was tedious and time consuming task and
performing it over RDP had added more complexity and time to it.
7.3 Monitoring
It was also necessary to monitor the health of each load injector during the execution to make sure that those are
not overloaded. Keeping an eye on resource consumption of 5 load injectors from a single controlling machine
was challenging.
7,4 Data transfer
These tests generated huge amount of result data including the resource utilization data. In absence of the
applications such as Microsoft excel and slow speed of the RDP connectivity it was difficult to perform the
analysis of this data on cloud machines itself so that data had to be downloaded for every test run. Downloading
of large amount of data over the internet connection was a time consuming process and there were cost involved
after the stipulated limit set by Microsoft for data transfer.
8. Conclusion and outlook
For load and performance testing, cloud provides an edge over the conventional on-premises test setups. One
can take advantage of the cloud to build the test environments within a very short period of time with a cost and
logistical flexibility however with all these advantages there are few disadvantages or challenges with the cloud
environment which could severely hamper the purpose. A limitation such as no access to the underlying hardware
configuration parameters doesn’t suit for the activities such as bottleneck identification, tuning and optimization.
Managing cloud setup remotely and data transfer over internet adds few more complexities and delays in overall
schedule.
One can consider cloud for load and performance testing however it is recommended that all these pros and cons
should be studied in details and that too in the context of the load testing requirement, see what could be possible
and what cannot be performed on cloud and base on that define the strategy to perform load testing on cloud.
References


[Gartner 2014]– http://www.gartner.com/it-glossary/cloud-computing/
[Cloud market 2013 ]– Gartner survey
http://www.forbes.com/sites/louiscolumbus/2013/02/19/gartner-predicts-infrastructure-services-willaccelerate-cloud-computing-growth/
89
Building Reliability in to IT Systems
Kutumba Velivela / Ramapantula Uday Shankar
[email protected]/ [email protected]
Information Technology (IT) System reliability is in critical focus for government and
business because of huge cost and reputation impacts. A well designed application
will not only be failure-free but also will allow predicting failures so that preventive
maintenance can take place. It will also have adequate resilience, capacity, security
and data integrity.
IT reliability includes all parts of the system, including hardware, software, interfaces,
support setup, operations and procedures. Due to the complexity in each of these
areas, organisations are giving priority to developing end-to-end reliability-specific
capabilities. These capabilities can delivered under the headings of: assessment,
engineering, design, modelling, assurance and monitoring.
In this paper, we propose formal methods for developing reliability centre of
excellence, with a customised maturity model, that will guarantee 5-9s availability to
critical business functions. Positive effects of this approach, other than giving peace of
mind to senior managers, include reduction in frequent re-design of applications,
positive culture change within the organisation and increase in market share.
Keywords: IT Availability Management, Reliability, Centre of Excellence,
Assessment, Engineering, Design, Modelling, Assurance, Monitoring,
Metrics, Error Prevention, Fault Detection, Fault Removal, Service Level
Agreement (SLA), Maintainability.
1. Introduction
Technology Service Disruptions are costly…
“Availability Management is responsible for
optimising and monitoring IT services so that they
function reliably and without interruption, so as to
comply with the SLAs, and all at a reasonable
cost.”[ITIL OSIATIS]
Technology services failure has been making
news headlines for last few years for causing
extreme impacts in well established businesses
and government departments. Payment / ATM
failures, travel disruptions, medical operations
cancellation, huge trading losses, reduced defence
security, smart mobile blackout and unpaid wages
headed top technology disasters in the last few
years. Affected organisations include US
Government, NHS, Walmart, Bank of England,
M&S, Natwest, LBG, Stock exchanges, Airlines,
Utilities and car manufacturers [Colin 2013] [Phil
2011] [Phil 2012]. A research summary on reasons
for IT systems unavailability is included in
Appendix A.
• Recover from downtime 1.13hours to 27 hours
• Maximum Tolerant downtime 52.63minutes
• Average number major disaster events per year (not
including medium and minor): 3.5
LARGE OR SMALL - DOWNTIME HURTS
( COST PER HOUR)
£457,500
£143,759
£109,116
£5,721
All respondents Small Companies
Medium
Companies
Large Companies
Source: Aberdeen Group, May 2013
 1 in 4 small companies close down due to major IT systems failure.
 70% of small firms go out of business in a year after a major data loss
Source: HP and SCORE Report
Figure 1 – Costs and other impacts of service
disruptions
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free
right to publish this paper in CMG India Annual Conference Proceedings.
90
Even companies that had no major failures are hit
by
ever
increasing
hardware/software
maintenance costs and delayed software
deliveries. Operations are not able to cope when
there is unexpected increases in faults. Essential
services are being shut down with no prior notice
due to communication failures and/or process
failures. Backups and switching to redundant
systems often does not work when needed.
Technology and IT systems reliability used to be
the primary concern of installation designers and
maintenance teams for more than 50 years. But,
due to millions of dissatisfied customers, loss of
data, fraud write-offs, regulatory fines and
criminal/civil penalties, technology reliability has
become a major concern for Business/IT account
managers, Business analysts, IT Strategists /
architects, Designers and Testers. This is more the
case with safety critical, 24x7 web sites, systems
software, embedded systems and other “highavailability must” applications.
This paper presents reliability-specific offerings
organisations can adapt for preventing errors,
detecting faults and removing them, maximising
reliability and reducing the drastic impacts of
failures. By using this paper as a roadmap,
businesses can build IT Reliability skills which
provide additional peace of mind to senior
management.
2. Background
Reliability is an important, but hard to achieve,
attribute of IT systems quality. These attributes are
normally
covered
under
non–functional
requirements in the early stages of projects.
Reliability analysis methods help identify critical
components and to quantify their impact on the
overall system reliability. Employing this sort of
analysis early in the lifecycle saves large
percentage of budget on maintenance and
production support.
Hardware-specific reliability and related methods
originated in Aerospace industry nearly 50 years
ago and subsequently became ‘must-use’ in
automotive, oil & gas and various other
manufacturing industries. Arising from this
appreciation of the importance of reliability and
maintainability, a series of US defence standards
(MIL-STDs) were introduced and implemented
around 1960s. Subsequently the UK Ministry of
Defence also introduced similar standards.
Reliability methods have successfully allowed
hardware products to be built to satisfy high
reliability requirements and the final product
reliability to be evaluated with acceptable
accuracy. In the recent years, many of these
products have come to depend on software for
their correct functioning, so the reliability of
combined hardware + software components has
become critically important. Even pure IT
applications are dependent on hosting data centre,
servers and other components to be reliable.
Hence, software reliability has become important
area of study for software engineers.
Even though still maturing, reliability methods have
been adapted either as standard or as a best
practice by a few large organizations. It is
regulatory in some of these for IT systems to be
certified to meet prior specified availability and
reliability requirements. Many other organisations
are yet to lap up this standard and reap the rich
rewards through focusing on the availability criteria
that all critical IT development processes must
comply with.
3. Reliability Engineering
99.999% Availability
=
5-9s
or
In order to meet raising customer expectation for
having quality software running 24X7, often
defined as 5-9s in requirements, there is a need
for a fundamental shift in the way IT applications
are developed and maintained. Detailed hardware
and software reliability requirements required to be
documented and special focus is to be given to
meet these from design to implementation stages.
Reliability skills proposed in this paper will offer a
comprehensive approach for addressing all IT
reliability-related
issues
including
capacity,
redundancy,
data
integrity,
security
and
maintainability.
Critical applications developed without a proper
reliability approach would lead to frequent partial
re-designs or full re-development because they
become cumbersome to maintain. A well designed
application will either be failure-free or will allow
predicting failures so that preventive maintenance
can take place. If a failure has safety or
environmental impact, it must be preventively
maintainable, preferably before it starts disrupting
production.
New reliability-specific capabilities will help
businesses substantially shift from reacting to
failures when they happen to pro-actively manage
them through approaches like Reliability Centred
Analysis and Design (RCDA), covered in more
detail in next section.
Setting up a separate reliability Centre-ofExcellence (COE) will not only help directly
enhance
business
image
and
customer
satisfaction, but also indirectly contribute to
increase in market share and cost-savings.
Developers who have applied these methods have
described them as “unique, powerful, thorough,
methodical, and focused.” The skills developed
91
highly correlated with attaining best-in-class levels
4 and 5 of Capability Maturity Model. Based on
multiple projects experience, when done properly,
Software Reliability Engineering only adds
approximate maximum of 2-3% to project cost.
Reliability engineering is a continuous process as
the analysis may have to be repeated as more IT
system releases are delivered. On-going
improvements in fault tolerant and defensive
programming techniques will be required to meet
business expected targets for reliability.
3.1 Reliability-Specific Capabilities
-
Reliability COE focusses on related business
issues and help them efficiently meet their
expectations. A combination of offerings can be
provided under major headings of:
-
Reliability Engineering,
Reliability Assessment,
Reliability Modelling,
Reliability Centred Design Analysis (RCDA),
Software Reliability Acceptance Testing,
Reliability Analysis and Monitoring using
specialist tools.
The methods used under the above heading are
fundamentally similar but reliability offerings often
have to be customised depending on different
stages of development. For example, a reliability
assessment offering will apply mainly to existing
applications and will need some modelling, use of
tools and some testing. Similarly, a reliability
engineering offering applies to new or being
redesigned applications and will need some
assessment, modelling, RCDA, testing and tools
use.
3.1.1 Reliability Engineering
Reliability Engineering involves defining reliability
objectives and adapting required fault prevention,
fault removal and failure forecasting modelling
techniques to meet the defined objectives all
through the development lifecycle. The emphasis
is on quantifying availability by planning and
guiding software development, test and build
processes to meet the target service levels. A
collaborative culture change is needed in solution
architecture, application development, service
delivery, operational and maintenance teams to
implement this approach.
Fault prevention during build requires better
development and test methods that will reduce
error occurrences. Smart error handling and
debugging techniques are to be adapted during
design and test reviews so that faults are removed
at the earliest possible time. By modelling
occurrences of failures and using statistical
methods to predict and estimate reliability of IT
systems, more focus can be given to high risk
components and Single Points of Failure (SPOFs).
Refer to Figure 2 for a representation of
engineering components.
Fault
Prevention
during build
Feedback
Loops
Fault Removal
through
Inspection and
Testing
Failure
Forecasting &
Modelling
Fault Tolerance & Defensive Programming techniques
Figure 2 – Reliability Engineering Components
3.1.1.1 Reliability Engineering Techniques
Popular Hardware techniques include redundancy,
load-sharing, synchronisation, mirroring and
reconciliation at different architecture tiers. Some
of the software techniques include Modularity for
Fault Containment, Programming for Failures,
Defensive Programming, N-Version Programming,
Auditors, and Transactions to clean up state after
failure.
3.1.2 Reliability Assessment
Reliability Assessment can be conducted on multilocation systems, single data centres, services,
servers and/or component levels. Diagram below
shows three popular assessment methods and
how they can be implemented together in a
continuous improvement scenario. Each of the
approaches can be implemented on their own as
one-off exercises depending on the life cycle stage
the IT system is in.
92
relevant components is one of the major reasons
for IT unavailability.
Metricbased
Reliability
Analysis
Architecture
-based
Reliability
Analysis
Black box
Reliability
Analysis
A combination of these methods will be required
for IT systems that require high levels of reliability.
3.1.3 Reliability Modelling
Evaluation based
on function
points,
complexity,
development
process and
testing methods
Evaluation of the IT
component
reliabilities and
system architecture
Estimation of
the reliability
based on failure
observations
from testing or
operation
Feedback Loop
Feedback Loop
Figure 3 – Reliability Assessment Methods
Architecture-based reliability analysis focuses on
understanding relationships among system
components and their influence on system
reliability. This is based on the process of
identifying critical components/interfaces and
concentrating more on the potential problem areas
and SPOFs. It assumes reliability and availability
of IT systems is proportionate to corresponding
measurements of its reusable hardware/software
components. For example,
97.99%
97.7%
98.99%
99.9%
98.99%
98%
End-to-end
97.7%
Over 200 models have been developed to help IT
Project Managers to deliver reliable software ontime and with-in budget. A good practical
modelling exercise can be used to initiate
enhancements that improve reliability from early
development phase.
Based on predictive
analytics concepts, different models are used
depending on the type of analysis needed:
- Predict reliability at some future time
based on past historical data even during
design stages,
- Estimate reliability at some present or
future time based on data collected from
current tests,
- Estimate the number of errors remaining in
a partially tested software and guide the
test manager as to when to stop testing.
Like performance models, no single reliability
model can be used in every situation because they
are based on a number of assumptions,
parameters,
mathematical
calculation
and
probabilities.
Intranet
Web
Server
App
Server 1
App
Server 2
DB Server
Figure 4 – Measuring reliability by components
Metric based Reliability analysis is based on the
static analysis of the hardware/software complexity
and maturity of the design and development
process and conditions. This approach is
particularly useful when there is no failure data is
available, for example, when the new IT system is
still in design stages. IEEE had developed a
standard IEEE Std. 982.2 (1988) and a few other
product metrics are available to support reliability
assessors in achieving optimum reliability levels in
software products. Similar vendor supplied
reliability data available for hardware components
and third-party components used.
The black box approach ignores information about
the internal structure of the application and
relationships among system components. It is
based on collecting failure data during testing
and/or operation and using such data to
predict/estimate when the next failure occurs.
Black-box reliability analysis evaluates how
reliability improves during testing and varies after
delivery. As pointed out in Appendix A, not
adapting best practices in long-term monitoring of
The modelling field is fast maturing and carefully
chosen models can be applied in practical
situations and give meaningful results.
3.1.4 Reliability Centred Design and Analysis
(RCDA)
Reliability should be designed-in at the IT strategy
level and a formalized RCDA methodology is
needed to reduce the probability and consequence
of failure. Various statistics have been published
that prove large % of failures can be prevented by
making needed changes at design stage.
Successfully implemented RCDA can result in an
improved productivity and reduced maintenance
costs.
The focus of RCDA all through the life cycle is to
ensure services are available whenever business
users need it. For that to happen IT capacity has to
be aligned to business needs, sufficient
redundancy is built-in such that critical services still
run during significant failures and data
integrity/confidentiality is maintained at all times.
Below is a high level flow diagram that shows a
sequence of basic steps to be followed as part of
RCDA:
93
Reliability Requirements and Specification
Key Component analysis using RBDs
Conduct Failure Mode Effect Analysis for Key Components
Update Capacity and Performance Plan
Update Security and Data Integrity Plan
validation and verification phases. However,
traditional software development and testing often
focus on the success scenarios whereas reliabilityspecific testing focuses on things that can go
wrong. New testing methods focus on failure
modes related to timing, sequence, faulty data,
memory management, algorithms, I/O, DB issues,
schedule, execution and tools.
Assurance
Facilitator
Update Resiliency /Failsafe/Backup Plans
Review Error Handling/Reporting/ Diagnostic Techniques
Review Operational Profiles/ Fault Tree/ Event Tree Diagrams
Production
Support
Reliability
Engineer
Update Production Support/ Maintenance Plans
Service
Delivery
Review Validation/ Verification Reports
Re-design
Acceptance Criteria
Met?
Solution
Architect
Certified
Figure 5 – Basic Steps in RCDA for an IT
system
3.1.4.1 Load-balancing and Failover
Reliable IT systems should be housed in a highly
secure and resilient data centres and the solutions
should be built around a redundant architecture
able to ensure hardware, network, databases, and
power availability as needed. Latest active/active
failover, recovery and continuity mechanisms to be
considered to help meet the high business
availability requirements.
However, IT architects need to be careful while
employing complex redundant solutions as they
can often be the sources for major failures. Some
of the latest major business IT failures are due to
incorrect setup or inadequate testing of complex
redundancy and backup solutions.
3.1.4.2 Other Design Factors
Technology
Suppliers
Figure 6 – Example Assurance Team Structure
Some of the methods that guide these tests are
Reliability Block Diagrams (RBDs), Failure Mode
Effect Analysis (FMEA), Fault Tree Analysis,
Defect Classification, Operational Profiles and
error handling/reporting functions. These methods
help testers develop reliability-specific test cases
during integration, user acceptance, nonfunctional, regression and deployment test phases.
Some sectors need their IT systems to be certified
along with hardware components and they need
reliability based acceptance criteria to be defined
and met before releasing any changes in to
production. Given a component of IT system
advertised as having a failure rate, Assurance
team can analyse if it meets that failure rate to a
specific level of confidence.
Business will not accept IT systems just because
they are available 24x7. Reliable IT systems must
meet various business specified requirements
including performance, capacity to match business
growth, security, data integrity and on-going
maintainability.
[Evan 2003] identified Top-20 Key High Availability
Design Principles that range from removing Single
Point of Failures to keeping things simple. This
kind of analysis will guide reliability designers and
architects in developing customised best practices.
3.1.5 Reliability Acceptance Testing
Like all other non-functional requirements,
reliability and availability for IT systems need good
94
Customers
Accept
People
Quality
Process
Tools
Thought
Leadership
Governance
Efficiency
Innovation
Collaboration
Regulation
Continue
Partners
Failure Number
Reject
Vendors
Figure 8 – Example CoE Maturity Model
Normalized Failure Time
Figure 7 – Assurance Criteria Example
3.1.6 Reliability Monitoring and Analysis using
Specialist Tools
Reliability is measured by counting the number of
operational failures and their effect on IT systems
at the time of failure and afterwards. A long-term
measurement program is required to assess the
reliability of critical systems. Some of the wellknown software reliability metrics that can used
include Probability of Failure on Demand
(POFOD), Rate of Fault Occurrence (ROCOF),
Mean Time to Failure (MTTF), Mean Time
Between Failure(MTTR), and Mean Time to
Repair(MTTR).
Most of the analysis mentioned above can be
performed by using office tools by an experienced
analyst. However, a few specialized tools and
workbenches available that will help in completing
different types of analysis including reliability
modelling and estimation/prediction. Partial list of
these tools is available in references [Kishor 2013,
Goel 1985]. Prediction/estimation using these tools
need good understanding of analytics methods
and basic probability theory.
The maturity model similar to the above can be
used as a basis for a ‘CoE development plan of
action’ and as a means of tracking progress
against targets.
The model above shows sample 9 central and 4
interface headings. In a real model, these heading
are to be chosen in consultation with senior
management and other stakeholders.
4.1 Strategy
Most organisations prefer to start with small steps
when it comes to new CoEs and customise the
approach as the concept catches on with more
partners and customers. Here are a list of generic
steps that can be followed:
-
Reliability specialist team has to master the tool
related skills before recommending any of them to
the customer area. Often tools related skills result
in continuous source of budgets / revenue for the
CoE for prolonged periods of time.
-
4. How to Setup Reliability CoE?
There is no one fixed method for setting up
Reliability CoE and, whichever way, it is not going
to be simple journey. Constructing any niche team
requires commitment, hard work and support from
all stakeholders. Below sample model shows,
some of the factors that will bring maturity to the
CoE organisation.
-
Consult with industry sponsors and outside
partners,
Appoint a talent leadership with high-level of
business knowledge,
Establish vision for the reliability practices,
Identify software reliability champions
internally and customer areas,
Define organisation structure and secure
funding,
Start building a knowledge repository and
sharing mechanisms,
Develop action plan for each of the areas
mentioned in the maturity model,
Develop strict metrics for each area
mentioned in maturity model,
Evaluate, select, and mandate vendor
products and standards,
Collaborate with other IT consultancy areas
to create reusable assets,
Setup review and approval mechanism for
deliverables,
Seek feedback and use it for continuous
improvement,
Encourage innovation and allow challenging
status-quo,
95
Customise to fit different customer cultures.
-
4.2 Processes
IT processes often constrained by resources,
backlog of projects, governance processes and
controls, and lack of focus on security and
maintainability, fail to deliver any of the set
objectives. Other than some of the generic
processes like project management, software
engineering, and marketing, Reliability CoE need
the following for quick delivery of set availability
objectives:
-
-
-
-
an agile assessing, modelling, testing and
measurement process for reliability,
techniques that focus on error prevention,
fault detection and removal,
process to adapt for real time, online/web,
batch applications,
an early defect/ SPOFs detection
framework supported by comprehensive
error handling process,
knowledge repository and reliability
governance program,
adaption programs to find better ways of
working with
partners, vendors and
governmental departments,
processes to identify areas which will need
less effort but likely to have bigger
outcome,
review process with the aim of continuous
improvement.
4.3 Technologies
Reliability technologies are fast evolving but
currently there are no uniformly recognised and
matured ones.
Most companies have their own selection of
products and methods that fall in their own comfort
zone. That means, a thorough assessment with
customer engagement and a Proof-of-Concept is
needed before adapting these technologies in
customer areas. Below diagram shows where
customer engagement and POC fits in an IT
technology lifecycle.
Engage Business
& IT in tech.
selection
Assess
Engage
Assess
Stakeholder
Requirements
Confirm with
Business Case
and Standards
POC
Governance
Build POC with*
Vendor Support
Architect
Solution
Architect
Build
Build solution**
* Choose technologies adaptable to customer scenarios.
** Build solutions that scale for growth.
Figure 9 – Reliability Technology Selection
Process
4.3.1 Tools
A few suites of tools/workbenches available that
will support reliability analyst in documenting
Reliability Block Diagrams(RBDs), Fault Tree
Analysis, Markov Modelling, Failure Mode and
Effect Analysis (FMEA), Root cause Analysis,
Weibull Analysis, Availability Simulation, Reliability
Centred Maintenance and Life-cycle Cost
Analysis.
The focus of these tools is mostly hardware
reliability but recently they have been adapted for
IT
infrastructure,
software
and
process
components. A few software specific tools
available that help in Software Reliability Modeling,
Statistical Modeling and Estimation, Software
Reliability Prediction Tools[Kishor 2013], [Allen
1999].
4.4 People
Supply of people with proven and practical
reliability analysis experience is very limited.
Because of this companies need to find people
with partially available skills and have to train them
in the rest of the areas. Below shows a good
proportion of skills needed in a reliability CoE.
Figure 10 – Proportion of skills in Reliability
CoE
Other than the generic roles like project manager,
business analyst, architect, operational analyst, a
96
few companies recruit specialist Reliability
Managers and Reliability Analysts. Sample
position descriptions for these roles is provided in
Appendix B.
In general, staff with 6-10 years of experience in 34 areas of the below list could be trained into the
specialised reliability roles.
-
Capacity Management,
Service Level Management,
Configuration Management,
Change Management,
Test/Release Management,
Incident Management,
Production Support and Operations,
Maintenance Management,
Product Life Cycle Management,
Vendor Management,
Resilience and Disaster Recovery,
Supply chain Management,
Asset Management.
When the focus is on a particular IT application,
participation from SMEs in the areas of business
functions, hardware, network, process, security,
software, tools, data, operations, and maintenance
would be needed.
5. Conclusion
IT organisations must focus on what is going on in
business areas and customise to help them
efficiently meet their requirements for systems
availability and reliability. Good set of reliability
practices can halve the re-active fixes needed for
the IT systems. The earlier they are adapted in the
lifecycle the better the savings for businesses.
Based on experience, up to 30% productivity
gains and roughly same percentage in reduction in
maintenance costs is predicted to be achievable
through these practices.
Reliability is one of the characteristic of IT systems
and, with systematic approach, it is possible to
meet business requirements with smaller cost and
minimum disruption. Implementation of any chosen
reliability methods will succeed with seamless
integration with current SDLC, Agile and
Transformation methodologies. Marketed properly,
Reliability Capabilities has good potential for
generating regular income and on-going project
work for commercial organisations.
Setting up separate reliability excellence team in
specialist IT departments would require broader
effort and participation from the strategy,
architecture, assurance, tools and industry vertical
solution teams. Developing a system for proper
data capture, its interpretation and taking action to
reflect in terms of KPIs like reliability and
availability, identifying critical failure areas is the
key.
Setting up the reliability CoE will not only help in
giving reliability the priority it needs but also
enhance organisation image and improve
customer satisfaction, greatly reducing the risk of
angry customers. In the long term, best reliability
practices will result in positive culture change
within the team as well as increased market share.
6. References
[ITIL OSIATIS] http://itil.osiatis.es/ITIL_course
/it_service_management/availability_management/
overview_availability_management/overview_avail
ability_management.php
[HP 2007] Impact on U.S. Small Business of
Natural & Man-Made Disasters, HP and SCORE
report 2007.
[Colin
2013]
http://www.telegraph.co.uk/
technology/
news/
10520015/
The-top-tentechnology -disasters-of-2013.html - By Colin
Armitage, chief executive, Original Software
[Phil 2011] http://www. Business computing world
.co.uk/ top-10-software-failures-of-2011 - By Phil
Codd, Managing Director, SQS.
[Phil
2012]
http://www.business
computing
world.co.uk/ top-10-software -failures- of-2012 - By
Phil Codd, Managing Director, SQS.
[Quoram 2013] Quorum Disaster Recovery Report,
http://www.quorum.net/ 2013 QuorumLabs, Inc.
[JBS 2013] http://www. jbs. cam. ac.uk/
media/2013/research-by- cambridge- mbas-fortech-firm- undo-finds- software- bugs- cost-theindustry- 316-billion-a-year/
[NIST 2002] The Economic Impacts of Inadequate
Infrastructure for Software Testing, June 2002,
NIST Planning Report 02-3
[Ponemon 2013] 2013 Study on Data Centre
Outages, Ponemon Institute LLC, September
2013.
[Aberdeen 2013] Downtime and data loss – How
much can you afford? Analyst Insight, Aberdeen
Group, August 2013
[Kishor 2013] Software Reliability and Availability –
TCS Ahmadabad – January 2013 – Kishor Trivedi,
Dept. of Electrical & Computer Engineering, Duke
University, Durham, NC 27708
[Musa 1987] Software reliability: measurement,
prediction, and application. Musa, J. D., Iannino,
A., & Okumoto, K. (1987). New York: McGraw–Hill
Publication.
[Bonthu 2012] A Survey on Software Reliability
Assessment by Using Different Machine Learning
Techniques, Bonthu Kotaiah , Dr. R.A. Khan,
International Journal of Scientific & Engineering
Research, Volume 3, Issue 6, June-2012 1 ISSN
2229-5518
[Pandey 2013] Early software reliability prediction
– A fuzzy logic approach, Pandey A.K, Goyal N.K,
Springer, 2013
97
[Pham 2006] System software reliability, reliability
engineering series. Pham, H. (2006). London:
Springer.
[Lyu 1996] Handbook of software reliability
engineering. Lyu, M. R. (1996). NY: McGraw–
Hill/IEE Computer Society Press.
[Goel
1985]
Software
reliability
models:
assumptions, limitations, and applicability. Goel, A.
L. (1985).
IEEE Transaction on Software
Engineering, SE–11(12), 1411–1423.
[Allen 1999] Software Reliability and Risk
Management: Techniques and Tools. Allen Nikora
and Michael Lyu, tutorial presented at the 1999
International Symposium on Software Reliability
Engineering.
[Ulrik 2010] Availability of enterprise IT systems –
an expert-based Bayesian model, Ulrik Franke,
Pontus Johnson, Johan König, Liv Marcks von
Würtemberg: Proc. Fourth International Workshop
on Software Quality and Maintainability (WSQM
2010), Madrid
[Evan 2003] Blueprint for High Availability – Evan
Marcus/Hal Stern – 2003 – Wiley Publications
7. Acknowledgement
Authors are grateful to Girish Chaudhari, Peter
Andrew, Carl Borthwick and Jonathan Wright who
reviewed the material when it was first prepared
for an internal team discussion. We would also like
to thank Prajakta Vijay Bhatt and the anonymous
CMG referees for their comments which have
helped
make
this
paper
better.
98
Appendix A: Surveyed Reasons for Unavailability
A survey among a few academic availability experts in 2010 ranked reasons for unavailability of enterprise IT
systems [Ulrik 2010]. They identified the lack of best practices in the following areas are the causes:
•
•
•
•
•
•
•
•
•
•
Monitoring of the relevant components
Requirements and procurement
Operations
Avoidance of network failures, internal application failures, and external services that fail
Network redundancy
Technical solution of backup, and process solution of backup
Physical location,
Infrastructure redundancy
Storage architecture redundancy
Change Control
[Evan 2003] identified investment in the following areas will help improve the availability of IT systems.
• Good systems and admin procedures
• Reliable backups
• Disk and volume management
• Networking
• Local Environment
• Client Management
• Services and application
• Failovers
• Replication
Even though these studies does not apply in all cases, they provide useful guidelines for architects and
designers of IT systems. This paper proposes more structured approach for availability management that
applies to most business organisations.
99
Appendix B – Sample Reliability Engineer Position Descriptions
Senior Reliability Engineer – Technical IT Infrastructure

This position is located within xxx Team in Reliability, Maintainability and Testability Support
Discipline,
 xxx team has a role of increasing the availability and reducing the through life cost of ownership of IT
systems for customers.
Main responsibilities:













Own end-to-end availability and performance of customer critical services from infrastructure point of
view,
Ensure five 9s reliable experience for IT systems users located in UK and abroad,
Liaison with customer teams and other partners to obtain Reliability data,
Analyse, Model and interpret arising data to forecast the reliability of customer IT systems,
Utilisation of reliability data to produce analysis and system performance reports for customers,
Capable of technical deep-dives into code, networking, operating systems and storage problem areas,
Respond to and resolve emergent service problems to prevent problem recurrence,
Liaising with Design, Support, Maintenance, Procurement and Commercial functions to identify
suitable recommendations for improvements,
Understanding and interpreting IT maintenance and support information to identify root causes of IT
failure,
Attendance at customer high-level service reviews and support root cause analysis,
Detailed IT systems analysis to support releases in different production environments,
Representing xxx team in internal and external customer meetings,
Participate in service capacity planning, demand forecasting, software performance analysis and
system tuning activities
Minimum qualifications


BS degree in Computer Science or related field or equivalent practical experience,
Proven experience in similar role in a commercial organisation, using formal reliability tools and
procedures,
 Good understanding of reliability, maintainability and testability practices
Preferred qualifications








MS degree in Computer Science or related field,
Experience with different M/F, servers, desktop systems administration and logistics,
Expertise in data structures, algorithms and basic statistical probability theory,
Expertise in analysing and troubleshooting large-scale distributed systems,
Knowledge of network analysis, performance and application issues using standard tools: BMC
Patrol, Teamquest or similar,
Experience in a high-volume or critical production service environment,
Sound understanding of understanding of IT life-cycle management and maturity gates,
Strong leadership, communication, report writing, and presentation skills.
100
Senior Reliability Engineer – Software Engineering

This position is located within the xxx Team in Reliability, Maintainability and Testability Support
Discipline,
 xxx team has a role of increasing the availability and reducing the through life cost of ownership of IT
systems for customers.
Main responsibilities:











Own end-to-end availability and performance of customer critical services from software design point
of view,
Manage availability, latency, scalability and efficiency of customer services by engineering reliability
into software and systems,
Review and influence ongoing design, architecture, standards and methods for operating services and
systems,
Work in conjunction with software engineers, systems administrators, network engineers and
hardware teams to derive detailed reliability requirements,
Identify metrics and drive initiatives to improve the quality of design processes,
Understanding of fault prevention, fault removal, fault tolerance & defensive programming design
techniques,
Liaison with customer teams and other partners to build five 9s reliability into software delivery
procedures,
Capable of technical deep-dives into code, networking, operating systems and storage design
problem areas,
Attendance at customer high-level IT design reviews,
Representing xxx team in internal and external customer meetings,
Participate in capacity planning, demand forecasting, software performance analysis and system
tuning activities.
Minimum qualifications


BS degree in Computer Science or related field or equivalent practical experience,
Proven experience in similar role in a commercial organisation, using formal reliability tools and
procedures,
 Good understanding of reliability, maintainability and testability practices.
Preferred qualifications







MS degree in Computer Science or related field,
Expertise in complexity analysis and basic statistical probability theory,
Expertise in designing end-to-end large-scale distributed systems with full resilience,
Experience in end-to-end infrastructure, data, applications, security and service design,
Experience in a high-volume or critical production service environment,
Sound understanding of understanding of IT life-cycle management and maturity gates,
Strong
leadership,
communication,
report
writing,
and
presentation
skills
101

Hinjewadi, Pune Hinjewadi, Pune

Transcription

Similar documents

iPPlus Sales Sheet - IP Plus Consulting, Inc.

Public and Private Cloud Interoperability

Company Fact Sheet 2016

Changes That Heal_indd

for Kamstrup Omnipower - smart-me

Open Cloud Initaitive http://www.opencloudinitiative.org/ Sam

Cloud IP Camera

02. Plenair Tom Welling Dharminder Debisarun

MICO-RACE wireless monitor - Blue Jay Technology Co.,Ltd