Rally On-Premise Edition: What is My User Capacity? White Paper

Transcription

Rally On-Premise Edition: What is My User Capacity? White Paper
Rally On-Premise Edition: What is
My User Capacity?
White Paper
By Jim Campbell, Rally On-Premise Edition Product Owner
On-Premise will support your capacity needs. Rally has had more
than enough capacity for our largest Fortune 100 customers. But it is
important to evaluate your organization’s specific dynamics to make
the call on how many users you can comfortably support in your
hardware environment.
Rally offers an On-Premise version of our Application Lifecycle
Management (ALM) product. It is deployed as a copy of our SaaS
stack, packaged in a VMware appliance. This virtual machine contains
the same components as our operational stack: Oracle, Apache, and
our application (app) servers. It is basically plug and play; install it in a
VMware environment, bring it up, start adding users, and get to work.
Currently, Rally has customers with more than 10,000 individual users,
and many customers with well over 1000 users. We are often asked
how many users the On-Premise appliance can support.
Every customer has a different load profile on Rally, and one of the
benefits of a SaaS deployment is that we can look at the differences
in how customer usage impacts the Rally stack. In this whitepaper
we’ll present some data on those variations to help you answer “how
many users can I put on my Rally On-Premise instance?” because the
answer really is “it depends.”
www.rallydev.com ©2013 Rally Software Development
1
As the only Agile multi-tenant solution with thousands of customers worldwide, our SaaS
stack has a rich set of logging and analysis tools to allow our Operations (Ops) team
to monitor customer activity without accessing proprietary content. Our On-Premise
Edition will deploy this same monitoring capability later this year, which will allow Rally
administrators to see data such as TPS, Java virtual machine (JVM) memory usage, and
the like.
Looking at Actual Customer Data
So how do we start to answer the question of capacity for your Rally On-Premise
Edition? Since we don’t have access to actual On-Premise data, we started by looking
at the load characteristics of a number of our largest SaaS customers, since we wanted
organizational dynamics similar to our large On-Premise customers. We also focused on
TPS as our limiting factor for capacity. This is a metric that we can easily measure on a
per customer basis.
Memory is another limiting factor, but we give access to the current memory utilization
in the On-Premise control panel. We’ve had good success with customers simply
increasing the memory on the VM once the utilization max approaches 90%. With our
largest customers (many thousands of users), 12 GBytes has been more than adequate.
Analyzing the numbers from our large SaaS customers
First of all, there are two types of users. There are human users who interact by clicking
on a screen, and there are “robot” users, which are connectors to other systems,
customer-written data extractors, and so on. Different customers have different mixes of
these two types of users. Some have almost no robot transactions; others have a large
number of automated transactions.
Here is a large eastern US based customer with several thousand users, with the hourly
maximum TPS over two days.
www.rallydev.com ©2013 Rally Software Development
2
Some observations:
• The human transactions pretty much follow a typical business day. People use Rally
more in the morning when the coffee is kicking in, take lunch off, and have another
jump mid-afternoon.
• When we drill down into the makeup of the customer’s robot transactions, we
see that they are mostly driven by the Eclipse and Excel plugins, as well as the
Subversion Source Code Management connector. Since these are “people-driven,”
that explains the business day cycle. They also use our Ruby API, which accounts for
the other transactions using custom integrations.
• Their peak TPS is 6.5 TPS.
Here is a much larger US west-coast based customer with a different set of transaction
dynamics.
www.rallydev.com ©2013 Rally Software Development
3
Some observations on this second customer:
• We see a substantially higher load of robot TPS as a ratio of human TPS. Unlike the
first customer, they don’t match the same shape as the human transactions. Their
data shows that their use of connectors is much more oriented to custom integrations
using the Rally API, and therefore is probably less “people” driven. The two matching
spikes across the two days at 4am hint strongly at off hour custom integrations for
data extraction.
• We are seeing a peak at 11.5 TPS at 8am US Western, when there is a high load (4
TPS) of robot transactions.
• When we did further research, we found that this customer has a larger distribution
of users in other countries than the first customer. We can see some correlation with
human TPS and timezones for this customer from the spikes at 2am. Looking across
a number of other large customers, we found that their peak TPS was lower if they
had users in more time zones (i.e. load was spread over the 24 hour clock).
www.rallydev.com ©2013 Rally Software Development
4
Normalized TPS
To allow “apples to apples comparison,” we need to normalize data against the
subscription size. Below are the maximum TPS load graphs from the above two
customers normalized to TPS per 1000 users.
Customer 1 average: 1.3 (18% higher)
Customer 2 average: 1.1
There is a small (18%) difference in the average of TPS over the course of 24 hours,
but there is a 61% higher max TPS because of the great variability between the peak to
average load of the first customer. Below is a representative sample of large customers
and their max TPS during each hour. Each line is a separate customer, with a circle for
each hourly TPS sample. Looking at this set of customers, we see a clear concentration
between 2 and 4 TPS/1000 users. But note that some customers have an equal load
www.rallydev.com ©2013 Rally Software Development
5
from robot transactions. If these were running at the same time as the peak human use
for Customer 1, we could see a load as high as 7 TPS/1000. There is an important point
to be made here: it is entirely possible to max out the capacity of a Rally instance by
unleashing a set of poorly written robot connectors. Our Ops team monitors our SaaS
stack for this and actively reaches out to customers running high robot TPS to help them
fine tune their integrations to lower the peak loads. As an OnPrem customer, you’ll want
to work with your users on building and using well behaved integrations.
We can confidently say that our large SaaS customers average 2-4 TPS per 1000 users,
and we believe that you will see similar results on the Rally On-Prem appliance. This
number can be affected by integrations and timezone distribution of users.
On Prem Capacity
Now that we can estimate the max TPS your users will generate, what is the maximum
TPS capacity of an On-Premise server? Again, “it depends.” This time it depends on
the hardware you are running it on. Our On-Premise edition is shipped as a VMware
appliance, to be installed in a VM infrastructure. We recommend a configuration (listed
www.rallydev.com ©2013 Rally Software Development
6
here on our web page) of two CPU cores and 8-12 GB of memory. For our SaaS stack,
we have a sophisticated load test that we run with every SaaS version before we
release it. Using that same load test against the most recent On-Premise release, on
the recommended hardware configuration, with a typical user data set, we measure a
maximum of 17 TPS before we max out the CPUs. Our experience with our SaaS stack
tells us that increasing CPU cores should improve capacity in a near-linear fashion, and
we are seeing the same behavior with the On-Premise Edition. Doubling the CPU cores
to 4 moves the maximum TPS up to 40. That is as high as we have currently tested,
since that is more than enough for the largest customers and prospects to date. We
see near-linear improvements as we add JVMs to our SaaS stack, so we have every
expectation that we’ll see the same in our On-Premise edition when we need to handle
larger subscriptions.
Maximum Users
So what does this mean as far as answering the question of how many users can an OnPremise instance support? Again, it depends on the dynamics of the specific customer
load, but here are some guidelines:
Normalized Transaction Load
Max Users with
Standard Configuration
2 CPU Cores, 8 GB
Max Users with
Large Configuration
4 CPU Cores, 16 GB
2 TPS/Thousand
7,500
17,000
3 TPS/Thousand
5,000
12,000
4 TPS/Thousand
3,700
8,500
We’ve seen that there is about a 2x variability across a set of large customers and their
maximum load on Rally. An On-Premise installation won’t be able to initially judge what
their expected normalized TPS will be, and therefore their max based on the number of
users. But based on our experience, we know that it won’t be at the high end of 4 TPS
in the initial period of use. We’re planning on releasing monitoring tools for Rally OnPremise in the future, and then customers will be able to watch these transaction rates
and upgrade their hardware as needed. It is always appropriate to work with the authors
of robot transaction integrations and either reduce their peak load, or move the traffic to
low usage hours. We are confident that by adding more cores and memory, and looking
at other hardware options such as SSDs, we have lots of runway for user configurations
far larger than the numbers above.
www.rallydev.com ©2013 Rally Software Development
7
If you have any other questions, please don’t hesitate to contact your sales rep, or your
technical account manager.
Enjoy your Rally On-Premise instance!
Acknowledgements
A number of people provided valuable input into this paper. Thanks to Marc Chipouras,
Brian Dupras, Vikas Shivamurthy, Dara Warde and Ian Whitmore.
www.rallydev.com ©2013 Rally Software Development
8