Index

Capacity Planning and Performance Management (CPPM)

Introduction

Capacity Planning and Performance Management are tools that enable a manager to make good trade offs between cost of computer systems and the performance as seen by the users. The value of CPPM information is often less appreciated because modern desktop systems are so powerful. With tight budgets and no obvious performance problems, managers are tempted into avoiding what they see as an unnecessary expense.

CPPM Terms Defined

Performance is a measure of how fast a system can process a single event, or alternatly, as a rate - how many events can be processed per unit time.

In order to characterize the results, careful definition of the hardware and software components is required, plus the details of what actions the event performs.

Capacity is usually specified as a continuous rate of processing for a specific system and event(s) over a long enough time to reach steady state.

How Are They Related?

Performance and capacity may seem to be each other's inverse, but that is not precise usage. Typically, Performance is managed, meaning it is a current and ongoing process. Capacity is something that is planned, meaning it is a future point and is the desired level of performance that will be required.

The two terms are complimentary in their usage, and linked by their common element of performance equals capacity / time. They are related in another way - sequence. Before you can plan for capacity, you need to know the current performance and the additional demands of the planned system.

When Do You Need Performance Management?

Performance Management (PM) is frequently done in response to an unexpected event, such as a system's response time collapsing under a small growth in workload. Under these circumstances, the study is done rapidly and sometimes companies just throw new hardware at the problem because small system enhancements appear inexpensive.

Unfortunately, this approach suffers from a number of problems. First, the upgrades may not solve the problem, or may only add a short extension to the current system. Second, if the system is already at its limit for hardware, the upgrade may be very expensive.

Third and most important, the hardware band-aid approach gives the company no information about the cause of the problem or the actual increase in capacity. This is an almost certain path to a repeat of the original crisis at some time in the future, with unknown risks and costs.

While any one quick and dirty hardware upgrade may work fine, there is no guarantee that the next problem can be solved that way, nor what may cause it. The risks of an interruption to business are unpredictable. Since every business now depends on computer capability to efficiently run their business, treating PM as a crisis event could be the last mistake some businesses make.

Alternatives to the Crisis Approach

For any business beyond a few desktops sharing a peer network, PM should not be an event, but a process. The trick to avoiding the crisis is to perform regular monitoring of some kind on the system's load and performance. The kind and amount of monitoring depends on the size of the system, how critical it is for normal operation, and how much money would be lost per hour if it was down.

Basic PM need not require a performance analyst in house because this level of PM can be done by moderately experienced people with simple tools and training. There are three steps in basic PM:

  1. The first step is regular monitoring and recording of how the system is performing.
  2. The second step is reviewing the data collected to determine if there are hot spots, and to look for trends in the data that indicate rate of growth (or decline).
  3. The third step is to calculate how much capacity for additional work remains in the system, both at average and peak loads.
Beyond basic PM the process gets more complex. There are a range of additional tasks that can be added to refine the information and increase confidence in the calculations.

What is typically done when companies want to use the additional techniques is to hire a consultant to set up the process and train the internal person to monitor the process, with the consultant backing up the monitoring. At some point of increasing complexity, the job may be outsourced to specialists, or the company may choose to train and use only internal people.

Capacity Planning Approaches

Basic capacity planning is implicit in the basic performance management approach discussed above. Knowing where you are is a necessary first step towards figuring out how to get to another point.

It may be as simple as seeing what difference one additional unit of workload will do, and projecting that as a linear increment (but see below), or as complex as running simulated workloads on simulated hardware that does not exist.

Beyond that simple approach, useful CP rapidly becomes a complex project. The two core problems in CP are getting accurate information on programs not yet written, and accurate simulation in a reasonable time.

To perform Capacity Planning (CP), it is necessary to know the system performance requirements (memory, cpu, I/O) for the most used programs being planned. While this may be done by simulation in theory, accurate software performance numbers require test runs with real components on real hardware.

The reason for simulation being difficult is twofold - information and non-linearity. The amount of information required for an accurate simulation of a typical computer system is large, specialized and rarely all available. Hardware is not usually the problem - it is the complex internals of the OS and applications that have inadequate or no timing information.

The non-linearity comes in because computer systems can only be modeled accurately with queueing simulation. The queueing process is subject to extreme non-linearity at an unpredictable levels of activity, making linear or algorthmic approximations worth almost nothing since they rarely give a useful result.

As an example, an accurate simulation of a simple system could take hours, and one of a real system, weeks. Such systems are called NP complete, shorthand for computations that take more time as an exponential relation of the complexity involved. The increase in time and cost rapidly outweighs the value of the information because of the uncertainties of the workload information.

Another approach is to use queueing theory directly. Unfortunately, the theory only has solutions for some of the simpler queueing models, and most real systems are far more complex than theory can handle directly.

How then is CP done in practice? Capacity planning is where experience and insight are critical, and this means someone with experience on a range of systems and problem types. Obviously this is not inexpensive, but it is likely to be much more expensive to skip this step, or do it with inexperienced staff.

Getting the wrong answer will be expensive. Key to getting the right answer, or one close enough to be useful, is asking the right questions. It is here that experience counts most. Since there are so many possible items to get data for, it becomes essential to use the smallest set of data that gives a useful answer, because of the NP compute problem.

The good news is that most sites will not need to go to detailed simulations for CP because most of what is needed can be developed by adding capabilities to the performance management tools. Only in the case of a big new software project or a major change in the environment, such as a business acquisition, should you need to go the full detail route.

The next part of this series will go into more detail on how to start your performance management effort, and some of the tools that can assist you.

[30]

All content on this site is Copyright 2001 and 2002 by Bill Nicholls