The following post has been co-authored by Mattia Berlusconi.

When faced with the need to manage the capacity of virtualized environments, IT managers usually first take steps to ensure that the virtual infrastructure never runs out of physical resources. This usually entails collecting hypervisors CPU utilizations, memory consumption, storage space allocation and similar data and performing some kind of forecasting exercise, possibly taking into consideration expected future demand, like, for example, new business-led initiatives that will impact resource usage. All these metrics are typically covered by virtualization platforms, such as VMware vSphere, and are kept under control by Operations teams responsible for managing the virtual piece of the IT stack.

The second, more mature, capacity management need regards the business applications that are running within the virtual machines (VMs). Let’s suppose your CRM application, on which your sales and support enterprise workflow is implemented, is deployed on virtual machines. As a service owner, it is paramount that you have visibility into the volume of product sales transactions per hour you can support – otherwise you might risk costly QoS degradation or even business interruption and consequent revenue losses. How can you provide this answer to your business?

You need an Application Capacity Model to do this, of course. You need to find the relationships about selling transaction volume and how your infrastructure gets utilized. But wait: how are you going to measure the “infrastructure” metrics, if your application is running on VMs?

Traditionally, monitoring agents are deployed within physical servers in order to collect the relevant consumption metrics of the various physical resources (CPU, memory, disks, and so on). With virtualization, it became clear that this approach could no longer be used as before. Virtualization introduced new measurements challenges, with the result that monitoring agents running within guest operating systems might not be able to correctly measure some of the most important capacity metrics. The common answer to this problem was to use virtualization platforms metrics: the hypervisor can best measure what really is your virtual machine CPU utilization, for example.

This has led to a common practice of totally discarding metrics coming from guest OS monitoring agents and exclusively focusing on hypervisor metrics for any capacity management need. Why do you need guest level metrics, after all, when the virtualization platform already gives you a wealth of interesting metrics?

The sad news is that this vision is misleading. If you need to understand the capacity of your business application, the hypervisor metrics might not be enough. Not only guest OS metrics should not be thrown away, but they are actually useful for the capacity management purposes.

Measuring Application Memory Demand

We are going to perform some experiments with the goal to estimate the memory demand of an application running within a VM. We will generate artificial load within the VM, simulating common application activity. Then, memory metrics gathered from the hypervisor will be analyzed. Finally, guest OS metrics will be used to assess if they are of any value. We will focus on vSphere, the leading x86 virtualization platform.

Memory Metrics

In terms of virtual machine memory consumption, vSphere provides several metrics that are widely used within the industry for performance evaluation activities. The most important ones for Capacity Management are:

  • Memory Active: Amount of guest “physical” memory actively used
  • Memory Swap Rate: Rate at which memory is swapped from disk into active memory (the opposite for SwapOutRate)

The metrics have been gathered using the vSphere Web Client.

In order to better understand what’s going on within the VM, memory consumption metrics available from the guest OS will be also gathered during the experiments. Specifically:

  • Memory Used (KB): Total amount of used memory
  • Memory Cached (KB): Amount of memory used for data caching
  • Commit Memory (KB): Amount of memory needed for current workload
  • SwapIn, SwapOut (Pages/sec): Amount of pages transferred from main memory to swap device, per second

The metrics have been gathered using the ‘sar -r’ and ‘sar -W’ Linux commands.

Methodology

To illustrate our point, we perform three experiments:

  • Experiment one: a synthetic application allocates memory within the VM, simulating the memory demand placed by a real application running. The stress tool is used within the VM to allocate a given memory size and access it in order to force real memory allocation by the guest OS and the hypervisor.
  • Experiment two: an IO workload generator generates file input/output operations within the VM, simulating common application file reads and writes (e.g. data ingestion, XML processing, databases, backups, compression jobs, etc). The fio IO workload generator has been used to inject realistic IO request patterns.
  • Experiment three: similar to experiment one, we simulate an application working set to expand to consume almost all the VM memory, thereby forcing the guest OS to paging/swapping.

The experiments were conducted in our lab testbed with the following configuration:

  • Virtualization infrastructure: vSphere 5.0.0, running on HP DL360 G7, 2 Xeon E5620 @ 2.4 Ghz with 48 GB of physical memory
  • Test VM: CentOS 6.4 Linux configured with 2 GB memory, 2 vCPUs and 20 GB of storage.

Experiment One – Application Memory Allocation

vSphere active memory increases in response to increased memory demands generated by the application running within the VM. Guest OS metrics show increased utilization due to application committed memory. The guest OS decreases memory used for caching purposes to make room for the application memory requests.

Experiment Two – Application IO Workload Generation

vSphere active memory increases in response to increased memory demands generated by the guest OS. Guest OS metrics shows increased utilization due to inflation of the cached memory, while application memory demands (committed memory) remains constant.

I suspect that this results are somewhat unexpected to some of you: IO workloads actually cause increased memory demands too! Although at first that behavior might seem just plain wrong, it is actually a common modern operating system strategy to increase performance by using available physical memory to cache file data. By doing so, the OS hopes to avoid costly disk operations: latency-wise, accessing disks is about 6 orders of magnitude slower than memory!

Experiment Three – Application Memory Allocation leading to Swapping

vSphere active memory increases in response to increased memory demands generated by the application running within the VM. vSphere swap rates counters do not show any activity. Guest OS metrics shows increased utilization due to application committed memory. The guest OS is forced in a memory pressure condition and reacts to it by swapping in and out memory pages.

Again, interesting results here. Excessive application memory demands within the VM caused severe guest OS swapping activity, likely leading to poor application response time and user QoS. Nevertheless, vSphere swapping counters do not signal any activity here. Why is going on? Again, this is correct: the vSphere swapping counters are meant to measure hypervisor swapping activity, which takes place when a memory pressure condition is felt by the hypervisor due to other VM memory demands. It does not signal guest OS swapping!

Conclusion and Key Points

First, application memory demands within virtual machines cannot be reliably estimated just by looking at vSphere memory metrics. VM active memory varies in response to guest OS memory allocation, which can be due to application memory demands or just by file IO operations.

Secondly, severe memory over-commitment within virtual machines, leading to large impacts on application response times and QoS, cannot be reliably determined without looking at guest OS memory metrics. VM swapping counters available from vSphere are meant for a different purpose (tracking hypervisor swapping).

It seems like in order to perform Business Application Capacity Management in virtual environments, you actually need guest level metrics!