OProfile Internals


Table of Contents

1. Introduction
1. Overview
2. Components of the OProfile system
2.1. Architecture-specific components
2.2. oprofilefs
2.3. Generic kernel driver
2.4. The OProfile daemon
2.5. Post-profiling tools
2. Performance counter management
1. Providing a user interface
2. Programming the performance counter registers
2.1. Starting and stopping the counters
2.2. IA64 and perfmon
3. Collecting and processing samples
1. Receiving interrupts
2. Core data structures
3. Logging a sample
4. Synchronising the CPU buffers to the event buffer
5. Identifying binary images
6. Finding a sample's binary image and offset
4. Generating sample files
1. Processing the buffer
2. Locating and writing sample files
5. Generating useful output
1. Handling the profile specification
2. Collating the candidate sample files
3. Generating profile data
4. Generating output
Glossary of OProfile source concepts and types

List of Figures

3.1.

Chapter 1. Introduction

This document is current for OProfile version 0.7.1. This document provides some details on the internal workings of OProfile for the interested hacker. This document assumes strong C, working C++, plus some knowledge of kernel internals and CPU hardware.

Note

Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4 uses a very different kernel module implementation and daemon to produce the sample files.

1. Overview

OProfile is a statistical continuous profiler. In other words, profiles are generated by regularly sampling the current registers on each CPU (from an interrupt handler, the saved PC value at the time of interrupt is stored), and converting that runtime PC value into something meaningful to the programmer.

OProfile achieves this by taking the stream of sampled PC values, along with the detail of which task was running at the time of the interrupt, and converting into a file offset against a particular binary file. Because applications mmap() the code they run (be it /bin/bash, /lib/libfoo.so or whatever), it's possible to find the relevant binary file and offset by walking the task's list of mapped memory areas. Each PC value is thus converted into a tuple of binary-image,offset. This is something that the userspace tools can use directly to reconstruct where the code came from, including the particular assembly instructions, symbol, and source line (via the binary's debug information if present).

Regularly sampling the PC value like this approximates what actually was executed and how often - more often than not, this statistical approximation is good enough to reflect reality. In common operation, the time between each sample interrupt is regulated by a fixed number of clock cycles. This implies that the results will reflect where the CPU is spending the most time; this is obviously a very useful information source for performance analysis.

Sometimes though, an application programmer needs different kinds of information: for example, "which of the source routines cause the most cache misses ?". The rise in importance of such metrics in recent years has led many CPU manufacturers to provide hardware performance counters capable of measuring these events on the hardware level. Typically, these counters increment once per each event, and generate an interrupt on reaching some pre-defined number of events. OProfile can use these interrupts to generate samples: then, the profile results are a statistical approximation of which code caused how many of the given event.

Consider a simplified system that only executes two functions A and B. A takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at 100 cycles a second, and we've set the performance counter to create an interrupt after a set number of "events" (in this case an event is one clock cycle). It should be clear that the chances of the interrupt occurring in function A is 1/100, and 99/100 for function B. Thus, we statistically approximate the actual relative performance features of the two functions over time. This same analysis works for other types of events, providing that the interrupt is tied to the number of events occurring (that is, after N events, an interrupt is generated).

There are typically more than one of these counters, so it's possible to set up profiling for several different event types. Using these counters gives us a powerful, low-overhead way of gaining performance metrics. If OProfile, or the CPU, does not support performance counters, then a simpler method is used: the kernel timer interrupt feeds samples into OProfile itself.

The rest of this document concerns itself with how we get from receiving samples at interrupt time to producing user-readable profile information.

2. Components of the OProfile system

2.1. Architecture-specific components

If OProfile supports the hardware performance counters found on a particular architecture, code for managing the details of setting up and managing these counters can be found in the kernel source tree in the relevant arch/arch/oprofile/ directory. The architecture-specific implementation works via filling in the oprofile_operations structure at init time. This provides a set of operations such as setup(), start(), stop(), etc. that manage the hardware-specific details of fiddling with the performance counter registers.

The other important facility available to the architecture code is oprofile_add_sample(). This is where a particular sample taken at interrupt time is fed into the generic OProfile driver code.

2.2. oprofilefs

OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from userspace at /dev/oprofile. This consists of small files for reporting and receiving configuration from userspace, as well as the actual character device that the OProfile userspace receives samples from. At setup() time, the architecture-specific may add further configuration files related to the details of the performance counters. For example, on x86, one numbered directory for each hardware performance counter is added, with files in each for the event type, reset value, etc.

The filesystem also contains a stats directory with a number of useful counters for various OProfile events.

2.3. Generic kernel driver

This lives in drivers/oprofile/, and forms the core of how OProfile works in the kernel. Its job is to take samples delivered from the architecture-specific code (via oprofile_add_sample()), and buffer this data, in a transformed form as described later, until releasing the data to the userspace daemon via the /dev/oprofile/buffer character device.

2.4. The OProfile daemon

The OProfile userspace daemon's job is to take the raw data provided by the kernel and write it to the disk. It takes the single data stream from the kernel and logs sample data against a number of sample files (found in /var/lib/oprofile/samples/current/. For the benefit of the "separate" functionality, the names/paths of these sample files are mangled to reflect where the samples were from: this can include thread IDs, the binary file path, the event type used, and more.

After this final step from interrupt to disk file, the data is now persistent (that is, changes in the running of the system do not invalidate stored data). So the post-profiling tools can run on this data at any time (assuming the original binary files are still available and unchanged, naturally).

2.5. Post-profiling tools

So far, we've collected data, but we've yet to present it in a useful form to the user. This is the job of the post-profiling tools. In general form, they collate a subset of the available sample files, load and process each one correlated against the relevant binary file, and finally produce user-readable information.

Chapter 2. Performance counter management

1. Providing a user interface

The performance counter registers need programming in order to set the type of event to count, etc. OProfile uses a standard model across all CPUs for defining these events as follows :

event The event type e.g. DATA_MEM_REFS
unit mask The sub-events to count (more detailed specification)
counter The hardware counter(s) that can count this event
count The reset value (how many events before an interrupt)
kernel Whether the counter should increment when in kernel space
user Whether the counter should increment when in user space

The term "unit mask" is borrowed from the Intel architectures, and can further specify exactly when a counter is incremented (for example, cache-related events can be restricted to particular state transitions of the cache lines).

All of the available hardware events and their details are specified in the textual files in the events directory. The syntax of these files should be fairly obvious. The user specifies the names and configuration details of the chosen counters via opcontrol. These are then written to the kernel module (in numerical form) via /dev/oprofile/N/ where N is the physical hardware counter (some events can only be used on specific counters; OProfile hides these details from the user when possible). On IA64, the perfmon-based interface behaves somewhat differently, as described later.

2. Programming the performance counter registers

We have described how the user interface fills in the desired configuration of the counters and transmits the information to the kernel. It is the job of the ->setup() method to actually program the performance counter registers. Clearly, the details of how this is done is architecture-specific; it is also model-specific on many architectures. For example, i386 provides methods for each model type that programs the counter registers correctly (see the op_model_* files in arch/i386/oprofile for the details). The method reads the values stored in the virtual oprofilefs files and programs the registers appropriately, ready for starting the actual profiling session.

The architecture-specific drivers make sure to save the old register settings before doing OProfile setup. They are restored when OProfile shuts down. This is useful, for example, on i386, where the NMI watchdog uses the same performance counter registers as OProfile; they cannot run concurrently, but OProfile makes sure to restore the setup it found before it was running.

In addition to programming the counter registers themselves, other setup is often necessary. For example, on i386, the local APIC needs programming in order to make the counter's overflow interrupt appear as an NMI (non-maskable interrupt). This allows sampling (and therefore profiling) of regions where "normal" interrupts are masked, enabling more reliable profiles.

2.1. Starting and stopping the counters

Initiating a profiling session is done via writing an ASCII '1' to the file /dev/oprofile/enable. This sets up the core, and calls into the architecture-specific driver to actually enable each configured counter. Again, the details of how this is done is model-specific (for example, the Athlon models can disable or enable on a per-counter basis, unlike the PPro models).

2.2. IA64 and perfmon

The IA64 architecture provides a different interface from the other architectures, using the existing perfmon driver. Register programming is handled entirely in user-space (see daemon/opd_perfmon.c for the details). A process is forked for each CPU, which creates a perfmon context and sets the counter registers appropriately via the sys_perfmonctl interface. In addition, the actual initiation and termination of the profiling session is handled via the same interface using PFM_START and PFM_STOP. On IA64, then, there are no oprofilefs files for the performance counters, as the kernel driver does not program the registers itself.

Instead, the perfmon driver for OProfile simply registers with the OProfile core with an OProfile-specific UUID. During a profiling session, the perfmon core calls into the OProfile perfmon driver and samples are registered with the OProfile core itself as usual (with oprofile_add_sample()).

Chapter 3. Collecting and processing samples

1. Receiving interrupts

Naturally, how the overflow interrupts are received is specific to the hardware architecture, unless we are in "timer" mode, where the logging routine is called directly from the standard kernel timer interrupt handler.

On the i386 architecture, the local APIC is programmed such that when a counter overflows (that is, it receives an event that causes an integer overflow of the register value to zero), an NMI is generated. This calls into the general handler do_nmi(); because OProfile has registered itself as capable of handling NMI interrupts, this will call into the OProfile driver code in arch/i386/oprofile. Here, the saved PC value (the CPU saves the register set at the time of interrupt on the stack available for inspection) is extracted, and the counters are examined to find out which one generated the interrupt. Also determined is whether the system was inside kernel or user space at the time of the interrupt. These three pieces of information are then forwarded onto the OProfile core via oprofile_add_sample(). Finally, the counter values are reset to the chosen count value, to ensure another interrupt happens after another N events have occurred. Other architectures behave in a similar manner.

2. Core data structures

Before considering what happens when we log a sample, we shall divert for a moment and look at the general structure of the data collection system.

OProfile maintains a small buffer for storing the logged samples for each CPU on the system. Only this buffer is altered when we actually log a sample (remember, we may still be in an NMI context, so no locking is possible). The buffer is managed by a two-handed system; the "head" iterator dictates where the next sample data should be placed in the buffer. Of course, overflow of the buffer is possible, in which case the sample is discarded.

It is critical to remember that at this point, the PC value is an absolute value, and is therefore only meaningful in the context of which task it was logged against. Thus, these per-CPU buffers also maintain details of which task each logged sample is for, as described in the next section. In addition, we store whether the sample was in kernel space or user space (on some architectures and configurations, the address space is not sub-divided neatly at a specific PC value, so we must store this information).

As well as these small per-CPU buffers, we have a considerably larger single buffer. This holds the data that is eventually copied out into the OProfile daemon. On certain system events, the per-CPU buffers are processed and entered (in mutated form) into the main buffer, known in the source as the "event buffer". The "tail" iterator indicates the point from which the CPU may be read, up to the position of the "head" iterator. This provides an entirely lock-free method for extracting data from the CPU buffers. This process is described in detail later in this chapter.

Figure 3.1. 

3. Logging a sample

As mentioned, the sample is logged into the buffer specific to the current CPU. The CPU buffer is a simple array of pairs of unsigned long values; for a sample, they hold the PC value and the counter for the sample. (The counter value is later used to translate back into the relevant event type the counter was programmed to).

In addition to logging the sample itself, we also log task switches. This is simply done by storing the address of the last task to log a sample on that CPU in a data structure, and writing a task switch entry into the buffer if the new value of current() has changed. Note that later we will directly de-reference this pointer; this imposes certain restrictions on when and how the CPU buffers need to be processed.

Finally, as mentioned, we log whether we have changed between kernel and userspace using a similar method. Both of these variables (last_task and last_is_kernel) are reset when the CPU buffer is read.

4. Synchronising the CPU buffers to the event buffer

At some point, we have to process the data in each CPU buffer and enter it into the main (event) buffer. The file buffer_sync.c contains the relevant code. We periodically (currently every HZ/4 jiffies) start the synchronisation process. In addition, we process the buffers on certain events, such as an application calling munmap(). This is particularly important for exit() - because the CPU buffers contain pointers to the task structure, if we don't process all the buffers before the task is actually destroyed and the task structure freed, then we could end up trying to dereference a bogus pointer in one of the CPU buffers.

We also add a notification when a kernel module is loaded; this is so that user-space can re-read /proc/modules to determine the load addresses of kernel module text sections. Without this notification, samples for a newly-loaded module could get lost or be attributed to the wrong module.

The synchronisation itself works in the following manner: first, mutual exclusion on the event buffer is taken. Remember, we do not need to do that for each CPU buffer, as we only read from the tail iterator (whilst interrupts might be arriving at the same buffer, but they will write to the position of the head iterator, leaving previously written entries intact). Then, we process each CPU buffer in turn. A CPU switch notification is added to the buffer first (for --separate=cpu support). Then the processing of the actual data starts.

As mentioned, the CPU buffer consists of task switch entries and the actual samples. When the routine sync_buffer() sees a task switch, the process ID and process group ID are recorded into the event buffer, along with a dcookie (see below) identifying the application binary (e.g. /bin/bash). The mmap_sem for the task is then taken, to allow safe iteration across the tasks' list of mapped areas. Each sample is then processed as described in the next section.

After a buffer has been read, the tail iterator is updated to reflect how much of the buffer was processed. Note that when we determined how much data there was to read in the CPU buffer, we also called cpu_buffer_reset() to reset last_task and last_is_kernel, as we've already mentioned. During the processing, more samples may have been arriving in the CPU buffer; this is OK because we are careful to only update the tail iterator to how much we actually read - on the next buffer synchronisation, we will start again from that point.

5. Identifying binary images

In order to produce useful profiles, we need to be able to associate a particular PC value sample with an actual ELF binary on the disk. This leaves us with the problem of how to export this information to user-space. We create unique IDs that identify a particular directory entry (dentry), and write those IDs into the event buffer. Later on, the user-space daemon can call the lookup_dcookie system call, which looks up the ID and fills in the full path of the binary image in the buffer user-space passes in. These IDs are maintained by the code in fs/dcookies.c; the cache lasts for as long as the daemon has the event buffer open.

6. Finding a sample's binary image and offset

We haven't yet described how we process the absolute PC value into something usable by the user-space daemon. When we find a sample entered into the CPU buffer, we traverse the list of mappings for the task (remember, we will have seen a task switch earlier, so we know which task's lists to look at). When a mapping is found that contains the PC value, we look up the mapped file's dentry in the dcookie cache. This gives the dcookie ID that will uniquely identify the mapped file. Then we alter the absolute value such that it is an offset from the start of the mapping (the mapping need not start at the start of the actual file, so we have to consider the offset value of the mapping). These two values are then entered into the event buffer. In this manner, we have converted a PC value, which has transitory meaning only, into a static dcookie/offset tuple for later processing by the daemon.

We also attempt to avoid the relatively expensive lookup of the dentry cookie value by storing the cookie value directly into the dentry itself; then we can simply derive the cookie value immediately when we find the correct mapping.

Chapter 4. Generating sample files

1. Processing the buffer

2. Locating and writing sample files

Chapter 5. Generating useful output

1. Handling the profile specification

2. Collating the candidate sample files

3. Generating profile data

4. Generating output

Glossary of OProfile source concepts and types

application image

The primary binary image used by an application. This is derived from the kernel and corresponds to the binary started upon running an application: for example, /bin/bash.

binary image

An ELF file containing executable code: this includes kernel modules, the kernel itself (a.k.a. vmlinux), shared libraries, and application binaries.

dcookie

Short for "dentry cookie". A unique ID that can be looked up to provide the full path name of a binary image.

dependent image

A binary image that is dependent upon an application, used with per-application separation. Most commonly, shared libraries. For example, if /bin/bash is running and we take some samples inside the C library itself due to bash calling library code, then the image /lib/libc.so would be dependent upon /bin/bash.

merging

This refers to the ability to merge several distinct sample files into one set of data at runtime, in the post-profiling tools. For example, per-thread sample files can be merged into one set of data, because they are compatible (i.e. the aggregation of the data is meaningful), but it's not possible to merge sample files for two different events, because there would be no useful meaning to the results.

profile class

A collection of profile data that has been collected under the same class template. For example, if we're using opreport to show results after profiling with two performance counters enabled profiling DATA_MEM_REFS and CPU_CLK_UNHALTED, there would be two profile classes, one for each event. Or if we're on an SMP system and doing per-cpu profiling, and we request opreport to show results for each CPU side-by-side, there would be a profile class for each CPU.

profile specification

The parameters the user passes to the post-profiling tools that limit what sample files are used. This specification is matched against the available sample files to generate a selection of profile data.

profile template

The parameters that define what goes in a particular profile class. This includes a symbolic name (e.g. "cpu:1") and the code-usable equivalent.