An SMP Implementation - A Retrospective

James M. Flemming
Christopher A. Beute
October 31, 1991

The following are trademarks of Digital Equipment Corporation: DEC, VAX, and UNIBUS

ABSTRACT

This paper discusses a multiprocessing implementation of UNIX System V Release 2.2 on VAX processors. It includes a discussion of goals, development constraints, and development order. Debugging tools and techniques are discussed, along with a framework for performance analysis.

INTRODUCTION

This paper discusses an actual implementation of a symmetric multiprocessing (SMP) system utilizing UNIX (® UNIX is a registered trademark of UNIX System Laboratories, Inc.) System V Release 2.2 as a code base. It covers many of the development constraints, their influence on design goals, tradeoffs in favor of constraints versus goals, their impact on design decisions, and the actual implementation including the development strategy, debugging techniques, and performance evaluation and enhancements. The paper itself is organized along the lines of a discussion of many of the points that were considered in the design process, followed by a discussion of the actual System V implementation on Digital Equipment Corporation VAX multiprocessor systems.

DEVELOPMENT CONSTRAINTS

Like most projects done in a vendor's engineering environment, which has as its foremost goal product development, as opposed to a pure research and development activity, there were a number of constraints placed on the design. Because of the geographical location of the expertise most likely to succeed with the project, the development effort was to be distributed. The team was composed of two people located in New Jersey with extensive System V internals experience, and two people located in Massachusetts who had previous experience doing an SMP implementation based on another operating system. For a number of reasons, it was impossible to co-locate these people for the duration of the project.

It was essential from the onset of the project to acquire or develop tools which would enable distributed development. A few of the tools that made this possible included Digital's internal DECnet network, electronic mail, VAXnotes (an online conferencing package), uucp, sccs, telephone conferences, and airlines (as required when debugging code after integration). This aspect of the project also had a major influence on the development strategy chosen, which will be discussed later.

The first and foremost development constraint placed upon the team by management was that time to market was key. Some of the implications of this requirement were to de-emphasize portability, and to take advantage of the VAX architecture [1], particularly those aspects of it that would simplify the implementation of an SMP operating system. This resulted in the following assumptions about the characteristics of the hardware on which the kernel would run:

A single, shared physical memory visible from all processors.
Hardware maintained cache coherency.
A set of hardware interlocked instructions (atomic with respect to multiprocessor access) including bit set/clear/test variants.
An interrupt stack per processor (used to implement one idle loop rather than requiring an idle process context per CPU).
An interprocessor interrupt capability (used to request other CPUs to perform services upon request, e.g., clearing a translation lookaside buffer or TLB)
An I/O structure which allowed all CPUs access to every I/O device on the system.

Some of the above were taken advantage of to reduce the complexities of the implementation, while others were simply considered necessary to meet performance requirements. There were also a set of constraints placed on the design by the engineering group itself to meet various perceived customer needs and to ensure that the product was maintainable and upgradeable. These included:

Changes to the existing code base must be minimized, localized, and modular.
It had to be possible to build both uniprocessor and multiprocessor kernels from the same sources (actually, just by modifying a configuration file and rebuilding a kernel).
The effects of the presence of the code to support SMP in uniprocessor kernels must have a minimal impact on uniprocessor performance.
User written device drivers must work in an SMP environment without modifications (although modifications might be desirable to maximize performance).
All existing user mode applications, if properly coded, must work without modification in an SMP environment (there are VAX architectural considerations implicit in this statement).

Finally, there was a third set of goals which needed to be pervasive throughout the design, implementation, and future enhancements of the project to ensure the system's success and longevity as a product. They were:

Scalability - it was unclear what this meant at the beginning of the project, but one goal chosen was to design support for up to 16 processors.
Performance - the goal was a dual processor system that would deliver 1.6 times the performance of a uniprocessor for interactive workloads.
Easy to port to other VAX models - this turned out to be more or less dictated by the distributed nature of the project, since the same VAX hardware models weren't available at both development sites.

DEVELOPMENT STRATEGY

The approach chosen was to implement master/slave or asymmetric multiprocessing (ASMP) as the first step in the development process. Briefly, a master/slave or ASMP system can be characterized as follows:

In master/slave multiprocessing, there are two or more CPUs. The master is the general-purpose CPU: it does both computation and all system input/output (I/O). The slaves have no I/O devices except perhaps a console terminal and are therefore dedicated to computation.
In this master/slave organization, the slaves do not take orders from the master. All CPUs execute the scheduling routine to find a process to run; there is code to prevent multiple CPUs from selecting the same process. Processes are executed by the master just as in a uniprocessor configuration. The slaves are the same for user-mode computations.
When a process running on a slave executes a system call, however, the slave cannot proceed. It marks the process as needing the master's attention, then enters the scheduler and selects another process to run. The master, in the meantime, is working on other processes, and the next time it makes a scheduling pass, it will find and run processes marked as master-service-required by the slaves.

Thus, slaves are slaves by virtue of the fact that they cannot perform all system duties and must rely on the master for most services. If there are enough compute-bound processes available, a master/slave system can provide acceptable performance. As long as processes make few master service requests, the slaves can be effectively utilized.

There were many advantages to taking this initial approach. The key advantage was that once an ASMP system was working, it became very easy to partition the large task of making it symmetric into many smaller independent tasks (at least for non-I/O system calls), which was very important in a distributed development environment. Some of the other advantages included:

Since non-I/O system calls could be made reentrant and SMP safe one at a time, unit testing is easy.
With the proper discipline, a complete set of system call unit test programs can result (one for each system call, developed as part of the task of making the system call capable of being executed concurrently by multiple CPUs).
It is possible to use/test software on the development system, since the development methodology will tend to reduce major instabilities. This means that more compute power is available sooner for compute bound tasks that go on during development, such as test compiles, kernel builds, etc.
It is possible to stop development at any time, test, package, and have a product with well-defined characteristics and a target market.

For a complete SMP implementation, the entire operating system must be reentrant or single-threaded as required. Going from ASMP to SMP allows re-entrancy to be implemented only as required by the task at hand. Symmetry is then accomplished by identifying critical sections, interlocking them, and just moving the single threading back another layer.

ASMP Implementation

In reality, any multiprocessing (MP) system is never truly symmetric. There are always functions that only make sense to execute on one CPU in the complex. It should, of course, be possible for some other CPU to take over those duties should the one which has taken on these tasks fail. This CPU is called the policy or primary CPU. The CPU on which System V is originally booted becomes the policy CPU automatically, and thus is also referred to as the boot CPU. The non-policy CPUs are sometime called secondary CPUs. Thus, the terms boot, policy, primary, and in the ASMP case, master CPU are used interchangeably. Likewise, non-boot, non-policy, secondary, and in the ASMP case, slave are also used interchangeably.

In SMP, though all CPUs are equal in that functionally they are able to handle both computation and I/O, certain activities are allocated only to the policy CPU. For example, the policy CPU maintains the system date and time-of-day, system up-time, and is responsible for coordinating I/O to the console terminal.

A useful question to be able to ask in the code is "is the CPU executing this code path the policy CPU?" Based on the answer, the code can proceed differently. In an ASMP system, the policy CPU also is the master CPU. When a process enters kernel mode as a result of executing a system call, mark the process as requiring master service. Next, the code can ask, "is the current CPU the policy CPU?" and if the answer is yes, then proceed. If the answer is no, call the scheduler to find another process to run.

MP Database

In a uniprocessor system, there exists both global system data and per-process data. In an MP system, there also exists per-processor data to enable each CPU to keep track of what it is doing. Many of these data items already exist in the uniprocessor database, but in an MP system, there must be a copy per-processor rather than a copy per system. An example of global system data that is relevant only in an MP environment is the cell bootcpu, which keeps track of the identity of the boot CPU.

Per-processor data is maintained in a structure called the CPU data block (cdb, see Figure 1). The cdb contains, among other things, the interrupt stack for the current CPU (with its associated guard pages to detect stack underflow and overflow), uptime, idle time, lock level (see the section on deadlocks), scheduler flags, scheduling variables such as the current process address and its priority, and counters for statistics collected on a per-processor basis. Examples of counters include the number of context switches executed on this CPU and the number of system calls executed on this CPU.

The cdb contains per-processor temporaries used for such things as power failure processing and recovery. The cdb also contains cells used for communicating information to other CPUs or the policy CPU. An example of this are the cells used for coordinating panic processing for failures which occur on non-policy CPUs. Another example of communication cells is the mechanism used for implementing kernel mode printf() calls on non-policy CPUs.

Depending on the physical connection of the console terminal(s) to the system, it might only be possible for one of the CPUs to do actual console I/O. Even if that is not the case, it would not be useful for several CPUs to decide to attempt I/O to the console terminal simultaneously, since the output would be intermixed and garbled.

To avoid this, when non-policy CPUs make calls to the printf() function, instead of outputting the text directly to the console terminal, it is buffered in the cdb and a flag is set to indicate that there exists some printf text to be output. Whenever the boot CPU's clock interrupt routine executes, it checks to see if the flag indicating output pending is set and if true, the routine searches for a cdb that has printf data in its printf buffer and proceeds to output it to the console terminal.

Figure 1: Cpu Data Block (cdb)
This file was created by the VAX DOCUMENT Graphics Rendering Utility with the following commandline DOCUMENT/GRAPHICS=RENDER RCKP_BOX_1.GRA/TYPE=(PS)/OUTPUT=FIGURE_1.SDML/NOBACKGROUND/SDML=FIGURE [FIGURE_FILE](PS\FIGURE_1.PS\44.2)

**Contents of Cpu Data Block**
Symbol	Usage
cdb_cpuid	Unique numeric CPU identifier, i.e., 0, 1, 2...n.
cdb_sysid	System identifier, e.g., CPU serial number.
cdb_cpubit	Bit corresponding to this CPU's number, 1<<cdb_cpuid.
cdb_owned	Bit 0 set if this cdb belongs to a CPU.
cdb_state	Current cdb state, BOOT, RUN, PAUSED, DEAD, PRIMARY.
cdb_istack	Virtual address of CPU's interrupt stack.
cdb_next	Pointer to next cdb in the system.
cdb_lcklevel	Current lock level - deadlock avoidance.
cdb_locks	Linked list of locks owned by process currently executing on this CPU.
cdb_rsched	Scheduler is running on this CPU.
cdb_switching	Context switching on this CPU.
cdb_idleflag	Bit 0 set if idle loop running on this CPU.
cdb_curpri	Current process priority.
cdb_curproc	Pointer to current process' process data block.
cdb_ka_cpu	Pointer to cdb to watch.
cdb_ka_uptime	Number of ticks since this CPU started.
cdb_ka_keepalivecnt	Timer for keepalive flush.
cdb_ka_checkcpu	Boolean, check keepalive every other time.
cdb_timepile	Time pile shavings.
cdb_one_sec	Number of ticks left in this second.
cdb_lticks	Number of ticks left in current process' quantum.
cdb_uptime	Number of ticks since this CPU was started.
cdb_pswitch	Number of context switches on this CPU.
cdb_syscall	Number of system calls executed on this CPU.
cdb_syscallswitch	Number of context switches required to force a system call to run on a specified CPU.
cdb_affinswitch	Number of context switches required by CPU affinity.
cdb_prbuf	Pointer to this CPU's printf buffer.
cdb_pbfptr	Putter index into this CPU's printf buffer.
cdb_pbftkr	Taker index into this CPU's printf buffer.
cdb_pfcnt	Count of the number of power fails seen by this CPU.
cdb_pftime	System uptime since the last power fail occured on this CPU.
cdb_pcbb	Powerfail storage - process control block base.
cdb_sisr	Powerfail storage - software interrupt summary.
cdb_isp	Powerfail storage - interrupt stack pointer.
cdb_dump_fill[CDB_DUMP_SZ]	sp, pcbb, mapen, ipl, r0-r13.
cdb_dump_region	Used to fill in cdb_dump_fill.
cdb_dump_pc	Crash PC.
cdb_fill[4]	Reserved for future growth, patching.

Who Am I

When entering kernel mode due to an interrupt, system call, page fault, or exception, or after an explicit call to the scheduler, it is necessary to figure out the identity of the CPU the code is currently executing on, i.e., find the address of the cdb belonging to the current CPU. This function (in the kernel called whoami) is implemented in a hardware architecture dependent fashion. Even on VAXs, the actual implementation is significantly different between various models. At CPU startup time, it is necessary for a CPU to be able to uniquely identify itself. The whoami() procedure is then implemented using the lowest possible overhead method based upon that identification.

For example, a very bad implementation from a performance point of view, but one that is easy to understand for the purposes of illustration, is to suppose that each CPU has a unique serial number (SN). At initialization time, the CPU would search for a free cdb (cdbs are allocated, initialized, and linked together by the policy CPU when it starts up). When it finds a free one, the cdb is marked as in use and a cell in the cdb is initialized to its SN. Afterwards, anytime the CPU entered kernel mode and needed the address of its cdb, it would read its serial number and chain down the list of cdbs, comparing its SN to the one stored in the cdb until a match is found. Obviously, the overhead of this scheme would be prohibitive, but it illustrates the concept.

In reality, the System V SMP implementation allocates a cdb as described and places the address in a read/write VAX architectural register that System V ignores. The address of the cdb can be acquired by a simple read from processor register. So, an example of the usage of whoami to address a cdb might be:

   if (((c = whoami())->cdb_cpuid) == bootcpu) {
      ...;    /* here to do boot CPU things */
   }

Ordinarily, in an ASMP system, the only distinction that would normally be used in decision making would be whether the thread is executing on either the policy or a non-policy CPU. But, it is certainly possible to envision tasks which might be designated for a particular CPU. An example of this might be timer callout table processing.

A field could be set aside in each entry in this table that would contain either the CPU number of the processor that should process the request or a generic CPU (probably the boot CPU) specified if the function could be processed anywhere. Each CPU could then examine the callout table and process any entries reserved for it, in addition to any entries the could be processed anywhere.

Scheduling

The System V scheduler runs both periodically (at each clock tick) and when the currently executing process is finished or temporarily unable to continue (waiting for I/O completion, etc.). If the scheduler selects the same process to run that was running during the previous interval, the process can be resumed immediately. Otherwise, a context switch is necessary. Context switching entails saving the status (context) of the former process, then setting up and starting the selected process. Context switching can take from several microseconds to several milliseconds, depending on hardware characteristics and the amount of context to save and restore.

In SMP, multiple CPUs execute the scheduling routine to find processes to run, which typically results in the same process being run at different times by different CPUs throughout the course of its existence. It is possible, however, to assign an individual process to execute exclusively on a particular CPU. This is known as CPU affinity, and is useful for running user-mode diagnostics on a suspect CPU, servicing a real-time device connected to a particular CPU, or implementing ASMP.

In general, the scheduling technique of multiple CPUs working on a single queue of processes is efficient. Queuing theory shows that multiple servers working from a single queue give better response than multiple servers and multiple queues, which would be the case with multiple independent systems where each CPU has its own copy of the operating system and thus its own scheduler and scheduler queue. With this architecture, SMP offers automatic and dynamic load balancing.

Idle Loop, Scheduler and Idle Context

When a CPU starts up and finishes initialization, it calls the scheduler; if there are no processes to run, it begins to run the idle process. From the point of view of the rest of the system, there is little difference between the idle process and any other process. At a minimum the idle process consists of a stack, even though it may not have a unique address space or u-area. In uniprocessor System V systems, this was implemented by using the stack of the last process to be run (after its context was saved) as the stack for the idle process and for the scheduler.

This method wouldn't work in an MP environment, since the last process which ran could become computable, scheduled, and run on some other CPU. This would result in the same stack being used for two different purposes, i.e., for the scheduler and the idle process on one CPU, and for the process scheduled and running on another CPU.

One solution to this problem would be to have an idle context (or the stack for the idle process) for each CPU. Since the VAX architecture requires a stack for interrupt context, it was decided to save the space and the potential complexity of having a "real" idle context for the idle process. Instead, the idle process would run in interrupt context using the VAX interrupt stack.

In this implementation, the VAX System V scheduler and idle process runs in interrupt context (at the lowest VAX software priority interrupt level). The idle loop itself is basically a test and branch to self VAX instruction at a well-known location, checking the state of a scheduler flag (see the section on performance for more information).

Starting Slave CPUs

Per-processor initialization and startup activities have been discussed, but a mechanism is necessary to get individual CPUs started at predetermined start addresses. Starting the boot CPU is accomplished in the same fashion as it would be in a uniprocessor system, i.e., starting the operating system at a specified address is the last operation carried out by the bootstrap program.

Starting the slave processors is hardware implementation dependent and varies significantly between the models within the VAX family, due to variations in console subsystems. What the various models do have in common is the ability for the other CPUs (the slaves or secondaries) to be started by actions of the boot CPU once it is functioning. The process of starting the secondary CPUs was implemented as a function of an ioctl() system call issued to a newly developed cpu pseudo-device. The cpu pseudo-device service routine also implemented ioctl() functions to stop secondary CPUs and to return information about all known CPUs. A new management program called cpucntl was created to serve as the interface to the cpu pseudo-device.

To get the secondary CPUs running, the super user would run the cpucntl program (normally as part of the system startup procedure scripts) that issued an ioctl() function specifying the CPU number of the processor to be started. This procedure would be repeated for each of the secondary CPUs. Later, this was simplified by adding a facility to start all known secondary processors. Once a CPU was started successfully, it would issue a message to the console terminal reporting that fact.

It was interesting to note that it took several days to get a secondary CPU from a cold, standing start to the point where it successfully executed the idle loop. This was partly due to poor hardware documentation, and the amount of work to get an uninitialized CPU into a state consistent with one that has gone through a complete system bootstrap.

CPU Affinity

A CPU affinity mechanism is necessary to tie the scheduling of processes to a desired CPU. There are three types of CPU affinity implemented in System V. The first two are hard affinities in the sense that when a process is bound to a CPU, it cannot run on any other CPU until the affinity is broken. The third type is a soft affinity, expressing a preference for a CPU on which the process can run.

The first type is a long term affinity and is established by a process via a system call or shell command. Its purpose is to ensure that a given process runs only on a specified CPU. It would be used, for example, to run CPU diagnostics when the user wants the execution to be locked to a single processor and not permitted to migrate freely among multiple CPUs. Another example of usage would be a system that contains a real-time device accessible from only one CPU in the MP complex. The user would want to bind a real-time program that serviced the device to the CPU from which the device was accessible.
The second type is a short term affinity which is established when a a process begins executing in kernel mode and is later relinquished when the process is about to enter user mode. This type of affinity was the basis for the ASMP implementation and will be discussed further in the section on ASMP Scheduling.
The third type of affinity is advisory only and does not cause a hard binding to a CPU. It can be thought of as signifying a preferred CPU upon which to run a process and is used to enhance the system/process performance. It is almost always preferable to run a process on the last CPU to run it since this increases the possibility that some of the process data still resides in that CPU's cache. In some hardware architectures, some of the process' address translation data might also still be contained in that CPU's translation buffer.

In general, performance is likely to be enhanced if a factor that goes into the process selection criteria used by the scheduler is to pick a process that last ran on the current CPU. Note that factoring this into the selection criteria will almost certainly add to scheduling overhead, and could possibly be unfair if the highest priority process is not chosen to run in favor of another choice made for performance reasons. In fact, even without the above, MP scheduling greatly increases the likelihood that at any given point in time the highest priority eligible process will not be running.

The hard CPU affinities described above were implemented by adding three new fields to the process data structure. Each affinity type was defined as a bit mask, signifying the eligibility of a CPU to run this process. When the scheduler scans the queue of runnable processes, it checks each of the affinity cells in the process structure to see if the process might be eligible to run on the current CPU. The absence of a bit corresponding to the current CPU in any of the affinity cells makes the process ineligible to run on that CPU. The fork() procedure was enhanced to cause a child process to inherit the affinity masks of the parent.

ASMP Scheduling

The most straightforward possible implementation of ASMP was created by modifying the code handling processor traps to the kernel. Whenever a process running on a slave processor entered kernel mode, it would be marked as waiting for master service and the slave would call the scheduler to find another process to run. This was implemented in System V by using the second form of CPU affinity described above.

Two procedures were developed to bind/unbind a process to/from a CPU. They were oncpun(id), where id is the logical CPU number to which the process is to be bound, and resetcpu(), which reverses the action of oncpun(). Thus, whenever a process page faulted or encountered an exception, there would be an immediate call to oncpun(bootcpu), which would mark the process as runnable only on the boot CPU and, if the current CPU was not the boot CPU, would call the scheduler. At the end of page fault or exception processing, a call would be made to resetcpu() making the process runnable on all CPUs before returning to user mode.

System Calls

The above mechanism would work as well for system calls. However, many system calls are so trivial that it would be senseless to incur the potential overhead of two context switches (away from the slave and onto the master) just to return a value such as the current process id (pid).

A new field was added to the data structure used for dispatching system calls. The new field was a boolean value that, if true, allowed the system call to be executable on all CPUs. Thus, as a part of the system call dispatch, if the boolean was false for the specific system call executed by the process, the oncpun(bootcpu) procedure was called to bind the process to the boot CPU until a corresponding resetcpu() call was executed (usually, at the end of the system call processing just before returning to user mode).

This was the mechanism used to go from ASMP to SMP--gradually. As system calls were made reentrant and SMP safe, the boolean was added to the data structure entry for that particular system call. As the development work on a specific system call was in progress, the boolean was added and the oncpun(bootcpu) inserted into the body of the code wherever it was necessary to single-thread the execution of the system call because it was not yet reentrant or SMP safe beyond that point.

ASMP Summary

As part of the boot CPU's initialization, it allocates, links together, and sets up static data in the cdbs (as many as the configuration data specifies as a maximum). When each slave CPU initializes, it locates a free cdb and fills in fields in the cdb such as the CPU serial number. Further work is then done to initialize processor registers so that the whoami() function will properly return the address of that CPU's cdb.

After a CPU is started and has finished its initialization, it calls the scheduler and attempts to find a process to run. If there are none or none are executable on that CPU, the CPU runs the idle process. There is a mechanism which allows processes to be marked as runnable only on particular CPUs. Using this mechanism to mark processes as only runnable on the boot CPU whenever they enter kernel mode allows the implementation of ASMP.

Note that all of the mechanisms described above are not only useful, but necessary to implement ASMP. SMP is then implemented from this base by simply making the code reentrant and SMP safe, and by removing the forced CPU rescheduling and affinities associated with processes executing in kernel mode.

INTERLOCKS

When multiple CPUs are executing the same code, interlocks are necessary to guarantee exclusive access to critical sections of the code. A critical section is any sequence of instructions which must be executed atomically, that is, as an uninterrupted unit. A section of code is critical only if global data are being changed, and could result in an inconsistency if more than one processor tries to change the same data simultaneously.

As an example, assume that the system maintains a linked-list of free memory pages. If two CPUs each need to allocate a page of memory, both will consult the free-list to find an available page. Without an interlock, both CPUs could determine that the first page in the list is free and both allocate the page for themselves. An interlock ensures that such accesses are correctly coordinated (single-threaded).

The lowest level where atomic execution is required is at the machine instruction-set level. Certain instructions that read data from memory, modify it, and write the modified data back to memory must be uninterruptible, that is, the read/modify/write cycle must be atomic. Given such instructions, they can be used to lock and unlock critical sections of code.

Two mechanisms using test-and-set instructions are used in VAX System V for synchronization. The first is called a spin lock and is described in greater detail below. The other SMP lock/unlock mechanism is an implementation of semaphores, with appropriate queuing of requests when a resource is busy.

Granularity of locking is important -- locking large sections of code means that other CPUs will generally have a better chance to collide for access to a resource. Therefore the smaller a section of code that is locked, the more likely that access contention for this section is reduced and thus the better overall system performance. In SMP, the code was instrumented for performance measurement to ensure proper locking granularity.

SMP Locks: A Closer Look

In a uniprocessor configuration, the processor's interrupt system can be used as a synchronization mechanism: interrupts can be turned off -- as a group or selectively -- to prevent routines at other interrupt levels from modifying shared data.

Many of the synchronization mechanisms employed in uniprocessor systems simply require generalization to provide synchronization in a multiprocessor system. However, caution must be exercised since locks that are not a bottleneck in uniprocessor systems may cause serious performance problems in a multiprocessing environment (perhaps due to lock granularity). A few examples of locks that normally exist in a uniprocessor operating system and how they can be generalized for multiprocessing will serve to illustrate SMP synchronization techniques.

Before describing these mechanisms, it is necessary to introduce a type of lock not usually encountered in a uniprocessor system. This lock is known as a spin lock. It is normally obtained and held only for a short time relative to a context switch (or when a context switch is not possible). A spin lock is implemented as follows:

Using a read/modify/write instruction, attempt to test-and-set a memory location (a lock) associated with the data/critical section (see below) to be accessed.
If the test indicates that the set succeeded, proceed to access/modify the data associated with the lock.
If the test indicates that the set failed, that is, the lock is already owned by another processor, loop back and execute the test-and-set until the set succeeds and the lock is obtained.

This lock is known as a spin lock because the processor "spins" in a tight loop waiting to acquire the lock.

Since the processor owning a lock should spend little time in the critical section, a spin lock should not be too expensive in lost time for a CPU that has to wait for the lock. Unfortunately, however, a test-and-set instruction may consume significant memory bandwidth since the read/modify/write operation locks out all other processors (including I/O channels) until it completes. A useful variation of the spin lock mechanism is therefore to first simply test the lock; if it is potentially available (not currently set), then try to obtain it using the read/modify/write test-and-set instruction. Thus the loop becomes:

Test the lock for potential availability. If not available, loop.
If potentially available, try to obtain it using the test-and-set instruction.
If obtained, proceed.
If not obtained, loop back to the potential-availability test.

This modification eliminates the read/modify/write instruction from the main loop and thus substantially reduces the demand on memory bandwidth.

Locking and spl() Functions

The most common, and usually most easily recognized, form of synchronization that must occur in uniprocessor systems is between process level and interrupt level when dealing with a device and its associated data structures. This is normally achieved by disabling interrupts from a device when dealing with device-specific data at process level which could be modified at interrupt level should a device interrupt occur. This usually takes the form of the following sequence of instructions:

block interrupts from the device - usually, s = spln(); where n is the level that the device interrupts at
...
test and perhaps modify device data/state
...
start i/o (if appropriate)
...
ensure device data is in a consistent state
...
unblock interrupts from the device - usually splx(s) where s is the value returned from the spln() call above

In a multiprocessing environment, the above sequence is insufficient since the device must be prevented from interrupting any processor. This can be achieved within the above instruction sequence by introducing a spin lock. The lock must be obtained after interrupts are blocked and released before interrupts are unblocked; furthermore, the interrupt-level processing code must also use the same spin lock before executing interrupt-level code associated with the device.

This was implemented in System V by having the common entry point for each interrupt level acquire the spin lock associated with that interrupt level, and having the common interrupt exit routine associated with the interrupt level relinquish the spin lock. Note that by using these mechanisms (had they worked well), the goal of minimizing source code changes is nicely met, since the only modifications required for the implementation is the code necessary to acquire/relinquish a spin lock in the spl() routines and the common interrupt entry/exit routines.

In summary, blocking interrupts will keep the interrupt-level code from being executed on the current processor while the device data is being modified, and the spin lock will keep the interrupt-level code from being executed on any other processor while the device data is inconsistent.

The introduction of spin-lock synchronization in this situation is simply a generalization of the synchronization that is already necessary in a single processor system. Thus, a first step in the implementation of a multiprocessor system is to identify those places where the interrupt system is used to disallow conflicting execution of code paths, and to augment these with spin locks to prevent concurrent execution of the code paths by multiple CPUs.

Critical Sections

CPU scheduling is another area where spin locks can be applied. Locking could be implemented implicitly rather than explicitly by not allowing a processor to be rescheduled while in privileged mode, except as the direct result of a call to the CPU scheduler. If it happens to be true that the operating system can reschedule at any time, the locks necessary to guarantee data consistency will already exist, and all that is required is that the locks be generalized for multiprocessing. However, as is often the case, and was the case in System V, data consistency on a single CPU system was maintained by not allowing the CPU to be rescheduled at any arbitrary point when it was executing in privileged mode; instead, it required an explicit call to the scheduler when data structures were consistent and rescheduling was proper.

Allowing multiple CPUs concurrent execution in privileged mode requires that those areas where data is inconsistent be identified and locked to bracket those critical sections. Fortunately, this turns out to be simpler than it might initially appear. Since a process cannot be run simultaneously on multiple processors (this implementation did not support user mode threads), internal process data need only be consistent when rescheduling, and thus individual process data need not be interlocked. Also, much data is local to an individual processor, and so such per-processor data also require no interlock. The only data that must be interlocked are true system-global data. An excellent example of such data is the scheduler database - scanning, adding processes to, and removing processes from the run-queues must be single-threaded and is therefore protected in MP by a scheduler spin lock.

Many other data structures fall into this category and usually include system-wide linked-list manipulation, changing process state with respect to the scheduler when the process isn't running, allocation of system-wide resources, etc. Whether these data structures are protected by spin locks or a scheduler semaphore is usually determined by the length of time required to access the particular data structure and, if modifying it, to leave it in a consistent state. If the time is short compared to the time required to do a context switch, a spin lock is probably the best mechanism. If the time is longer but the likelihood of collision on an interlock is low, a spin lock may still be the best choice. Independent of the duration, if nothing can be done without access to the critical section (again, the scheduler database and the execution of an interrupt routine are good examples), a spin lock is required. The best choice can usually only be determined by performance analysis. Therefore, good lock instrumentation is critical to achieving the best possible system performance.

Mechanisms for managing other system resources, which usually require the long-term ownership of a lock (e.g., across I/O), already exist in uniprocessor operating systems and usually don't require modification for an SMP environment -- except to ensure that the locking mechanisms used are atomic with respect to multiple processors. Such mechanisms are implemented by trying to obtain a lock and, if not available, calling the scheduler rather than spinning. Once the lock is obtained data access proceeds normally. When done, in addition to relinquishing the lock, a check is made to see if any processes are waiting for the lock. If so, the scheduler is notified that it should run a particular process, or allow all processes waiting for the lock to run to compete for it. As mentioned, these types of locks don't usually require modification for a multiprocessing environment (although the scheduler interlock may come into play when making these resources visible to the scheduler when such a lock is relinquished), but it may be advantageous to introduce additional locks of this type when implementing symmetric multiprocessing.

This type of lock can be introduced into the system where a spin lock would be acceptable except for the fact that the average waiting time to obtain the lock (if it is not available) is potentially long compared to the time required to do a context switch.

Interlock Mechanics

In System V, locks are defined by a lock data block structure (ldb, see figure 2). A lock data block structure contains elements which are used for acquiring the lock, keeping track of lock ownership, debugging, performance monitoring and analysis, and deadlock avoidance. There were two classes of locks created: system locks and process locks. System locks are acquired in system or interrupt context (e.g., scheduler lock and hardware error locks). Process locks are acquired in kernel mode by a CPU on behalf of the current process, and are considered to be ownable by a process.

The lock itself is implemented using the VAX BBSSI/BBCCI instructions on bit zero of the ldb_lock variable (the getlock()/givelock() routines are written in assembly language). The VAX BBSSI (branch on bit set and set interlocked) and the VAX BBCCI (branch on bit clear and clear interlocked) instructions test the state of the bit(s) specified in the instruction and, if they are in the test state indicated by the instruction, the branch is taken. Regardless of the original state of the bits, they are put into the new state as indicated by the instruction and this is done interlocked (atomically) [1].

The locking algorithm described previously is implemented using these instructions. Effectively, this is accomplished by looping on the BBSSI, with the modification to avoid using an interlocked instruction to test for potential lock availability, until the instruction doesn't branch in which case the lock has been obtained. Complimentarily, the BBCCI instruction is used to relinquish the lock and a crash results if the branch is taken, since this represents relinquishing a lock which wasn't owned.

Since appropriate interrupts must be blocked while spinning, the spin loop itself may also want to poll for the possible occurrence of external events which, in the normal course of things, would be seen as a result of interrupts. A couple of examples might be an interprocessor interrupt generated by some other CPU to request a service or a kernel debugger trying to get the spinning CPU's attention.

Once the lock is obtained, the ldb_owner field is filled in with the cpu-id of the current CPU (useful for debugging and performance analysis). If the lock (semaphore) being obtained is a process lock, i.e., the linkable bit is set in the lock block, then the pid of the process owning the lock (for debugging) is stored in the ldb_pid field in the lock block, and the lock block is linked to the list of locks already owned by the process (through the ldb_link field). This linked list is used to relinquish the locks owned by the process should it block (sleep) at some point in its execution prior to explicitly returning the locks when exiting critical sections. If the scheduler returns the locks in this manner as a result of the process going to sleep, a record of the locks that the process owned is kept and when the process is awakened, the locks are re-acquired in their original order.

Finally, to aid in debugging and potentially useful in doing performance analysis (see below), the return address of the caller to the lock routine is stored in ldb_pc. For initial debugging, an expanded ldb structure was used that stored in a ring the last few return addresses for calls to acquire the lock, and the CPUs (cdb addresses) on which the lock was acquired.

Figure 2: Lock Data Block (ldb)
This file was created by the VAX DOCUMENT Graphics Rendering Utility with the following commandline DOCUMENT/GRAPHICS=RENDER SKUNKWORKS:[FLEMMING]RCKP_GEMETA.GRA/TYPE=(PS)/OUTPUT=FIGURE_2.SDML/NOBACKGROUND/SDML=FIGURE [FIGURE_FILE](PS\FIGURE_2.PS\13.7)

**Contents of Lock Data Block**
Symbol	Usage
ldb_lock	Interlock - bit 0 set/cleared atomically.
ldb_nest	Current lock nesting level.
ldb_linkable	Boolean, true then link this lock block to the end of the list of lock blocks already owned by this process.
ldb_level	Dead lock avoidance level number.
ldb_pid	If linkable, pid of the owner.
ldb_name	Pointer to the name of the lock - for lock usage reporting utility.
ldb_owner	Current cpuid of CPU that the lock is owned on.
ldb_link	Link to locks obtained earlier so they can be given back if sleep is called.
ldb_calls	Number of times lock was obtained.
ldb_spins	Number of spins waiting for the lock.
ldb_collide	Number of collisions with other CPUs when attempting to obtain the lock.

Nesting

It is occasionally useful to try to obtain a lock without worrying about whether the current CPU already owns it (if a CPU tries to get a lock that it already owns and nothing is done to prevent this or allow for it, the system will hang). Arguments can be made that lock ownership should always be known by the programmer on any code path, and in general this should be true (see the section on debugging).

However, since locks are often being inserted in "spaghetti" code, often with a large number of code paths arriving at the same point, or since it may be desirable to have a recursive procedure obtain a lock, or since it may be desirable to call a procedure that must own a lock to be executed either with or without the lock, and since the lock code itself must already test to see if the lock is owned by the caller to avoid the hang described above, it turns out (mostly for performance reasons) to be convenient to allow lock calls to nest.

Nesting is implemented by having the lock code determine if the lock is already owned by the caller and, if so, simply incrementing a count field (ldb_nest) and returning. When the lock is relinquished, the count is decremented and, if it hasn't yet gone to zero, the routine just exits. Only when the count field returns to zero is the actual ownership (i.e., lock bit) cleared and the lock declared free.

If an attempt is made to return the lock more times than it was obtained, this will be detected and a panic will result. Unfortunately, if the nesting doesn't get completely undone, the system will hang, but this is essentially the same as obtaining a lock without nesting and not relinquishing it. Either case indicates a faulty code path that must be corrected.

Deadlocks

A major concern when dealing with locks is the classic ordering problem, which can lead to deadlocks (also known as deadly embraces). These can occur if locks are not all obtained in the same order by all CPUs on all code paths and can lead to one CPU owning lock A and needing lock B, while another CPU owns lock B and needs lock A.

Since it is not always easy for a developer to remember the correct order and also, since the system could potentially run for a very long time with the locks not being obtained in the correct order without deadlocking, the following mechanism was implemented to avoid deadlocks and assure that locks were always obtained in the same order.

A static (defined at compile time) field in the lock block, ldb_level, contains a number defined by the creator of the lock. When the getlock() routine is called to obtain a lock, this field in the lock block is compared with a field in the cdb filled in from the ldb_level field from the lock block of the last lock acquired (or zero if no locks are currently owned). If the field in the current lock block is less than the field in the cdb, a panic results.

Thus, by correctly specifying lock level numbers when creating new locks, lock ordering can be enforced and the potential for deadly embrace will result in a panic (hopefully quickly identifying the offending code path allowing for a relatively painless bug fix) rather than a hung system, which might be significantly more difficult to analyze.

Lock Instrumentation

Since most of the difference between the performance of a uniprocessor system and a multiprocessor is going to be due to the amount of time spent spinning, it was essential that the developers be able to determine the overhead caused by having to acquire locks, the relative frequency of collision between multiple CPUs attempting to obtain the same lock at the same time, and the amount of time spent spinning when a collision does occur.

To give these SMP overheads visibility, there are three fields defined in the lock block to collect statistics about lock usage. These fields are:

ldb_calls, which is incremented each time a call to getlock()is made to obtain a lock
ldb_collide, which is incremented each time an attempt is made to obtain a lock and it is already owned by some other CPU
ldb_spins, which is incremented each time through the spin loop

By looking at the relative size of these numbers, it was possible for a developer to determine if a given locking strategy was more successful than others and makes it possible to fine-tune any given implementation. More on this in the section on performance.

SMP Implementation

Once the ASMP implementation had been achieved, the process of moving to an SMP code base was accomplished subsystem by subsystem. The order in which the systems were made SMP safe was as follows:

Block I/O Subsystem (buffer management)
Inode Handling
Files and File Systems
Virtual Memory (including sched, vhand, and region)
Process Management
Miscellaneous System Calls
IPC (semaphores, shared memory, messages)

Work on these subsystems ran concurrently with work on error handling and device drivers. This was combined with constant regression testing and preliminary performance analysis, along with the development of test programs to exercise a newly threaded subsystem. As previously mentioned, this work was divided up between the two development locations. It became necessary to synchronize the code base between the two development systems almost daily as the project entered this phase.

The work in this phase consisted of identifying global data to be protected for each subsystem, and inserting locking code around critical sections as has already been mentioned. While some of the work became fairly routine, some areas created some interesting problems to resolve, either due to complexity or performance. The next few sections discuss some of these in greater detail.

Buffer Management

The block I/O buffer cache subsystem was the first major subsystem to be threaded. It introduced a number of problems that would be seen again later, but it also had a few unique quirks all its own. Since this subsystem is heavily accessed in both timesharing and production environment, it was important to have this code work as efficiently as possible.

The primary data structure in the block I/O subsystem is the buffer header (or buf) structure, which describes the contents of a kernel data buffer containing file system data [2]. If the desired data is not found in a kernel buffer, the buf structure is handed off to the appropriate mass storage device driver to complete an I/O request.

The buf structure is unique in that it exists on two doubly linked lists. One list is a hash queue, to allow an easy search for data from a particular device and block address. Each buf structure containing valid data always exists on a hash queue. The other list is called a "free list", a linked list of all buf structures not actively in use by a device driver. This FIFO list is the source of buf structures to be used to create new I/O requests.

The logic for processing buf structures contained many loops, as particular data structures are being examined, or as the system attempts to locate free resources. The use of locking code within loops showed up as a severe performance degradation that needed to be corrected before the product could ship. Moving locks outside of loop calls became one of the most important lessons learned -- "lock calls aren't cheap!"

The second lesson learned was a "feature" of the VAX architecture. The buf structure contains state flags, some of which were desired to be set/cleared using interlocked instructions (BBSSI/BBCCI), while others could be set and cleared normally. The operations on a bit field, when mixed between interlocked and non-interlocked access, provide inconsistent results. The result of this experience was to reserve special fields in a data structure if interlocked access was required.

A subsequent problem that appeared was one of scaling. As processors supported larger physical memories, the sizes of tunable data structures were encouraged to grow larger. A natural desire for a system administrator of a system with lots of memory, and indications that a lot of free memory exists, is to expand the size of the block I/O buffer cache. With more memory dedicated to the buffer cache, fewer physical disk I/Os should be required.

A side-effect of growing the size of the buffer cache is the length of time it takes the contents to be written out to disk. The sync command executes an update() system call, causing all "dirty" I/O buffers (data cached, but not yet written to disk) to be flushed back to the disk at once.

This problem existed even in the uniprocessor code base for System V. SMP made the process more painful, since some of the most heavily utilized locks, i.e., those associated with the buffer cache, needed to be held for long periods of time. It became apparent that system performance suffered greatly when a large buffer cache needed to be flushed to disk.

Part of this problem was alleviated in System V Release 3.1, with the introduction of the bdflush daemon. This daemon owned the responsibility of "smoothing" out the I/O request rate of the buffer cache by aging the contents on a clock-driven basis. However, this did not completely solve the problem, since a cache that filled quickly and needed to be flushed still caused a spike in system performance. The problem was later resolved by rewriting the algorithm used by update() to avoid multiple passes through the buffer cache.

Sched and Vhand

The System V Release 2.2 code base for the VAX was the first version to replace a swapping environment with a demand-paging environment. The novelty of the implementation, combined with a lack of documentation, made the process of threading the Virtual Memory (VM) subsystem an interesting challenge. In actuality, SMP helped to shake out a few problems with the implementation mostly by virtue of added sanity checking code.

All prior versions of UNIX System V had been implemented with one memory reclamation process called sched (the swapper). This had traditionally been implemented as a kernel process (pid = 0) that was awakened when memory availability was low. It selected one or more target processes to swap out, and then executed the necessary I/Os to migrate a process to a disk backing store. The swapper executed in a process context to allow the scheduler to interact with it as a normal process for disk I/O completion.

The virtual memory subsystem kept around the swapper for critical memory shortages. However, another mechanism was added to reduce the need for swapping, and keeping the free pool of memory pages above a tunable limit. The page stealing daemon, vhand, is awakened when free memory falls beneath a tunable lower limit. It uses a two pass method of scanning a list of active memory regions, first changing page table entry protections to cause soft page faulting upon reference. On its second pass, vhand reclaims any pages that were candidates for salvaging (pages that had not been referenced since the first pass through memory).

These mechanisms are able to co-exist on a uniprocessor, since only one of the two processes can be executing at any point in time. SMP created a number of interesting problems that hadn't existed before:

How could sched and vhand be allowed to execute simultaneously?
How could sched be kept from swapping a process active on another CPU?
How could vhand be kept from stealing pages from an active process?

The solution to the first problem was handled initially in a simple fashion - each of these kernel processes established hard affinity to the boot CPU, guaranteeing that their execution was serialized. This restriction was lifted in a subsequent release.

The other two problems were handled by adding new state bits to the process data structure. Once a process becomes a candidate for swapping or page stealing, a bit is set in the process header to prevent that process from being scheduled until the VM operation is completed. Both vhand and sched were then prevented from acting upon the memory for a currently executing process. The general approach remains one of mutual exclusion of actions (scheduling, swapping, or page stealing) on an object (a process or memory region).

Device Drivers

One of the design constraints faced was supporting user-written device drivers unchanged. A few strategies were developed to make this feasible, achieving varying amounts of success. Some device drivers made inappropriate references to kernel data structures, or assumed that kernel data structures (such as inodes) could be modified within the scope of the device driver. Although some device drivers did exhibit these problems, the majority of the device drivers tested behaved properly.

The initial strategy to handle devices was to use a forced affinity method. Forced affinity to the boot processor was assumed, unless a bit in a system configuration file (/etc/master) was set, indicating that this driver could execute on any processor. Another bit was added to this configuration file to request that a device driver spinlock be created for the use of the driver, in anticipation of making device drivers SMP safe. Both of these mechanisms are still in use (see section below on the Configuration Utility).

The initial attempt to make device drivers SMP safe involved the use of the previously mentioned modified spl() function calls (global spls), which were implemented as spin locks. It was observed that most of the global data and critical sections of device drivers are already protected by raising processor priority by use of spl() calls. Extending the spl() protection to other areas in the driver, where required, was a simple and straightforward task.

This method met with mixed results. It quickly allowed the disk and tape I/O subsystems to execute on any CPU. Due to hardware characteristics of the development platforms, interrupts were fielded by only the boot CPU. Thus, the results of the exercise worked fairly well if the I/O was predominantly to disk or tape. In a network I/O environment, the boot CPU was still a critical resource.

Attempts to multithread other I/O subsystems met with poorer results. The communications hardware present on the VAX at the time were predominantly UNIBUS based. Due to a hardware restriction, all UNIBUS I/O had to be issued and all UNIBUS device interrupts serviced from the same CPU. The boot CPU was chosen since it is the only one guaranteed to be available before UNIBUS devices are accessed. This greatly limited the ability to distribute I/O across multiple processors, and it was necessary to wait for BI and XMI options to become available to solve these problems.

Once some newer communications options were available, the next problem encountered was a design flaw in the global spl approach. Not all device drivers made use of a single priority level for synchronization. Any code that interacted with the terminal subsystem had to be closely examined, since in the terminal subsystem, priority was raised to spl6() from the more common spl4() or spl5() levels. This was a problem since the semantics of the global spls changed from the uniprocessor case.

On a uniprocessor, code executing at spl5() could be guaranteed that no other code could execute at a priority below that level. With global spls, one processor could be executing at spl6() while another executed at spl5(). If one processor was raising IPL from spl5() to spl6(), the lower priority lock was surrendered. These semantic changes caused a lot of difficulty as code migrated into and out of device drivers.

The final problem with the initial approach was one of granularity. Almost all of the device drivers available for the VAX used spl5() as their synchronization level. While the disk subsystem held the spl5() global lock on one CPU, the tape system was unable to make progress until it was released. The common use of spl5() effectively created a single I/O spin lock shared by almost every device driver.

The resolution to these problems took place in subsequent releases. The global spl locks were abandoned, and the spl() calls resumed their normal function of just raising/lowering processor priority levels. Individual device drivers were made SMP safe, either with locally declared locks, or by use of the lock created by the configuration utility. This rework was completed for all mass storage devices and some of the communications framework.

Error Handling

Some of the most hardware-dependent functionality is required to handle the errors generated by a processor, along with its I/O and memory subsystems. The support for uniprocessor versions of the development systems existed and contained all of the typical support routines required. The task remaining was to enhance those routines, as required, to handle errors from multiple CPUs concurrently, and to enhance other related routines such as powerfail handling. Finally, a watchdog mechanism was needed to check the state of the system.

Machine errors

For VAX architecture systems, error routines are required for the CPU (machine check handler), memory subsystem (corrected read data/read data substitute errors), and I/O busses. Device drivers are responsible for handling errors for the devices they support.

In the hardware environment, it was possible for each processor to signal both CPU and memory errors. One of the first tasks was to take the uniprocessor support code for each of these areas and provide for concurrent execution. Because these code paths are (hopefully) executed infrequently, it was sufficient to use spin locks to single thread access to these routines.

I/O bus errors created a different class of problem. Most of these errors cannot be corrected, since they are not tied to any synchronous processing by the CPU. The exceptions were bus interface adapters (e.g., NMI to BI, XMI to BI), which could be reinitialized for certain classes of errors. Support routines for these devices were also single threaded by use of spin locks.

Error Logging

Another facility which needed to be multithreaded was the error logging subsystem. This consisted of a kernel buffer area for storing data to be logged to a disk file, and the supporting routines to manage the area. Since different errors could be logged by different CPUs simultaneously, this area was protected by spin locks. This lock was one of the very few system context locks created, as this lock could be acquired while no process was currently executing.

A further enhancement to the error logging facility that was planned, but not implemented, was to include information in the standard error log header indicating on which CPU the error was generated. For some classes of errors, this could have provided more meaningful information.

Powerfail Handling

The uniprocessor versions of the development systems supported powerfail restart handling, if optional battery backup hardware units existed on the system. Code enhancements were required to the powerfail subsystem in two areas - saving the state of individual CPUs prior to total power loss, and restoring each processor's state upon resumption of power.

VAX processors provide a high level interrupt once power is removed from the system, and the power system provides some number of milliseconds of time for a processor to save its state. The changes required were to save individual CPU state information to each cdb instead of to one common location.

The actions taken upon power being restored got a bit more complex. The boot CPU is the only processor restarted by the console subsystem, so it inherits the responsibility of restarting itself and any active secondaries. Once the primary CPU's state had been restored and it could resume operation, each of the active secondary processors was restarted, using the same mechanism as during a bootstrap. The secondary processors were responsible for recognizing that this was a restart attempt, rather than a reboot, and to signal the boot processor when they were ready to resume normal operations.

Keep-alive processing

All CPUs within an SMP system need to be checked periodically, so that if an individual CPU fails, clean-up action can be taken. The clock service routine was enhanced to check the health of all CPUs in the complex. Within its cdb, each CPU maintains a keep-alive counter, which is reinitialized at each clock tick. A non-boot CPU sets its keep-alive counter to -n, while the boot CPU initializes its keep-alive count to -maxcpus*n ("maxcpus" is the number of CPUs in the configuration, and n is a heuristically determined parameter large enough to avoid false alarms).

After reinitializing its own counter, a non-boot CPU increments the boot CPU's keep-alive counter; the boot CPU increments the keep-alive counters of all other CPUs. If the resulting value of a keep-alive counter is still negative, the corresponding CPU is considered to be functioning. If the count reaches zero, however, it means the corresponding CPU wasn't able to reinitialize its counter recently, and the CPU is declared dead. Obviously, there must be a method for defeating (disabling) this mechanism while debugging. Otherwise, whenever a CPU encountered a breakpoint, it would be quickly declared dead.

After a CPU is declared dead, its keep-alive word continues to be incremented and is therefore positive. A positive keep-alive word indicates programs set to run only on that CPU (by system call or shell command) can no longer be run. This problem would be reported to the system operator as appropriate.

If the keep-alive counter incremented to zero belongs to the current boot CPU, the CPU that performed the operation could assume the role of boot CPU. The instruction used to increment the keep-alive counter must be interlocked so that it is only possible for one CPU to actually see it transition from negative to zero. Thereafter, CPUs incrementing it will see a positive number.

Thus, only one CPU can attempt to take over the role of the policy CPU and that occurs on the transition of the boot CPU's keep-alive word from negative to zero. This action is called a role switch, and is accomplished by changing a few variables that determine which CPU is currently the policy CPU. These include the variable that contains the CPU number of the current policy CPU (bootcpu). The fact that a role switch occurred would be reported to the operator.

While implementing role switching is conceptually straightforward, a lot of thought must go into how CPUs are designated to make everything work correctly in the event of a failure of the primary CPU. Because of the order in which things occur when the system is booted, initially, the contents of the bootcpu variable will be zero. For a role switch to be possible, a distinction must be made between CPU 0 and the boot CPU.

For example, consider the clock callout table. There may be entries that specify that the processing must be done by physical CPU 0. There also may be entries that specify the generic boot CPU. If no role switch has occurred (the usual case), both refer to the same CPU. On a role switch, things that should be done by the generic boot CPU should be carried out by the CPU that assumed the role of the primary CPU, while those entries that specified that the action must be carried out by CPU 0 cannot be serviced (unless CPU 0 can be restarted). Thus, it is important that a distinction be made between a CPU number (which might be the CPU number of the current boot CPU) and the identity of the generic boot CPU. Otherwise, role switching will result in chaos or hung systems.

In reality, while the keep-alive mechanism described above was implemented in the first version of the SMP system, the role switching mechanism never was. In fact, any CPU being declared dead resulted in a panic, a dump, and a reboot instead. Note however, that whenever a CPU is declared dead (assuming it's not accompanied by any corresponding hardware errors -- determining that might require special hardware or an intelligent console assistance), if the process that the dead CPU was running is killed and the dead CPU did not own any locks, it should be possible to continue the system without the dead CPU. In fact, if the cause for the CPU stopping can be determined and corrected, there is no reason that it cannot be restarted and the system would once again be at full performance potential.

In a later development cycle, it was discovered through performance analysis that the overhead cost of the above keep-alive checking mechanism was higher than desirable, which resulted in a modification of the actual checking algorithm itself. As a result, a similar but more esoteric lower overhead method was implemented to do the checking.

System V Verification Suite

One of the primary requirements was to keep the system compliant to the System V Interface Definition (SVID), since the target customer base for the product demanded it. One means of compliance testing was the System V Verification Suite (SVVS), a test package sold by AT&T allowing vendors to test their releases against a standards-checking tool. After the initial release of the product, each succeeding version is checked against the appropriate version of SVVS before being released for shipment.

SVVS consists of a set of programs, and since programs are known to contain errors, it is not surprising that some of the problems encountered when verifying the implementation turned out to be problems within SVVS. It is suspected that this was one of the first SMP implementations to run the SVVS suite, and some race conditions were identified within the tests and waivered. Most of these had to do with parent/child interactions, discussed later in the section on debugging applications.

Debugging

It was realized in the early stages that a few printf() statements were not going to provide an adequate debugging environment for developing an SMP system. A variety of techniques were used to assist the developers, many built from existing mechanisms. This section will detail some of the most useful ones created.

XDELTA

XDELTA is a machine level debugger included with Digital's VAX/VMS systems. It can be included into the VMS kernel and provides a very low level debugging facility, at the machine instruction level. The code for it was also fairly compact, allowing it to be used on small memory configurations.

A version of XDELTA had been ported by one of the authors of this paper as a development tool for debugging support of new VAX processors. It was used in the early stages of development, until a richer, more powerful tool capable of handling multiple processors was developed. XDELTA was abandoned after ksdb (see below) became functional.

PIDs in structures

There was a class of data structures that was owned by a process for significant periods of time (even across rescheduling events). Frequently, during debugging, the system would hang as many processes eventually tried to access a data structure that was not released properly. Structures where this became an issue included I/O buffers, inodes, regions, and file systems.

The method chosen to isolate these problems was to add a new field to each of the desired structures called an owner field. Once a process acquired exclusive use of the structure, the process PID was copied to the owner field. When a particular structure was determined to be the cause of a system hang, the owning process could be quickly identified, speeding the search for the reason the structure had never been released properly.

SMP_ASSERT

Another technique used quite frequently in the code was to sanity check assumptions made about the state of data structures and the ownership of locks. For example, in some of the inode handling routines, certain functions are only supposed to be called if the inode argument is locked. Adding code to test that assumption (included via conditional compiles) aided in the tracking of faulty code paths.

It is interesting to note that most of this checking code was added as a result of code reviews. In such a review, the reader would often announce "Here we are in routine fred() with an inode argument, assumed to be locked. Say, I wonder if we really always have the inode lock? Easy, lets check." In fact, quite a few problems were discovered that previously existed in the uniprocessor code base.

The mechanism created was called SMP_ASSERT, and was based on the existing ASSERT mechanism available with the version of the C compiler used. The compiler created debugging information about the source file and line number to be used in the run-time check. If the condition checked by the ASSERT was false, the exact source file and line number information were available to be printed on the system console. During most of the development effort, an ASSERT failure also caused the system to enter the kernel debugger immediately.

SMP_ASSERT was useful in tracking down what turned out to be one of the project's "most interesting" bugs. Reviews of several crashes or hangs were leading to the opinion that the whoami() function was broken. Careful code reviews didn't reveal the problem. The next alternative was to use SMP_ASSERT to validate the use of the cdb pointer as used in various routines.

Once the code was liberally sprinkled with SMP_ASSERT calls, it was discovered that some routines were entered with one cdb pointer, but exited with another. Once isolated, the problem was obvious -- establishing a pointer to the current cdb at the entry of a routine is only valid so long as the process cannot be rescheduled. If the code path calls functions that allow the process to sleep, the cdb pointer must be re-established later in the the procedure.

These sanity checks were used constantly during the development effort, internal debug cycle, and first external field tests. When the time came for the first release product to ship, the SMP_ASSERT code was deactivated by a simple change in one header file.

Kernel Symbolic Debugger

The typical debugging environment for the kernel includes kernel mode printf() function calls, crash dumps (if available), and other home-grown techniques to shorten the process of deciphering the cause of system failures. Some attempts have been made to include user-level debugging tools into the kernel such as adb by Zimmerman [3], but these often required access to a second machine and would not scale easily beyond a uniprocessor. The perceived critical need was for a kernel debugging tool which would work in an SMP environment.

One of the developers on this project, who had prior interest in this area, developed ksdb, a kernel symbolic debugger based upon the existing System V user level debugger sdb. The key differentiator for this utility was a co-routine environment designed to handle multiple CPUs, requiring minimal intrusion into the kernel, and requiring only minor modifications to sdb. In fact, the design was made portable, so that another higher level debugger could have been accommodated.

The kernel debugger makes use of the information created by the C language compiler if the -g option is used. This gave the debugger the ability to set and clear breakpoints using line number information from program listings, instead of requiring kernel virtual addresses. Both global and local variable name information is also available in this environment. However, the debugger does not have the ability to display source lines -- this facility would have required access to a dedicated second system (a common shortcoming for this application). Access to a timesharing system containing the current sources was still required via a second terminal.

The co-routine implementation allows the primary CPU to act as the interface to the console terminal. When a secondary CPU encounters a breakpoint, it enters a service routine to send and receive data to the primary processor's co-routine. The primary processor, at well defined locations, checks for data to be serviced from other processors and then acts as the vehicle for message passing. Likewise, if a user enters the debugger and wishes to communicate with a secondary processor (e.g., to examine the state of internal processor registers), the primary processor co-routine leaves a message for the secondary processor to find and interpret.

The debugger can be entered in a variety of ways. In the code, a call to the function kcr_enter() causes the CPU executing that path to enter the debugger; if the debugger is not loaded with the kernel image, the call becomes a no-op. A special control-key sequence entered on the console is interpreted by the console terminal driver as a request to enter the debugger, and a call to kcr_enter() is executed. The panic processing procedure also contained a call to kcr_enter().

There are those instances, however, when the system is not in a state to respond to console terminal input. This could happen, for example, when a deadlock condition occurs and the processor's interrupt priority level is high enough to ignore the console terminal activity. For these occasions, a special "trap-door" entry was created. The CPU is first halted via a console mechanism and the current program counter (PC) value saved in an unused internal register. Then, the CPU is restarted by changing the contents of the program counter to a "well known value" (0x80000402 on the VAX), and then directing the processor to continue. This mechanism can be used to force any CPU, primary or secondary, into the debugger service routine.

smp_debug_tools

Once the kernel debugger became a viable tool, its use was extended by taking advantage of its ability to call functions loaded with the kernel image. Some common information gathering techniques for a hung or crashed system were made simpler by creating routines to provide the information. For example, the debugger makes it possible to display the contents of an inode to see if it is currently locked. With several hundred inodes configured into the system, the process of finding all locked inodes becomes a time-consuming task. By creating a routine to search the inode table and print out the address of each locked inode, the process could be shortened considerably.

The "toolkit" of debugging routines expanded and contracted, based upon the area of the kernel currently under evaluation. A general set of routines was always left available for providing commonly required information. These routines included the following:

Print a list of locked inode addresses
Convert an inode number to an inode address
Print a list of all sleeping process pids
Print a list of process pids on the run queue
Print a list of process pids sleeping on a particular address
Convert a pid to a process address
Print a list of locked memory regions
Determine if a memory region is on the active list
Determine if a memory region is on the free list

A hardcopy list of these routines was kept handy to show the available facilities. These functions greatly reduced the time to determine the cause of a failure.

Application debugging

Once the SMP system was available for test use, the problem of debugging applications began to arise. These were applications that appeared to run correctly on uniprocessor versions of the system, but failed under SMP. These problem applications could be generally categorized into two classes: those that were faulty already on uniprocessors and those that depended upon implicit knowledge of kernel behavior.

A typical example of a faulty application was one that made improper usage of shared memory and semaphores. Careless use of semaphores to protect user data in shared memory could go undiscovered for long periods of time on a uniprocessor. On an SMP system, these problems materialized much more quickly.

An interesting variation of the above was an application that created its own semaphore mechanism (rather than using the system call) using the VAX interlocked instructions. A well known data cell in shared memory was established as the semaphore, and ownership acquired with the VAX BBSSI instruction. However, ownership of the semaphore was cleared using a non-interlocked instruction (a simple zeroing of the data cell). Due to hardware writeback caches, the fact that the cell was cleared was not immediately echoed to all processors, leading to different views of the state of the cell. Once the code was changed to do an interlocked clear of the cell, the application resumed working.

The second class of problem was less obvious than the first. Some applications were written to take advantage of system behavior on a uniprocessor, and failed when moved to a multiprocessor. Typical of this was the application that depended upon the scheduling of a child process to run at a higher priority than its parent after a fork() system call. On an SMP system, both parent and child can be running simultaneously, so depending upon scheduling order tends to be hazardous.

Instrumentation

Since an SMP system behaves in many respects identically to a uniprocessor, there was the need to answer the question about the effectiveness of the additional CPUs. In addition, the developers needed mechanisms to measure the effects of changes made to the performance of the system as a whole.

Two new utility programs were created to give visibility to some of the statistics being kept by the system: spar and cpucntl. In addition, two existing performance measurement techniques were modified to report data on multiple CPUs: the sar and Kernel Profiler utilities.

Spin Lock Activity Reporter

spar is an acronym for spinlock activity reporter, and was created to allow examination of the spin lock database. The utility used the standard method of accessing kernel data (/dev/kmem and the nlist() function) to create a snapshot of the statistics kept by each spin lock. The output of the utility, shown in example 1, contained the lock name, the number of times it had been acquired, the percentage of time there was a collision when multiple CPUs attempted to acquire the lock simultaneously, and the average number of spins for each lock acquisition failure.

This information was invaluable to the developers to understand the dynamics of a workload on a given system. By concentrating on locks that had high collision rates, changes could be made to the system to reduce contention. A high spin-rate for each failure was an indication that a lock was being held for too long a time, and that work was needed to reduce the coverage of one lock or to split it into two. This information offers continual motivation to refine the granularity of critical sections protected by spin locks.
[example](spar Report) [example_file](spar.sdml)

CPU Control Utility

The primary focus of the cpucntl utility, as mentioned before, is to control starting and stopping secondary processors. Another option, a display feature, was also added to enhance visibility of SMP performance. This was used to provide visibility into accounting fields added to the cdbs, instead of overloading another utility (such as spar).

Some of the more useful information supplied by cpucntl help explain the contribution of the secondary CPUs. Absolute counts of system calls executed by each CPU allow a comparison of behavior between the primary and secondaries. A separate counter shows, for each secondary, the number and percentage of system calls not eligible to execute on any other than the primary CPU. This shows how much work must be deferred to the primary only.

Another set of useful information concerns the number of process context switches done by each system. On a balanced system, the distribution should be fairly even among individual processors. Also displayed for the secondaries are the number of context switches due to forced affinity. When the percentage of forced affinity switches becomes too high, the primary CPU becomes the critical resource of the system as well as the chief bottleneck.

System Activity Reporter

The sar (system activity reporter) facility is a standard System V utility that reports on the behavior of system workloads and availability of system resources. It consists of two components: a data collection utility (sadc) normally run by cron, and the reporting utility (sar) which has many options. This tool is a favorite of system administrators to measure system performance.

Both data collection and reporting utilities were enhanced to capture performance data from multiple CPUs where appropriate. For example, the default display of the sar utility shows a daily breakdown of the percentage of CPU time spent in four categories: time spent in user mode, time spent in kernel mode, time spent waiting for I/Os to complete, and idle time. The data gathering for this report takes place during the hardware clock interrupt service routine.

Because each CPU executes its own hardware clock service routine, the data is collected on a per-processor basis. The data collector was then enhanced to gather the data from all CPUs, and the reporting program was modified to report the statistics on a per-processor basis. This allows the system administrator to see the same familiar view of performance information, enhanced by a breakdown for each CPU.

Kernel Profiler

Another standard facility of System V was enhanced primarily as a debugging performance analysis tool. The kernel profiler existed as a pseudo-device driver (called prof) and a set of utilities. When built into a kernel and activated, a PC sampling mechanism at each hardware clock interrupt gathered data about the work being done on the system. This data was collected into a large table in the kernel by means of hashing techniques, and then saved onto disk files by a utility program.

The purpose of the profiler is to help the system developer understand which routines in the kernel are consuming the greatest amounts of time. By using the sampled PC data over a large enough sampling period, kernel routines that are poorly written or are being used with a higher than expected frequency can be determined. By understanding the characteristics of the user workload, the system developer can make a judgment if a routine is behaving as expected or not.

As with sar, the sampling of the data takes place on a per-processor basis. The pseudo-device driver and the utilities were enhanced to handle data from multiple CPUs, on a per-processor basis. This gave the developers another visibility point into what the behavior of secondary CPUs was like in the kernel, and to measure the efficiency of new routines added to support SMP (such as locking overhead and interprocessor interrupts).

Performance Investigations

Adding a CPU to an SMP system doesn't produce an additional CPU's worth of computing (linear scaling). This is because of various forms of contention within the hardware caused by the need to maintain cache coherency, because of additional demands for various bus bandwidths, and so on. However, the largest contributor to SMP overhead and non-linear scalability is the overhead associated with spin locks. Thus, the largest payoff software developers can achieve in terms of enhancing system performance will usually be associated with reducing the scope of critical sections protected by spin locks.

As with all performance enhancement projects, the question that must be asked is how to determine what to investigate? Obviously, the developer's time should be spent in areas that will produce the maximum results. Without tools, one must rely on intuition. One of the hard and fast rules discovered during SMP development was that "if it's intuitive, its probably wrong," along with "if it can happen, it will, only faster." spar was the primary tool used (and still used today) for performance investigations.

Two types of overheads associated with spin locks: these are the overhead associated with acquiring/relinquishing a spin lock and overhead incurred by CPU collisions while attempting to obtain a spin lock. Since the former depends a great deal on the associated hardware architecture and the actual code used to implement the spin lock, this discussion will focus on the latter. However, it is important to note that there is an obvious tradeoff between obtaining/releasing a lock multiple times to reducing the scope of the lock ownership and the time other CPUs wind up spinning. If collisions are rare or the spin wait overhead is low, the lock allocation overhead can dominate what might be gained by potentially less serialization.

There are several possibilities for why spin locks add unnecessary overhead to an SMP operating system. These include:

A bug. For example, a spin lock is inserted in the code, presumably for a well thought out reason, but either the reasoning was faulty, or the need went away due to continuing development.
The size of a critical section. Too much code is executed between the acquisition and release of a particular spin lock.
Poor (SMP) algorithms. These result in a large critical section.
Poor lock granularity.

What follows is a discussion of the these possible problems and some of the investigations that were proposed and/or carried out to discover where such problems might exist in the SMP implementation. It is important to note that for the most part, spar was either the only tool necessary to point out that a problem existed or that an area should be investigated. In general, at least as a first pass, if spar indicated a high collision rate associated with a particular lock, usage of that lock was investigated. It is also important to note that what might be a problem may not be observed. That is, collision rates are highly dependent on the work load being run. Thus, low collision rates observed for a particular benchmark do not necessarily imply low collision rates for all benchmarks.

While doing some load testing during the development, after running a very heavy I/O load for a time, the system would hang because a process or processes were stuck waiting for I/O to complete. The obvious and immediate response was that there was some sort of SMP race condition associated with a setting I/O done bit. It was observed that if an appropriate spin lock was inserted around the code that implemented the actual mechanics of setting the I/O done bit (thus making it an atomic operation with respect to multiple CPUs), the system no longer hung in this situation.

In reality, the problem was that a bit associated with the data structure's I/O wait state was not being set/cleared properly using interlocked instructions. This fact was independently discovered and fixed later on by someone who simply noticed the potential for a bug, but who didn't notice the fact that the spin lock was being acquired and released around the bit manipulation. Once performance analysis began in earnest, it was noted that by far the highest collision rate in the system was associated with the lock that was (among other things) protecting the I/O done bit. Examination of the code immediately lead to the conclusion that this invocation of the lock was not needed and approximately 20% of all lock collision overhead was eliminated by removing one unnecessary occurrence of an interlock.

In a uniprocessor system, "locks" are often required to "keep things from going away" should the scheduler get called, or if kernel mode rescheduling were allowed. These "locks", which usually just amount to marking things as undeletable in a data structure, must become real locks in an SMP system (sometimes associated with a spin lock). Since ownership of these locks in a uniprocessor environment is only important when the scheduler gets called, procedures were often coded so that they acquired the lock upon entry and freed it up upon exit since this made the code somewhat more obvious. To maximize parallelism in an SMP environment, however, locks should only be held when they are essential. So, a great deal of effort was put into obtaining these types of locks as late as possible and releasing them as soon as they were no longer required.

In an SMP environment, algorithms must be examined to ensure that their implementation allows the maximum possible concurrence. For example, most search algorithms (as originally implemented on a uniprocessor system) just start at the beginning of a list and search each item until either the list is exhausted or the item being searched for is found. Realizing that the contents of the list might change while searching it in an SMP environment, however, the developer's first inclination might be to create a lock for the list, acquire it before beginning the search, and releasing it when the search was complete and any necessary modifications to the list have been made.

If this list is searched often, there is obviously significant potential for collisions between processors needing to search/modify the list. The alternative, of course, is to search the list for the item of interest without the lock and, if it is found, acquire the lock and see if it's still there. In fact, if the operations used to insert and remove items from the list are done in the correct order, it's not even necessary to own the lock to carry out those functions.

Finally, poor lock granularity must be considered. This can occur when a single lock is used to synchronize access to multiple (perhaps independent) data structures. Unfortunately, it is not always easy to determine when data structures are independent and usually when the developer is incorrect, the result is rather nasty bugs. When it was determined that collision rates were high based on spar output, it became worthwhile investigating lock granularity.

If a useful lock collision is defined to be a collision that occurs when multiple CPUs attempt to access the same data structure simultaneously, then the most useful possible collision would be when all CPUs attempt to obtain the lock from the same call site. Not only are the CPUs trying to access the same data structure concurrently on multiple CPUs, but in fact, the CPUs are attempting to execute the same code to do that access. A non-useful collision would occur when multiple CPUs collide on a lock to access totally independent data structures. Unfortunately, since multiple developers wrote code that must acquire/relinquish locks across many modules, it was difficult to determine if non-useful collisions occurred and if they were significant simply by examining the code. It's worthwhile to note that if a non-useful collision ever does occur, it can be eliminated by introducing another lock (increasing lock granularity).

A really gross example can be used to illustrate the idea of a non-useful collision. Suppose the same lock were used to lock accesses to both the inode database and the buffer cache. Many of the collisions would be useful, such as a spinner waiting to access the buffer cache, owner accessing the buffer cache. On the other hand, there would also be (non-useful) collisions where the spinner wanted to access the buffer cache while the owner was accessing the inode database. Everything would work but, if the situation were recognized, performance would be greatly enhanced by having two locks: one for the inodes and one for the buffer cache.

By examining the code, it might be possible to discover cases where non-useful collisions might occur, but, due to the physics of the situation, it might be impossible to arrive at the necessary conditions for the collision to happen. Because of this, it would be useful if it were possible to come up with occurrences of non-useful collisions empirically. For example, an experiment could be designed as follows. Suppose that a CPU stored the return address in a location in the lock block upon obtaining a lock and set that location to minus one whenever the lock was relinquished. Suppose that any CPU that attempted to obtain a lock but wound up spinning stored its caller's return address in another location in the lock block and sets that location to minus one whenever the lock was obtained. Now, if an observing program obtains two return addresses whenever it looks at the lock block, the program knows that it has the return addresses for the callers to obtain the lock at the point where a collision occurred. A simple examination of the code (the data structure being synchronized on) at the call sites would let the developer determine if the collision was useful; of course, if it was not, future occurrences of that collision could be eliminated by introducing another lock.

Performance enhancements

Once the initial development and debugging were completed, a number of other techniques were employed to improve the performance of the system. Many of these related to the features of the VAX architecture, while others were used to overcome problems created by that architecture.

whoami

Liberal use had been made of the whoami() function throughout the code when it became necessary to identify the address of the current CPU's cdb. There was an express need to make the whoami() function as efficient as possible.

The initial implementation of whoami() depended upon knowledge that cdb structures were allocated in multiples of 8 512-byte pages, and that the processor's interrupt stack was contained in that range. The cdb address could then be calculated by masking off the appropriate number of bits from the interrupt stack address, resulting in the cdb address.

A further enhancement was next created by utilizing one of two VAX internal processor registers not used by the kernel. The supervisor and executive access modes were not used allowing the cdb address to be stored into the executive mode stack pointer (ESP) at system initialization time. The function then resolved to retrieving one internal processor register which served as a whoami register.

sed scripts

Another technique to enhance performance was used to reduce the overhead of function calls imposed by the VAX architecture. The standard method of resolving function calls and returns involves the use of the VAX CALLS and RET instructions [1]; both carry a high overhead in memory accesses. For frequently used functions, it is desirable to minimize this overhead.

The shell procedures used to build the System V kernel already included an intermediate step where assembly language code (generated by the compiler) runs through an editor script. The editor script was used to both eliminate unnecessary call overhead, and to take advantage of VAX features. For example, the routine bcopy(), which moves a specified amount of data from one kernel location to another, was reduced from a function call to a single, in-line, assembler statement using the MOVC3 instruction [1].

The existing editor script was enhanced to include optimizations for the SMP changes. For example, the whoami() function mentioned before was reduced to a single instruction, copying the data from the ESP register to the target location, all done in-line and with no call overhead.

Other functions processed by the enhanced editor script included functions created to call the VAX interlocked instructions that had been used (e.g., BBSSI, BBCCI, ADAWI). The getlock()/givelock() functions were also modified to use less expensive forms of routine calls (JSB/RSB) to reduce overhead [1].

Idle Loop and Interprocessor Doorbells

Ideally, the idle process for each CPU should do something useful without interfering with useful work being done by other processors. For example, the idle process could execute the scheduler anticipating processes becoming computable. However, this scheme is less than optimal since the scheduler interlock must be owned to execute the scheduler, and this would prevent other processors from running the scheduler to requeue processes going to sleep or find processes to run. Also, if the idle process executes a significant amount of code, it's possible that some significant bus bandwidth (that could more usefully serve other processors doing real work) could be consumed.

To avoid bandwidth consumption, and yet ensure that the idle process scanned the run queues as soon as it is likely to result in finding a process to run, the concept of a software doorbell to interrupt the idle loop was implemented. The basic mechanism allows a CPU to ring another CPU's doorbell, the doorbells of all other CPUs but the current CPU, or the doorbells of all other CPUs on a "significant event", such as a context switch from a runnable process (which is eligible to be run by another CPU), or I/O done (a process can be run because an I/O request has been satisfied). A context switch from a runnable process rings the doorbell of all CPUs except the current one, while I/O done rings all CPU's doorbells including the current one, since the current one may also be running the idle process.

A CPU has to "listen" for a doorbell; a doorbell will not interrupt or otherwise disturb a CPU. The only time a CPU pays attention to doorbells is when it is idle, that is, when it has run the scheduler, found nothing to do, and is running the "idle process" until something happens to make work available.

This software doorbell implementation is useful since a CPU is not taken away from useful work with interruptions. A negative aspect is that a CPU may look for work to do as the result of a doorbell yet find nothing to do. While looking for work to do, it holds interlocks and increases memory contention, and thereby possibly interferes with other CPUs that do have work to do.

wakeup1

One important change turned out to be necessary to improve performance in the area of process scheduling. Some tests were showing a lot of context switch activity, large amounts of activity for the scheduler lock, long queues of runnable processes, but little system throughput. The problem was isolated to a feature of the existing sleep/wakeup mechanism.

If a process attempts to acquire ownership of a resource, such as an inode, and fails due to the resource being locked by another user, the requesting process sets a flag indicating that the resource is desired, and calls the sleep function to allow another process to be scheduled. The sleep function call contains the address of the resource desired (e.g., the inode), and the process is placed on a hashed list of processes waiting for an event to happen [2].

Once the process owning the inode resource is about to release it, a check is made to see if other processes want the resource. If this is true, the wakeup function is called, passing the resource address, and all processes waiting for this resource are made runnable and can compete for the resource.

In a uniprocessor, one of the processes now made computable is scheduled, acquires the resource, and continues. The resource might be released before another process is scheduled, and thus ownership passed along without contending processes being put through further sleep/wakeup cycles, all of which involve locking the scheduler database.

The problem described above is magnified on an SMP system. Once the "winning" process acquires the resource, it is highly likely that the other processes will be scheduled soon on another CPU. Each will fail to acquire the resource, and then be put back to sleep. The scheduler is thus forced to examine each of the "unrunnable" processes, allowing each to run only to be quickly put back asleep.

The wakeup1() function was created to resolve these situations. Whenever there is contention for a resource that can only have a single owner, the wakeup1() function picks the process that has been waiting for the process the longest from the hash queue. Only the selected process is awakened and allowed to be scheduled. This significantly reduced the scheduler thrashing seen under a number of workloads.

AIM Results

The goal of the development, as stated previously, was a dual processor system that would deliver 1.6 times the performance of a uniprocessor for interactive workloads. The benchmark used to establish this value was an early version of the AIM III Test Suite (© Copyright 1986 Aim Technology Incorporated.) which simulates a number of environments. Two test environments, a standard mix and CPU intensive load, were run on an identical hardware configuration using both the uniprocessor and multiprocessor versions of the kernel. The standard mix consists of RAM, floating point, pipe, logic, math, and disk operations. The CPU intensive mix contained an equal combination of RAM, floating point, logic, and math operations.

The results for the original version were encouraging. The dual processor version ran at 1.6 times the efficiency of the original uniprocessor code base for the standard mix, and at 1.85 times the efficiency for the CPU intensive mix. The numbers were measured over a range of 1 to 256 users, increasing the load by 25 users, and then averaging the results.

On the negative side, when comparing the original uniprocessor code to the SMP code base running as a uniprocessor, the SMP version ran only at 85% of the original speed. This indicated that the system was taking a 15% performance hit by having SMP overhead (i.e., spin locks) in the system, for this particular workload. This lead to a need to examine the placement of locks in subsequent releases, to reduce this overhead.

Enhanced Utilities

Some of the utilities that were modified for this effort have already been discussed: sar and the Kernel Profiler. Three additional utilities were modified directly for SMP, only two of which were visible to the end user.

Process Status Utility

A small change was made to the utility ps, which displays process status. When used with certain flags (-efl), one of the fields displayed is the state of the process (runnable, sleeping, zombie, etc.) On a uniprocessor, the process running the ps utility is always the current one, and all other processes marked runnable are waiting for the CPU. On an MP system, however, it should be possible to display the identity of processes which are running on other processors.

The state field of ps was modified such that if the process was currently active (running) on a CPU, the logical CPU id (0, 1, 2, etc.) was displayed in the state field. A runnable process (code R) now signified one that was truly waiting to execute. Running ps and seeing it execute on something other than the primary CPU was a good indication that the secondary CPUs were performing properly.

An apparent inconsistency was also introduced by this change. Due to the nature of the way this version of ps worked, multiple processes could be reported as active on the same CPU. This was explained by understanding the mechanism of the utility. ps does its data collection one process at a time, and, before it can parse the entire realm of processes, multiple context switches can have taken place. The dynamic nature of the system was shown by the number of processes that were scheduled to run before the ps utility display could complete.

Crash Dump Utility

The crash dump examination utility of System V was modified extensively to to give visibility to the changes made within the kernel for SMP. These changes were made in combination with kernel changes during the crash dump procedure to capture state information from each CPU during a system failure.

The primary features added to the crash utility included display functions for spin lock and cdb structures. Being able to display the contents of these structures quickly in an easily readable form was a great time-saver during debugging.

crash also needed to understand and change context from one CPU to another. Just as the kernel was modified to have no single view of current process and current priority, crash needed modification to understand the difference between the context of the "current" process on the primary or secondary CPUs. This included understanding the location of the internal processor register state information for each CPU, its general purpose registers, and the location of its interrupt stack.

Configuration Utility

The config utility for System V converts an input file describing the system hardware configuration and other parameters into several source files. These files are then compiled and linked into the resulting kernel image.

The changes made to this utility are invisible to the user (typically, the system administrator). A new system parameter was created, specifying the number of CPUs on the system (which controlled initial allocation of cdbs). For a uniprocessor, the default value is one.

The larger change concerned how the utility initialized the bdevsw/cdevsw structures. Based upon the entry for a device in the file /etc/master, a device spin lock is optionally created, and its address added to the appropriate device switch entry. Finally, if the device being configured required forced affinity to the boot CPU, a boolean value was added to a new field in the structure(s).

Subsequent Releases

The development effort began in early 1987, and was wrapped up by mid-1988. An extended field test phase ran from mid-1988 until early 1989, when it was released as a product. Concurrent with the field test phase, a port of this work to the code base for System V Release 3.1 was begun. That product was eventually released in late 1989, supporting the VAX 8800 series processors. The current product release is based upon System V Release 3.2.

The platforms supporting SMP were extended to include the VAX 6000 series. Once the base uniprocessor support code was debugged, SMP support was provided by multithreading the processor-specific error handling code, and by providing processor-specific code for booting secondary CPUs and handling interprocessor interrupts. With these few processor-specific extensions, the remainder of the code base required no SMP specific modifications.

As mentioned previously, the concept of global spin locks was abandoned, and device drivers for the mass storage subsystems were multithreaded by adding spin locks where appropriate. The buffer cache flushing algorithm was rewritten to minimize impact on large memory systems, and other performance enhancements were introduced in the VM subsystem.

Summary

It was possible for a distributed development team, with a mixture of UNIX kernel and SMP expertise, to develop an SMP version of System V Release 2.2 and productize it for general use. Out of the experience, there were a number of things that the authors feel were done well, and a number of things that would be done differently a second time around.

Some of the things which worked well included the following:

the kernel debugger -- the project would have been much longer without it
the spin lock activity reporter -- it still provides valuable information
distributed development -- it often provided two different views on the same problem, and almost resembled a "two team" approach
debug instrumentation -- in cdbs, locks, and other kernel data structures, which allowed problems to be quickly located
utilizing the system for development as soon as possible -- helped to shake out problems, and provided an incentive to keep the system stable

Some of the things which would be done differently a second time around:

create the tools needed for a distributed development environment early, and don't count on common tools (diff, vi, etc.) to solve these problems
work out performance criteria and benchmarking techniques beforehand
have a wide range of types of workloads/stress tests available
focus on multithreading device drivers from the beginning -- all heavily used mass storage, networking, and local device access routines should be multithreaded if possible
don't use spl() calls in device routines and take advantage of the fact that devices interrupt at particular priority levels to enhance independent device routine's concurrency
understand the impact of adding locks to uniprocessor performance
have a target for the impact of changes on a uniprocessor, and trade off against SMP performance

Acknowledgments

The authors wish to thank Carlos Baradello and Dave Leonard for their support in preparing and reviewing this paper. More thanks go to Dave Leonard and Tomas Lofgren, who were responsible for enabling the work to take place, and to Mike Cattolico and Tony Dziedzic, who comprised the third and fourth members of the original team. Many people in the VAX System V Development Group contributed time and energy to make the product become real, and expanding it for future releases. Final thanks go to Tim Litt and Tara Scanlon for helpful suggestions and pointing out major abuses of the English language.

References

T. Leonard, ed., VAX Architecture Reference Manual (Bedford: Digital Press, Digital Equipment Corporation, 1987).
M. Bach, The Design of the UNIX Operating System, (Prentice-Hall, 1986).
Zimmerman, USENIX paper, TBD