Guideline: Concurrency

Introduction

The art of good design is that of choosing the "best" way to meet a set of requirements. The art of good concurrent system design is often that of choosing the simplest way to satisfy the needs for concurrency. One of the first rules for designers should be to avoid reinventing the wheel. Good design patterns and design idioms have been developed to solve most problems. Given the complexity of concurrent systems it only makes sense to use well-proven solutions and to strive for simplicity of design.

Concurrency Approaches

Concurrent tasks that take place entirely within a computer, are called threads of execution. Like all concurrent tasks, threads of execution are an abstract concept since they occur in time. The best we can do to physically capture a thread of execution is to represent its state at a particular instant in time.

The most direct way of representing concurrent tasks using computers is to dedicate a separate computer to each task. However, this usually too expensive and is not always conducive to conflict resolution. It is common, therefore, to support multiple tasks on the same physical processor through some form of multi-tasking. In this case, the processor and its associated resources such as memory and busses are shared. (Unfortunately, this sharing of resources may also lead to new conflicts that were not present in the original problem.)

The most common form of multi-tasking is to provide each task with a "virtual" processor. This virtual processor is typically referred to as a process or task. Normally, each process has its own address space that is logically distinct from the address space of other virtual processors. This protects processes on conflicting with each other against accidental overwrites of each other's memory. Unfortunately, the overhead required to switch the physical processor from one process to another is often prohibitive. It involves significant swapping of register sets within the CPU (context switching) that even with modern high-speed processors may take hundreds of microseconds.

To reduce this overhead, many operating systems provide the ability to include multiple lightweight threads within a single process. The threads within a process share the address space of that process. This reduces the overhead involved in context switching, but increases the likelihood of memory conflicts.

For some high-throughput applications, even the overhead of lightweight thread switching may be unacceptably high. In such situations it is common to have an even lighter-weight form of multi-tasking that is achieved by taking advantage of some special features of the application.

The concurrency requirements of the system can have a dramatic impact upon the architecture of the system. The decision to move functionality from a uni-process architecture to a multi-process architecture introduces significant changes to the structure of the system, in many dimensions. Additional mechanisms (e.g. remote procedure calls) may need to be introduced which may substantially change the architecture of the system.

System availability requirements must be considered, as well as the additional overhead of managing the additional processes and threads.

As with most architectural decisions, changing the process architecture effectively trades one set of problems for another:

Approach	Advantages	Disadvantages
Uni-process, no threads	Simplicity Fast intra-process messaging	Hard to balance workload Can't scale to multiple processors
Uni-process, multi-threaded	Fast intra-process messages Multi-tasking without inter-process communication Better multi-tasking without the overhead of 'heavyweight' processes	Application must be 'thread-safe' Operating system must have efficient thread-management Shared memory issues need to be considered
Multi-process	Scales well as processors are added Relatively easy to distribute across nodes	Sensitive to process boundary: using inter-process communication too much hurts performance Swapping and context switches are expensive Harder to design

A typical evolutionary path is to start with a uni-process architecture, adding processes for groups of behaviors that need to occur simultaneously. Within these broader groupings, consider additional needs for concurrency, adding threads within processes to increase concurrency.

The initial starting point is to assign many active objects to a single operating system task or thread, using a purpose-built active object scheduler - this way it is usually possible to achieve a very lightweight simulation of concurrency, although, with a single operating system task or thread, it will not be possible to take advantage of machines with multiple CPUs. The key decision is to isolate blocking behavior in separate threads, so that blocking behavior does not become a bottleneck. This will result in a separation of active objects with blocking behavior into their own operating system threads.

In real-time systems, this reasoning applies equally to capsules - each capsule has a logical thread of control, which may or may not share an operating system thread, task or process with other capsules.

Issues

Unfortunately, like many architectural decisions, there are no easy answers; the right solution involves a carefully balanced approach. Small architectural prototypes can be used to explore the implications of a particular set of choices. In prototyping the process architecture, focus on scaling the number of processes up to the theoretical maximums for the system. Consider the following issues:

Can the number of processes be scaled up to the maximum? How far beyond the maximum can the system be pushed? Is there allowance for potential growth?
What is the impact of changing some of the processes to lightweight threads which operate in a shared process address space?
What happens to response time as the number of processes are added? As the amount of inter-process communication (IPC) is increased? Is there noticeable degradation?
Could the amount of IPC be reduced by combining or reorganizing processes? Would such a change result in large monolithic processes which are difficult to load-balance?
Can shared memory be used to reduce IPC?
Should all processes get "equal time" when time resources are allocated? Is it possible to carry the time allocation? Are there potential draw-backs to changing the scheduling priorities?

Inter-Object Communications

Active objects can communicate with each other synchronously or asynchronously. Synchronous communication is useful because it can simplify complex collaborations through strictly controlled sequencing. That is, while an active object is executing a run-to-completion step that involves synchronous invocations of other active objects, any concurrent interactions initiated by other objects can be ignored until the full sequence is completed.

While this is useful in some cases, it can also be problematic since it can happen that a more important high-priority event may have to wait (priority inversion). This is exacerbated by the possibility that the synchronously invoked object may itself be blocked waiting on a response to a synchronous invocation of its own. This can lead to unbounded priority inversion. In the most extreme case, if there is circularity in the chain of synchronous invocations, it can lead to deadlock.

Asynchronous invocations avoid this problem enabling bounded response times. However, depending on the software architecture, asynchronous communication often leads to more complex code since an active object may have to respond to several asynchronous events (each of which might entail a complex sequence of asynchronous interactions with other active objects) at any time. This can be very difficult and error prone to implement.

The use of an asynchronous messaging technology with assured message delivery can simplify the application programming task. The application can continue operation even if the network connection or remote application is unavailable. Asynchronous messaging does not preclude using it in a synchronous mode. Synchronous technology will require a connection to be available whenever the application is available. Because a connection is known to exist, handling commit processing may be easier.

In the approach recommended in the Rational Unified Process for real-time systems, capsules communicate asynchronously through the use of signals, according to particular protocols. It is possible, nevertheless to achieve synchronous communication through the use of signal pairs, one in each direction.

Pragmatics

Although the context-switching overhead of active objects may be very low, it is possible that some applications may still find that cost unacceptable. This typically occurs in situations where large amounts of data need to be processed at a high rate. In those cases, we may have to fall back to using passive objects and more traditional (but higher risk) concurrency management techniques such as semaphores.

These considerations, however, do not necessarily imply that we must abandon the active object approach altogether. Even in such data-intensive applications, it is often the case that the performance sensitive part is a relatively small portion of the overall system. This implies that the rest of the system can still take advantage of the active object paradigm.

In general, performance is only one of the design criteria when it comes to system design. If the system is complex, then other criteria such as maintainability, ease of change, understandability, etc. are equally if not even more important. The active object approach has a clear advantage since it hides much of the complexity of concurrency and concurrency management while allowing design to be expressed in application-specific terms as opposed to low-level technology-specific mechanisms.

Heuristics

Focus on Interactions between Concurrent Components

Concurrent components with no interactions are an almost trivial problem. Nearly all of the design challenges have to do with interactions among concurrent tasks, so we must first focus our energy on understanding the interactions. Some of the questions to ask are:

Is the interaction one-directional, bi-directional, or multi-directional?
Is there a client-server or master slave relationship?
Is some form of synchronization required?

Once the interaction is understood, we can think about ways to implement it. The implementation should be selected to yield the simplest design consistent with the performance goals of the system. Performance requirements generally include both overall throughput and acceptable latency in the response to externally generated events.

These issues are even more critical for real-time systems, which are often less tolerant of variations in performance, for example 'jitter' in response time, or missed deadlines.

Isolate and Encapsulate External Interfaces

It is bad practice to embed specific assumptions about external interfaces throughout an application, and it is very inefficient to have several threads of control blocked waiting for an event. Instead, assign a single object the dedicated task of detecting the event. When the event occurs, that object can notify any others who need to know about the event. This design is based upon a well-known and proven design pattern, the "Observer" pattern [GAM94]. It can easily be extended for even greater flexibility to the "Publisher-Subscriber Pattern," where a publisher object acts as intermediary between the event detectors and the objects interested in the event ("subscribers") [BUS96].

Isolate and Encapsulate Blocking and Polling Behavior

Actions in a system may be triggered by the occurrence of externally generated events. One very important externally generated event may be simply the passage of time itself, as represented by the tick of a clock. Other external events come from input devices connected to external hardware, including user interface devices, process sensors, and communication links to other systems. This is overwhelmingly true for real-time systems, which typically have high connectivity with the outside world.

In order for software to detect an event, it must be either blocked waiting for an interrupt, or periodically checking hardware to see if the event has occurred. In the latter case, the periodic cycle may need to be short to avoid missing a short lived event or multiple occurrences, or simply to minimize the latency between the event's occurrence and detection.

The interesting thing about this is that no matter how rare an event is, some software must be blocked waiting for it or frequently checking for it. But many (if not most) of the events a system must handle are rare; most of the time, in any given system, nothing of any significance is happening.

The elevator system provides many good examples of this. Important events in the life of an elevator include a call for service, passenger floor selection, a passenger's hand blocking the door, and passing from one floor to the next. Some of these events require very time-critical response, but all are extremely rare compared to the time-scale of the desired response time.

A single event may trigger many actions, and the actions may depend upon the states of various objects. Furthermore, different configurations of a system may use the same event differently. For example, when an elevator passes a floor the display in the elevator cab should be updated and the elevator itself must know where it is so that it knows how to respond to new calls and passenger floor selections. There may or may not be elevator location displays at each floor.

Prefer Reactive Behavior to Polling Behavior

Polling is expensive; it requires some part of the system to periodically stop what it is doing to check to see if an event has occurred. If the event must be responded to quickly, the system will have to check for event arrival quite frequently, further limiting the amount of other work which can be accomplished.

It is far more efficient to allocate an interrupt to the event, with the event-dependent code being activated by the interrupt. Though interrupts are sometimes avoided because they are considered "expensive", using interrupts judiciously can be far more efficient than repeated polling.

Cases where interrupts would be preferred as an event-notification mechanism are those where event arrival is random and infrequent, such that most polling efforts find that the event had not occurred. Cases where polling would be preferred are those in which events arrive in a regular and predictable manner and most polling efforts find that the event has occurred. In the middle, there will be a point at which one is indifferent to either polling or reactive behavior - either will do equally well and the choice matters little. In most cases, however, given the randomness of events in the real world, reactive behavior is preferred.

Prefer Event Notification to Data Broadcasting

Broadcasting data (typically using signals) is expensive, and is typically wasteful - only a few objects may be interested in the data, but everyone (or many) must stop to examine it. A better, less resource consumptive approach is to use notification to inform only those objects who are interested that some event has occurred. Restrict broadcasting to events which require the attention of many objects (typically timing or synchronization events).

Make Heavy Use of Light-weight Mechanisms and Light Use of Heavy-weight Mechanisms

More specifically:

Use passive objects and synchronous method invocations where concurrency is not an issue but instantaneous response is.
Use active objects and asynchronous messages for the vast majority of application-level concurrency concepts.
Use OS threads to isolate blocking elements. An active object can be mapped to an OS thread.
Use OS processes for maximum isolation. Separate processes are needed if programs need to be started up and shut down independently, and for subsystems which may need to be distributed.
Use separate CPUs for physical distribution or for raw horsepower.

Perhaps the most important guideline for developing efficient concurrent applications is to maximize the use of the lightest weight concurrency mechanisms. Both hardware and operating system software play a major part in supporting concurrency, but both provide relatively heavy-weight mechanisms, leaving a great deal of work to the application designer. We are left to bridge a big gap between the available tools and the needs of concurrent applications.

Active objects help to bridge this gap by virtue of two key features:

They unify the design abstractions by encapsulating the basic unit of concurrency (a thread of control) which can be implemented using any of the underlying mechanisms provided by the OS or CPU.
When active objects share a single OS thread, they become a very efficient, light-weight concurrency mechanism which would otherwise have to be implemented directly in the application.

Active objects also make an ideal environment for the passive objects provided by programming languages. Designing a system entirely from a foundation of concurrent objects without procedural artifacts like programs and processes leads to more modular, cohesive, and understandable designs.

Eschew performance bigotry

In most systems less than 10% of the code uses more than 90% of the CPU cycles.

Many system designers act as though every line of code must be optimized. Instead, spend your time optimizing the 10% of the code that runs most often or takes a long time. Design the other 90% with an emphasis on understandability, maintainability, modularity, and ease of implementation.

Choosing Mechanisms

The non-functional requirements and the architecture of the system will affect the choice of mechanisms used to implement remote procedure calls. An overview of the kinds of trade-offs between alternatives is presented below.

Mechanism	Uses	Comments
Messaging	Asynchronous access to enterprise servers	Messaging middleware can simplify the application programming task by handling queuing, timeout and recovery/restart conditions. You can also use messaging middleware in a pseudo-synchronous mode. Typically, messaging technology can support large message sizes. Some RPC approaches may be limited in message sizes, requiring additional programming to handle large messages.
JDBC/ODBC	Database calls	These are database-independent interfaces for Java servlets or application programs to make calls to databases that may be on the same or another server.
Native interfaces	Database calls	Many database vendors have implemented native application program interfaces to their own databases which offer a performance advantage over ODBC at the expense of application portability.
Remote Procedure Call	To call programs on remote servers	You may not need to program at the RPC level if you have an application builder that takes care of this for you.
Conversational	Little used in e-business applications	Typically low-level program-to-program communication using protocols such as APPC or Sockets.

Summary

Many systems require concurrent behavior and distributed components. Most programming languages give us very little help with either of these issues. We have seen that we need good abstractions to understand both the need for concurrency in applications, and the options for implementing it in software. We have also seen that, paradoxically, while concurrent software is inherently more complex than non-concurrent software, it is also capable of vastly simplifying the design of systems which must deal with concurrency in the real world.