To understand why FLASH offers the perfect experimental vehicle for coherence protocol comparisons, it is necessary to ﬁrst discuss the FLASH architecture as well as the
micro-architecture of the FLASH node controller. This chapter begins with the FLASH
design rationale and follows with a description of the FLASH node. Particularly highlighted are elements of theFLASH architecture that support a ﬂexible communication
protocol environment or aid in the deadlock avoidance strategy of the machine. A crucial
part of this performance study is that the protocols discussed here are full-ﬂedged protocols that run on a real machine and therefore must deal with all the real-world problems
encountered by a general-purpose multiprocessor architecture. In thecontext of the
FLASH multiprocessor, this chapter discusses the centralized resource usage and deadlock
avoidance issues that each protocol must handle in addition to managing the directory
state of cache lines in the system.
3.1 An Argument for Flexibility
The previous chapter described four vastly different cache coherence protocols, ranging
from the simple bit-vector/coarse-vector protocolto very complex protocols like SCI and
COMA. While the bit-vector protocol is amenable to hardware implementation in ﬁnite
state machines, the task of implementing all-hardware SCI and COMA protocols is much
more daunting. Not only would the more complex protocols require more hardware
resources, but they would also be more difﬁcult to debug and verify. In machines with
ﬁxed, hard-wiredprotocols, the ability to verify the correctness of the protocol is a critical
factor in the choice of protocol. If the cache coherence protocol is incorrect, the machine
will simply be unstable. A machine is either cache-coherent, or it isn’t.
For this reason, commercial DSM machines usually employ the bit-vector protocol
because it is the simplest, or the SCI protocol because the IEEE has made ita standard and
ensures designers that the protocol is correct as long as they adhere to that standard.
Unfortunately, in practice it is difﬁcult to verify a hardware implementation of either the
bit-vector or the SCI protocol. In addition, there is the larger issue of which protocol is the
Chapter 3: FLASH Architecture
best protocol anyway. The scalability issues of the fourprotocols described in Chapter 2
are simply not well understood. The robustness of the protocols over a wide range of
machine sizes, application characteristics, and architectural parameters is an open question, and the main focus of this research.
Rather than ﬁx the cache coherence protocol in hardware ﬁnite state machines at design
time, an alternative approach is to design a ﬂexible, programmablenode controller architecture that implements the coherence protocol in software. This was precisely the
approach taken by Sequent when designing their SCI-based NUMA-Q machine.
Figure 3.1 shows the cache state diagram for the Sequent SCI protocol. The details are not
important—note only that it is extraordinarily complicated! The SCI protocol is complicated enough that one could argue the bestway to implement SCI is in a programmable
environment where debugging, veriﬁcation, and future protocol optimizations are easier
than in an all-hardware approach.
The difﬁculty with designing programmable node controllers is doing so without sacriﬁcing the performance of an all-hardware solution. This was the primary goal of the Stanford FLASH multiprocessor. The FLASH approach takes ﬂexibilityone step further and
embeds in the node controller a protocol processor capable of running multiple cache
coherence protocols. This ﬂexibility allows the use of FLASH to explore the performance
impact of multiple cache coherence protocols on the same underlying architecture. The
next sections describe the FLASH node and the node controller architecture in detail.
3.2 FLASH and MAGIC...