Un Titulo

Páginas: 18 (4442 palabras) Publicado: 3 de marzo de 2013
Defining and Measuring Supercomputer Reliability,
Availability, and Serviceability (RAS)
Jon Stearley
Sandia National Laboratories
Albuquerque, New Mexico

Abstract. The absence of agreed definitions and metrics for supercomputer RAS
obscures meaningful discussion of the issues involved and hinders their solution.
This paper seeks to foster a common basis for communication aboutsupercomputer RAS, by proposing a system state model, definitions, and measurements.
These are modeled after the SEMI-E10 [1] specification which is widely used in
the semiconductor manufacturing industry.

1 Impetus
The needs for standardized terminology and metrics for supercomputer RAS begins
with the procurement process, as the below quotation excellently summarizes:
“prevailing procurementpractices are ... a lengthy and expensive undertaking both for the government and for participating vendors. Thus any technically
valid methodologies that can standardize or streamline this process will result
in greater value to the federally-funded centers, and greater opportunity to focus on the real problems involved in deploying and utilizing one of these large
systems.” [2]
Appendix A providesseveral examples of “numerous general hardware and software
specifications” [2] from the procurements of several modern systems. While there are
clearly common issues being communicated, the language used is far from consistent.
Sites struggle to describe their reliability needs, and vendors strive to translate these
descriptions into capabilities they can deliver to multiple customers. Anotherexample
is provided by this excerpt:
“The system must be reliable... It is important to define what we mean by
reliable. We do not mean high availability... Reliability in this context means
that a large parallel job running for many hours has a high probability of successfully completing. It is measured by the mean time between job failures.
Note that the system can undergo a failure thatdoes not lead to loss of a
job without affecting reliability - this is important to developing reliability enhancement strategies. A related requirement would be that if the system undergoes a failure that is local, only jobs using that local resource are affected. This
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Departmentof Energy under Contract DE-AC04-94AL85000.

2

kind of aspect of reliability we also call resiliency. Note that a system can have
very high availability and not be reliable for our purposes. It is, by contrast,
unlikely that a system that has low availability could have high reliability.” [3]

Standardized terms and measurements for supercomputer RAS will streamline the procurementprocess.
Once a system is operational, even a simple phrase like “the system is up” can
have very different meanings between who is speaking, who is hearing, and what system is being described. Categorizing the type and impact of undesired system events
is similarly unclear - for example: is intermittent response from an I/O node an interrupt or a failure, how can its effect on users measured, etc?In both operational and
design review, it is difficult to have meaningful discussions due to inability to agree
on terminology. Making complex supercomputers reliable is difficult even with clear
communication, but unclear communication further complicates the process and delays
progress. Standardized terms and measurements will facilitate practical improvements
in RAS performance.
Not allsites track RAS data for their supercomputer(s), and comparing data from
those sites who do requires careful review of their definitions and calculations. For
example, both NERSC and LLNL do an excellent job at tracking RAS data (NERSC
data is public at http://www.nersc.gov/nusers/status/AvailStats/,
LLNL provided extensive data to me upon request) - matching words and metric names
are used, but...
Leer documento completo

Regístrate para leer el documento completo.

Estos documentos también te pueden resultar útiles

  • titulo del titulo
  • Titulo
  • Titulos
  • El titulo
  • Titulo
  • Soy un titulo
  • Sin titulo
  • Titulos

Conviértase en miembro formal de Buenas Tareas

INSCRÍBETE - ES GRATIS