Wim P. Krijnen November 10, 2009
The purpose of this book is to give an introduction into statistics in order to solve some problems of bioinformatics. Statistics provides procedures to explore and visualize data as well as to test biological hypotheses. The book intends to be introductory in explaining and programming elementarystatistical concepts, thereby bridging the gap between high school levels and the specialized statistical literature. After studying this book readers have a sufﬁcient background for Bioconductor Case Studies (Hahne et al., 2008) and Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Genteman et al., 2005). The theory is kept minimal and is always illustrated by severalexamples with data from research in bioinformatics. Prerequisites to follow the stream of reasoning is limited to basic high-school knowledge about functions. It may, however, help to have some knowledge of gene expressions values (Pevsner, 2003) or statistics (Bain & Engelhardt, 1992; Ewens & Grant, 2005; Rosner, 2000; Samuels & Witmer, 2003), and elementary programming. To support self-study asuﬃcient amount of challenging exercises are given together with an appendix with answers. The programming language R is becoming increasingly important because it is not only very ﬂexible in reading, manipulating, and writing data, but all its outcomes are directly available as objects for further programming. R is a rapidly growing language making basic as well as advanced statistical programmingeasy. From an educational point of view, R provides the possibility to combine the learning of statistical concepts by mathematics, programming, and visualization. The plots and tables produced by R can A readily be used in typewriting systems such as Emacs, L TEX, or Word. Chapter 1 gives a brief introduction into basic functionalities of R. Chapter 2 starts with univariate data visualization andthe most important descriptive statistics. Chapter 3 gives commonly used discrete and continuous distributions to model events and the probability by which these occur. These distributions are applied in Chapter 4 to statistically test hypotheses from bioinformatics. For each test the statistics involved are brieﬂy explained and its application is illustrated by examples. In Chapter 5 linear modelsare explained and applied to testing for diﬀerences between groups. It gives a basic approach. In Chapter 6 the three phases of analysis of microarray data (preprocessing, analysis, post processing) are brieﬂy introduced and illustrated by many examples bringing ideas together with R scrips and interpretation of results. Chapter 7 starts with an intuitive approach into Euclidian distance
iiiand explains how it can be used in two well-known types of cluster analysis to ﬁnd groups of genes. It also explains how principal components analysis can be used to explore a large data matrix for the direction of largest variation. Chapter 8 shows how gene expressions can be used to predict the diagnosis of patients. Three such prediction methods are illustrated and compared. Chapter 9 introducesa query language to download sequences eﬃciently and gives various examples of computing important quantities such as alignment scores. Chapter 10 introduces the concept of a probability transition matrix which is applied to the estimation of phylogenetic trees and (Hidden) Markov Models. R commands come after its prompt >, except when commands are part of the ongoing text. Input and output of Rwill be given in verbatim typewriting style. To save space sometimes not all of the original output from R is printed. The end of an example is indicated by the box . In its Portable Document Format (PDF)1 there are many links to the Index, Table of Contents, Equations, Tables, and Figures. Readers are encouraged to copy and paste scripts from the PDF into the R system in order to study its...