Chemometrics

Solo disponible en BuenasTareas
  • Páginas : 29 (7194 palabras )
  • Descarga(s) : 0
  • Publicado : 1 de septiembre de 2012
Leer documento completo
Vista previa del texto
Accepted Manuscript
Tree-based ensemble methods and their applications in analytical chemistry Dong-Sheng Cao, Jian-Hua Huang, Yi-Zeng Liang, Qing-Song Xu, Liang-Xiao Zhang PII: DOI: Reference: To appear in: S0165-9936(12)00233-6 http://dx.doi.org/10.1016/j.trac.2012.07.012 TRAC 13930 Trends in Analytical Chemistry

Please cite this article as: D-S. Cao, J-H. Huang, Y-Z. Liang, Q-S. Xu, L-X.Zhang, Tree-based ensemble methods and their applications in analytical chemistry, Trends in Analytical Chemistry (2012), doi: http://dx.doi.org/10.1016/ j.trac.2012.07.012

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting,and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Tree-based ensemble methods and their applications in analytical chemistry
Dong-Sheng Cao, Qing-Song Xu, Liang-Xiao Zhang, Jian-Hua Huang, Yi-Zeng LiangLarge amounts of data from high-throughput analytical instruments have generally become more and more complex, bringing a number of challenges to statistical modeling. To understand complex data further, new statistically-efficient approaches are urgently needed to: (1) select salient features from the data; (2) discard uninformative data; (3) detect outlying samples in data; (4) visualizeexisting patterns of the data; (5) improve the prediction accuracy of the data; and, finally, (6) feed back to the analyst understandable summaries of information from the data. We review current developments in tree-based ensemble methods to mine effectively the knowledge hidden in chemical and biology data. We report on applications of these algorithms to variable selection, outlier detection,supervised pattern analysis, cluster analysis, tree-based kernel and ensemble learning. Through this report, we wish to inspire chemists to take greater interest in decision trees and to obtain greater benefits from using the tree-based ensemble techniques.
Keywords: Chemometrics; Classification and regression tree (CART); Cluster analysis;
Complex data; Ensemble algorithm; Kernel method; Outlierdetection; Pattern analysis; Tree-based ensemble; Variable selection

Dong-Sheng Cao, Jian-Hua Huang, Yi-Zeng Liang* Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha 410083, P. R. China Qing-Song Xu School of Mathematics and Statistics, Central South University, Changsha 410083, P. R.

China Liang-Xiao Zhang Key Laboratory of SeparationScience for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China

Corresponding author. Tel.: +86 731 88830824; Fax: +86 731 88830831; E-mail: yizeng_liang@263.net

1.

Introduction

Traditionally, multivariate statistical techniques, including partial least squares (PLS), principal-component analysis (PCA) and Fisher discriminantanalysis (FDA), play an important role in chemistry [1]. They have been widely used in analytical chemistry for model modeling and data analysis. However, these approaches were greatly challenged due to the emergence of more complex data resulting from modern high-throughput analytical instruments, which usually show the following characteristics: (1) Outliers are commonly generated by experimentalerrors or uncontrolled factors; (2) Chemical and biological data from high-throughput analytical experiments have a large number of variables, most of which are irrelevant to or would even interfere with our analysis. Also, the sample size is comparatively small. This is the so-called “large p, small n” problem that has proved to be very challenging in statistical learning; (3) Most chemical and...
tracking img