Statistical
and Neural Classifiers:
An integrated approach to design.
Springer, London.
Preface
-
Les hommes de chez toi, dit le petit prince,
cultivent cinq mille roses dans un même jardin...
et ils n'y trouvent pas ce qu'ils cherchent...
- Ils ne le trouvent pas, répondis-je...
Et cependant ce qu'ils cherchent pourrait être
trouvé dans une seule rose ou un peu d'eau...
- Bien sûr, répondis-je.
Et le petit prince ajouta:
- Mais les yeux sont aveugles.
Il faut chercher avec le coeur.
Antoine de Saint-Exupery "Le Petit Prince",
Chapitre XXV
In his book Antoine de Saint-Exupery wrote: "... In your country, people
plant five thousand roses in the same garden ... and they do not find
what they are searching for. Meanwhile, they could find everything they
are seeking for in a single rose, or in a drop of water... However,
eyes are blind. You have to seek by your heart." I am fond of Antoine
de Saint Exupery. I like his books and I believe in these words.
When, after 25 years of research work in multivariate statistical analysis
and statistical pattern recognition, I became interested in Artificial
Neural Networks (ANN), I remembered de Saint-Exupery's words about the
single rose. Instead of investigating multilayer perceptrons with thousands
of neurones, I at first began to use statistical methods to analyse
a single neurone ( a single layer perceptron (SLP) and plain back-propagation
(BP) training algorithm. The SLP and BP training are simplified mathematical
models of complex information processing phenomena that take place in
nature. I discovered that a single neurone can explain much about complex
brain-learning behaviour.
After several years of research work, I learned that during the training
phase, the single layer perceptron's weights are increasing and, therefore,
the statistical properties of the cost function that is minimised during
the training process are also changing. In its dynamical evolution,
the SLP classifier can actually become one of several statistical classifiers
that differ in their complexity.
At first the SLP classifier behaves as a simple Euclidean distance classifier.
In this situation each pattern class is characterised by a sample mean
vector. In further training, the SLP classifier begins to evaluate correlations
and variances of features and almost becomes the standard linear Fisher
classifier. Further, the SLP begins to behave as a robust classifier
that ignores atypical training patterns. Later on in the training phase
the SLP behaves as a classifier that minimizes the number of incorrectly
classified training patterns. If there are no training errors, the SLP
maximizes the margin between the decision boundary and the closest training-pattern
vectors. Thus, the decision boundary is halfway between the training
patterns and behaves as a support vector classifier where a fixed number
of training patterns determine a position of the decision boundary.
Thus, the SLP classifier begins as the simplest possible statistical
classifier designed under the assumption of Gaussian pattern classes
and ends as a complex classifier that can perform well with non-Gaussian
pattern classes.
One more interesting peculiarity of the SLP classifier is that the performance
(the generalization error) of the perceptron depends on the initial
conditions. If the starting perceptron weight vector is almost optimal,
the SLP classifier initially contains much useful information. This
information can be utilized to determine the final weight vector. To
preserve and use the information in the initial weights, one must not
overtrain the SLP. This is a very valuable property of adaptively trained
neural networks.
The more I work in the ANN discipline, the more I marvel at the remarkable
qualities of neural networks. Specifically, I am amazed that the SLP
classifier dynamics progress from the simplest algorithms to the most
complex algorithms in a natural manner. Statisticians and engineers
have long understood that in designing decision-making algorithms from
experimental data one needs to progress from simple algorithms to complex
algorithms. The artificial neurone accomplishes this complexity progression
in a natural manner. Statisticians required several decades to develop
a number of statistical classification and regression rules: Fisher
(1936) proposed his linear discriminant function more than six decades
ago and Vapnik devised his support vector machine (Vapnik, 1995) only
recently. The neurone, however, implements this algorithm design progression
in a logical sequence. One can think of this progression as nature's
method.
There is a plethora of useful information currently available in the
field of statistical pattern recognition. The main element of statistical
pattern recognition is the assumption that the pattern vectors are random
vectors that can be modelled as observations from multivariate distributions.
In the parametric approach, the classifier designer assumes he knows
this multivariate distribution precisely. In order to determine the
classification algorithm, one needs to estimate the unknown multivariate
density function parameters using the training data. Researchers have
found that the more complex the multivariate density model assumed,
or equivalently, the greater the number of parameters to be estimated,
the greater the number of training vectors that must be employed to
adequately determine the classifier. Therefore, a large number of parsimonious
multivariate distribution densities have been formulated for this purpose.
One of the most interesting and important facts utilized in parametric
classification is that if some of the pattern-class densities' parameters
are in common, these parameters have a negligible influence on the increase
in the generalization error. This is a very favourable property of the
statistical parametric classifier approach. On the other hand, incorrect
assumptions about the type of the probabilistic distribution assumed
for the pattern vectors lead to an increase in the classification error.
This is one of the main shortcomings of the parametric statistical classifier
approach.
Being interested in both statistical pattern recognition and artificial
neural network theory, I perceived a certain conflict between these
two classification paradigms, and sometimes even a dissonance among
proponents of these two classification methods. Statisticians generally
have good mathematical backgrounds with which to analyse decision-making
algorithms theoretically. They have proven many rigorous results concerning
the optimality of statistical classification algorithms. However, they
often pay little or no attention to the applicability of their own theoretical
results and generally do not heed practical or even theoretical results
obtained by ANN researchers.
ANN scientists advocate that one should make no assumptions concerning
the multivariate densities assumed for the pattern classes. They, instead,
propose that one should assume only the structure of the decision-making
rules, for example a linear discriminant function in the space of original
or transformed features, and then estimate the unknown rule coefficients
(weights) directly from the training data. For this they suggest one
minimize the number of errors incurred while classifying the training
vectors (empirical error). Many such algorithms have been suggested
to solve practical problems. Some of these algorithms have a theoretical
justification and some have no theoretical elucidation yet.
Known properties and deficiencies of both statistical and neural classification
algorithms hints that one should integrate the two classifier design
strategies and utilize their good qualities. There are three key aspects
of this integration. The first key is the fact that the correct initial
weights of the perceptron contain information that can be saved for
use in future training processes. Thus, we can utilize pattern classifiers
based on statistical methods to define the initial perceptron weight
vector. The second key is the fact that, during training, the perceptron
changes its statistical properties and evolves from simple classification
algorithms to more complex classification algorithms. The third key
is the fact that, one can use the diversity of statistical methods and
the multivariate models to perform different whitening data transformations,
where the input variables are decorrelated and scaled in order to have
the same variances. Then while training the perceptron in the transformed
feature space, we can obtain the Euclidean distance classifier after
the very first iteration. In the original feature space, the weight
vector of this classifier is equivalent to the decision making rule
found by utilizing the statistical methods and the multivariate models
just mentioned. Further training can diminish the negative aspects of
approximately correct or incorrect statistical assumptions.
Thus, it is possible to merge the statistical and neural approaches.
Specifically, instead of using statistical methods and the multivariate
models directly to design the classifier, we can use them to whiten
the data. We can then train the perceptron paying special attention
to the optimal stopping time. The data whitening reduces the generalisation
error and simultaneously speeds up the training process. This approach
merges the qualities of both statistical and neural classification algorithm
design strategies. Investigations of the visual cortex in biological
systems, however, have shown that the input decorrelation technique
is already realized in natural systems. This is one more piece of evidence
that the data decorrelation and scaling technique performed prior to
perceptron training is a natural method of information processing.
An objective of this book is to explain the details necessary to understand
and utilise the integrated statistical and neural net approach to design
the classification rules. We, therefore, present a discussion of the
diversity of linear and non-linear statistical pattern classification
algorithms that can be utilised in an advanced neural network analysis.
Special attention is paid to the assumptions used to design the algorithm,
the generalisation error, and the training-set size relationships. Knowledge
of these relationships allows one to analyse and compare the amount
of information obtained from the training data, the assumptions, or
from the educated guesses utilised to construct the decision-making
algorithm. This is perhaps the central question that arises in machine
learning and classifier design theory.
Performance, complexity, and training-set size relationships in the
nonparametric neural net approach have been discussed in a number of
books (Vapnik, 1982, 1995; Wolpert, 1995; Cherkassky and Mulier, 1996;
Vidyasagar, 1997, etc.). According to Wolpert, "the statistical and
neural net approaches have their own jargon, their own mathematical
models, their own concern, and their own results. And, for the most
part, they don't interact". This book primarily takes a statistical
point of view but does not ignore other approaches. Alternative texts
are Raudys (1976), Aivazian et al. (1988), Fukunaga (1990), McLachlan
(1992), Devroye, Gyorfi and Lugosi (1996), Duda, Hart, and Stork (2000).
The present book, however, is more focused on the integration of statistical
and neural approaches to design the classification algorithms. In order
to focus on performance, complexity, and design set size relationships
more deeply in this book, I employ a simple formulation of the pattern
recognition problem. For more general formulations of the pattern recognition
problem and related questions I refer interested readers to broader
texts such as Fukunaga's book. To make the book accessible to more readers,
I adopt Fukunaga's notation.
The book is targeted to graduate students and research workers in data
modelling, pattern recognition, and artificial neural networks. No special
background beyond a good working knowledge of probability and statistics,
elements of linear algebra, and calculus at the undergraduate level
is required. However, one will benefit by having a popular pattern recognition
or neural networks book (e.g., Fukunaga (1990) or Haykin (1998)) close
at hand.
The book is organized somewhat like a reference book. At the same time
I pay particular attention to the ideas used to design and analyse statistical
classification algorithms that can be useful for understanding artificial
neural network classifiers. For analysis of neural networks and statistical
algorithms, the most important aspect is assumptions utilised in the
algorithm design process. Therefore, in order to have a comprehensive
point of view of the problem, I omit a part of the details concerning
the estimation of parameters of well known statistical algorithms that
can be found in the popular literature. To make understanding the main
ideas easier, I provide a number of simple illustrative examples.
In the first chapter, I present the main definitions and terms and review
the effects of finite training-set size on classifiers. In the second
chapter, I review principles of the statistical decision theory, review
important statistical multivariate data models, and give a taxonomy
of pattern classification algorithms that can be obtained or improved
while training ANN classification systems. In the third chapter, I present
known results concerning the performance and generalisation error relationships
for a number of parametric and nonparametric classification algorithms.
In the fourth chapter, I consider training peculiarities and the generalisation
and complexity of neural classifiers. In the fifth chapter, I explain
the integration of the statistical and neural classification approaches.
In the sixth and final chapter, I consider the topic of model selection,
paying special attention to the accuracy of solutions to this important
topic.
Contents
Abbreviations and Notations
1. Quick Overview
1.1 The Classifier Design Problem
1.2 Single Layer and Multilayer Perceptrons
1.3 The SLP as the Euclidean Distance and the Fisher Linear Classifiers
1.4 The Generalisation Error of the EDC and the Fisher DF
1.5 Optimal Complexity - The Scissors Effect
1.6 Overtraining in Neural Networks
1.7 Bibliographical and Historical Remarks
2. Taxonomy of Pattern Classification Algorithms
2.1 Principles of Statistical Decision Theory
2.2 Four Parametric Statistical Classifiers
2.2.1 The Quadratic Discriminant Function
2.2.2 The Standard Fisher Linear Discriminant Function
2.2.3 The Euclidean Distance Classifier
2.2.4 The Anderson-Bahadur Linear DF
2.3 Structures of the Covariance Matrices
2.3.1 A Set of Standard Assumptions
2.3.2 Block Diagonal Matrices
2.3.3 The Tree Type Dependence Models
2.3.4 Temporal Dependence Models
2.4 The Bayes Predictive Approach to Design Optimal Classification Rules
2.4.1 A General Theory
2.4.2 Learning the Mean Vector
2.4.3 Learning the Mean Vector and CM
2.4.4 Qualities and Shortcomings
2.5. Modifications of the Standard Linear and Quadratic DF
2.5.1 A Pseudo-Inversion of the Covariance Matrix
2.5.2 Regularised Discriminant Analysis (RDA)
2.5.3 Scaled Rotation Regularisation
2.5.4 Non-Gausian Densities
2.5.5 Robust Discriminant Analysis
2.6 Nonparametric Local Statistical Classifiers
2.6.1 Methods Based on Mixtures of Densities
2.6.2 Piecewise-Linear Classifiers
2.6.3 The Parzen Window Classifier
2.6.4 The k-NN Rule and a Calculation Speed
2.6.5 Polynomial and Potential Function Classifiers
2.7 Minimum Empirical Error and Maximal Margin Linear Classifiers
2.7.1 The Minimum Empirical Error Classifier
2.7.2 The Maximal Margin Classifier
2.7.3 The Support Vector Machine
2.8 Piecewise-Linear Classifiers
2.8.1 Multimodal Density Based Classifiers
2.8.2 Architectural Approach to Design of the Classifiers
2.8.3 Decision Tree Classifiers
2.9 Classifiers for Categorical Data
2.9.1 Multinomial Classifiers
2.9.2 Estimation of Parameters
2.9.3 Decision Tree and the Multinomial Classifiers
2.9.4 Linear Classifiers
2.9.5 Nonparametric Local Classifiers
2.10 Bibliographical and Historical Remarks
3. Performance and the Generalisation Error
3.1 Bayes, Conditional, Expected, and Asymptotic Probabilities of Misclassification
3.1.1 The Bayes Probability of Misclassification
3.1.2 The Conditional Probability of Misclassification
3.1.3 The Expected Probability of Misclassification
3.1.4 The Asymptotic Probability of Misclassification
3.1.5 Learning Curves: An Overview of Different Analysis Methods
3.1.6 Error Estimation
3.2 Generalisation Error of the Euclidean Distance Classifier
3.2.1 The Classification Algorithm
3.2.2 Double Asymptotics in the Error Analysis
3.2.3 The Spherical Gaussian Case
3.2.3.1 The Case N2 = N1
3.2.3.2 The Case N2 >> N1
3.3 Most Favourable and Least Favourable Distributions of the Data
3.3.1 The Non-Spherical Gaussian Case
3.3.2 The Most Favourable Distributions of the Data
3.3.3 The Least Favourable Distributions of the Data
3.3.4 Intrinsic Dimensionality
3.4 Generalisation Errors for Modifications of the Standard Linear Classifier
3.4.1 The Standard Fisher Linear DF
3.4.2 The Double Asymptotics for the Expected Error
3.4.3 The Conditional Probability of Misclassification
3.4.4 A Standard Deviation of the Conditional Error
3.4.5 Favourable and Unfavourable Distributions
3.4.6 Theory and Real-World Problems
3.4.7 The Linear Classifier D for the Diagonal CM
3.4.8 The Pseudo-Fisher Classifier
3.4.9 The Regularised Discriminant Analysis
3.5 Common Parameters in Different Competing Pattern Classes
3.5.1 The Generalisation Error of the Quadratic DF
3.5.2 The Effect of Common Parameters in Two Competing Classes
3.5.3 Unequal Sample Sizes in Plug-In Classifiers
3.6 Minimum Empirical Error and Maximal Margin Classifiers
3.6.1 Favourable Distributions of the Pattern Classes
3.6.2 VC Bounds for the Conditional Generalisation Error
3.6.3 Unfavourable Distributions for the Euclidean Distance and Minimum
Empirical Error Classifiers
3.6.4 Generalisation Error in the Spherical Gaussian Case
3.6.5 Intrinsic Dimensionality
3.6.6 The Influence of the Margin
3.6.7 Characteristics of the Learning Curves
3.7 Parzen Window Classifier
3.7.1 The Decision Boundary of the PW Classifier with Spherical Kernels
3.7.2 The Generalisation Error
3.7.3 Intrinsic Dimensionality
3.7.4 Optimal Value of the Smoothing Parameter
3.7.5 The k-NN Rule
3.8 Multinomial Classifier
3.9 Bibliographical and Historical Remarks
4. Neural Network Classifiers
4.1 Training Dynamics of the Single Layer Perceptron
4.1.1 The SLP and its Training Rule
4.1.2 The SLP as Statistical Classifier
4.1.2.1 The Euclidean Distance Classifier
4.1.2.2 The Regularised Discriminant Analysis
4.1.2.3 The Standard Linear Fisher Classifier
4.1.2.4 The Pseudo-Fisher Classifier
4.1.2.5 Dynamics of the Magnitudes of the Weights
4.1.2.6 The Robust Discriminant Analysis
4.1.2.7 The Minimum Empirical Error Classifier
4.1.2.8 The Maximum Margin (Support Vector) Classifier
4.1.3 Training Dynamics and Generalisation
4.2 Non-linear Decision Boundaries
4.2.1 The SLP in Transformed Feature Space
4.2.2 The MLP Classifier
4.2.3 Radial Basis-Function Networks
4.2.4 Learning Vector Quantisation Networks
4.3 Training Peculiarities of the Perceptrons
4.3.1 Cost Function Surfaces of the SLP Classifier
4.3.2 Cost Function Surfaces of the MLP Classifier
4.3.3 The Gradient Minimisation of the Cost Function
4.4 Generalisation of the Perceptrons
4.4.1 Single Layer Perceptron
4.4.1.1 Theoretical Background
4.4.1.2 The Experiment Design
4.4.1.3 The SLP and Parametric Classifiers
4.4.1.4 The SLP and Structural (Nonparametric) Classifiers
4.4.2 Multilayer Perceptron
4.4.2.1 Weights of the Hidden Layer Neurones are Common for all Outputs
4.4.2.2 Intrinsic Dimensionality Problems
4.4.2.3 An Effective Capacity of the Network
4.5 Overtraining and Initialisation
4.5.1 Overtraining
4.5.2 Effect of Initial Values
4.6 Tools to Control Complexity
4.6.1 The Number of Iterations
4.6.2 The Weight Decay Term
4.6.3 The Antiregularisation Technique
4.6.4 Noise Injection
4.6.4.1 Noise Injection into Inputs
4.6.4.2 Noise Injection into the Weights and into the Outputs of the
Network
4.6.4.3 "Coloured" Noise Injection into Inputs
4.6.5 Control of Target Values
4.6.6 The Learning Step
4.6.7 Optimal Values of the Training Parameters
4.6.8 Learning Step in the Hidden Layer of MLP
4.6.9 Sigmoid Scaling
4.7 The Co-Operation of the Neural Networks
4.7.1 The Boss Decision Rule
4.7.2 Small Sample Problems and Regularisation
4.8 Bibliographical and Historical Remarks
5. Integration of Statistical and Neural Approaches
5.1 Statistical Methods or Neural Nets?
5.2 Positive and Negative Attributes of Statistical Pattern Recognition
5.3 Positive and Negative Attributes of Artificial Neural Networks
5.4 Merging Statistical Classifiers and Neural Networks
5.4.1 Three Key Points in the Solution
5.4.2 Data Transformation or Statistical Classifier?
5.4.3 The Training Speed and Data Whitening Transformation
5.4.4 Dynamics of the Classifier after the Data Whitening Transformation
5.5 Data Transformations for the Integrated Approach
5.5.1 Linear Transformations
5.5.2 Non-linear Transformations
5.5.3 Performance of the Integrated Classifiers in Solving Real-World
Problems
5.6 The Statistical Approach in Multilayer Feed-forward Networks
5.7 Concluding and Bibliographical Remarks
6. Model Selection
6.1 Classification Errors and their Estimation Methods
6.1.1 Types of Classification Error
6.1.2 Taxonomy of Error Rate Estimation Methods
6.1.2.1 Methods for Splitting the Design Set into Training and Validation
Sets
6.1.2.2 Practical Aspects of using the Leave-One-Out Method
6.1.2.3 Pattern Error Functions
6.2 Simplified Performance Measures
6.2.1 Performance Criteria for Feature Extraction
6.2.1.1 Unsupervised Feature Extraction
6.2.1.2 Supervised Feature Extraction
6.2.2 Performance Criteria for Feature Selection
6.2.3 Feature Selection Strategies
6.3 Accuracy of Performance Estimates
6.3.1 Error Counting Estimates
6.3.1.1 The Hold-Out Method
6.3.1.2 The Resubstitution Estimator
6.3.1.3 The Leaving-One-Out Estimator
6.3.1.4 The Bootstrap Estimator
6.3.2 Parametric Estimators for the Linear Fisher Classifier
6.3.3 Associations Between the Classification Performance Measures
6.4 Feature Ranking and the Optimal Number of Features
6.4.1 The Complexity of the Classifiers
6.4.2 Feature Ranking
6.4.3 Determining the Optimal Number of Features
6.5 The Accuracy of the Model Selection
6.5.1 True, Apparent and Ideal Classification Errors
6.5.2 An Effect of the Number of Variants
6.5.3 Evaluation of the Bias
6.6 Additional Bibliographical Remarks
Appendices
A.1 Elements of Matrix Algebra
A.2 The First Order Tree Type Dependence Model
A.3 Temporal Dependence Models
A.4 Pikelis Algorithm for Evaluating Means and Variances of the True,
Apparent and Ideal Errors in Model Selection
A.5 Matlab Codes (the Non-Linear SLP Training, the First Order Tree
Dependence Model, and Data Whitening Transformation)
References
Index
You can get this book from Amazone
bookshef