Statistical and Neural Classifiers

Statistical and Neural Classifiers:
An integrated approach to design.

Springer, London.

 

 

Preface

 

- Les hommes de chez toi, dit le petit prince,
cultivent cinq mille roses dans un même jardin...
et ils n'y trouvent pas ce qu'ils cherchent...
- Ils ne le trouvent pas, répondis-je...
Et cependant ce qu'ils cherchent pourrait être
trouvé dans une seule rose ou un peu d'eau...
- Bien sûr, répondis-je.
Et le petit prince ajouta:
- Mais les yeux sont aveugles.
Il faut chercher avec le coeur.

Antoine de Saint-Exupery "Le Petit Prince",
Chapitre XXV




In his book Antoine de Saint-Exupery wrote: "... In your country, people plant five thousand roses in the same garden ... and they do not find what they are searching for. Meanwhile, they could find everything they are seeking for in a single rose, or in a drop of water... However, eyes are blind. You have to seek by your heart." I am fond of Antoine de Saint Exupery. I like his books and I believe in these words.
When, after 25 years of research work in multivariate statistical analysis and statistical pattern recognition, I became interested in Artificial Neural Networks (ANN), I remembered de Saint-Exupery's words about the single rose. Instead of investigating multilayer perceptrons with thousands of neurones, I at first began to use statistical methods to analyse a single neurone ( a single layer perceptron (SLP) and plain back-propagation (BP) training algorithm. The SLP and BP training are simplified mathematical models of complex information processing phenomena that take place in nature. I discovered that a single neurone can explain much about complex brain-learning behaviour.
After several years of research work, I learned that during the training phase, the single layer perceptron's weights are increasing and, therefore, the statistical properties of the cost function that is minimised during the training process are also changing. In its dynamical evolution, the SLP classifier can actually become one of several statistical classifiers that differ in their complexity.
At first the SLP classifier behaves as a simple Euclidean distance classifier. In this situation each pattern class is characterised by a sample mean vector. In further training, the SLP classifier begins to evaluate correlations and variances of features and almost becomes the standard linear Fisher classifier. Further, the SLP begins to behave as a robust classifier that ignores atypical training patterns. Later on in the training phase the SLP behaves as a classifier that minimizes the number of incorrectly classified training patterns. If there are no training errors, the SLP maximizes the margin between the decision boundary and the closest training-pattern vectors. Thus, the decision boundary is halfway between the training patterns and behaves as a support vector classifier where a fixed number of training patterns determine a position of the decision boundary. Thus, the SLP classifier begins as the simplest possible statistical classifier designed under the assumption of Gaussian pattern classes and ends as a complex classifier that can perform well with non-Gaussian pattern classes.
One more interesting peculiarity of the SLP classifier is that the performance (the generalization error) of the perceptron depends on the initial conditions. If the starting perceptron weight vector is almost optimal, the SLP classifier initially contains much useful information. This information can be utilized to determine the final weight vector. To preserve and use the information in the initial weights, one must not overtrain the SLP. This is a very valuable property of adaptively trained neural networks.
The more I work in the ANN discipline, the more I marvel at the remarkable qualities of neural networks. Specifically, I am amazed that the SLP classifier dynamics progress from the simplest algorithms to the most complex algorithms in a natural manner. Statisticians and engineers have long understood that in designing decision-making algorithms from experimental data one needs to progress from simple algorithms to complex algorithms. The artificial neurone accomplishes this complexity progression in a natural manner. Statisticians required several decades to develop a number of statistical classification and regression rules: Fisher (1936) proposed his linear discriminant function more than six decades ago and Vapnik devised his support vector machine (Vapnik, 1995) only recently. The neurone, however, implements this algorithm design progression in a logical sequence. One can think of this progression as nature's method.
There is a plethora of useful information currently available in the field of statistical pattern recognition. The main element of statistical pattern recognition is the assumption that the pattern vectors are random vectors that can be modelled as observations from multivariate distributions. In the parametric approach, the classifier designer assumes he knows this multivariate distribution precisely. In order to determine the classification algorithm, one needs to estimate the unknown multivariate density function parameters using the training data. Researchers have found that the more complex the multivariate density model assumed, or equivalently, the greater the number of parameters to be estimated, the greater the number of training vectors that must be employed to adequately determine the classifier. Therefore, a large number of parsimonious multivariate distribution densities have been formulated for this purpose.
One of the most interesting and important facts utilized in parametric classification is that if some of the pattern-class densities' parameters are in common, these parameters have a negligible influence on the increase in the generalization error. This is a very favourable property of the statistical parametric classifier approach. On the other hand, incorrect assumptions about the type of the probabilistic distribution assumed for the pattern vectors lead to an increase in the classification error. This is one of the main shortcomings of the parametric statistical classifier approach.
Being interested in both statistical pattern recognition and artificial neural network theory, I perceived a certain conflict between these two classification paradigms, and sometimes even a dissonance among proponents of these two classification methods. Statisticians generally have good mathematical backgrounds with which to analyse decision-making algorithms theoretically. They have proven many rigorous results concerning the optimality of statistical classification algorithms. However, they often pay little or no attention to the applicability of their own theoretical results and generally do not heed practical or even theoretical results obtained by ANN researchers.
ANN scientists advocate that one should make no assumptions concerning the multivariate densities assumed for the pattern classes. They, instead, propose that one should assume only the structure of the decision-making rules, for example a linear discriminant function in the space of original or transformed features, and then estimate the unknown rule coefficients (weights) directly from the training data. For this they suggest one minimize the number of errors incurred while classifying the training vectors (empirical error). Many such algorithms have been suggested to solve practical problems. Some of these algorithms have a theoretical justification and some have no theoretical elucidation yet.
Known properties and deficiencies of both statistical and neural classification algorithms hints that one should integrate the two classifier design strategies and utilize their good qualities. There are three key aspects of this integration. The first key is the fact that the correct initial weights of the perceptron contain information that can be saved for use in future training processes. Thus, we can utilize pattern classifiers based on statistical methods to define the initial perceptron weight vector. The second key is the fact that, during training, the perceptron changes its statistical properties and evolves from simple classification algorithms to more complex classification algorithms. The third key is the fact that, one can use the diversity of statistical methods and the multivariate models to perform different whitening data transformations, where the input variables are decorrelated and scaled in order to have the same variances. Then while training the perceptron in the transformed feature space, we can obtain the Euclidean distance classifier after the very first iteration. In the original feature space, the weight vector of this classifier is equivalent to the decision making rule found by utilizing the statistical methods and the multivariate models just mentioned. Further training can diminish the negative aspects of approximately correct or incorrect statistical assumptions.
Thus, it is possible to merge the statistical and neural approaches. Specifically, instead of using statistical methods and the multivariate models directly to design the classifier, we can use them to whiten the data. We can then train the perceptron paying special attention to the optimal stopping time. The data whitening reduces the generalisation error and simultaneously speeds up the training process. This approach merges the qualities of both statistical and neural classification algorithm design strategies. Investigations of the visual cortex in biological systems, however, have shown that the input decorrelation technique is already realized in natural systems. This is one more piece of evidence that the data decorrelation and scaling technique performed prior to perceptron training is a natural method of information processing.
An objective of this book is to explain the details necessary to understand and utilise the integrated statistical and neural net approach to design the classification rules. We, therefore, present a discussion of the diversity of linear and non-linear statistical pattern classification algorithms that can be utilised in an advanced neural network analysis. Special attention is paid to the assumptions used to design the algorithm, the generalisation error, and the training-set size relationships. Knowledge of these relationships allows one to analyse and compare the amount of information obtained from the training data, the assumptions, or from the educated guesses utilised to construct the decision-making algorithm. This is perhaps the central question that arises in machine learning and classifier design theory.

Performance, complexity, and training-set size relationships in the nonparametric neural net approach have been discussed in a number of books (Vapnik, 1982, 1995; Wolpert, 1995; Cherkassky and Mulier, 1996; Vidyasagar, 1997, etc.). According to Wolpert, "the statistical and neural net approaches have their own jargon, their own mathematical models, their own concern, and their own results. And, for the most part, they don't interact". This book primarily takes a statistical point of view but does not ignore other approaches. Alternative texts are Raudys (1976), Aivazian et al. (1988), Fukunaga (1990), McLachlan (1992), Devroye, Gyorfi and Lugosi (1996), Duda, Hart, and Stork (2000). The present book, however, is more focused on the integration of statistical and neural approaches to design the classification algorithms. In order to focus on performance, complexity, and design set size relationships more deeply in this book, I employ a simple formulation of the pattern recognition problem. For more general formulations of the pattern recognition problem and related questions I refer interested readers to broader texts such as Fukunaga's book. To make the book accessible to more readers, I adopt Fukunaga's notation.
The book is targeted to graduate students and research workers in data modelling, pattern recognition, and artificial neural networks. No special background beyond a good working knowledge of probability and statistics, elements of linear algebra, and calculus at the undergraduate level is required. However, one will benefit by having a popular pattern recognition or neural networks book (e.g., Fukunaga (1990) or Haykin (1998)) close at hand.
The book is organized somewhat like a reference book. At the same time I pay particular attention to the ideas used to design and analyse statistical classification algorithms that can be useful for understanding artificial neural network classifiers. For analysis of neural networks and statistical algorithms, the most important aspect is assumptions utilised in the algorithm design process. Therefore, in order to have a comprehensive point of view of the problem, I omit a part of the details concerning the estimation of parameters of well known statistical algorithms that can be found in the popular literature. To make understanding the main ideas easier, I provide a number of simple illustrative examples.
In the first chapter, I present the main definitions and terms and review the effects of finite training-set size on classifiers. In the second chapter, I review principles of the statistical decision theory, review important statistical multivariate data models, and give a taxonomy of pattern classification algorithms that can be obtained or improved while training ANN classification systems. In the third chapter, I present known results concerning the performance and generalisation error relationships for a number of parametric and nonparametric classification algorithms. In the fourth chapter, I consider training peculiarities and the generalisation and complexity of neural classifiers. In the fifth chapter, I explain the integration of the statistical and neural classification approaches. In the sixth and final chapter, I consider the topic of model selection, paying special attention to the accuracy of solutions to this important topic.


Contents

Abbreviations and Notations

1. Quick Overview
1.1 The Classifier Design Problem
1.2 Single Layer and Multilayer Perceptrons
1.3 The SLP as the Euclidean Distance and the Fisher Linear Classifiers
1.4 The Generalisation Error of the EDC and the Fisher DF
1.5 Optimal Complexity - The Scissors Effect
1.6 Overtraining in Neural Networks
1.7 Bibliographical and Historical Remarks

2. Taxonomy of Pattern Classification Algorithms
2.1 Principles of Statistical Decision Theory
2.2 Four Parametric Statistical Classifiers
2.2.1 The Quadratic Discriminant Function
2.2.2 The Standard Fisher Linear Discriminant Function
2.2.3 The Euclidean Distance Classifier
2.2.4 The Anderson-Bahadur Linear DF
2.3 Structures of the Covariance Matrices
2.3.1 A Set of Standard Assumptions
2.3.2 Block Diagonal Matrices
2.3.3 The Tree Type Dependence Models
2.3.4 Temporal Dependence Models
2.4 The Bayes Predictive Approach to Design Optimal Classification Rules
2.4.1 A General Theory
2.4.2 Learning the Mean Vector
2.4.3 Learning the Mean Vector and CM
2.4.4 Qualities and Shortcomings
2.5. Modifications of the Standard Linear and Quadratic DF
2.5.1 A Pseudo-Inversion of the Covariance Matrix
2.5.2 Regularised Discriminant Analysis (RDA)
2.5.3 Scaled Rotation Regularisation
2.5.4 Non-Gausian Densities
2.5.5 Robust Discriminant Analysis
2.6 Nonparametric Local Statistical Classifiers
2.6.1 Methods Based on Mixtures of Densities
2.6.2 Piecewise-Linear Classifiers
2.6.3 The Parzen Window Classifier
2.6.4 The k-NN Rule and a Calculation Speed
2.6.5 Polynomial and Potential Function Classifiers
2.7 Minimum Empirical Error and Maximal Margin Linear Classifiers
2.7.1 The Minimum Empirical Error Classifier
2.7.2 The Maximal Margin Classifier
2.7.3 The Support Vector Machine
2.8 Piecewise-Linear Classifiers
2.8.1 Multimodal Density Based Classifiers
2.8.2 Architectural Approach to Design of the Classifiers
2.8.3 Decision Tree Classifiers
2.9 Classifiers for Categorical Data
2.9.1 Multinomial Classifiers
2.9.2 Estimation of Parameters
2.9.3 Decision Tree and the Multinomial Classifiers
2.9.4 Linear Classifiers
2.9.5 Nonparametric Local Classifiers
2.10 Bibliographical and Historical Remarks

3. Performance and the Generalisation Error
3.1 Bayes, Conditional, Expected, and Asymptotic Probabilities of Misclassification
3.1.1 The Bayes Probability of Misclassification
3.1.2 The Conditional Probability of Misclassification
3.1.3 The Expected Probability of Misclassification
3.1.4 The Asymptotic Probability of Misclassification
3.1.5 Learning Curves: An Overview of Different Analysis Methods
3.1.6 Error Estimation
3.2 Generalisation Error of the Euclidean Distance Classifier
3.2.1 The Classification Algorithm
3.2.2 Double Asymptotics in the Error Analysis
3.2.3 The Spherical Gaussian Case
3.2.3.1 The Case N2 = N1
3.2.3.2 The Case N2 >>
N1
3.3 Most Favourable and Least Favourable Distributions of the Data
3.3.1 The Non-Spherical Gaussian Case
3.3.2 The Most Favourable Distributions of the Data
3.3.3 The Least Favourable Distributions of the Data
3.3.4 Intrinsic Dimensionality
3.4 Generalisation Errors for Modifications of the Standard Linear Classifier
3.4.1 The Standard Fisher Linear DF
3.4.2 The Double Asymptotics for the Expected Error
3.4.3 The Conditional Probability of Misclassification
3.4.4 A Standard Deviation of the Conditional Error
3.4.5 Favourable and Unfavourable Distributions
3.4.6 Theory and Real-World Problems
3.4.7 The Linear Classifier D for the Diagonal CM
3.4.8 The Pseudo-Fisher Classifier
3.4.9 The Regularised Discriminant Analysis
3.5 Common Parameters in Different Competing Pattern Classes
3.5.1 The Generalisation Error of the Quadratic DF
3.5.2 The Effect of Common Parameters in Two Competing Classes
3.5.3 Unequal Sample Sizes in Plug-In Classifiers
3.6 Minimum Empirical Error and Maximal Margin Classifiers
3.6.1 Favourable Distributions of the Pattern Classes
3.6.2 VC Bounds for the Conditional Generalisation Error
3.6.3 Unfavourable Distributions for the Euclidean Distance and Minimum Empirical Error Classifiers
3.6.4 Generalisation Error in the Spherical Gaussian Case
3.6.5 Intrinsic Dimensionality
3.6.6 The Influence of the Margin
3.6.7 Characteristics of the Learning Curves
3.7 Parzen Window Classifier
3.7.1 The Decision Boundary of the PW Classifier with Spherical Kernels
3.7.2 The Generalisation Error
3.7.3 Intrinsic Dimensionality
3.7.4 Optimal Value of the Smoothing Parameter
3.7.5 The k-NN Rule
3.8 Multinomial Classifier
3.9 Bibliographical and Historical Remarks

4. Neural Network Classifiers
4.1 Training Dynamics of the Single Layer Perceptron
4.1.1 The SLP and its Training Rule
4.1.2 The SLP as Statistical Classifier
4.1.2.1 The Euclidean Distance Classifier
4.1.2.2 The Regularised Discriminant Analysis
4.1.2.3 The Standard Linear Fisher Classifier
4.1.2.4 The Pseudo-Fisher Classifier
4.1.2.5 Dynamics of the Magnitudes of the Weights
4.1.2.6 The Robust Discriminant Analysis
4.1.2.7 The Minimum Empirical Error Classifier
4.1.2.8 The Maximum Margin (Support Vector) Classifier
4.1.3 Training Dynamics and Generalisation
4.2 Non-linear Decision Boundaries
4.2.1 The SLP in Transformed Feature Space
4.2.2 The MLP Classifier
4.2.3 Radial Basis-Function Networks
4.2.4 Learning Vector Quantisation Networks
4.3 Training Peculiarities of the Perceptrons
4.3.1 Cost Function Surfaces of the SLP Classifier
4.3.2 Cost Function Surfaces of the MLP Classifier
4.3.3 The Gradient Minimisation of the Cost Function
4.4 Generalisation of the Perceptrons
4.4.1 Single Layer Perceptron
4.4.1.1 Theoretical Background
4.4.1.2 The Experiment Design
4.4.1.3 The SLP and Parametric Classifiers
4.4.1.4 The SLP and Structural (Nonparametric) Classifiers
4.4.2 Multilayer Perceptron
4.4.2.1 Weights of the Hidden Layer Neurones are Common for all Outputs
4.4.2.2 Intrinsic Dimensionality Problems
4.4.2.3 An Effective Capacity of the Network
4.5 Overtraining and Initialisation
4.5.1 Overtraining
4.5.2 Effect of Initial Values
4.6 Tools to Control Complexity
4.6.1 The Number of Iterations
4.6.2 The Weight Decay Term
4.6.3 The Antiregularisation Technique
4.6.4 Noise Injection
4.6.4.1 Noise Injection into Inputs
4.6.4.2 Noise Injection into the Weights and into the Outputs of the Network
4.6.4.3 "Coloured" Noise Injection into Inputs
4.6.5 Control of Target Values
4.6.6 The Learning Step
4.6.7 Optimal Values of the Training Parameters
4.6.8 Learning Step in the Hidden Layer of MLP
4.6.9 Sigmoid Scaling
4.7 The Co-Operation of the Neural Networks
4.7.1 The Boss Decision Rule
4.7.2 Small Sample Problems and Regularisation
4.8 Bibliographical and Historical Remarks

5. Integration of Statistical and Neural Approaches
5.1 Statistical Methods or Neural Nets?
5.2 Positive and Negative Attributes of Statistical Pattern Recognition
5.3 Positive and Negative Attributes of Artificial Neural Networks
5.4 Merging Statistical Classifiers and Neural Networks
5.4.1 Three Key Points in the Solution
5.4.2 Data Transformation or Statistical Classifier?
5.4.3 The Training Speed and Data Whitening Transformation
5.4.4 Dynamics of the Classifier after the Data Whitening Transformation
5.5 Data Transformations for the Integrated Approach
5.5.1 Linear Transformations
5.5.2 Non-linear Transformations
5.5.3 Performance of the Integrated Classifiers in Solving Real-World Problems
5.6 The Statistical Approach in Multilayer Feed-forward Networks
5.7 Concluding and Bibliographical Remarks

6. Model Selection
6.1 Classification Errors and their Estimation Methods
6.1.1 Types of Classification Error
6.1.2 Taxonomy of Error Rate Estimation Methods
6.1.2.1 Methods for Splitting the Design Set into Training and Validation Sets
6.1.2.2 Practical Aspects of using the Leave-One-Out Method
6.1.2.3 Pattern Error Functions
6.2 Simplified Performance Measures
6.2.1 Performance Criteria for Feature Extraction
6.2.1.1 Unsupervised Feature Extraction
6.2.1.2 Supervised Feature Extraction
6.2.2 Performance Criteria for Feature Selection
6.2.3 Feature Selection Strategies
6.3 Accuracy of Performance Estimates
6.3.1 Error Counting Estimates
6.3.1.1 The Hold-Out Method
6.3.1.2 The Resubstitution Estimator
6.3.1.3 The Leaving-One-Out Estimator
6.3.1.4 The Bootstrap Estimator
6.3.2 Parametric Estimators for the Linear Fisher Classifier
6.3.3 Associations Between the Classification Performance Measures
6.4 Feature Ranking and the Optimal Number of Features
6.4.1 The Complexity of the Classifiers
6.4.2 Feature Ranking
6.4.3 Determining the Optimal Number of Features
6.5 The Accuracy of the Model Selection
6.5.1 True, Apparent and Ideal Classification Errors
6.5.2 An Effect of the Number of Variants
6.5.3 Evaluation of the Bias
6.6 Additional Bibliographical Remarks

Appendices
A.1 Elements of Matrix Algebra
A.2 The First Order Tree Type Dependence Model
A.3 Temporal Dependence Models
A.4 Pikelis Algorithm for Evaluating Means and Variances of the True, Apparent and Ideal Errors in Model Selection
A.5 Matlab Codes (the Non-Linear SLP Training, the First Order Tree Dependence Model, and Data Whitening Transformation)

References
Index

You can get this book from Amazone bookshef