|
||||||||||||
|
|
During my studies at Kaunas University of Technology I got an Engineers degree with a little knowledge in mathematics. Therefore, while working on my Ph D thesis in a field of statistical pattern recognition I had to study mathematics, probability and statistical methods self dependently. Without knowing about complex mathematical techniques to find a probability that a value of a linear discriminant function of a sample based Euclidean distance classifier is negative I approximated an unknown distribution of the discriminant function by a Gaussian law. In order to ground this postulation I had to suppose that the number of features p (components of a feature vector) is very large. This assumption challenged another one - a learning set size n should increase too. Thus, my ignorance has lead to a simple asymptotic method to analyze distributions of the complex multivariate statistics: we analyze a case when n, the learning set size, and p, a number of dimensions, are increasing simultaneously. This approach
has lead to simple asymptotic generalization error formulae for the Euclidean
distance classifier (Raudys,
1967) as well
as for a number of other more complex classification rules: five linear
and quadratic classifiers differing in assumption about a structure of
the covariance matrix (Raudys 1972), linear regularized discriminant analysis
(Raudys and Skurichina, C1994, Raudys et. al, 1995), a standard
Fisher linear classifier with a pseudoinversion of the covariance matrix
(Raudys and Duin, 1998) and a "primitive", regularized, standard
least squares with a coventional or the pseudo inversion of the covariance
matrix regressions (Raudys, 2000a). Now this asymptotic analysis technique
is called a "thermodinamic limit" approach and is widely used to investigate
the generalization error of the artificial neural networks. In comparison
with a conventional asymptotic analysis where only the sample size n
approaches infinity (see e.g. Fukunaga, 1990) this approach allows
to obtain rather exact estimates when the sample size n is comparable
with number of features p (the dimensionality). Apart the asymptotic method proposed, and a simple asymptotic formula for the generalization error of the Euclidean distance classifier, my first research paper published in Novosibirsk (Raudys, 1967) analyzed a case when the statistical classifier is constructed according to one model of the multivariate density function, however, it is applied to classify a data described by another statistical model. An exact and the asymptotic formulae (when both, n and p, are increasing) for the generalization error of the Euclidean distance classifier show that here we can obtain, both very small and very large generalization errors. A term a "real", "an effective" dimensionality of the data was introduced and it was shown that in a very favorite case, a real dimensionality p*=1. It means that the classifier can be trained with very short training sequences. This conclusion is very important for the ANN analysis: years later (Raudys, 1998a) it was shown that conditions exist where the single layer perceptron after the first training iteration in a batch mode results the Euclidean distance classifier. It means that, regardless the formal dimensionality of the data, in principle, the single layer perceptron also can be trained with very short learning sequences. After 1967 exact or almost exact (non asymptotic) equations for the generalization error of the Euclidean distance classifier (Raudys, 1969), a standard Fisher linear discriminant function (V.Pikelis - published in Raudys and Pikelis, 1975), the linear statistical classifier based on sample mean vectors and variances common for both pattern classes (V.Pikelis, 1973), the quadratic discriminant function (Raudys, 1976b) were obtained. Together with a student of my group Vytautas Grabauskas we have shown that in a case of unequal number of training vectors from both pattern classes in the learning set, the sample based quadratic discriminant function is biased. Then a paradox can be observed - while increasing the number of the learning observations only from one pattern class, we can obtain an increase in the generalization error. A term to reduce a bias of the quadratic discriminant function was proposed (see e.g, Raudys, 1984a, Raudys and Jain, 1991, Raudys, B2001). A most part of our earlier results concerning the performance, complexity and learning set size relationships are reviewed in two research papers Raudys and Pikelis, 1975, 1980. For a more recent review see Raudys and Jain, 1991, Raudys and Young, S2001, Raudys B2001. Beginning
with 1968 for a number years, a scientific discussion about parametric
and nonparametric classification rules was carried out in former Soviet
Union. Vladimir Vapnik from the Institute of Control Theory in Moscow
advocated that it is not wise to make assumptions about the type and parameters
of the multivariate distribution densities, to estimate these parameters from the training set
data and only then to construct the linear classifier. Instead of constructing
the above parametric classifier he recommended that much better way is
to make assumptions about a structure of the classification rule (for
example a linear discriminant function in a space of original or transformed
features) and then - to estimate unknown coefficients (weights) of the
discriminant function directly from the training set data. For this he
suggested to minimize an empirical error (empirical risk). In a case of
a zero empirical error, he suggested to maximize a distance (a margin)
between a discriminant hyperplane and the closest to it learning set vector
(a method of a "generalized portrait"). Me, and some other Soviet researchers
advocated that making assumptions about the type of the distribution density
function is a some sort of introducing a prior information into
the classifier's design process. In a case, when this additional (prior)
information is more or less correct, it can reduce the classification
error (Raudys, 1972). In 1969 while reviewing the both approaches I compared
the generalization error estimates of V.Vapnik's minimum empirical risk
based nonparametric classification rule, the Euclidean distance and the
linear Fisher parametric classifiers and demonstrated that
the gain of utilizing the prior assumptions can be dramatically large
(Raudys, C1970). Apart from parametric classification rules I analyzed nonparametric Parzen window (PW) density estimate based classifiers (Raudys, 1976c, C1977, 1978a, 1991c, 1998c ). To define this classifier we need to fix a kernel's type and a value of a smoothing parameter h. For the spherical Gaussian kernel it was shown that the generalization error depends on p*/n, where p* is an intrinsic dimensionality of the data. It was concluded that nmin, a minimal size of the training set necessary to train this classifier sufficiantely well, should be proportional to ap*, where a constant a depends on requirements for training accuracy, an asymptotic classification error, and the smoothing parameter h. A fast procedure to estimate an optimal value of the smoothing parameter h was suggested where we use a leaving-one-out method to estimate the generalization error simultaneously for a set of the smoothing parameter's values (h1, h2, …. hR ). In 1990 a Ph D student from our group Marina Skurichina analyzed a number of real and artificial data sets and showed that the kernel width, not it's type, is the most important factor while constructing the Parzen window classifier with the smallest generalization error. Only in a couple of cases, the PW classifier with a rectangular window lost against the smooth bell shaped kernels (Skurichina, 1990). While classifying objects characterized by the categorical (discrete) variables, or while constructing a decision rule for making a unifying decision according to decisions of single modules in a cooperative neural network design (a classifiers' fusion rule), we can use a multinomial classifier. Its decisions are based on estimation of probabilities that a certain combination of discrete input variables is observed. Typically, the number of possible combinations is immense. Therefore, often it is thought that in order to construct such classifier we need a tremendous number of training examples. Theoretical equations for the generalization error as well as tables calculated show that for favorable distributions of the pattern classes, as well as in many practical situations, the training set size can be moderate (Griskevicius and Raudys, 1979, Raudys,1984a, S. Putinaite,1990, Raudys B2001). One more important result obtained in our research group, is a generalization error of the piecewise linear classifier. K. Juskevicius (1983) analyzed a version of the classifier where each pattern class is split into subclasses by means of cluster analysis techniques, unknown mean vectors of each subclass are estimated, and an allocation of a unknown vector is performed according to a class membership of its closest mean vector (it is very similar to a classifier, at the present time called a learning vector quantifier neural network). He demonstrated that in order to calculate the generalization error of such classifier one can use the simple equation for the Euclidean distance classifier with a new modified estimate of the learning set's size - n*=n/k, where k is a number of subclasses. One possibly useful for other researchers result is a method to determine an optimal number of features in a finite training-set case. It is supposed the features are ranked apriori according their usefulness. We suggest to use a leaving-one-out or a cross validation error estimate of the generalization error P for each number of features (say pj), to smooth an empirical graph P=f(pj), and according to a minimum of the smoothed graph to determine the optimal dimensionality (Raudys, 1976c, 1979). One of my "hobbies" is a theoretical as well as experimental comparisons of classification rules, classification error estimation, feature extraction methods, e.t.c. A review and theoretical comparison of over 150 of classification rules can be found in my unpublished book (Raudys, 1984a) and in a little bit a narrower extent - in Raudys, 1975, B2001, a comparison of the error estimation techniques - in (Raudys,1973ab, C1978, C1988, B2001), Raudys and Vaitukaitis, 1984. An experimental comparison of 13 statistical classification algorithms on 4 real world data sets is published in (Raudys, Pikelis, Juskevicius, 1975). There, 50 or even 100 times we selected small random learning sets from all the data available (our "general population"), then we trained 13 classifiers and tested them on our "general population". We concluded that one from the two classical statistical classifiers - the standard linear discriminant function or the PW classifier - most often appears to be the best one (or is very close to the best one) in a pool of the statistical classifiers. Later we used these results in order to be convinced that our generalization error theoretical estimates derived for the multivariate Gaussian case can be used to characterize the small learning set properties of the statistical classifiers applied to the real world data. Now, after almost 30 years of experience, I should like to include into this set of "very good classifiers" an optimally stopped single layer perceptron. Another, also possibly useful, result which did not attract attention of other researchers concerns an accuracy of the feature (model) selection (Raudys, 1979a). While comparing different subsets of features or different classification or prediction rules, we have to chose the best variant from a number of competitive ones. Let unbiased, however, approximate estimates of performance of each single variant are used (for example, an independant validation set). Then a real performance of apparently the best variant becomes optimistically biased. The true generalization error of the best variant can be much higher that the apparent error - a minimal value of the estimates. A method and a mathematical model for a joint distribution of a set of estimates and their true errors is proposed in order to calculate a bias between the expectations of the true and apparent errors in the model selection task (Raudys, 1981, C1987, C1992, Raudys and Pikelis, 1982b, C1982c, Raudys, B2001).
|