I worked on

During my studies at Kaunas University of Technology I got  an Engineers degree  with a little knowledge in mathematics. Therefore, while working on  my  Ph D thesis in a field of statistical pattern recognition I had to study mathematics, probability and statistical methods self dependently. Without knowing about complex mathematical techniques to find a probability that a value of a linear discriminant function of a sample based Euclidean distance classifier is negative I approximated an unknown distribution of the discriminant function by a Gaussian law. In order to ground this postulation I had to suppose that the number of features p (components of a feature vector) is very large. This assumption challenged another one - a  learning set size n should increase too. Thus, my ignorance has lead to a simple asymptotic method to analyze  distributions of the complex multivariate statistics: we analyze a case when n, the learning set size, and p, a number of dimensions, are increasing simultaneously.

This approach has lead to simple asymptotic generalization error formulae for the Euclidean distance classifier (Raudys, 1967) as well as for a number of other more complex classification rules: five linear and quadratic classifiers differing in assumption about a structure of the covariance matrix (Raudys 1972), linear regularized discriminant analysis (Raudys and Skurichina, C1994, Raudys et. al, 1995), a standard  Fisher linear classifier with a pseudoinversion of the covariance matrix (Raudys and  Duin, 1998) and a "primitive", regularized, standard least squares with a coventional or the pseudo inversion of the covariance matrix regressions (Raudys, 2000a). Now this asymptotic analysis technique is called a "thermodinamic limit" approach and is widely used to investigate the generalization error of the artificial neural networks. In comparison with a conventional asymptotic analysis where only the sample size n approaches infinity (see e.g.  Fukunaga, 1990) this approach allows to obtain rather exact estimates when the sample size n is comparable with number of features p (the dimensionality).
At the same time A. Kolmogorov, Yu. Blagovesgenskij and his PhD student   A. Deev were working on the same type of the asymptotic analysis. In early 1970 A. Kolmogorov learned about my asymptotic approach and asked me to come to Moscow and to present a paper in his seminar. In February 1970 I came to Moscow and presented a talk about the new double asymptotic approach and about the generalization error expressions derived for five distinct statistical classifiers differing in their complexity. Some of the results for audience were difficult to believe.

Firstly, from my analysis it followed that asymptotically when n and p tend to infinity the generalization error of the linear statistical classifier, based both on sample mean vectors and variances common for both pattern classes, coincides with the expression for  more simple Euclidean distance classifier where only sample means are used to determine the classification rule. It means that estimation of variances common for both classes asymptotically does not affect the generalization error.
Second, it was shown that in the asymptotic discussed, the generalization error of the sample  based  classical quadratic discriminant function  depends on a ration p2/n. For other four classification rules this ratio is linear - p/n. A.Kolmogorov asked A.Deev to verify my analytical derivations, and after the verification, a month later, he acknowledged these results. This recognition opened for me all doors in Moscow statistical community. My results about five classifiers were published two years later (Raudys, 1972 ), and extended and generalized in a number of research papers of Moscow statisticians (Deev, 1974), Meshalkin, 1976, Meshalkin and Serdobolskij, 1978).
Many years later while working in the Artificial neural networks theory I understood a fundamental significance of the Meshalkin and Serdobolskij result. While generalizing my 1972 result they proved a theorem - estimation of r parameters of the multivariate distribution density functions common for all pattern classes asymptotically (when both the dimensionality p and the sample size n are increasing) do not influence the generalization error if r is proportional to n. In multi-layer perceptrons and radial basic function neural networks, the weights of the hidden layer typically are connected with all outputs. It means they are common for all pattern classes. According to the result of Meshalkin and Serdobolskij the hidden layer weights have less influence while determining the generalization error. It means the number of hidden layer weights have less influence while determining a complexity of the neural net (see Raudys, C1994b, Raudys, B2001). This conclusion is supported by a number of experimental results of many authors but contradicts with a great many theoretical results obtained in a connectionist community. Here it is claimed that the complexity of the network is proportional to a number of network's weights. This disagreement indicates that, the question about the networks complexity deserves more attentive analysis.

Apart the asymptotic method proposed, and a simple asymptotic formula for the generalization error of the Euclidean distance classifier, my first research paper published in Novosibirsk (Raudys, 1967) analyzed a case when the statistical classifier is constructed according to one model of the multivariate density function,  however, it is applied to classify a data described by another statistical model. An exact and the asymptotic formulae (when both, n and p, are increasing) for the generalization error of the Euclidean distance classifier show that here we can obtain, both very small and very large generalization errors.  A term a "real", "an effective" dimensionality of the data was introduced and it was shown that in a very favorite case, a real dimensionality p*=1. It means that the classifier can be trained with very short training sequences. This conclusion is very important for the ANN analysis: years later (Raudys, 1998a) it was shown that conditions exist where the single layer perceptron after the first training iteration in a batch mode results the Euclidean distance classifier. It means that, regardless the formal dimensionality of the data,  in principle, the single layer perceptron also can be trained with very short learning sequences.

After 1967 exact or almost exact (non asymptotic) equations for the generalization error of the Euclidean distance classifier (Raudys, 1969), a standard Fisher linear  discriminant function (V.Pikelis - published in Raudys and Pikelis, 1975), the linear statistical classifier based on sample mean vectors and variances common for both pattern classes (V.Pikelis, 1973), the quadratic discriminant function (Raudys, 1976b) were obtained. Together with a student of my group Vytautas Grabauskas we have shown that in a case of unequal number of training vectors from both pattern classes in the learning set, the sample based quadratic discriminant function is biased. Then a paradox can be observed - while increasing the number of the learning observations only from one pattern class, we can obtain an increase in the generalization error. A term to reduce a bias of the quadratic discriminant function was proposed (see e.g, Raudys, 1984a,  Raudys and Jain, 1991, Raudys, B2001). A most part  of our earlier results concerning the performance, complexity and learning set size relationships are reviewed in two research papers Raudys and Pikelis, 1975, 1980. For a more recent review see Raudys and Jain, 1991, Raudys and Young, S2001, Raudys B2001.

Beginning with 1968 for a number years, a scientific discussion about parametric and nonparametric classification rules was carried out in former Soviet Union. Vladimir Vapnik from the Institute of Control Theory in Moscow advocated that it is not wise to make assumptions about the type and parameters of the multivariate distribution densities, to estimate these parameters from the training set data and only then to construct the linear classifier. Instead of constructing the above parametric classifier he recommended that much better way is to make assumptions about a structure of the classification rule (for example a linear discriminant function in a space of original or transformed features) and then - to estimate unknown coefficients (weights) of the discriminant function directly from the training set data. For this he suggested to minimize an empirical error (empirical risk). In a case of a zero empirical error, he suggested to maximize a distance (a margin) between a discriminant hyperplane and the closest to it learning set vector (a method of a "generalized portrait"). Me, and some other Soviet researchers advocated that making assumptions about the type of the distribution density function is a some sort of introducing a prior information into the classifier's design process. In a case, when this additional (prior) information is more or less correct, it can reduce the classification error (Raudys, 1972). In 1969 while reviewing the both approaches I compared the generalization error estimates of V.Vapnik's minimum empirical risk based nonparametric classification rule, the Euclidean distance and the linear Fisher  parametric classifiers and  demonstrated that the gain of utilizing the prior assumptions can be dramatically large (Raudys, C1970).
In this dispute paper
(Raudys, C1970) a "scissors effect" was demonstrated firstly: it was also shown that in a small training set case, it is preferable to utilize simple classification rule (the Euclidean distance classifier which is based on sample estimates of the mean vectors of the pattern classes) and in a large training set case, one needs to use the more complex classification rule (the linear Fisher classifier which is based on sample estimates of the mean vectors and a pooled covariance matrix of the pattern classes) can perform better. In my paper (C1970) I noticed that the Vapnik's generalization error's estimates are obtained for the most unfavorable type of distributions of the pattern classes and result in an extremely pessimistic upper bounds. For many years I was very keen to obtain non-pessimistic estimates and succeeded to do this only recently (Raudys, 1993, 1997, Raudys&Diciunas, C1996, 2000). In these papers, the nonparametric zero empirical error classifier used to classify two spherical Gaussian pattern classes is analyzed. Now this result is generalized for a non-zero empirical error case (V.Diciunas, 20001, unpublished), an exact asymptotic expansion is obtained (Basalykas, Diciunas and Raudys, 1996&997). It was shown that in some cases, the nonparametric linear classifier can be trained with shorter training sequences than the standard linear Fisher discriminant function.

Apart from parametric classification rules I analyzed nonparametric Parzen window (PW) density estimate based classifiers (Raudys, 1976c, C1977, 1978a, 1991c, 1998c ). To define this classifier we need to fix a kernel's type and a value of a smoothing parameter h. For the spherical Gaussian kernel it was shown that the generalization error depends on p*/n, where p* is an intrinsic dimensionality of the data. It was concluded that nmin, a minimal size of the training set necessary to train this classifier sufficiantely well, should be proportional  to ap*, where a constant a depends on requirements for training accuracy, an asymptotic classification error, and the smoothing parameter h. A fast procedure to estimate an optimal value of the smoothing parameter h was suggested where we use a leaving-one-out method to estimate the generalization error simultaneously for a set of the smoothing parameter's values (h1, h2,  …. hR ). In 1990 a Ph D student from our group Marina Skurichina analyzed a number of real and artificial data sets and showed that the kernel width, not it's type, is the most important factor while constructing the Parzen window classifier with the smallest generalization error. Only in a couple of cases, the PW classifier with a rectangular window lost against the smooth bell shaped kernels (Skurichina, 1990).

While classifying objects characterized by the categorical (discrete) variables, or while constructing a decision rule for making a unifying decision  according to decisions of  single modules in a cooperative neural network design (a classifiers' fusion rule), we can use a multinomial classifier. Its decisions are based on estimation of probabilities that a certain combination of discrete input variables is observed. Typically, the number of possible combinations is immense. Therefore, often it is thought that in order to construct  such classifier we need a tremendous number of training examples. Theoretical equations for the generalization error as well as tables calculated show that for favorable distributions of the pattern classes, as well as in many practical situations, the training set size can be moderate (Griskevicius and Raudys, 1979, Raudys,1984a, S. Putinaite,1990, Raudys B2001).

One more important result obtained in our research group, is a generalization error of the piecewise linear classifier. K. Juskevicius (1983) analyzed a version of the classifier where each pattern class is split into subclasses by means of  cluster analysis techniques, unknown mean vectors of each subclass are estimated,  and an allocation of a unknown vector is performed according to a class membership of its closest mean vector (it is very similar to a classifier, at the present time called a learning vector quantifier neural network). He demonstrated that in order to calculate the generalization error of such classifier one can use the simple equation for the Euclidean distance classifier with a new modified estimate of the learning set's size -  n*=n/k, where k is a number of subclasses.

One possibly useful for other researchers result is a method to determine an optimal number of features in a finite training-set case. It is supposed the features are ranked apriori according their usefulness. We suggest to use a leaving-one-out or a cross validation error  estimate of the generalization error P for each number of features (say pj), to smooth an empirical graph P=f(pj), and according to a minimum of the smoothed graph to determine the optimal dimensionality (Raudys, 1976c, 1979).

One of my "hobbies" is a theoretical as well as experimental comparisons of classification rules, classification error estimation, feature extraction methods, e.t.c. A review and theoretical comparison of over 150 of classification rules can be found in my unpublished book (Raudys, 1984a) and in a little bit a narrower extent - in Raudys, 1975, B2001, a comparison of the error estimation techniques - in (Raudys,1973ab, C1978, C1988, B2001), Raudys and Vaitukaitis, 1984. An experimental comparison of 13 statistical classification algorithms on 4 real world data sets is published in (Raudys, Pikelis, Juskevicius, 1975). There, 50 or even 100 times we selected small random learning sets from all the data available (our "general population"), then we trained 13 classifiers and tested them on our "general population". We concluded that one from the two classical statistical classifiers - the standard linear discriminant function or the PW classifier - most often appears to be the best one (or is very close to the best one) in a pool of the statistical classifiers. Later we used these results in order to be convinced that our generalization error theoretical estimates derived for the multivariate Gaussian case can be used to characterize the small learning set properties of the statistical classifiers applied to the real world data. Now, after almost 30 years of experience, I should like to include into this set of  "very good classifiers" an optimally stopped single layer perceptron.

Another, also possibly useful, result which did not attract attention of other researchers concerns an accuracy of the feature (model) selection (Raudys, 1979a). While comparing different subsets of features or different classification or prediction rules, we have to chose the best variant from a number of competitive ones. Let unbiased, however, approximate estimates of performance of each single variant are used (for example, an independant validation set). Then a real performance of apparently the best variant becomes optimistically biased. The true generalization error of the best variant can be much higher that the apparent error - a minimal value of the estimates. A method and a mathematical model for a joint distribution of a set of estimates and their true errors is proposed in order to calculate a bias between the expectations of the true and apparent errors in the model selection task (Raudys, 1981, C1987, C1992, Raudys and Pikelis, 1982b, C1982c, Raudys, B2001).