I am working on

Since 1990 I begun to be interested in artificial neural networks theory. At first I was attracted by a local minima problem. We used a back propagation algorithm in order to train the multilayer perceptron classifier hundreds of times starting from different initial positions and found that the training ends with different weight vectors, different empirical and different generalization errors (Schmidt, Raudys et al, C1993). We tried to initialize the MLP classifier with weights of the piecewise liner classifier and showed that this approach reduces the generalization error (Raudys and Skurichina, 1992). A student from our group Aistis Raudys suggested to use a human ability to analyze images on a plane. He proposed an original non-parametric algorithm to map the training set vectors onto a plane, to initialize the perceptron in a space of two most informative directions, then - to add the third direction, to train the perceptron, to add the fourth direction, e.t.c until a desirable complexity of the network will be achieved. Few years later I returned to the initialization problem and together with Shun-ichi Amari we have shown that in one minimum case, a good starting position can help to reduce the generalization error (Raudys and Amari, 1998). A necessary condition to save information contained in the initial weight vector is to prevent an overtraining: to stop training earlier before a minimum of the cost function is achieved. Later I succeded to show that in the high-dimensional case (a very large number of inputs), the overtraining effect should be always observed (Raudys, C2000b, B2001).

A main my result obtained in a channel of the ANN theory, however, is a demonstration that while training the non-linear SLP evolves: in adaptive gradient training, one can obtain seven standard statistical classification and six prediction rules of different complexity. It was shown that conditions exist where after the first iteration of the gradient minimization training algorithm performed in a batch mode, one can obtain the well known Euclidean distance classifier. For this we need to start training from zero initial weight vector, a mean of the training data should be moved into the center of coordinate axis, and if the training set is  composed from the same number of vectors of both pattern classes, we need to use "symmetrical" targets (desired outputs). In further iterations, we have a classical regularized discriminant analysis and are moving towards the standard linear Fisher classifier. If the number of dimensions exceeds the training set size, we are approaching the Fisher classifier with the pseudo -inversion of the sample covariance matrix. In further iterations, we have a kind of a robust classifier which is insensitive to atypical training set vectors distant from the discriminant hyper-plane (robust classification rule). Then in a case where the weights of the perceptron are large, we move towards a minimum empirical error classifier. When  we have no empirical errors we are approaching the maximal margin (the generalized portrait, the support vector) classifier. The number of types of the classification rules can be increased if we train the perceptron in a space of new features, which can be obtained by nonlinear transformations of the original p features. A progressive movement from the the Euclidean distance classifier to the maximum margin classifier explains the well known overtraining (overfitting) phenomenon: in a way from the simplest algorithm to the most complex one, one of the classifiers appears to be the best one in a finite learning set case (Raudys, C1996a). Analysis of the generalization error of the statistical classifiers just enumerated can help to understand this effect more deeply from a theoretical point of view (Raudys, 1998b). Similar considerations are valid for SLP used to solve prediction (regression) problem. While training the perceptron one can obtain six known regressions: the "primitive", the regularized, the standard least squares with a coventional or the pseudo inversion of the covariance matrix, a robust and at the very end (if the weights are sufficiently large) - a minimax regression  (Raudys, S1999a).  

In order to obtain a full gamma of the statistical classifiers and regressions available in SLP training, one needs to know means how to control the training process. In addition to known complexity control techniques, it was shown that the network's desired outputs are of primary importance in determining the type of the classification rule (Raudys, 1998a, C2000c, B2001). It was shown theoretically that besides a conventional spherical zero mean noise injection a "colorful" noise determined by k-nearest neighbors distorts the training set in a minimal way and helps to reduce the generalization error (Skurichina, Raudys and Duin, 2000). To obtain the empirical error and the maximal margin classifiers the weights of the network should be large. Here, an introduction of an antiregularization term to a cost function (Raudys, C1995c), as well as an exponentially increasing learning step (Raudys, 1998a) can be very useful. It was shown also that the learning step value used to train hidden layer neurons of the multilayer perceptrons controls a degree of a training process noise added to inputs of an output layer, and acts as a factor which controls the networks complexity (Raudys, C200c, B2001).  

The knowledge about the effect of initial values and evolution of the non-linear SLP in its training process can be utilized to integrate the statistical and neural net theory based approaches to obtain classification and prediction rules. In the new approach, instead of designing parametric statistical classifiers we need to use the training data information (sample means, conventional, or constrained, regularized estimates of the covariance matrix common for two pattern classes) in order to transform the learning and the test set set data into the spherical one. It is known that for the spherical Gaussian data the best sample based classifier is the Euclidean distance classifier. Thus, after such transformation and the first gradient descent iteration, we obtain this classifier. The solution is equivalent to the statistical classifier which could be obtained by utilizing "the learning set based information" just mentioned. It can happen that the parametric assumptions utilized to transform the data are not absolutely correct. Then, in further perceptron training, we can improve the decision boundary. In a case we suceed to stop training in a time, we can obtain an "optimal" solution (Raudys, C1998 d, S1999a, Raudys and Saudargiene, 1998, 2001, Saudargiene 1999, Raudys B2001). Thus, both approaches, statistical and neural net based, can be utilised simultaneously in order to utilise positive attributes of both of them. It is to-day's my answer to our 30 years old discussion with Vladimir Vapnik. 

One more benefit which could be obtained from the perceptron's evolution theory consists in the weights initialization. Often the data changes in a time, and an old data cannot be included into the training set any more. One of the solutions is to use the old weight vector found from the old data for initialization and then to train the SLP with the new data. Therewith we need to stop training in a time. Another approach is to use the old data in order to transform both, the old and the new data sets. The transformation should help to obtain such distributions of the multivariate vectors where the Euclidean distance classifier or the "primitive" regression (i.e.  SLP after the first training iteration) have very good small learning set properties. The asymptotic expressions for the generalization errors presented in my papers - the first (Raudys, 1967 - classification) and the last ones (Raudys, 1998b- classification, 2000a - regression), constitute a theoretical basis for this transformation.