|
||||||||||||
|
|
Since 1990
I begun to be interested in artificial neural networks theory. At first
I was attracted by a local minima problem. We used a back propagation
algorithm in order to train the multilayer perceptron classifier hundreds
of times starting from different initial positions and found that the
training ends with different weight vectors, different empirical and different
generalization errors (Schmidt, Raudys et al, C1993).
We tried to initialize the MLP classifier with weights of the piecewise
liner classifier and showed that this approach reduces the generalization
error (Raudys and Skurichina, 1992). A student from our group Aistis
Raudys suggested to use a human
ability to analyze images on a plane. He proposed an original non-parametric
algorithm to map the training set vectors onto a plane, to initialize
the perceptron in a space of two most informative directions, then - to
add the third direction, to train the perceptron, to add the fourth direction,
e.t.c until a desirable complexity of the network will be achieved. Few
years later I returned to the initialization problem and together with
Shun-ichi
Amari we have shown that in one minimum
case, a good starting position can help to reduce the generalization
error (Raudys and Amari, 1998). A necessary
condition to save information contained in the initial weight vector is
to prevent an overtraining: to stop training earlier before a minimum
of the cost function is achieved.
Later I succeded to show that in the high-dimensional case (a very large
number of inputs), the overtraining effect should be always observed (Raudys,
C2000b, B2001). A main my result
obtained in a channel of the ANN theory, however, is a demonstration that
while training the non-linear SLP evolves: in adaptive gradient
training, one can obtain seven standard statistical classification and
six prediction rules of different complexity. It was shown that conditions
exist where after the first iteration of the gradient minimization training
algorithm performed in a batch mode, one can obtain the well known Euclidean
distance classifier. For this we need to start training from zero initial
weight vector, a mean of the training data should be moved into the center
of coordinate axis, and if the training set is composed from the
same number of vectors of both pattern classes, we need to use "symmetrical"
targets (desired outputs). In further iterations, we have a classical
regularized discriminant analysis and are moving towards the standard
linear Fisher classifier. If the number of dimensions exceeds the training
set size, we are approaching the Fisher classifier with the pseudo -inversion
of the sample covariance matrix. In further iterations, we have a kind
of a robust classifier which is insensitive to atypical training set vectors
distant from the discriminant hyper-plane (robust classification rule).
Then in a case where the weights of the perceptron are large, we move
towards a minimum empirical error classifier. When we have no empirical
errors we are approaching the maximal margin (the generalized portrait,
the support vector) classifier. The number of types of the classification
rules can be increased if we train the perceptron in a space of new features,
which can be obtained by nonlinear transformations of the original p
features. A progressive movement from the the
Euclidean distance classifier to the maximum margin classifier explains
the well known overtraining (overfitting) phenomenon: in a way from the
simplest algorithm to the most complex one, one of the classifiers appears
to be the best one in a finite learning set case (Raudys, C1996a).
Analysis of the generalization error of the statistical classifiers just
enumerated can help to understand this effect more deeply from a theoretical
point of view (Raudys, 1998b). Similar considerations are valid
for SLP used to solve prediction (regression) problem. While training
the perceptron one can obtain six known regressions: the "primitive",
the regularized, the standard least squares with a coventional or the
pseudo inversion of the covariance matrix, a robust and at the very end
(if the weights are sufficiently large) - a minimax regression (Raudys,
S1999a). In order to obtain a full
gamma of the statistical classifiers and regressions available in SLP
training, one needs to know means how to control the training process.
In addition to known complexity control techniques, it was shown that
the network's desired outputs are of primary importance in determining
the type of the classification rule (Raudys, 1998a, C2000c,
B2001). It was shown theoretically that besides a conventional spherical
zero mean noise injection a "colorful" noise determined by k-nearest
neighbors distorts the training set in a minimal way and helps to reduce
the generalization error (Skurichina, Raudys and Duin, 2000). To obtain
the empirical error and the maximal margin classifiers the weights of
the network should be large. Here, an introduction of an antiregularization
term to a cost function (Raudys, C1995c), as well as an exponentially
increasing learning step (Raudys, 1998a) can be very useful. It
was shown also that the learning step value used to train hidden layer
neurons of the multilayer perceptrons controls a degree of a training
process noise added to inputs of an output layer, and acts as a factor
which controls the networks complexity (Raudys, C200c, B2001).
The knowledge about the effect
of initial values and evolution of the non-linear SLP in its training
process can be utilized to integrate the statistical and neural net
theory based approaches to obtain classification and prediction rules.
In the new approach, instead of designing parametric statistical classifiers
we need to use the training data information (sample means, conventional,
or constrained, regularized estimates of the covariance matrix common
for two pattern classes) in order to transform the learning and
the test set set data into the spherical one. It is known that
for the spherical Gaussian data the best sample based classifier is the
Euclidean distance classifier. Thus, after such transformation and the
first gradient descent iteration, we obtain this classifier. The solution
is equivalent to the statistical classifier which could be obtained by
utilizing "the learning set based information" just mentioned. It can
happen that the parametric assumptions utilized to transform the data
are not absolutely correct. Then, in further perceptron training, we can
improve the decision boundary. In a case we suceed to stop training in
a time, we can obtain an "optimal" solution (Raudys, C1998 d, S1999a,
Raudys and Saudargiene, 1998, 2001, Saudargiene 1999, Raudys B2001). Thus,
both approaches, statistical and neural net based, can be utilised
simultaneously in order to utilise positive attributes of both of
them. It is to-day's my answer to our 30 years old discussion with Vladimir
Vapnik. One more benefit which could be obtained from the perceptron's evolution theory consists in the weights initialization. Often the data changes in a time, and an old data cannot be included into the training set any more. One of the solutions is to use the old weight vector found from the old data for initialization and then to train the SLP with the new data. Therewith we need to stop training in a time. Another approach is to use the old data in order to transform both, the old and the new data sets. The transformation should help to obtain such distributions of the multivariate vectors where the Euclidean distance classifier or the "primitive" regression (i.e. SLP after the first training iteration) have very good small learning set properties. The asymptotic expressions for the generalization errors presented in my papers - the first (Raudys, 1967 - classification) and the last ones (Raudys, 1998b- classification, 2000a - regression), constitute a theoretical basis for this transformation. |