Why Statisticians Should Not FART

                         Warren S. Sarle
                           Aug 1, 1995

        Copyright (C) 1995 by Warren S. Sarle, Cary, NC, USA

               ftp://ftp.sas.com/pub/neural/fart.txt

Introduction
------------

Fuzzy Adaptive Resonance Theory (FART) is one of the more recent and
popular members of the family of neural networks called "Adaptive
Resonance Theory" by Stephen Grossberg and colleagues.  According to
Carpenter, Grossberg, and Rosen (1991), FART is "capable of rapid
stable learning of recognition categories in response to arbitrary
sequences of analog or binary input patterns." In statistical
terminology, "recognition categories" are clusters and "arbitrary
sequences of input patterns" are data.  Hence FART is a method for
clustering data.  FART is a generalization of ART 1, which is
restricted to binary data.  FART handles continuous data in the
interval [0,1].

FART, like ART 1 as discussed by Moore (1988), is basically similar to
many iterative clustering algorithms in which each case is processed
by:

 * finding the "nearest" cluster seed (AKA prototype or template)
   to that case
 * updating that cluster seed to be "closer" to the case

where "nearest" and "closer" can be defined in hundreds of different
ways. In FART, the framework is modified slightly by introducing the
concept of "resonance" so that each case is processed by:

 * finding the "nearest" cluster seed that "resonates" with
   the case
 * updating that cluster seed to be "closer" to the case

Given a case X=(x_i) and a cluster seed W=(w_i), i=1, ..., v, nearness
is assessed by a similarity measure called the "choice function":

   Sum{ min(x_i,w_i) }
   -------------------
   alpha + Sum{ w_i }

where sums are over i and alpha is a user-specified parameter usually
equal to a very small positive number such as 1e-6.

Resonance is based on a slightly different similarity measure called
the "match function". A seed resonates with the case if:

   Sum{ min(x_i,w_i) }
   ------------------- >= rho
       Sum{ x_i }

where rho is a user-specified "vigilance" parameter between 0 and 1.

If no seed resonates with a case, a new cluster is created, usually
with a seed equal to the case (the "fast commit" option as in Hartigan's
(1975) leader algorithm). If a resonant seed is found, the seed is 
updated according to the formula:

   w(new)_i = beta[min(x_i,w(old)_i)] + (1-beta)[w(old)_i]

where beta is a user-specified learning rate, usually 1 for "fast
learning."

Both ART 1 and FART, as presented so far, suffer from a problem
demonstrated in great detail by Moore (1988): with noisy data, the
cluster seeds, being minima, tend to degenerate to zero, and new
clusters must be continually created, resulting in "category
proliferation." Carpenter et al. claim to have solved the category
proliferation problem by "complement coding": each case X is doubled
in length such that x_(i+v)=1-x_i. With complement coding, w_i for
i<=v is still the cluster minimum (assuming fast learning) for the ith
variable, but 1-w_(i+v) is the cluster maximum. Thus each cluster seed
represents both cluster minima and cluster maxima and can be thought
of as a box within the hypercubic data space.  Complement coding does
indeed prevent the number of categories from growing without bound,
but it does nothing to ensure that the number of categories is
appropriate in any practically useful sense.

With complement coding, the vigilance parameter places an upper
bound on the size of each category box. The size of a box is
defined as the sum of its dimensions:

   size = Sum [1-w_(i+v)-w_i] <= v(1-rho)
          i<=v

During training, when a new case is assigned to a category, the box
grows just enough to contain the new case. As the box grows, the set
of points that resonate with the category shrinks. When the box
reaches the size given by the right-hand side of the above formula,
only points within the box resonate with the category, hence the box
cannot grow any larger.

The category boxes can, and frequently do, overlap. Do overlapping
categories make any sense?  They would make sense if the boxes were
estimating the support of the components of a mixture distribution.
But since FART records information only about the minimum and maximum
values in each category, it ignores information on density. If two
boxes overlap, cases within the overlap may be assigned to the
category with lower density. Hence, the way overlapping categories are
treated in FART makes no statistical sense. This phenomenon can be
illustrated on a small data set (the %FART macro, using the SAS/IML
product, is in ftp://ftp.sas.com/pub/neural/fart.sas):

   title 'Cross';
   data d; input x y; cards;
   .1 .5
   .3 .5
   .7 .5
   .9 .5
   .5 .8
   .5 .2
   .5 .5
   ;

   %fart(data=d,input=x y,rho=.5)
   %plot;

                              Cross

           Plot of Y*X.  Symbol is value of CATEGORY.

    Y |
      |
  1.0 +
      |
  0.9 +
      |
  0.8 +                           2
      |
  0.7 +
      |
  0.6 +
      |
  0.5 +       1         1         2         1         1
      |
  0.4 +
      |
  0.3 +
      |
  0.2 +                           2
      |
  0.1 +
      |
  0.0 +
      |
      ---+----+----+----+----+----+----+----+----+----+----+--
        0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0

                                  X

The first four observations form the horizontal arm of a cross.  By
the time these observations have been processed, the first category
has grown to near its maximal size determined by rho=.5.  The fifth
and sixth observations form the vertical bar of the cross. These
observations do not resonate with the first category; hence they are
assigned to a second category that overlaps the first category. The
seventh observation is at the center of the cross and is assigned by
the choice function to the second category, by a narrow margin.

The fact that FART can construct elongated and overlapping categories,
such as those in the previous example, could be a great advantage
provided that the categories were determined by the distribution of
the data. But the categories are determined primarily by the order in
which the data are presented, not the distribution of the data. In the
following example, the data are distributed in three horizontal bars:

   title 'Three Bars';
   data bars;
      do n=1 to 100;
         x=ranuni(1);
         y=ranuni(1)*.2+mod(n,3)*.4;
         u=ranuni(1);
         output;
      end;
   run;

With the data sorted in random order, and the vigilance parameter rho
set to a value corresponding to the size of each of the three bars,
FART obtains four categories forming boxes of various shapes:

   title2 'Sorted in Random Order';
   proc sort data=bars; by u; run;
   %fart( data=bars, input=x y, rho=.4, out=fbarsu)
   %plot(fbarsu)

                           Three Bars
                     Sorted in Random Order

           Plot of Y*X.  Symbol is value of CATEGORY.

    Y |
      |
  1.0 +               3
      |             3                     3       3     3
  0.9 +    3 3    3           3    3 3     33333       3
      |         333           3     3        33 3    3 3
  0.8 +                       3         3           3
      |
  0.7 +
      |
  0.6 +           2          2     2         1
      |   2 2           2 2   2    2     1            4
  0.5 +     2                2 2    1  1         1
      |         2 2  2        2 2        1     11     4   4
  0.4 +                       2      1
      |
  0.3 +
      |
  0.2 +                 2            1     1   11
      |    2       2         2  22                      4
  0.1 +         2 2      2    2     1     1    1      44
      |    2      2     2  2 2     1 11             4 44
  0.0 +                       2
      |
      ---+----+----+----+----+----+----+----+----+----+----+--
        0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0

                                  X

If the data are sorted so that all cases within a bar are together,
then FART successfully identifies the three horizontal bars:

   title2 'Sorted by Y X';
   proc sort data=bars; by y x; run;
   %fart( data=bars, input=x y, rho=.4, out=fbarsh)
   %plot(fbarsh)

                           Three Bars
                          Sorted by Y X

           Plot of Y*X.  Symbol is value of CATEGORY.

    Y |
      |
  1.0 +               3
      |             3                     3       3     3
  0.9 +    3 3    3           3    3 3     33333       3
      |         333           3     3        33 3    3 3
  0.8 +                       3         3           3
      |
  0.7 +
      |
  0.6 +           2          2     2         2
      |   2 2           2 2   2    2     2            2
  0.5 +     2                2 2    2  2         2
      |         2 2  2        2 2        2     22     2   2
  0.4 +                       2      2
      |
  0.3 +
      |
  0.2 +                 1            1     1   11
      |    1       1         1  11                      1
  0.1 +         1 1      1    1     1     1    1      11
      |    1      1     1  1 1     1 11             1 11
  0.0 +                       1
      |
      ---+----+----+----+----+----+----+----+----+----+----+--
        0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0

                                  X

But if the data are sorted so that cases with similar values of X
appear together, then FART finds four vertical bars:

   title2 'Sorted by X Y';
   proc sort data=bars; by x y; run;
   %fart( data=bars, input=x y, rho=.4, out=fbarsv)
   %plot(fbarsv)

                           Three Bars
                          Sorted by X Y

           Plot of Y*X.  Symbol is value of CATEGORY.

    Y |
      |
  1.0 +               1
      |             1                     3       3     4
  0.9 +    1 1    1           2    2 2     33333       4
      |         111           2     2        33 3    3 4
  0.8 +                       2         3           3
      |
  0.7 +
      |
  0.6 +           1          2     2         3
      |   1 1           2 2   2    2     3            4
  0.5 +     1                2 2    2  3         3
      |         1 1  1        2 2        3     33     4   4
  0.4 +                       2      2
      |
  0.3 +
      |
  0.2 +                 2            2     3   33
      |    1       1         2  22                      4
  0.1 +         1 1      2    2     2     3    3      44
      |    1      1     2  2 2     2 23             3 44
  0.0 +                       2
      |
      ---+----+----+----+----+----+----+----+----+----+----+--
        0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0

                                  X

The dependence of the categories on the order of the data can be
reduced by using a slow learning rate. If the learning rate is reduced
to beta=.1, 29 iterations are required to form five categories of
various shapes:

   title3 'With Slow Learning, Beta=.1';
   %fart( data=bars, input=x y, rho=.4, beta=.1, out=fbarss)
   %plot(fbarss)

                           Three Bars
                          Sorted by X Y
                   With Slow Learning, Beta=.1

           Plot of Y*X.  Symbol is value of CATEGORY.

    Y |
      |
  1.0 +               3
      |             3                     3       3     3
  0.9 +    3 3    3           3    3 3     33333       3
      |         333           3     3        33 3    3 3
  0.8 +                       1         3           3
      |
  0.7 +
      |
  0.6 +           1          1     2         2
      |   1 1           1 1   1    2     2            2
  0.5 +     1                1 1    2  2         2
      |         1 1  1        1 1        2     22     2   5
  0.4 +                       1      2
      |
  0.3 +
      |
  0.2 +                 1            2     2   22
      |    1       1         1  12                      2
  0.1 +         4 4      4    4     2     2    2      22
      |    4      4     4  4 4     2 22             2 22
  0.0 +                       4
      |
      ---+----+----+----+----+----+----+----+----+----+----+--
        0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0

                                  X

But reduction of the learning rate is of limited effect and can be
nullified by increasing the sample size. Using 10,000 cases and
beta=.1, with the number of iterations restricted to one to save time,
FART identifies five categories in neat vertical bars:

   title3 'With Slow Learning, Beta=.1, and 10,000 Cases';
   data bigbars;
      do n=1 to 10000;
         x=ranuni(1);
         y=ranuni(1)*.2+mod(n,3)*.4;
         u=ranuni(1);
         output;
      end;
   run;
   proc sort data=bigbars; by x y; run;
   %fart( data=bigbars, input=x y, rho=.4, beta=.1, maxiter=1, out=fbarsb)
   %plot(fbarsb)

                           Three Bars
                          Sorted by X Y
          With Slow Learning, Beta=.1, and 10,000 Cases

           Plot of Y*X.  Symbol is value of CATEGORY.

    Y |
      |
  1.0 +  111111111111222222222233333333334444444444455555555
      |  111111111111222222222233333333334444444444455555555
  0.9 +  111111111112222222222233333333334444444444455555555
      |  111111111111222222222233333333334444444444455555555
  0.8 +  111111111112222222222233333333334444444444555555555
      |
  0.7 +
      |
  0.6 +  111111111111222222222233333333334444444444455555555
      |  111111111111222222222233333333334444444444455555555
  0.5 +  111111111111222222222233333333334444444444455555555
      |  111111111111222222222233333333334444444444455555555
  0.4 +  111111111112222222222233333333334444444444455555555
      |
  0.3 +
      |
  0.2 +  111111111112222222222233333333334444444444455555555
      |  111111111111222222222233333333334444444444455555555
  0.1 +  111111111111222222222233333333334444444444455555555
      |  111111111111222222222233333333334444444444455555555
  0.0 +  111111111112222222222233333333334444444444455555555
      |
      ---+----+----+----+----+----+----+----+----+----+----+--
        0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0

                                  X

One of the most basic properties of statistical estimators is
"consistency." Loosely speaking, a consistent estimator is one that
converges in some sense, as the sample size goes to infinity, to the
true value of the characteristic being estimated. It is apparent from
the previous example that FART does not provide consistent estimates
of categories, no matter how the categories might be defined, because
the results of FART depend critically on the order of the data,
regardless of how large the sample is.  Hence, FART is statistically
unacceptable as a method of learning categories.

FARTMAP is an extension of FART to supervised classification
(Carpenter, Grossberg, Markuzon, Reynolds, and Rosen 1991; Kasuba 1993
is a more accessible reference). In FARTMAP, each category is
associated with one class of the target classification variable in a
many-to-one mapping. The algorithm is very similar to FART. The main
addition is a step that occurs after resonance has been established:
if the output class does not match the target class, the vigilance
parameter is increased to force the creation of a new category, which
is associated with the target class. FARTMAP has no mechanism to avoid
overfitting.  Hence, noisy data can produce a vast proliferation of
categories with severe overfitting, so FARTMAP should never be used
with noisy data (Williamson 1995).

References:

   Carpenter, G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., and 
   Rosen, D.B. (1992). "Fuzzy ARTMAP: A neural network architecture for
   incremental supervised learning of analog multidimensional maps,"
   IEEE Transactions on Neural Networks, 3, 698-713.

   Carpenter, G.A., Grossberg, S., Rosen, D.B. (1991), "Fuzzy ART:
   Fast stable learning and categorization of analog patterns by an
   adaptive resonance system," Neural Networks, 4, 759-771.

   Hartigan, J.A. (1975), _Clustering Algorithms_, NY: Wiley.

   Kasuba, T. (1993), "Simplified Fuzzy ARTMAP," AI Expert, 8, 18-25.

   Moore, B. (1988), "ART 1 and Pattern Clustering", in Touretzky, D.,
   Hinton, G. and Sejnowski, T., eds., _Proceedings of the 1988
   Connectionist Models Summer School_, 174-185,  San Mateo, CA: Morgan
   Kaufmann.

   Williamson, J.R. (1995), "Gaussian ARTMAP: A Neural Network for Fast
   Incremental Learning of Noisy Multidimensional Maps," Technical
   Report CAS/CNS-95-003, Boston University, Center of Adaptive
   Systems and Department of Cognitive and Neural Systems.