Why Statisticians Should Not FART Warren S. Sarle Aug 1, 1995 Copyright (C) 1995 by Warren S. Sarle, Cary, NC, USA ftp://ftp.sas.com/pub/neural/fart.txt Introduction ------------ Fuzzy Adaptive Resonance Theory (FART) is one of the more recent and popular members of the family of neural networks called "Adaptive Resonance Theory" by Stephen Grossberg and colleagues. According to Carpenter, Grossberg, and Rosen (1991), FART is "capable of rapid stable learning of recognition categories in response to arbitrary sequences of analog or binary input patterns." In statistical terminology, "recognition categories" are clusters and "arbitrary sequences of input patterns" are data. Hence FART is a method for clustering data. FART is a generalization of ART 1, which is restricted to binary data. FART handles continuous data in the interval [0,1]. FART, like ART 1 as discussed by Moore (1988), is basically similar to many iterative clustering algorithms in which each case is processed by: * finding the "nearest" cluster seed (AKA prototype or template) to that case * updating that cluster seed to be "closer" to the case where "nearest" and "closer" can be defined in hundreds of different ways. In FART, the framework is modified slightly by introducing the concept of "resonance" so that each case is processed by: * finding the "nearest" cluster seed that "resonates" with the case * updating that cluster seed to be "closer" to the case Given a case X=(x_i) and a cluster seed W=(w_i), i=1, ..., v, nearness is assessed by a similarity measure called the "choice function": Sum{ min(x_i,w_i) } ------------------- alpha + Sum{ w_i } where sums are over i and alpha is a user-specified parameter usually equal to a very small positive number such as 1e-6. Resonance is based on a slightly different similarity measure called the "match function". A seed resonates with the case if: Sum{ min(x_i,w_i) } ------------------- >= rho Sum{ x_i } where rho is a user-specified "vigilance" parameter between 0 and 1. If no seed resonates with a case, a new cluster is created, usually with a seed equal to the case (the "fast commit" option as in Hartigan's (1975) leader algorithm). If a resonant seed is found, the seed is updated according to the formula: w(new)_i = beta[min(x_i,w(old)_i)] + (1-beta)[w(old)_i] where beta is a user-specified learning rate, usually 1 for "fast learning." Both ART 1 and FART, as presented so far, suffer from a problem demonstrated in great detail by Moore (1988): with noisy data, the cluster seeds, being minima, tend to degenerate to zero, and new clusters must be continually created, resulting in "category proliferation." Carpenter et al. claim to have solved the category proliferation problem by "complement coding": each case X is doubled in length such that x_(i+v)=1-x_i. With complement coding, w_i for i<=v is still the cluster minimum (assuming fast learning) for the ith variable, but 1-w_(i+v) is the cluster maximum. Thus each cluster seed represents both cluster minima and cluster maxima and can be thought of as a box within the hypercubic data space. Complement coding does indeed prevent the number of categories from growing without bound, but it does nothing to ensure that the number of categories is appropriate in any practically useful sense. With complement coding, the vigilance parameter places an upper bound on the size of each category box. The size of a box is defined as the sum of its dimensions: size = Sum [1-w_(i+v)-w_i] <= v(1-rho) i<=v During training, when a new case is assigned to a category, the box grows just enough to contain the new case. As the box grows, the set of points that resonate with the category shrinks. When the box reaches the size given by the right-hand side of the above formula, only points within the box resonate with the category, hence the box cannot grow any larger. The category boxes can, and frequently do, overlap. Do overlapping categories make any sense? They would make sense if the boxes were estimating the support of the components of a mixture distribution. But since FART records information only about the minimum and maximum values in each category, it ignores information on density. If two boxes overlap, cases within the overlap may be assigned to the category with lower density. Hence, the way overlapping categories are treated in FART makes no statistical sense. This phenomenon can be illustrated on a small data set (the %FART macro, using the SAS/IML product, is in ftp://ftp.sas.com/pub/neural/fart.sas): title 'Cross'; data d; input x y; cards; .1 .5 .3 .5 .7 .5 .9 .5 .5 .8 .5 .2 .5 .5 ; %fart(data=d,input=x y,rho=.5) %plot; Cross Plot of Y*X. Symbol is value of CATEGORY. Y | | 1.0 + | 0.9 + | 0.8 + 2 | 0.7 + | 0.6 + | 0.5 + 1 1 2 1 1 | 0.4 + | 0.3 + | 0.2 + 2 | 0.1 + | 0.0 + | ---+----+----+----+----+----+----+----+----+----+----+-- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 X The first four observations form the horizontal arm of a cross. By the time these observations have been processed, the first category has grown to near its maximal size determined by rho=.5. The fifth and sixth observations form the vertical bar of the cross. These observations do not resonate with the first category; hence they are assigned to a second category that overlaps the first category. The seventh observation is at the center of the cross and is assigned by the choice function to the second category, by a narrow margin. The fact that FART can construct elongated and overlapping categories, such as those in the previous example, could be a great advantage provided that the categories were determined by the distribution of the data. But the categories are determined primarily by the order in which the data are presented, not the distribution of the data. In the following example, the data are distributed in three horizontal bars: title 'Three Bars'; data bars; do n=1 to 100; x=ranuni(1); y=ranuni(1)*.2+mod(n,3)*.4; u=ranuni(1); output; end; run; With the data sorted in random order, and the vigilance parameter rho set to a value corresponding to the size of each of the three bars, FART obtains four categories forming boxes of various shapes: title2 'Sorted in Random Order'; proc sort data=bars; by u; run; %fart( data=bars, input=x y, rho=.4, out=fbarsu) %plot(fbarsu) Three Bars Sorted in Random Order Plot of Y*X. Symbol is value of CATEGORY. Y | | 1.0 + 3 | 3 3 3 3 0.9 + 3 3 3 3 3 3 33333 3 | 333 3 3 33 3 3 3 0.8 + 3 3 3 | 0.7 + | 0.6 + 2 2 2 1 | 2 2 2 2 2 2 1 4 0.5 + 2 2 2 1 1 1 | 2 2 2 2 2 1 11 4 4 0.4 + 2 1 | 0.3 + | 0.2 + 2 1 1 11 | 2 2 2 22 4 0.1 + 2 2 2 2 1 1 1 44 | 2 2 2 2 2 1 11 4 44 0.0 + 2 | ---+----+----+----+----+----+----+----+----+----+----+-- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 X If the data are sorted so that all cases within a bar are together, then FART successfully identifies the three horizontal bars: title2 'Sorted by Y X'; proc sort data=bars; by y x; run; %fart( data=bars, input=x y, rho=.4, out=fbarsh) %plot(fbarsh) Three Bars Sorted by Y X Plot of Y*X. Symbol is value of CATEGORY. Y | | 1.0 + 3 | 3 3 3 3 0.9 + 3 3 3 3 3 3 33333 3 | 333 3 3 33 3 3 3 0.8 + 3 3 3 | 0.7 + | 0.6 + 2 2 2 2 | 2 2 2 2 2 2 2 2 0.5 + 2 2 2 2 2 2 | 2 2 2 2 2 2 22 2 2 0.4 + 2 2 | 0.3 + | 0.2 + 1 1 1 11 | 1 1 1 11 1 0.1 + 1 1 1 1 1 1 1 11 | 1 1 1 1 1 1 11 1 11 0.0 + 1 | ---+----+----+----+----+----+----+----+----+----+----+-- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 X But if the data are sorted so that cases with similar values of X appear together, then FART finds four vertical bars: title2 'Sorted by X Y'; proc sort data=bars; by x y; run; %fart( data=bars, input=x y, rho=.4, out=fbarsv) %plot(fbarsv) Three Bars Sorted by X Y Plot of Y*X. Symbol is value of CATEGORY. Y | | 1.0 + 1 | 1 3 3 4 0.9 + 1 1 1 2 2 2 33333 4 | 111 2 2 33 3 3 4 0.8 + 2 3 3 | 0.7 + | 0.6 + 1 2 2 3 | 1 1 2 2 2 2 3 4 0.5 + 1 2 2 2 3 3 | 1 1 1 2 2 3 33 4 4 0.4 + 2 2 | 0.3 + | 0.2 + 2 2 3 33 | 1 1 2 22 4 0.1 + 1 1 2 2 2 3 3 44 | 1 1 2 2 2 2 23 3 44 0.0 + 2 | ---+----+----+----+----+----+----+----+----+----+----+-- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 X The dependence of the categories on the order of the data can be reduced by using a slow learning rate. If the learning rate is reduced to beta=.1, 29 iterations are required to form five categories of various shapes: title3 'With Slow Learning, Beta=.1'; %fart( data=bars, input=x y, rho=.4, beta=.1, out=fbarss) %plot(fbarss) Three Bars Sorted by X Y With Slow Learning, Beta=.1 Plot of Y*X. Symbol is value of CATEGORY. Y | | 1.0 + 3 | 3 3 3 3 0.9 + 3 3 3 3 3 3 33333 3 | 333 3 3 33 3 3 3 0.8 + 1 3 3 | 0.7 + | 0.6 + 1 1 2 2 | 1 1 1 1 1 2 2 2 0.5 + 1 1 1 2 2 2 | 1 1 1 1 1 2 22 2 5 0.4 + 1 2 | 0.3 + | 0.2 + 1 2 2 22 | 1 1 1 12 2 0.1 + 4 4 4 4 2 2 2 22 | 4 4 4 4 4 2 22 2 22 0.0 + 4 | ---+----+----+----+----+----+----+----+----+----+----+-- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 X But reduction of the learning rate is of limited effect and can be nullified by increasing the sample size. Using 10,000 cases and beta=.1, with the number of iterations restricted to one to save time, FART identifies five categories in neat vertical bars: title3 'With Slow Learning, Beta=.1, and 10,000 Cases'; data bigbars; do n=1 to 10000; x=ranuni(1); y=ranuni(1)*.2+mod(n,3)*.4; u=ranuni(1); output; end; run; proc sort data=bigbars; by x y; run; %fart( data=bigbars, input=x y, rho=.4, beta=.1, maxiter=1, out=fbarsb) %plot(fbarsb) Three Bars Sorted by X Y With Slow Learning, Beta=.1, and 10,000 Cases Plot of Y*X. Symbol is value of CATEGORY. Y | | 1.0 + 111111111111222222222233333333334444444444455555555 | 111111111111222222222233333333334444444444455555555 0.9 + 111111111112222222222233333333334444444444455555555 | 111111111111222222222233333333334444444444455555555 0.8 + 111111111112222222222233333333334444444444555555555 | 0.7 + | 0.6 + 111111111111222222222233333333334444444444455555555 | 111111111111222222222233333333334444444444455555555 0.5 + 111111111111222222222233333333334444444444455555555 | 111111111111222222222233333333334444444444455555555 0.4 + 111111111112222222222233333333334444444444455555555 | 0.3 + | 0.2 + 111111111112222222222233333333334444444444455555555 | 111111111111222222222233333333334444444444455555555 0.1 + 111111111111222222222233333333334444444444455555555 | 111111111111222222222233333333334444444444455555555 0.0 + 111111111112222222222233333333334444444444455555555 | ---+----+----+----+----+----+----+----+----+----+----+-- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 X One of the most basic properties of statistical estimators is "consistency." Loosely speaking, a consistent estimator is one that converges in some sense, as the sample size goes to infinity, to the true value of the characteristic being estimated. It is apparent from the previous example that FART does not provide consistent estimates of categories, no matter how the categories might be defined, because the results of FART depend critically on the order of the data, regardless of how large the sample is. Hence, FART is statistically unacceptable as a method of learning categories. FARTMAP is an extension of FART to supervised classification (Carpenter, Grossberg, Markuzon, Reynolds, and Rosen 1991; Kasuba 1993 is a more accessible reference). In FARTMAP, each category is associated with one class of the target classification variable in a many-to-one mapping. The algorithm is very similar to FART. The main addition is a step that occurs after resonance has been established: if the output class does not match the target class, the vigilance parameter is increased to force the creation of a new category, which is associated with the target class. FARTMAP has no mechanism to avoid overfitting. Hence, noisy data can produce a vast proliferation of categories with severe overfitting, so FARTMAP should never be used with noisy data (Williamson 1995). References: Carpenter, G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., and Rosen, D.B. (1992). "Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps," IEEE Transactions on Neural Networks, 3, 698-713. Carpenter, G.A., Grossberg, S., Rosen, D.B. (1991), "Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system," Neural Networks, 4, 759-771. Hartigan, J.A. (1975), _Clustering Algorithms_, NY: Wiley. Kasuba, T. (1993), "Simplified Fuzzy ARTMAP," AI Expert, 8, 18-25. Moore, B. (1988), "ART 1 and Pattern Clustering", in Touretzky, D., Hinton, G. and Sejnowski, T., eds., _Proceedings of the 1988 Connectionist Models Summer School_, 174-185, San Mateo, CA: Morgan Kaufmann. Williamson, J.R. (1995), "Gaussian ARTMAP: A Neural Network for Fast Incremental Learning of Noisy Multidimensional Maps," Technical Report CAS/CNS-95-003, Boston University, Center of Adaptive Systems and Department of Cognitive and Neural Systems.