The question of which features in the training set are used by a particular feedforward network can be excruciatingly difficult to answer. It is easier to discuss tempting methods that do not work than it is to find methods that do, so that will be done first.The two most common notions of importance are predictive importance and causal importance. Predictive importance is concerned with the increase in generalization error when an input is omitted from a network. Causal importance is concerned with situations where you can manipulate the values of the inputs and you want to know how much the outputs will change. Predictive importance and causal importance are the same only in very limited circumstances.
The predictive importance and causal importance of a given input both depend on what other inputs are used in the network. Marginal importance considers each input in isolation. There are many measures of marginal importance that are easy to compute without even training a network, such as Pearson correlation, rank correlation, Hoeffding's measure of dependence, and mutual information. But marginal importance is of little practical use other than for a preliminary description of the data. An input with high marginal importance may have low causal and predictive importance, and an input with low marginal importance may have high causal and predictive importance.
Even in a linear model, it is not generally possible to come up with a single number for the importance of each input. But understanding measures of importance for linear models is a big step towards understanding measures of importance for neural nets, so this answer will begin with a discussion of linear models. Measures of importance for linear models are described in more detail by Darlington (1968).
This article does not address the problem of selecting an optimal subset of inputs. Except in special cases, measures of the importance of single inputs cannot be combined to tell you the importance of a subset of inputs.
Y = b + w1*X1 + w2*X2 + w3*X3 + noisewhere b is the bias and w1, w2, and w3 are the connection weights.
For example, with these training data:
Y X1 X2 X3 ---------------- 7 1 2 500 3 2 1 100 6 3 4 200 9 4 3 600 12 5 6 300 15 6 5 700 18 7 8 800 14 8 7 400The weights learned by least squares are:
Input Weight ----- ------ X1 0.506250 X2 1.006250 X3 0.008750 bias -0.243750The mean squared training error is 0.6125.
Y = b + w1*X1 + w2*X2 + w3*X3 + noiseSuppose X1 is measured in meters, but you want to convert it to millimeters. Since the conversion multiplies X1 by 1000, you have to divide w1 by 1000. Similarly, if you want to convert X1 to kilometers, you have to divide X1 by 1000 and multiply w1 by 1000. Thus the size of w1 depends entirely on the units of measurement of X1. Likewise, the size of w2 depends entirely on the units of measurement of X2. So unless X1 and X2 are measured in comparable units, the comparison of w1 and w2 is meaningless.
For the data in the linear model example above, X3 has by far the smallest weight. But X3 has much larger values, and a larger range of values, than the other inputs. In this example, X1 and X2 were measured in meters, while X3 was measured in centimeters. If you convert X3 to meters, the weights become:
Input Weights based on common units ----- ----------------------------- X1 0.506250 X2 1.006250 X3 0.875000 bias -0.243750Now X1 has the smallest weight.
You can train the network using standardized inputs (often a good idea), or you can train the network on the raw inputs and then multiply each input weight by the standard deviation of the input. Either way, you get the same standardized weights (barring local minima, convergence problems, etc.). In linear models, it is also customary to standardize targets, but that does not affect comparisons of input weights--for the purposes of this discussion, the crucial thing is to standardize the inputs.
Standardized weights can be compared meaningfully if the standard deviations are meaningful. For the standard deviations to be meaningful, it is usually necessary for the input cases to be a representative sample from the set of all cases you want to be able to generalize to.
For example, suppose you want to use a linear model to predict job performance ratings for people who apply for jobs at your company. There are two inputs:
If job performance really is linearly related to test score and GPA, changing the cutoff score will not affect the true raw (nonstandardized) weights. In fact, the true raw weights will not be affected by any method of selecting cases based solely on the inputs, as long as the distribution of the inputs is nonsingular.
Omitted Change Input in MSE ----- ------- X1 0.24051 X2 0.95018 X3 3.06250X1 and X2 appear much less important than X3 because X1 and X2 are strongly correlated with each other as shown in the following correlation matrix:
X1 X2 X3 X1 1.00000 0.90476 0.47619 X2 0.90476 1.00000 0.47619 X3 0.47619 0.47619 1.00000If you omit X1 from the model and retrain, X2's weight increases to compensate. If you omit X2 from the model and retrain, X1's weight increases to compensate. If you omit X3 from the model, there is no other highly correlated input to compensate. Thus, X3 is more important for prediction than either X1 or X2 considered individually.
However, it would be incorrect to conclude that X1 and X2 are jointly unimportant. If both X1 and X2 are omitted from the model, the MSE increases by 8.77738, which is much greater than the sum of the increases (1.19069) resulting from omitting each input individually.
If the inputs are uncorrelated, the change in MSE produced by omitting an input is proportional to the square of that input's standardized weight, regardless of which inputs are included in the model. If the inputs are correlated, there is not necessarily a monotonic relation between change in MSE and the squared standardized weights. Hence it is only with uncorrelated inputs that the change in MSE is a measure of both causal and predictive importance.
In statistical applications, it is common to test hypotheses about the "true" values of the weights. "True" means the optimal value for the entire population that you want to generalize to. For example, you could test the "null" hypothesis:
H0: W1 is exactly zero
H1: W1 is nonzero
In real life, you have a single training set from which to learn w1. Hopefully, this training set is a representative sample of the population. In some applications, the training set may have been obtained by taking a random sample of the population. If so, you could imagine what might have happened if you had used a different seed for your pseudo-random number generator, or if the coin you flipped had landed differently--you would have obtained a different training set and probably a different value of w1. Imagine taking 1000 different random samples, numbered s=1,...,1000, to use as training sets. For each of these 1000 samples, you train a linear model and obtain an estimate of W1, say w1[s]. You can collect all of these values, w1[1],....,w1[1000], into a data set and look at their distribution. You could draw a histogram of the w1[s] values, compute their mean and standard deviation, etc. Now consider taking an infinite number of random samples and computing an infinite number of w1[s] values--that infinite collection is the sampling distribution of w1.
Fortunately, in linear models trained by least squares, it is possible to estimate the sampling distribution of w1[s]--or any other weight in the model, or even any linear combination of the weights--without doing an infinite number of calculations. Assuming that the noise distribution is not too pathological, the result is that for large enough training sets, w1[s] has an approximately normal distribution with mean W1 and variance equal to the product of (a) the variance of the noise and (b) the corresponding diagonal element of the inverse Hessian matrix. The square root of the variance of w1[s] is called the "standard error" of w1[s], say SE(w1).
Now we can address the question of how large w1 needs to be to provide evidence against H0. As any elementary statistics textbook will tell you, a normal random variable will fall within one standard deviation of the mean about two-thirds of the time, within two standard deviations of the mean about 95% of the time, and within three standard deviations of the mean about 99.7% of the time. Consider the following decision rules:
But there is a trade-off. If you make alpha small, you increase the probability of "type 2" error--failing to reject H0 when H0 is false. The probability of type 2 error is called beta, and the study of the trade-off between alpha and beta is called "power analysis" (power is 1-beta). It is generally not possible to compute beta because the alternative hypothesis H1, unlike the null hypothesis H0, does not provide a specific value for W1. But you can report beta values corresponding to a range of alpha and W1 values, preferably in the form of plots called "power curves".
Because of the alpha-beta trade-off, different people are likely to have different opinions about the appropriate significance level in any particular application. Therefore it is advisable not just to report decision(s) at one or two significance levels, but to report "p-values." The p-value for H0 is the probability that |w1[s]| >= |w1| given that H0 is true. In other words, the p-value is the probability in repeated sampling of obtaining an absolute weight at least as large as the absolute weight computed from the actual training set. The p-value has the useful property that you can reject H0 at a significance level alpha if and only if the p-value is less than or equal to alpha.
The null and alternative hypotheses can be reformulated as follows:
H0: The linear model including X1 as an input generalizes exactly as well a linear model without X1 but including all the other inputs of interest
H1: The linear model including X1 generalizes better than a linear model without X1
If you are interested in predictive importance, what matters is the change in generalization error when an input is omitted. If an input has a weight with a p-value of 50% or greater, it is safe to say that the input is of little predictive importance. A very small p-value, perhaps 0.1% divided by the number of inputs, indicates that the input is likely to be of some use for generalization, but not necessarily of much use. The reason, again, is that p-values can be very small because of a large number of training cases rather than because of high predictive importance.
In some branches of science, there is a tradition of publishing experimental results only if the null hypothesis can be rejected at some conventional significance level, usually 5%. Many statisticians view such conventions as superstition rather than valid statistical inference because these rituals ignore both the alpha-beta trade-off and the distinction between statistical significance and practical importance.
Consider an MLP with output Y, three inputs (X1, X2, X3), and a single hidden layer with five tanh units (H1 through H5), with weights as given in the following table:
To: H1 H2 H3 H4 H5 Y Bias 25 25 150 150 0 -0.1 X1 100 -100 0 0 0 X2 0 0 100 -100 0 X3 0 0 0 0 1 From: H1 0.1 H2 0.1 H3 0.1 H4 0.1 H5 1.0The output function can be written as follows:
Y = .1*tanh(100*(X1+.25))-.1*tanh(100*(X1-.25))-.1 + .1*tanh(100*(X2+1.5))-.1*tanh(100*(X2-1.5)) + tanh(X3)The output Y is the sum of three functions written on the three lines above. Each of these three functions depends on only one of the inputs. A model such as this in which the output is the sum of univariate (nonlinear) transformations of the inputs is called an "additive" model. It is easier to assess the importance of inputs in an additive model than in the general case because additivity implies that the effect of one input does not depend on the values of the other inputs. Thus, to understand the properties of the output function, you can consider the inputs one at a time, instead of having to visualize a 3-D nonlinear manifold in a 4-D space. Assuming each input is distributed over the interval [-3,3], the three additive functions appear as in the following plot (to the limited resolution of a plain-text file):
1.0 + 3333333333333 | 333333 | 333 | 333 | 33 0.5 + 33 | 33 | 33 | 2222222222222222222222222222222 | 2 11111 2 0.0 +22222222222222222 1 3 1 22222222222222222 |111111111111111111111111111111 3 111111111111111111111111111111 | 33 | 33 | 33 -0.5 + 33 | 33 | 333 | 333 | 333333 -1.0 +3333333333333 | -+-------+-------+-------+-------+-------+-------+-------+-------+- -3.00 -2.25 -1.50 -0.75 0.00 0.75 1.50 2.25 3.00The output has two abrupt changes in response to X1, corresponding to two large weights. The positions of these abrupt changes are close together, so except for a narrow interval containing these two abrupt changes, the output does not depend on X1 at all. The output also has two abrupt changes in response to X2, again corresponding to two large weights. The positions of these abrupt changes for X2 are farther apart than for X1, so it is clear that X1 and X2 have different effects on the output, and for most practical purposes, X2 would be considered more important than X1. It is also obvious that X3 is associated with much larger changes in the output than either X1 or X2, and for most practical purposes, X3 would be considered the most important input.
To summarize the importance of the inputs in this example, recall that the inputs are assumed to be independent and the output function is additive. Hence the importance of one input does not depend on what other inputs are included in the model or on any interaction between the inputs, so the importance of any one input can be assessed without regard to the other inputs. The inputs can therefore be ordered from least important to most important as follows:
A huge input-to-hidden weight does not necessarily mean that the input has a huge effect on the output, since the "squashing" functions of the hidden units limit that effect. Huge input-to-hidden weights usually indicate abrupt changes in the output, as would occur if the network were trying to approximate a discontinuity. But the size of the weight is related primarily to the abruptness of the change, not to the size of the change.
A tiny input-to-hidden weight does not necessarily mean that the input has a tiny effect on the output, since that effect can be amplified by the hidden-to-output weights. In fact it is quite common to have tiny input-to-hidden weights and huge hidden-to-output weights; some reasons for this are explained by Cardell, Joerding, and Li (1994).
The main advantage of raw weights over standardized weights in linear models is that the true raw weights (i.e. those that give the best possible generalization) do not depend on what region of the input space you want to generalize to, as long as that region is nonsingular. This invariance results from the fact that every point on a plane has the same slope and therefore the same weights apply. But when you are fitting an MLP to a nonlinear surface, different hidden units may be important in different regions of the input space. Consider a surface produced by the formula:
Y = tanh(X1) + tanh(X2)If you consider only cases with X1 > 3, an MLP with one hidden unit depending only on X2 will generalize very well. If you consider only cases with X2 > 3, an MLP with one hidden unit depending only on X1 will generalize very well. The weights for these two MLPs are completely different. For MLPs, the weights that give the best generalization can depend on what region of the input space you want to generalize to.
For the additive function example, the sum of the absolute (squared weights would produce similar results) is shown for each input:
Input Sum of absolute input weights ----- ----------------------------- X1 200 X2 200 X3 1Thus, the weights suggest two incorrect conclusions:
Importance of Xi = SUM |W(Xi,Hj)*W(Hj,O)| jwhere i indexes the inputs, j indexes the hidden units, W(Xi,Hj) indicates the weight connecting input Xi to hidden unit Hj, and W(Hj,O) indicates the weight connecting hidden unit Hj to the output. If the activation functions for the hidden and output units have bounded derivatives, the formula above can be used to obtain an upper bound for the change in output corresponding to a given change in the input. Hence if the formula yields a very small number, you can conclude that the input is not important (for a related discussion of pruning individual weights, see Setiono, 1997). However, the converse does not hold: if the formula yields a large number, you cannot conclude that the input is important.
For the additive function example, the sum of the absolute products of weights is shown for each input:
Input Sum of absolute products of weights ----- ----------------------------------- X1 20 X2 20 X3 1Thus, the sum of the absolute products of weights suggests the same two incorrect conclusions as before:
Input Sum of products of weights ----- -------------------------- X1 0 X2 0 X3 1Thus, using the sum of the products of weights leads to the incorrect conclusion that X1 and X2 are not important at all.
Incorporating both layers of weights should be an improvement over using only the input-to-hidden layer to measure importance of inputs. But simply taking the product of weights from the two layers ignores the "squashing" effect of the hidden-layer activation function. A given input-to-hidden weight should be adjusted by the amount of squashing, but the amount of squashing depends on the actual value of the corresponding input, as well as the values and weights of the other inputs, and also the bias. If you attempt to take all these complexities into account, you end up using the actual input-output function computed by the network. Measures of importance based on the input-output function are discussed in the following sections.
The gradient is a vector of partial derivatives. Each partial derivative, by definition, gives the local rate of change of the output with respect to the corresponding input, holding the other inputs fixed. Thus a partial derivative has the same interpretation as a weight in a linear model, except for one crucial difference: a weight in a linear model applies to the entire input space, but a partial derivative applies only to a small neighborhood of the input point at which it is computed.
Methods based on these partial derivatives are often referred to as "sensitivity analysis."
It is tempting to compute the partial derivatives at one "typical" point in the input space, such as the centroid, and assume that those derivatives are typical of the entire input space, but this assumption is dangerously false. If the partial derivatives are constant over the input space, then the output function is linear. If you are using a nonlinear neural network, presumably you think it is possible for the output function to have important nonlinearities. If the output function has important nonlinearities, then there will be important variation of the partial derivatives over the input space.
It can also be dangerously misleading to look at the partial derivatives at only a few points in the output space. One method that has been proposed is to vary each input in turn while all the other inputs are fixed at their mean values. But this method can overlook important variation of the partial derivatives. For example, consider a continuous version of the XOR data:
Y = X1 + X2 - 2*X1*X2where X1 and X2 vary uniformly over [0,1]. Then the mean of each input is .5, and if you fix one input to .5, you will find that the output is a constant regardless of the value of the other input. In other words, if either input is fixed at its mean value, the partial derivative with respect to the other input is zero.
For the additive function example, the partial derivative with respect to each input is shown at the mean of the inputs:
Input Partial derivative at mean ----- -------------------------- X1 0 X2 0 X3 1Thus, the partial derivatives at the mean suggest the incorrect conclusion that X1 and X2 are completely unimportant.
For the additive function example, the average partial derivatives with respect to each input are shown:
Input Average partial derivative ----- -------------------------- X1 0 X2 0 X3 0.33Thus, the average partial derivatives suggest the incorrect conclusion that X1 and X2 are completely unimportant.
In the additive function example, X1 and X2 have the same mean partial derivative. They also have the same mean absolute derivative and the same mean squared derivative. In fact, the partial derivatives for X1 and X2 have exactly the same frequency dustribution, so it is impossible to tell which one is more important based on the partial derivatives alone. The average absolute partial derivatives with respect to each input are shown:
Input Average absolute partial derivative ----- ----------------------------------- X1 0.5 X2 0.5 X3 0.33Thus, the average absolute partial derivatives suggest two incorrect conclusions:
D1 = f( X1+h, X2, X3) - f( X1, X2, X3)for a large, representative sample of input points, and then take the average absolute value or square of D1. But how do you choose h? If the output function is periodic, such as Y = sin(X1) + sin(2*X2) + sin(3*X3), D1 will be zero when h is a multiple of the period, but large when h is an odd multiple of half the period. So the safest thing to do is to look at a range of h values. The following plot shows the the mean absolute difference in the output as a function of h for the three inputs in the additive function used in the example above:
2.0 + 3 3 3 | 3 3 | | 3 | 1.5 + 3 | | | 3 | 1.0 + | 3 | | 3 | 0.5 + | 3 | | 3 2 2 2 | 2 2 2 1 2 0.0 + 1 1 1 1 1 1 1 1 1 1 1 1 | ---+----+----+----+----+----+----+----+----+----+----+----+----+-- 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0The order of importance of the inputs is evident in this plot, but in general there remains the problem of how to reduce the information to a single value for each input. Presumably this would involve averaging over h, but there are many different ways to do the averaging. When the inputs have different units of measurement and different distributions, it is not obvious how to select appropriate values for h. Suppose X1 is uniformly distributed on [0,1] and X2 is a binary variable with values in {0,1} with equal probability. For X2, the only sensible value of h is 1, but for X1 you would want to use various values of h intermediate between 0 and 1. How do you average over h in a way that fairly represents both X1 and X2?
It can be tempting to try to reduce the magnitude of the task by clamping all inputs but one at "typical" values and looking at differences produced by changing only one input at a time. But this approach is incorrect except for linear models, as discussed under "Why partial derivatives can be misleading."
For the additive function example, if absolute differences are averaged over all pairs of input values, the following results are obtained:
Input Average absolute difference ----- --------------------------- X1 .030 X2 .099 X3 .916Thus, this type of average absolute difference correctly indicates the order of importance of the inputs.
It is important to retrain the network after removing the input. Simply deleting an input unit from the network without retraining is equivalent to clamping the value of that input to zero for all cases. If zero is not a reasonable value for that input, the network outputs are likely to be nonsense.
Instead of clamping the input to a constant value of zero, it would be better to clamp it to a typical value such as the mean of that input. But you may not get a typical output value from a typical input value. In the example above with an additive output function, the mean value of X1 produces unusually high values of the output, and clamping X1 to its mean value increases the RMSE by 0.19. The mean value of X2 also produces high outputs, but not as unusually high as for X1, so clamping X2 to its mean value increases the RMSE by only 0.16. Thus X1 appears more important than X2.
For the additive function example, the changes in RMSE, with and without retraining, produced by omitting each input, are as follows:
........ Change in RMSE ........ Input No retraining With retraining ----- ------------- --------------- X1 .190 .057 X2 .140 .100 X3 .820 .808Thus, the change in RMSE with no retraining incorrectly suggests that X1 is more important than X2. But the change in RMSE with retraining yields the correct order of importance.
Retraining the network can, of course, be time-consuming. To be safe, you should go through the full training process including such things as multiple random weight initializations. But if the input being omitted is not related to any other inputs, it is probably adequate to clamp that input to a typical value and use the weights from the original network as initial values for retraining.
If you have many more training cases than weights in the network, it may be more efficient to approximate the change in the error function using the Hessian matrix instead of retraining the network (often called "optimal brain surgeon," OBS, in the neural net literature; see Stahlberger and Riedmiller, 1997; Hassibi and Stork, 1993; Hassibi, Stork, Wolff, and Watanabe 1994). This approximation may be poor if the number of training cases does not greatly exceed the number of weights, if the optimal weights are infinite (Cardell, Joerding, and Li 1994), or if the hidden units are not statistically well-identified.
Conventional statistical methods for nonlinear models (Gallant, 1987) can be used to test the null hypothesis that a given input has zero predictive importance. P-values can be computed in two main ways: likelihood ratio tests and Wald tests. Likelihood ratio tests compare the error functions for networks trained with and without the input in question. Wald tests use a quadratic approximation to the error function as in OBS. The interpretation of p-values is subject to the same caveats discussed under "Why statistical p-values can be misleading." Accurate p-values also require regularity conditions, such as those mentioned above regarding OBS, as well as other more technical statistical conditions described by Gallant (1987) and White (1992).
In the additive function example, if the values of X1 and X3 are restricted to differ by no more than 0.2 (causing those inputs to be highly correlated), the changes in RMSE produced by omitting each input are as follows:
........ Change in RMSE ........ Input No retraining With retraining ----- ------------- --------------- X1 .191 .028 X2 .143 .100 X3 .799 .061The results with no retraining are essentially the same as with independent inputs. But with retraining, both X1 and X3 appear less important, especially X3, which by this measure is now less important than X2.
Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction", Neural Computation, 7, 624-638.
Boger, Z., and Guterman, H. (1997), "Knowledge extraction from artificial neural network models," IEEE Systems, Man, and Cybernetics Conference, Orlando, FL.
Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why Some Feedforward Networks Cannot Learn Some Polynomials," Neural Computation, 6, 761-766.
Darlington, R.B. (1968), "Multiple Regression in Psychological Research and Practice," Psychological Bulletin, 69, 161-182.
Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley.
Hassibi, B. and Stork, D.G. (1993) "Second order derivatives for network pruning: Optimal Brain Surgeon" in Hanson, S.J., Cowan, J.D. and Giles, C.L., eds., Advances in Neural Information Processing Systems 5, 164-171, San Mateo, CA: Morgan-Kaufmann.
Hassibi, B., Stork, D.G., Wolff, G., and Watanabe, T. (1994), "Optimal Brain Surgeon: Extensions and performance comparisons," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) Advances in Neural Information Processing Systems 6, San Mateo: CA Morgan-Kaufmann, pp. 263-270.
Masters, T. (1994) Practical Neural Network Recipes in C++, San Diego: Academic Press.
Setiono, R. (1997), A penalty-function aproach for pruning feedforward neural networks," Neural Computation, 9, 185-204.
Stahlberger, A., and Riedmiller, M. (1997), "Fast network pruning and feature extractiob using the Unit-OBS algorithm," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 655-661.
White, H. (1992), Artificial Neural Networks: Approximation and Learning Theory, Blackwell.