How to measure importance of inputs?

Mar 6, 1998; Revised Aug 2, 1999

Warren S. Sarle, SAS Institute Inc., Cary, NC, USA

Copyright 1997, 1998, 1999 by Warren S. Sarle, Cary, NC, USA

Contents

Introduction

Terms such as "importance," "saliency," and "sensitivity" do not have precise, widely-accepted meanings. This answer will discuss a variety of methods that have been proposed to measure the importance of inputs, but the list is by no means exhaustive. Different measures of importance are likely to be useful in different applications of neural nets. The main point of this answer is that there is no single measure of importance that is appropriate for all applications. As Masters (1994) says (p. 191),
The question of which features in the training set are used by a particular feedforward network can be excruciatingly difficult to answer. It is easier to discuss tempting methods that do not work than it is to find methods that do, so that will be done first.
The two most common notions of importance are predictive importance and causal importance. Predictive importance is concerned with the increase in generalization error when an input is omitted from a network. Causal importance is concerned with situations where you can manipulate the values of the inputs and you want to know how much the outputs will change. Predictive importance and causal importance are the same only in very limited circumstances.

The predictive importance and causal importance of a given input both depend on what other inputs are used in the network. Marginal importance considers each input in isolation. There are many measures of marginal importance that are easy to compute without even training a network, such as Pearson correlation, rank correlation, Hoeffding's measure of dependence, and mutual information. But marginal importance is of little practical use other than for a preliminary description of the data. An input with high marginal importance may have low causal and predictive importance, and an input with low marginal importance may have high causal and predictive importance.

Even in a linear model, it is not generally possible to come up with a single number for the importance of each input. But understanding measures of importance for linear models is a big step towards understanding measures of importance for neural nets, so this answer will begin with a discussion of linear models. Measures of importance for linear models are described in more detail by Darlington (1968).

This article does not address the problem of selecting an optimal subset of inputs. Except in special cases, measures of the importance of single inputs cannot be combined to tell you the importance of a subset of inputs.

Linear models

A linear model is a feedforward NN with no hidden layer and an identity output activation function. If there is one output Y and three inputs X1, X2, and X3, the model is:
   Y = b + w1*X1 + w2*X2 + w3*X3 + noise
where b is the bias and w1, w2, and w3 are the connection weights.

For example, with these training data:

    Y   X1  X2   X3  
   ----------------
    7    1   2  500  
    3    2   1  100  
    6    3   4  200  
    9    4   3  600  
   12    5   6  300  
   15    6   5  700  
   18    7   8  800  
   14    8   7  400  
The weights learned by least squares are:
   Input  Weight
   -----  ------
   X1     0.506250
   X2     1.006250
   X3     0.008750
   bias  -0.243750
The mean squared training error is 0.6125.

Weights in linear models

In linear models, the weights have a simple interpretation: each weight is the change in the output associated with a unit change in the corresponding input, assuming all other inputs are held fixed. Whether this interpretation is useful largely depends on whether one input can in fact change independently of the other inputs. For example, if the data are from an industrial process in which all the inputs are controlled by an operator, and the operator can change inputs independently of each other, then the interpretation of the weights is directly applicable, and the weights are directly related to causal importance of the inputs. But if the inputs include characteristics of raw materials that the operator cannot control and that are not independent, the interpretation of the weights is of questionable relevance.

Why comparing weights in linear models can be misleading

Consider the linear model:
   Y = b + w1*X1 + w2*X2 + w3*X3 + noise
Suppose X1 is measured in meters, but you want to convert it to millimeters. Since the conversion multiplies X1 by 1000, you have to divide w1 by 1000. Similarly, if you want to convert X1 to kilometers, you have to divide X1 by 1000 and multiply w1 by 1000. Thus the size of w1 depends entirely on the units of measurement of X1. Likewise, the size of w2 depends entirely on the units of measurement of X2. So unless X1 and X2 are measured in comparable units, the comparison of w1 and w2 is meaningless.

For the data in the linear model example above, X3 has by far the smallest weight. But X3 has much larger values, and a larger range of values, than the other inputs. In this example, X1 and X2 were measured in meters, while X3 was measured in centimeters. If you convert X3 to meters, the weights become:

   Input  Weights based on common units
   -----  -----------------------------
   X1     0.506250
   X2     1.006250
   X3     0.875000
   bias  -0.243750
Now X1 has the smallest weight.

Why comparing standardized weights in linear models can be misleading

If you want to try to interpret weights when the inputs are not measured in comparable units, one thing you can do is standardize the inputs, i.e., divide each input by its standard deviation. Standardization also involves subtracting the mean, but that has no effect on weights other than biases. When the inputs are standardized, each input is measured in units of standard deviations.

You can train the network using standardized inputs (often a good idea), or you can train the network on the raw inputs and then multiply each input weight by the standard deviation of the input. Either way, you get the same standardized weights (barring local minima, convergence problems, etc.). In linear models, it is also customary to standardize targets, but that does not affect comparisons of input weights--for the purposes of this discussion, the crucial thing is to standardize the inputs.

Standardized weights can be compared meaningfully if the standard deviations are meaningful. For the standard deviations to be meaningful, it is usually necessary for the input cases to be a representative sample from the set of all cases you want to be able to generalize to.

For example, suppose you want to use a linear model to predict job performance ratings for people who apply for jobs at your company. There are two inputs:

If you hire everybody who applies for a job and use a representative sample of these people for the training cases, then the standard deviations of the two inputs are meaningful descriptors of the pool of job applicants. Thus it is legitimate to compare standardized input weights. But if you hire only people whose test score exceeds some cutoff score, while ignoring GPA, the standard deviation of test scores in your training set will be artificially reduced. The higher the cutoff score, the smaller the standard deviation of the test scores in the training set. The smaller the standard deviation, the smaller the standardized weight for test score. Which input has the higher standardized weight may depend more on your choice of cutoff score than it does on the importance of the inputs.

If job performance really is linearly related to test score and GPA, changing the cutoff score will not affect the true raw (nonstandardized) weights. In fact, the true raw weights will not be affected by any method of selecting cases based solely on the inputs, as long as the distribution of the inputs is nonsingular.

Why comparing changes in the error function in linear models can be misleading

Another way to measure the importance of an input is to omit it from the model, retrain the model, and see how much the error function increases. The change in the error function is a direct measure of predictive importance, but this measure can be misleading when the inputs are correlated. For the data in the linear model example above, the change in MSE produced by omitting each input is shown in the following table:
   Omitted  Change 
   Input    in MSE
   -----    -------
   X1       0.24051
   X2       0.95018
   X3       3.06250
X1 and X2 appear much less important than X3 because X1 and X2 are strongly correlated with each other as shown in the following correlation matrix:
            X1       X2       X3
   X1  1.00000  0.90476  0.47619
   X2  0.90476  1.00000  0.47619
   X3  0.47619  0.47619  1.00000
If you omit X1 from the model and retrain, X2's weight increases to compensate. If you omit X2 from the model and retrain, X1's weight increases to compensate. If you omit X3 from the model, there is no other highly correlated input to compensate. Thus, X3 is more important for prediction than either X1 or X2 considered individually.

However, it would be incorrect to conclude that X1 and X2 are jointly unimportant. If both X1 and X2 are omitted from the model, the MSE increases by 8.77738, which is much greater than the sum of the increases (1.19069) resulting from omitting each input individually.

If the inputs are uncorrelated, the change in MSE produced by omitting an input is proportional to the square of that input's standardized weight, regardless of which inputs are included in the model. If the inputs are correlated, there is not necessarily a monotonic relation between change in MSE and the squared standardized weights. Hence it is only with uncorrelated inputs that the change in MSE is a measure of both causal and predictive importance.

Why statistical p-values can be misleading: A brief introduction to significance tests

This section is for people who know enough statistics to have heard of a "p-value" but not enough to be sure what p-values are good for. The rest of you can skip this section.

In statistical applications, it is common to test hypotheses about the "true" values of the weights. "True" means the optimal value for the entire population that you want to generalize to. For example, you could test the "null" hypothesis:

H0: W1 is exactly zero

against the alternative hypothesis:

H1: W1 is nonzero

where W1 represents the true weight for the input variable X1, uppercase being used to distinguish the true weight from the learned weight w1 obtained with a given training set (this notation is not conventional but is used here because html does not support conventional statistical formulas). Even if the null hypothesis H0 is true, for most training sets, w1 will not be zero. But obviously, very small values of w1 are consistent with H0, while very large values of w1 provide evidence against H0. The question is, how large should w1 be to reject H0? The traditional frequentist (Bayesians have different ideas) answer to this question involves the "sampling distribution" of w1. To understand sampling distributions, you need to conduct a thought experiment.

In real life, you have a single training set from which to learn w1. Hopefully, this training set is a representative sample of the population. In some applications, the training set may have been obtained by taking a random sample of the population. If so, you could imagine what might have happened if you had used a different seed for your pseudo-random number generator, or if the coin you flipped had landed differently--you would have obtained a different training set and probably a different value of w1. Imagine taking 1000 different random samples, numbered s=1,...,1000, to use as training sets. For each of these 1000 samples, you train a linear model and obtain an estimate of W1, say w1[s]. You can collect all of these values, w1[1],....,w1[1000], into a data set and look at their distribution. You could draw a histogram of the w1[s] values, compute their mean and standard deviation, etc. Now consider taking an infinite number of random samples and computing an infinite number of w1[s] values--that infinite collection is the sampling distribution of w1.

Fortunately, in linear models trained by least squares, it is possible to estimate the sampling distribution of w1[s]--or any other weight in the model, or even any linear combination of the weights--without doing an infinite number of calculations. Assuming that the noise distribution is not too pathological, the result is that for large enough training sets, w1[s] has an approximately normal distribution with mean W1 and variance equal to the product of (a) the variance of the noise and (b) the corresponding diagonal element of the inverse Hessian matrix. The square root of the variance of w1[s] is called the "standard error" of w1[s], say SE(w1).

Now we can address the question of how large w1 needs to be to provide evidence against H0. As any elementary statistics textbook will tell you, a normal random variable will fall within one standard deviation of the mean about two-thirds of the time, within two standard deviations of the mean about 95% of the time, and within three standard deviations of the mean about 99.7% of the time. Consider the following decision rules:

  1. Reject H0 if |w1| > SE(w1)
  2. Reject H0 if |w1| > 2 SE(w1)
  3. Reject H0 if |w1| > 3 SE(w1)
If H0 is true, rule (1) will result in error about one-third of the time, i.e. in one-third of the infinite number of random training sets used in the definition of the sampling distribution of w1. If H0 is true, rule (2) will be in error about 5% of the time, while rule (3) will be in error about 0.3% of the time. This kind of error--rejecting H0 when H0 is true--is called "type 1" error. By adjusting the factor that multiplies SE(w1) in the decision rule, you can obtain any probability of error you want between 0 and 1. In other words, you can control the probability of type 1 error, which statisticians call the "significance level" or alpha.

But there is a trade-off. If you make alpha small, you increase the probability of "type 2" error--failing to reject H0 when H0 is false. The probability of type 2 error is called beta, and the study of the trade-off between alpha and beta is called "power analysis" (power is 1-beta). It is generally not possible to compute beta because the alternative hypothesis H1, unlike the null hypothesis H0, does not provide a specific value for W1. But you can report beta values corresponding to a range of alpha and W1 values, preferably in the form of plots called "power curves".

Because of the alpha-beta trade-off, different people are likely to have different opinions about the appropriate significance level in any particular application. Therefore it is advisable not just to report decision(s) at one or two significance levels, but to report "p-values." The p-value for H0 is the probability that |w1[s]| >= |w1| given that H0 is true. In other words, the p-value is the probability in repeated sampling of obtaining an absolute weight at least as large as the absolute weight computed from the actual training set. The p-value has the useful property that you can reject H0 at a significance level alpha if and only if the p-value is less than or equal to alpha.

The null and alternative hypotheses can be reformulated as follows:

H0: The linear model including X1 as an input generalizes exactly as well a linear model without X1 but including all the other inputs of interest

H1: The linear model including X1 generalizes better than a linear model without X1

Significance tests for these hypotheses can be done by looking at the sampling distribution of the change in MSE when X1 is excluded from the model. Such tests take a different mathematical route to arrive at the same destination as the significance tests based on the sampling distribution of w1[s]. In other words, p-values derived from the sampling distribution of the change in MSE are identical to p-values derived from the sampling distribution of w1[s]. The p-value can be viewed as an inverse measure of evidence against H0--the smaller the p-value, the greater the evidence against H0 (Bayesians don't accept this interpretation). The amount of evidence depends on: If you are interested in causal importance, what matters is the size of the true weight W1. Knowing that W1 is nonzero is of little use without also having some approximation to the actual value. The p-value does not provide information about the size W1, since, in the computation of the p-value, the size of W1 is inextricably tied in with other quantities--such as the number of training cases--that are irrelevant to causal importance. Hence a small p-value is not evidence in favor of the causal importance of W1, but only evidence that W1 is not completely unimportant.

If you are interested in predictive importance, what matters is the change in generalization error when an input is omitted. If an input has a weight with a p-value of 50% or greater, it is safe to say that the input is of little predictive importance. A very small p-value, perhaps 0.1% divided by the number of inputs, indicates that the input is likely to be of some use for generalization, but not necessarily of much use. The reason, again, is that p-values can be very small because of a large number of training cases rather than because of high predictive importance.

In some branches of science, there is a tradition of publishing experimental results only if the null hypothesis can be rejected at some conventional significance level, usually 5%. Many statisticians view such conventions as superstition rather than valid statistical inference because these rituals ignore both the alpha-beta trade-off and the distinction between statistical significance and practical importance.

MLP example: An additive function

Neural networks such as multilayer perceptrons (MLPs) are capable of fitting complicated nonlinear functions. However, many of the issues involving importance of inputs can be illustrated with relatively simple functions. This section will describe a simple, nonlinear, noise-free, additive function of three inputs to be used as a running example. Unless otherwise noted, the inputs will be assumed to be statistically independent to further simplify the measurement of importance.

Consider an MLP with output Y, three inputs (X1, X2, X3), and a single hidden layer with five tanh units (H1 through H5), with weights as given in the following table:

              To:  H1   H2     H3   H4    H5     Y
         Bias      25   25    150  150     0   -0.1
         X1       100 -100      0    0     0  
         X2         0    0    100 -100     0  
         X3         0    0      0    0     1  
 From:   H1                                     0.1
         H2                                     0.1
         H3                                     0.1
         H4                                     0.1
         H5                                     1.0
The output function can be written as follows:
   Y = .1*tanh(100*(X1+.25))-.1*tanh(100*(X1-.25))-.1
     + .1*tanh(100*(X2+1.5))-.1*tanh(100*(X2-1.5))   
     +    tanh(X3)                                   
The output Y is the sum of three functions written on the three lines above. Each of these three functions depends on only one of the inputs. A model such as this in which the output is the sum of univariate (nonlinear) transformations of the inputs is called an "additive" model. It is easier to assess the importance of inputs in an additive model than in the general case because additivity implies that the effect of one input does not depend on the values of the other inputs. Thus, to understand the properties of the output function, you can consider the inputs one at a time, instead of having to visualize a 3-D nonlinear manifold in a 4-D space. Assuming each input is distributed over the interval [-3,3], the three additive functions appear as in the following plot (to the limited resolution of a plain-text file):
 1.0 +                                                    3333333333333
     |                                              333333
     |                                           333
     |                                        333
     |                                       33
 0.5 +                                     33
     |                                    33
     |                                   33
     |                 2222222222222222222222222222222 
     |                2             11111             2
 0.0 +22222222222222222            1  3  1            22222222222222222
     |111111111111111111111111111111 3   111111111111111111111111111111
     |                             33
     |                            33
     |                           33
-0.5 +                          33
     |                        33
     |                      333
     |                   333
     |             333333
-1.0 +3333333333333
     |
     -+-------+-------+-------+-------+-------+-------+-------+-------+-
    -3.00   -2.25   -1.50   -0.75   0.00    0.75    1.50    2.25    3.00
The output has two abrupt changes in response to X1, corresponding to two large weights. The positions of these abrupt changes are close together, so except for a narrow interval containing these two abrupt changes, the output does not depend on X1 at all. The output also has two abrupt changes in response to X2, again corresponding to two large weights. The positions of these abrupt changes for X2 are farther apart than for X1, so it is clear that X1 and X2 have different effects on the output, and for most practical purposes, X2 would be considered more important than X1. It is also obvious that X3 is associated with much larger changes in the output than either X1 or X2, and for most practical purposes, X3 would be considered the most important input.

To summarize the importance of the inputs in this example, recall that the inputs are assumed to be independent and the output function is additive. Hence the importance of one input does not depend on what other inputs are included in the model or on any interaction between the inputs, so the importance of any one input can be assessed without regard to the other inputs. The inputs can therefore be ordered from least important to most important as follows:

  1. X1 is least important because it has a small effect over a small interval.
  2. X2 is moderately important because it has a small effect over a large interval.
  3. X3 is very important because it has a large effect over a large interval.

Why comparing weights in MLPs can be misleading

In MLPs, raw input-to-hidden weights depend on the units of measurement of the inputs, just as in linear models. And standardized input-to-hidden weights depend on the selection of training cases, just as in linear models. But comparing weights in MLPs is even more problematic than comparing weights in linear models. This difficulty arises from the fact the simple interpretation of weights for linear models does not apply to MLPs due to the hidden layer(s).

A huge input-to-hidden weight does not necessarily mean that the input has a huge effect on the output, since the "squashing" functions of the hidden units limit that effect. Huge input-to-hidden weights usually indicate abrupt changes in the output, as would occur if the network were trying to approximate a discontinuity. But the size of the weight is related primarily to the abruptness of the change, not to the size of the change.

A tiny input-to-hidden weight does not necessarily mean that the input has a tiny effect on the output, since that effect can be amplified by the hidden-to-output weights. In fact it is quite common to have tiny input-to-hidden weights and huge hidden-to-output weights; some reasons for this are explained by Cardell, Joerding, and Li (1994).

The main advantage of raw weights over standardized weights in linear models is that the true raw weights (i.e. those that give the best possible generalization) do not depend on what region of the input space you want to generalize to, as long as that region is nonsingular. This invariance results from the fact that every point on a plane has the same slope and therefore the same weights apply. But when you are fitting an MLP to a nonlinear surface, different hidden units may be important in different regions of the input space. Consider a surface produced by the formula:

   Y = tanh(X1) + tanh(X2)
If you consider only cases with X1 > 3, an MLP with one hidden unit depending only on X2 will generalize very well. If you consider only cases with X2 > 3, an MLP with one hidden unit depending only on X1 will generalize very well. The weights for these two MLPs are completely different. For MLPs, the weights that give the best generalization can depend on what region of the input space you want to generalize to.

For the additive function example, the sum of the absolute (squared weights would produce similar results) is shown for each input:

   Input   Sum of absolute input weights    
   -----   -----------------------------
   X1      200
   X2      200
   X3        1 
Thus, the weights suggest two incorrect conclusions:

Why sums of products of weights in MLPs can be misleading

In MLPs, the output depends on both input-to-hidden weights and hidden-to-output weights, so it is tempting to try to combine these two sets of weights in a measure of importance. One formula that has been suggested is the sum of absolute products of weights:
   Importance of Xi = SUM |W(Xi,Hj)*W(Hj,O)|
                       j
where i indexes the inputs, j indexes the hidden units, W(Xi,Hj) indicates the weight connecting input Xi to hidden unit Hj, and W(Hj,O) indicates the weight connecting hidden unit Hj to the output. If the activation functions for the hidden and output units have bounded derivatives, the formula above can be used to obtain an upper bound for the change in output corresponding to a given change in the input. Hence if the formula yields a very small number, you can conclude that the input is not important (for a related discussion of pruning individual weights, see Setiono, 1997). However, the converse does not hold: if the formula yields a large number, you cannot conclude that the input is important.

For the additive function example, the sum of the absolute products of weights is shown for each input:

   Input   Sum of absolute products of weights    
   -----   -----------------------------------
   X1      20
   X2      20
   X3       1 
Thus, the sum of the absolute products of weights suggests the same two incorrect conclusions as before: If you omit taking absolute values, as in Boger and Guterman's (1997) "causal index," the results are:
   Input   Sum of products of weights    
   -----   --------------------------
   X1       0
   X2       0
   X3       1 
Thus, using the sum of the products of weights leads to the incorrect conclusion that X1 and X2 are not important at all.

Incorporating both layers of weights should be an improvement over using only the input-to-hidden layer to measure importance of inputs. But simply taking the product of weights from the two layers ignores the "squashing" effect of the hidden-layer activation function. A given input-to-hidden weight should be adjusted by the amount of squashing, but the amount of squashing depends on the actual value of the corresponding input, as well as the values and weights of the other inputs, and also the bias. If you attempt to take all these complexities into account, you end up using the actual input-output function computed by the network. Measures of importance based on the input-output function are discussed in the following sections.

Why partial derivatives are more interpretable than weights

If the weights in an MLP cannot be interpreted like the weights in a linear model, is there something that can be so interpreted? To some degree, yes: the gradient of the output with respect to the inputs. Note that this gradient is not the gradient that is used for training (that gradient is taken with respect to the weights), but it can be computed in a manner similar to the usual backpropagation algorithm.

The gradient is a vector of partial derivatives. Each partial derivative, by definition, gives the local rate of change of the output with respect to the corresponding input, holding the other inputs fixed. Thus a partial derivative has the same interpretation as a weight in a linear model, except for one crucial difference: a weight in a linear model applies to the entire input space, but a partial derivative applies only to a small neighborhood of the input point at which it is computed.

Methods based on these partial derivatives are often referred to as "sensitivity analysis."

Why partial derivatives can be misleading

The interpretation of a derivative involves extremely small changes in the input. If a particular input can take only a discrete set of values, such as small integers or the Boolean values "true" and "false," then its partial derivative is unlikely to have any practical meaning. Partial derivatives are most useful for inputs that are measurements of theoretically continuous attributes (in practice, all measurements are discrete).

It is tempting to compute the partial derivatives at one "typical" point in the input space, such as the centroid, and assume that those derivatives are typical of the entire input space, but this assumption is dangerously false. If the partial derivatives are constant over the input space, then the output function is linear. If you are using a nonlinear neural network, presumably you think it is possible for the output function to have important nonlinearities. If the output function has important nonlinearities, then there will be important variation of the partial derivatives over the input space.

It can also be dangerously misleading to look at the partial derivatives at only a few points in the output space. One method that has been proposed is to vary each input in turn while all the other inputs are fixed at their mean values. But this method can overlook important variation of the partial derivatives. For example, consider a continuous version of the XOR data:

  Y = X1 + X2 - 2*X1*X2 
where X1 and X2 vary uniformly over [0,1]. Then the mean of each input is .5, and if you fix one input to .5, you will find that the output is a constant regardless of the value of the other input. In other words, if either input is fixed at its mean value, the partial derivative with respect to the other input is zero.

For the additive function example, the partial derivative with respect to each input is shown at the mean of the inputs:

   Input  Partial derivative at mean
   -----  --------------------------
   X1     0
   X2     0
   X3     1
Thus, the partial derivatives at the mean suggest the incorrect conclusion that X1 and X2 are completely unimportant.

Why average partial derivatives over the input space can be misleading

Partial derivatives are interpretable, but you have to evaluate them at a large, representative sample of points from the input space. The next question is how to reduce this large collection of numbers to a single measure of importance for each input. One obvious way to summarize the partial derivatives is to report an average value (mean, median, etc.). But the partial derivatives for a given input may take both large positive and large negative values, producing an average near zero. So average partial derivatives are useful but not sufficient for measuring importance of inputs.

For the additive function example, the average partial derivatives with respect to each input are shown:

   Input  Average partial derivative
   -----  --------------------------
   X1     0
   X2     0
   X3     0.33
Thus, the average partial derivatives suggest the incorrect conclusion that X1 and X2 are completely unimportant.

Why the average absolute (or squared) partial derivative can be misleading

To allow for both positive and negative partial derivatives, you can compute the average of the absolute values or squares. This gives you a better measure of importance than the average of the signed values. But the importance of an input depends not only on the size of the partial derivatives, but on the location of points in the input space with large partial derivatives. In fact, it is sometimes impossible to tell which of two inputs is more important even by looking at the complete frequency distribution of the partial derivatives, as is shown by the following example.

In the additive function example, X1 and X2 have the same mean partial derivative. They also have the same mean absolute derivative and the same mean squared derivative. In fact, the partial derivatives for X1 and X2 have exactly the same frequency dustribution, so it is impossible to tell which one is more important based on the partial derivatives alone. The average absolute partial derivatives with respect to each input are shown:

   Input  Average absolute partial derivative
   -----  -----------------------------------
   X1     0.5
   X2     0.5
   X3     0.33
Thus, the average absolute partial derivatives suggest two incorrect conclusions:

Why differences can be more informative than derivatives

Since the partial derivative of the output with respect to each input provides only local information, it might be better to look at the change in the output over an interval. Also, derivatives are not suitable for discrete inputs, while differences of outputs make sense even for binary inputs (e.g., Baxt and White 1995). The difference in output corresponding to a given change in an input is directly related to the causal importance of an input. For example, to assess the importance of X1 given an output function Y = f( X1, X2, X3), you could compute:
   D1 = f( X1+h, X2, X3) - f( X1, X2, X3)
for a large, representative sample of input points, and then take the average absolute value or square of D1. But how do you choose h? If the output function is periodic, such as Y = sin(X1) + sin(2*X2) + sin(3*X3), D1 will be zero when h is a multiple of the period, but large when h is an odd multiple of half the period. So the safest thing to do is to look at a range of h values. The following plot shows the the mean absolute difference in the output as a function of h for the three inputs in the additive function used in the example above:
  2.0 +                                                    3    3    3
      |                                          3    3
      |
      |                                     3
      |
  1.5 +                                3
      |
      |
      |                           3
      |
  1.0 +
      |                      3
      |
      |                 3
      |
  0.5 +
      |            3
      |
      |       3                   2    2    2
      |            2    2    2    1              2
  0.0 +  1    1    1    1    1         1    1    1    1    1    1    1
      |
      ---+----+----+----+----+----+----+----+----+----+----+----+----+--
        0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0
The order of importance of the inputs is evident in this plot, but in general there remains the problem of how to reduce the information to a single value for each input. Presumably this would involve averaging over h, but there are many different ways to do the averaging. When the inputs have different units of measurement and different distributions, it is not obvious how to select appropriate values for h. Suppose X1 is uniformly distributed on [0,1] and X2 is a binary variable with values in {0,1} with equal probability. For X2, the only sensible value of h is 1, but for X1 you would want to use various values of h intermediate between 0 and 1. How do you average over h in a way that fairly represents both X1 and X2?

It can be tempting to try to reduce the magnitude of the task by clamping all inputs but one at "typical" values and looking at differences produced by changing only one input at a time. But this approach is incorrect except for linear models, as discussed under "Why partial derivatives can be misleading."

For the additive function example, if absolute differences are averaged over all pairs of input values, the following results are obtained:

   Input  Average absolute difference
   -----  ---------------------------
   X1     .030 
   X2     .099 
   X3     .916
Thus, this type of average absolute difference correctly indicates the order of importance of the inputs.

Change in the error function when an input is removed

Another measure of the importance of an input is the change in the error function when the input is removed from the network. The change in the error function is a direct measure of predictive importance. However, the same pitfalls discussed under "Why comparing changes in the error function in linear models can be misleading" apply to nonlinear models as well.

It is important to retrain the network after removing the input. Simply deleting an input unit from the network without retraining is equivalent to clamping the value of that input to zero for all cases. If zero is not a reasonable value for that input, the network outputs are likely to be nonsense.

Instead of clamping the input to a constant value of zero, it would be better to clamp it to a typical value such as the mean of that input. But you may not get a typical output value from a typical input value. In the example above with an additive output function, the mean value of X1 produces unusually high values of the output, and clamping X1 to its mean value increases the RMSE by 0.19. The mean value of X2 also produces high outputs, but not as unusually high as for X1, so clamping X2 to its mean value increases the RMSE by only 0.16. Thus X1 appears more important than X2.

For the additive function example, the changes in RMSE, with and without retraining, produced by omitting each input, are as follows:

           ........ Change in RMSE ........
   Input   No retraining    With retraining
   -----   -------------    ---------------
   X1      .190             .057 
   X2      .140             .100 
   X3      .820             .808
Thus, the change in RMSE with no retraining incorrectly suggests that X1 is more important than X2. But the change in RMSE with retraining yields the correct order of importance.

Retraining the network can, of course, be time-consuming. To be safe, you should go through the full training process including such things as multiple random weight initializations. But if the input being omitted is not related to any other inputs, it is probably adequate to clamp that input to a typical value and use the weights from the original network as initial values for retraining.

If you have many more training cases than weights in the network, it may be more efficient to approximate the change in the error function using the Hessian matrix instead of retraining the network (often called "optimal brain surgeon," OBS, in the neural net literature; see Stahlberger and Riedmiller, 1997; Hassibi and Stork, 1993; Hassibi, Stork, Wolff, and Watanabe 1994). This approximation may be poor if the number of training cases does not greatly exceed the number of weights, if the optimal weights are infinite (Cardell, Joerding, and Li 1994), or if the hidden units are not statistically well-identified.

Conventional statistical methods for nonlinear models (Gallant, 1987) can be used to test the null hypothesis that a given input has zero predictive importance. P-values can be computed in two main ways: likelihood ratio tests and Wald tests. Likelihood ratio tests compare the error functions for networks trained with and without the input in question. Wald tests use a quadratic approximation to the error function as in OBS. The interpretation of p-values is subject to the same caveats discussed under "Why statistical p-values can be misleading." Accurate p-values also require regularity conditions, such as those mentioned above regarding OBS, as well as other more technical statistical conditions described by Gallant (1987) and White (1992).

Dependent inputs

In the nonlinear examples considered up to this point, the inputs have been statistically independent by construction. If the inputs are statistically dependent, it is even more difficult to measure the importance of inputs, because the effects of different inputs cannot generally be separated. The problem is essentially the same as the problem with correlated inputs in linear models, except that linear correlation is not an adequate indicator of statistical dependence of the inputs for nonlinear models.

In the additive function example, if the values of X1 and X3 are restricted to differ by no more than 0.2 (causing those inputs to be highly correlated), the changes in RMSE produced by omitting each input are as follows:

           ........ Change in RMSE ........
   Input   No retraining    With retraining
   -----   -------------    ---------------
   X1      .191             .028 
   X2      .143             .100 
   X3      .799             .061 
The results with no retraining are essentially the same as with independent inputs. But with retraining, both X1 and X3 appear less important, especially X3, which by this measure is now less important than X2.

Noisy data

For noisy data, all of the measures of importance of inputs are subject to sampling variation. Except for raw weights in linear models, it is difficult to estimate the amount of sampling variation (e.g., the standard errors of the importance measures). One possible way to assess the variability of importance measures is bootstrapping as illustrated by Baxt and White (1995).

References

Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction", Neural Computation, 7, 624-638.

Boger, Z., and Guterman, H. (1997), "Knowledge extraction from artificial neural network models," IEEE Systems, Man, and Cybernetics Conference, Orlando, FL.

Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why Some Feedforward Networks Cannot Learn Some Polynomials," Neural Computation, 6, 761-766.

Darlington, R.B. (1968), "Multiple Regression in Psychological Research and Practice," Psychological Bulletin, 69, 161-182.

Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley.

Hassibi, B. and Stork, D.G. (1993) "Second order derivatives for network pruning: Optimal Brain Surgeon" in Hanson, S.J., Cowan, J.D. and Giles, C.L., eds., Advances in Neural Information Processing Systems 5, 164-171, San Mateo, CA: Morgan-Kaufmann.

Hassibi, B., Stork, D.G., Wolff, G., and Watanabe, T. (1994), "Optimal Brain Surgeon: Extensions and performance comparisons," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) Advances in Neural Information Processing Systems 6, San Mateo: CA Morgan-Kaufmann, pp. 263-270.

Masters, T. (1994) Practical Neural Network Recipes in C++, San Diego: Academic Press.

Setiono, R. (1997), A penalty-function aproach for pruning feedforward neural networks," Neural Computation, 9, 185-204.

Stahlberger, A., and Riedmiller, M. (1997), "Fast network pruning and feature extractiob using the Unit-OBS algorithm," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 655-661.

White, H. (1992), Artificial Neural Networks: Approximation and Learning Theory, Blackwell.