/******************************************************************** name: jackboot title: Jackknife and Bootstrap Analyses product: stat system: all support: saswss - Warren S. Sarle update: 26Sep96 DISCLAIMER: THIS INFORMATION IS PROVIDED BY SAS INSTITUTE INC. AS A SERVICE TO ITS USERS. IT IS PROVIDED "AS IS". THERE ARE NO WARRANTIES, EXPRESSED OR IMPLIED, AS TO MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE REGARDING THE ACCURACY OF THE MATERIALS OR CODE CONTAINED HEREIN. Introduction ------------ The %JACK macro does jackknife analyses for simple random samples, computing approximate standard errors, bias-corrected estimates, and confidence intervals assuming a normal sampling distribution. The %BOOT macro does elementary nonparametric bootstrap analyses for simple random samples, computing approximate standard errors, bias-corrected estimates, and confidence intervals assuming a normal sampling distribution. Also, for regression models, the %BOOT macro can resample either observations or residuals. The %BOOTCI macro computes several varieties of confidence intervals that are suitable for sampling distributions that are not normal. In order to use the %JACK or %BOOT macros, you need to know enough about the SAS macro language to write simple macros yourself. See _The SAS Guide to Macro Processing_ for information on the SAS macro language. This document does not explain how the jackknife and bootstrap are performed or how the various confidence intervals are computed, but does provide some advice and caveats regarding usage. For an elementary introduction, see Dixon in the bibliography below. There is a thorough exposition in E&T that should be accessible to anyone who has done a year or more of statistical study. There is a widespread myth that bootstrapping is a magical spell to perform valid statistical inference on _anything_. S&T dispell this myth very effectively and very technically. For an elementary demonstration of the dangers of bootstrapping, see the "Cautionary Example" below. The Jackknife ------------- The jackknife works only for statistics that are smooth functions of the data. Statistics that are not smooth functions of the data, such as quantiles, may yield inconsistent jackknife estimates. The best results are obtained with statistics that are linear functions of the data. For highly nonlinear statistics, the jackknife can be inaccurate. See S&T, chapter 2, for a detailed discussion of the validity of the jackknife. The Bootstrap ------------- Bootstrap estimates of standard errors are valid for many commonly-used statistics, generally requiring no major assumptions other than simple random sampling and finite variance. There do exist some statistics for which the standard error estimates will fail, such as the maximum or minimum. The bootstrap standard error is consistent for some nonsmooth statistics such as the median. However, the bootstrap standard error may not be consistent even for very smooth statistics when the population distribution has very heavy tails. Inconsistency of the usual bootstrap estimators can often be remedied by using a resample size m(n) that is smaller than the sample size n, so that m(n)->infinity and m(n)/n->0 as n->infinity. Theoretical results on the consistency of the bootstrap standard error are not extensive. See S&T, chapter 3, for details. The bootstrap estimates of bias provided by the %BOOT macro are valid under simple random sampling for many commonly-used _plug-in_ estimators. A _plug-in_ estimator is one that uses the same formula to compute an estimate from a sample that is used to compute a parameter from the population. For example, if the sample variance is computed with a divisor of n (VARDEF=N), it is a plug-in estimate; if it is computed with a divisor of n-1 (VARDEF=DF, the default), it is _not_ a plug-in estimate. R-squared is a plug-in estimator; adjusted r-squared is not. Estimating the bias of a non-plug-in estimators requires special treatment; see "Bias Estimation" below. If you are using an estimator that is known to be unbiased, use the BIASCORR=0 argument with %BOOT. See E&T, chapter 10, for more discussion of bootstrap estimation of bias. The approximate normal confidence intervals computed by the %BOOT macro are valid if both the bias and standard error estimates are valid and if the sampling distribution is approximately normal. For non-normal sampling distributions, you should use the %BOOTCI macro, which requires a much larger number of resamples for adequate approximation. If you plan to use only %BOOT, 200 resamples will typically be enough. If you plan to use %BOOTCI, 1000 or more resamples are likely to be needed for a 90% confidence interval; greater confidence levels require even more resamples. The proper use of bootstrap confidence intervals is a matter of considerable controversy; see S&T, chapter 4, for a review. The %BOOT macro does balanced resampling when possible. Balanced resampling yields more accurate approximations to the ideal bootstrap estimators of bias and standard errors than does uniform resampling. Of course, both balanced resampling and uniform resampling produce approximations that converge to the same ideal bootstrap estimators as the number of resamples goes to infinity. Balanced resampling is of little benefit with %BOOTCI. See Hall, appendix II, for a discussion of balanced resampling and other methods from improving the computational efficiency of the bootstrap. Using %JACK and %BOOT --------------------- To use the %JACK or %BOOT macros, you must write a macro called %ANALYZE to do the data analysis that you want to bootstrap. The %ANALYZE macro must have two arguments: DATA= the name of the input data set to analyze OUT= the name of the output data set containing the statistics for which you want to compute bootstrap distributions. If possible, you should write the %ANALYZE macro to use BY processing. The BY statement must be specified via the %BYSTMT macro, which generates a BY statement in which the list of BY variables is given by a macro variable &BY. The &BY macro variable is not an argument to %ANALYZE or to %BYSTMT, but is specified by a %LET statement when needed. The %JACK and %BOOT macros run %ANALYZE once without a BY variable and then once with the the BY variable _SAMPLE_. If you do not use the %BYSTMT macro, the computations will be done with a macro loop instead of with BY processing. A macro loop takes much more computer time than BY processing but requires less disk space. If the %ANALYZE macro uses the %BYSTMT macro, two output data sets are created by the %JACK macro: JACKDATA contains the jackknife resamples. The variable _SAMPLE_ gives the resample number, and _OBS_ gives the original observation number. JACKDIST contains the resampling distributions of the statistics in the OUT= data set created by the %ANALYZE macro. The variable _SAMPLE_ gives the resample number. Two similar data sets are also created by the %BOOT macro when the %BYSTMT macro is used: BOOTDATA contains the bootstrap resamples. The variable _SAMPLE_ gives the resample number, and _OBS_ gives the original observation number. BOOTDIST contains the resampling distributions of the statistics in the OUT= data set created by the %ANALYZE macro. The variable _SAMPLE_ gives the resample number. In addition, the %JACK macro creates a data set JACKSTAT and the %BOOT macro creates a data set BOOTSTAT regardless of whether the %BYSTMT macro is used. These data sets contain the approximate standard errors, bias-corrrected estimates, and 95% confidence intervals assuming a normal sampling distribution. The %BOOTCI macro creates a data set BOOTCI containing the confidence intervals. If the OUT= data set contains more than one observation per BY group, you must specify a list of ID= variables when you run the %JACK or %BOOT macros. These ID= variables identify observations that correspond to the same statistic in different BY groups. For many procedures, these ID= variables would naturally be _TYPE_ and _NAME_, but those names are _not_ allowed to be used as ID= variables--you must use the RENAME= data set option to rename them. (Renaming variables can be tricky. You must use the _old_ name with the DROP= and KEEP= data set options, but you must use the _new_ name with the WHERE= data set option.) Consider analyzing the correlation of the LSAT and GPA variables from Efron and Tibshirani (1993): title 'Law School Data from Efron and Tibshirani, p. 19'; data law; input lsat gpa; cards; 576 3.39 635 3.30 558 2.81 578 3.03 666 3.44 580 3.07 555 3.00 661 3.43 651 3.36 605 3.13 653 3.12 575 2.74 545 2.76 572 2.88 594 2.96 ; The following %ANALYZE macro could be used to process all the statistics in the OUT= data set from PROC CORR: %macro analyze(data=,out=); proc corr noprint data=&data out=&out(rename=(_type_=stat _name_=with)); var lsat gpa; %bystmt; run; %mend; title2 'Jacknife Analysis'; %jack(data=law,id=stat with) title2 'Bootstrap Analysis'; %boot(data=law,id=stat with,random=123) However, if you are interested only in the correlation, it is more efficient to extract only the relevant observations and variables. It is also helpful to provide descriptive names and labels, as in this example: %macro analyze(data=,out=); proc corr noprint data=&data out=&out; var lsat gpa; %bystmt; run; %if &syserr=0 %then %do; data &out; set &out; where _type_='CORR' & _name_='LSAT'; corr=gpa; label corr='Correlation'; keep corr &by; run; %end; %mend; title2 'Jacknife Analysis'; %jack(data=law) title2 'Bootstrap Analysis'; %boot(data=law,random=123) It is advisable to make the OUT= data set as small as possible to conserve computer time and disk space. If you are running release 6.11 or later, you can use a WHERE= data set option on an output data set: title2 'Using WHERE= with an output data set--6.11 only'; %macro analyze(data=,out=); proc corr noprint data=&data out=&out(where=(_type_='CORR' & _name_='LSAT') rename=(gpa=corr) keep=gpa _type_ _name_ &by); var lsat gpa; %bystmt; run; %mend; title3 'Jacknife Analysis'; %jack(data=law) title3 'Bootstrap Analysis'; %boot(data=law,random=123) Unfortunately, you may not DROP any variable used in a WHERE= data set option. Bias Estimation --------------- The sample correlation is a plug-in estimator and hence is suitable for the bias estimator in %BOOT. The sample variance computed with a divisor of n-1 is not a plug-in estimator and therefore requires special treatment. In some procedures, you can use the VARDEF= option to obtain a plug-in estimate of the variance. The default value of VARDEF= is DF, which yields the usual adjustment for degrees of freedom, instead of the plug-in estimate. For example: title2 'The unbiased variance estimator is not a plug-in estimator'; proc means data=law var vardef=df; var lsat gpa; run; The following %ANALYZE macro could be used to jackknife the unbiased variance estimator, but the bootstrap over-corrects for the nonexistent bias: title2 'Estimating the bias of the unbiased estimator of variance'; %macro analyze(data=,out=); proc means noprint data=&data vardef=df; output out=&out(drop=_freq_ _type_) var=var_lsat var_gpa; var lsat gpa; %bystmt; run; %mend; title3 'The jackknife computes the correct bias of zero'; %jack(data=law) title3 'The bootstrap over-corrects for bias'; %boot(data=law,random=123) By specifying VARDEF=N instead of VARDEF=DF, you can tell the MEANS procedure to compute a plug-in estimate of the variance: title2 'Estimating the bias of the plug-in estimator of variance'; %macro analyze(data=,out=); proc means noprint data=&data vardef=n; output out=&out(drop=_freq_ _type_) var=var_lsat var_gpa; var lsat gpa; %bystmt; run; %mend; With the above %ANALYZE macro, %JACK yields an exact bias correction, while the bias-corrected estimates from %BOOT are very close to the unbiased estimates: title3 'Jacknife Analysis'; %jack(data=law) title3 'Bootstrap Analysis'; %boot(data=law,random=123) If the procedure you are using supports the VARDEF= option to produce plug-in estimates, you can use the %VARDEF macro to obtain correct bootstrap bias estimates of the corresponding non-plug-in estimates. The %VARDEF macro generates a VARDEF= option with a value of either N or DF as appropriate for use with the %BOOT macro (The %JACK macro ignores the %VARDEF macro). In the %ANALYZE macro, use %VARDEF in the procedure statement where the VARDEF= option would be syntactically correct. For example: title2 'Estimating the bias of the unbiased variance estimator'; %macro analyze(data=,out=); proc means noprint data=&data %vardef; output out=&out(drop=_freq_ _type_) var=var_lsat var_gpa; var lsat gpa; %bystmt; run; %mend; title3 'Bootstrap Analysis'; %boot(data=law,random=123) The variance estimator using VARDEF=DF is unbiased, so the bias correction estimated by bootstrapping is much smaller than in the previous example, in which the biased plug-in estimator was used. Confidence Intervals -------------------- The normal bootstrap confidence interval computed by %BOOT or %BOOTSE is accurate only for statistics with an approximately normal sampling distribution. The %BOOTCI macro provides the most commonly used types of bootstrap confidence intervals that: * are suitable for statistics with nonnormal sampling distributions and * require only a single level of resampling. You must run %BOOT before %BOOTCI, and it is advisable to specify at least 1000 resamples in %BOOT for a 90% confidence interval. For a higher level of confidence or for the BC and BCa methods, even more resamples should be used. The terminology for bootstrap confidence intervals is confused. The keywords used with the %BOOTCI macro follow S&T: Keyword Terms from the references ------- ------------------------- PCTL or "bootstrap percentile" in S&T; PERCENTILE "percentile" in E&T; "other percentile" in Hall; "Efron's `backwards' pecentile" in Hjorth HYBRID "hybrid" in S&T; no term in E&T; "percentile" in Hall; "simple" in Hjorth T "bootstrap-t" in S&T and E&T; "percentile-t" in Hall; "studentized" in Hjorth BC "BC" in all BCA "BCa" in S&T, E&T, and Hjorth; "ABC" in Hall (cannot be used for bootstrapping residuals in regression models) There is considerable controversy concerning the use of bootstrap confidence intervals. To fully appreciate the issues, it is important to read S&T and Hall in addition to E&T. Asymptotically in simple random samples, the T and BCa methods work better than the traditional normal approximation, while the percentile, hybrid, and BC methods have the same accuracy as the traditional normal approximation. In small samples, things get much more complicated: * The percentile method simply uses the alpha/2 and 1-alpha/2 percentiles of the bootstrap distribution to define the interval. This method performs well for quantiles and for statistics that are unbiased and have a symmetric sampling distribution. For a statistic that is biased, the percentile method amplifies the bias. The main virtue of the percentile method and the closely related BC and BCa methods is that the intervals are equivariant under transformation of the parameters. One consequence of this equivariance is that the interval cannot extend beyond the possible range of values of the statistic. In some cases, however, this property can be a vice--see the "Cautionary Example" below. * The BC method corrects the percentile interval for bias--median bias, not mean bias. The correction is performed by adjusting the percentile points to values other than alpha/2 and 1-alpha/2. If a large correction is required, one of the percentile points will be very small; hence a very large number of resamples will be required to approximate the interval accurately. See the "Cautionary Example" below. * The BCa method corrects the percentile interval for bias and skewness. This method requires an estimate of the acceleration, which is related to the skewness of the sampling distribution. The acceleration can be estimated by jackknifing for simple random samples which, of course, requires extra computation. For bootstrapping residuals in regression models, no general method for estimating the acceleration is known. If the acceleration is not estimated accurately, the BCa interval will perform poorly. The length of the BCa interval is not monotonic with respect to alpha (Hall, pp 134-135, 137). For large values of the acceleration and large alpha, the BCa interval is excessively short. The BCa interval is no better than the BC interval for nonsmooth statistics such as the median. * The HYBRID method is the reverse of the percentile method. While the percentile method amplifies bias, the HYBRID method automatically adjusts for bias and skewness. The HYBRID method works well if the standard error of the statistic does not depend on any unknown parameters; otherwise, the T method works better if a good estimate of the standard error is available. Of all the methods in %BOOTCI, the HYBRID method seems to be the least likely to yield spectacularly wrong results, but often suffers from low coverage in relatively easy cases. The HYBRID method and the closely related T method are not equivariant under transformation of the parameters. * The T method requires an estimate of the standard error (or a constant multiple thereof) of each statistic being bootstrapped. This requires more work from the user. If the standard errors are not estimated accurately, the T method may perform poorly. In simulation studies, T intervals are often found to be very long. E&T (p 160) claim that the T method is erratic and sensitive to outliers. Numerous other methods exist for bootstrap confidence intervals that require nested resampling, i.e., each resample of the original sample is itself reresampled multiple times. Since the total number of reresamples required is typically 25,000 or more, these methods are extremely expensive and have not yet been implemented in the %BOOT and %BOOTCI macros. The following example replicates the nonparametric confidence intervals shown in E&T, p 183. This example analyzes the variances of two variables, A and B, while E&T analyze only A. E&T do not show the hybrid interval, the normal ("standard") interval with bias correction, or the jackknife interval. title 'Spatial Test Data from Efron and Tibshirani, pp 180 & 183'; data spatial; input a b @@; cards; 48 42 36 33 20 16 29 39 42 38 42 36 20 15 42 33 22 20 41 43 45 34 14 22 6 7 0 15 33 34 28 29 34 41 4 13 32 38 24 25 47 27 41 41 24 28 26 14 30 28 41 40 ; %macro analyze(data=,out=); proc means noprint data=&data vardef=n; output out=&out(drop=_freq_ _type_) var=var_a var_b; var a b; %bystmt; run; %mend; title2 'Jackknife Interval with Bias Correction'; %jack(data=spatial,alpha=.10); title2 'Normal ("Standard") Confidence Interval with Bias Correction'; %boot(data=spatial,alpha=.10,samples=2000,random=123); title2 'Normal ("Standard") Confidence Interval without Bias Correction'; %bootse(alpha=.10,biascorr=0); title2 'Efron''s Percentile Confidence Interval'; %bootci(percentile,alpha=.10) title2 'Hybrid Confidence Interval'; %bootci(hybrid,alpha=.10) title2 'BC Confidence Interval'; %bootci(bc,alpha=.10) title2 'BCa Confidence Interval'; %bootci(bca,alpha=.10) title2 'Resampling with Computation of Studentizing Statistics'; %macro analyze(data=,out=); proc means noprint data=&data vardef=n; output out=&out(drop=_freq_ _type_) var=var_a var_b kurtosis=kurt_a kurt_b; var a b; %bystmt; run; data &out; set &out; stud_a=var_a*sqrt(kurt_a+2); stud_b=var_b*sqrt(kurt_b+2); drop kurt_a kurt_b; run; %mend; %boot(data=spatial,stat=var_a var_b,samples=2000,random=123); title2 'T Confidence Interval'; %bootci(t,stat=var_a var_b,student=stud_a stud_b,alpha=.10) If you want to compute _all_ the varieties of confidence intervals, you can use the %ALLCI macro: title2 'All Jackknife and Bootstrap Confidence Intervals'; %allci(stat=var_a var_b,student=stud_a stud_b,alpha=.10) Bootstrapping Regression Models ------------------------------- In regression models, there are two main ways to do bootstrap resampling, depending on whether the predictor variables are random or fixed. Stine provides an elementary introduction to bootstrapping regressions, including discussion of outliers, robust estimators, and heteroscedasticity. If the predictors are random, you resample observations just as you would for any simple random sample. This method is usually called "bootstrapping pairs". If the predictors are fixed, the resampling process should keep the same values of the predictors in every resample and change only the values of the response variable by resampling the residuals. To do this with the %BOOT macro, you must do a preliminary analysis in which you fit the regression model using the complete sample and create an output data set containing residuals and predicted values; it is this output data set that is used as input to the %BOOT macro. You must also specify the name of the residual variable and provide an equation for computing the response variable from the residual and predicted values. title 'Cement Hardening Data from Hjorth, p 31'; data cement; input x1-x4 y; label x1='3CaOAl2O3' x2='3CaOSiO2' x3='4CaOAl2O3Fe2O3' x4='2CaOSiO2'; cards; 7 26 6 60 78.5 1 29 15 52 74.3 11 56 8 20 104.3 11 31 8 47 87.6 7 52 6 33 95.9 11 55 9 22 109.2 3 71 17 6 102.7 1 31 22 44 72.5 2 54 18 22 93.1 21 47 4 26 115.9 1 40 23 34 83.8 11 66 9 12 113.3 10 68 8 12 109.4 ; proc reg data=cement; model y=x1-x4; output out=cemout r=resid p=pred; run; %macro analyze(data=,out=); options nonotes; proc reg data=&data noprint outest=&out(drop=Y _IN_ _P_ _EDF_); model y=x1-x4/selection=rsquare start=4; %bystmt; run; options notes; %mend; title2 'Resampling Observations'; title3 '(bias correction for _RMSE_ is wrong)'; %boot(data=cement,random=123) title2 'Resampling Residuals'; title3 '(bias correction for _RMSE_ is wrong)'; %boot(data=cemout,residual=resid,equation=y=pred+resid,random=123) Either method of resampling for regression models (observations or residuals) can be used regardless of the form of the error distribution. However, residuals should be resampled only if the errors are independent and identically distributed and if the functional form of the model is correct to within a reasonable approximation. If these assumptions are questionable, it is safer to resample observations. In the above example, R-squared is a plug-in estimator, so the bias correction is appropriate. The root mean squared error, _RMSE_, is not a plug-in estimator, so the bias correction for _RMSE_ is wrong. Unfortunately, the REG procedure does not support the VARDEF= option. _RMSE_ is not _very_ biased, so you could choose to ignore the bias and run the %BOOTSE macro to compute the standard error without a bias correction: title2 'Resampling Observations'; title3 'Without bias correction'; %bootse(stat=_rmse_,biascorr=0) To get the proper bias correction for _RMSE_, you have to use a DATA step that checks the macro variable &VARDEF and unadjusts for degrees of freedom when &VARDEF=N. You must also invoke the %VARDEF macro, but since you don't want to generate a VARDEF= option in this case, just assign the value returned by %VARDEF to an unused macro variable: %macro analyze(data=,out=); options nonotes; proc reg data=&data noprint outest=&out; model y=x1-x4/selection=rsquare start=4; %bystmt; run; %let junk=%vardef; data &out(drop=y _in_ _p_ _edf_); set &out; _mse_=_rmse_**2; %if &vardef=N %then %do; _mse_=_mse_*_edf_/(_edf_+_p_); _rmse_=sqrt(_mse_); %end; label _mse_='Mean Squared Error'; run; options notes; %mend; title2 'Resampling Observations'; %boot(data=cement,random=123) Note that _MSE_ is an unbiased estimate, so its estimated bias is very small. _RMSE_ is slightly biased and thus has a larger estimated bias. A Cautionary Example -------------------- Jackknifing and bootstrapping are no remedy for an inadequate sample size. For nonparametric resampling methods, the sample distribution must be reasonably close in some sense to the population distribution to obtain accurate inferences. In parametric methods, only the estimated parameters need be reasonably close to the population parameters to obtain accurate inferences. The smaller the sample size, the greater the fluctuations in the distribution of the sample. Nonparametric methods that are sensitive to a wide variety of such fluctuations will suffer more from small sample sizes than will parametric methods _if_ the assumptions of the parametric methods are valid. In this example, the purpose of the analysis is to find a 95% confidence interval for R**2 in a linear regression with 20 observations and 10 predictors. The predictors and response are generated from a multivariate normal distribution, so normal-theory methods are applicable. With real data, if the distribution were not known to be normal, you might be tempted to use the jackknife or bootstrap on the theory that normal approximations could not be trusted in such a small sample size. In fact, most of the jackknife and bootstrap methods cannot be trusted either. This example computes a 95% confidence interval with each of the methods available in %JACK and %BOOT using 1000 resamples. The results are assembled into a single data set called CI for comparison. PROC CANCORR is also used to obtain a normal-theory 95% confidence interval. Two versions of the %ANALYZE macro are shown, one with CANCORR and one with REG; either version can be used for the analysis. title 'A Cautionary Example'; %let n=20; %let p=10; *** generate multivariate normal data with true R**2=0.1; data x; array x x1-x&p; do n=1 to &n; drop n; do over x; x=rannor(123); end; p=sum(of x1-x&p)/sqrt(&p); e=rannor(123); y=p*sqrt(.1)+e*sqrt(.9); output; end; run; *** normal-theory confidence interval for R**2; proc cancorr data=x smc vdep short outstat=stat; var y; with x:; run; *** extract confidence interval from cancorr output; proc transpose data=stat(where=(_type_ in ('LCLRSQ' 'UCLRSQ'))) out=ci(rename=(LCLRSQ=alcl UCLRSQ=aucl)); id _type_; var y; run; data ci; set ci; drop _name_; length method $20; method='Normal theory'; run; %*** macro for bootstrapping using cancorr; %macro analyze(data=,out=); options nonotes; proc cancorr data=&data smc vdep noprint outstat=&out(where=(_type_='RSQUARED') keep=_type_ y &by); var y; with x:; %bystmt; run; options notes; %mend; %*** macro for bootstrapping using reg; %macro analyze(data=,out=); options nonotes; proc reg data=&data noprint outest=&out(keep=_rsq_ &by); model y=x:/selection=rsquare start=&p; %bystmt; run; options notes; %mend; %jack(data=x) proc append data=jackstat(keep=alcl aucl method) base=ci; run; %boot(data=x,samples=1000,random=123) proc append data=bootstat(keep=alcl aucl method) base=ci; run; %bootci(Hybrid) proc append data=bootci(keep=alcl aucl method) base=ci; run; %bootci(PCTL) proc append data=bootci(keep=alcl aucl method) base=ci; run; %bootci(BC) proc append data=bootci(keep=alcl aucl method) base=ci; run; %bootci(BCa) proc append data=bootci(keep=alcl aucl method) base=ci; run; %*** macro for bootstrapping, with a data step to estimate the standard error of R**2; %macro analyze(data=,out=); options nonotes; proc reg data=&data noprint outest=&out(keep=_rsq_ &by); model y=x:/selection=rsquare start=&p; %bystmt; run; options notes; data &out; set &out(rename=(_rsq_=rsquare)); retain n &n p %eval(&p+1); drop n p; * standard error of rsquare--see Kendall & Stuart (1979), The Advanced Theory of Statistics, Vol 2, p 363, London: Charles Griffin & Company Ltd. Surprisingly, the bootstrap t interval seems to be slightly less conservative when rsquare is plugged into the formula for the standard error than when max(0,adjrsq) is plugged in; stud=max(.01,sqrt( (n-p)/(n**2-1)/(n-1) * (1-rsquare)**2 * ( 2*(p-1) + 4*rsquare*((n-p)*(n-1)+4*(p-1))/(n+3) ) )); run; %mend; %boot(data=x,stat=rsquare,samples=1000,random=123) %bootci(t,stat=rsquare,student=stud) proc append data=bootci(keep=alcl aucl method) base=ci; run; proc print data=ci; id method; run; The actual sampling distribution of R**2, based on 10000 simulated data sets, looks like this: Frequency 1000 + * | * * * | * * * * | * * * * * 800 + * * * * * * | * * * * * * * | * * * * * * * * | * * * * * * * * 600 + * * * * * * * * * | * * * * * * * * * * | * * * * * * * * * * | * * * * * * * * * * * 400 + * * * * * * * * * * * * | * * * * * * * * * * * * | * * * * * * * * * * * * * | * * * * * * * * * * * * * 200 + * * * * * * * * * * * * * * * | * * * * * * * * * * * * * * * * | * * * * * * * * * * * * * * * * * | * * * * * * * * * * * * * * * * * * * * ------------------------------------------------------ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 0 0 0 1 1 2 2 2 3 3 4 4 4 5 5 6 6 6 7 7 8 8 8 9 9 0 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 The bootstrap distribution computed from the one data set in this example is not even close to the true sampling distribution: Frequency | * 300 + * | * | * | * | * 200 + * | * | * | * * | * * * 100 + * * * * * | * * * * * * | * * * * * * * | * * * * * * * * * | * * * * * * * * * * * ------------------------------------------------------ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 0 0 0 1 1 2 2 2 3 3 4 4 4 5 5 6 6 6 7 7 8 8 8 9 9 0 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 The table of confidence intervals printed by the final PROC PRINT step is: METHOD ALCL AUCL Normal theory 0.00000 0.62876 Jackknife -0.44648 0.54393 Bootstrap Normal 0.07400 0.51324 Bootstrap Hybrid 0.18391 0.56566 Bootstrap PCTL 0.61824 1.00000 Bootstrap BC 0.51547 0.57231 Bootstrap BCa 0.51547 0.57231 Bootstrap t -3.11368 0.56556 The true value of R**2 in this example is 0.10, the sample plug-in estimate is 0.59, and the adjusted estimate is 0.14. The normal-theory interval can be considered the "right answer". The jackknife interval has a negative lower limit, and the upper limit is rather low, but the interval covers the true value. The bootstrap interval based on a normal approximation is short but does cover the true value. However, a glance at the chart of the bootstrap distribution shows that a normal approximation is suspect. The bootstrap hybrid interval is even shorter and does not cover the true value. The hybrid interval is poor because the bootstrap distribution is less variable and far more skewed than the true sampling distribution. The plug-in estimate is very biased, so it is no surprise that the bootstrap PCTL method works poorly. However, the PCTL interval lies entirely above the plug-in estimate, a dramatic illustration of Hall's claim that the PCTL interval is "backwards"! The bootstrap BC interval is extremely short and is not even close to the true value. The lower percentile point for computing the BC interval is .00000000010453, so billions of resamples would be required for an accurate approximation. The lower percentile point for the BCa interval is even smaller at 7.3099E-17, and would require an astronomical number of resamples for an accurate approximation. The bootstrap t interval has a wildy negative lower limit, and the upper limit is rather low, but the interval covers the true value. A simulation was performed by repeating the above analysis 2000 times on randomly generated data sets. For each method, the coverage probability (COVERAGE), the average length (LENGTH), and the positive part of the length (POSLEN=AUCL-MAX(0,ALCL)) were computed. Among the jackknife and bootstrap methods, the only acceptable coverage probability is for the bootstrap t interval, which is nevertheless very poor with regard to the length of the interval. Considering only the positive part of the interval, the bootstrap t interval is quite good, but it works well in this example only because we know a lower bound for the parameter and have an analytic expression for the standard error. METHOD COVERAGE LENGTH POSLEN ------------------------------------------------------ Bootstrap BC 0.00426 0.148825 0.148825 Bootstrap BCa 0.00426 0.148575 0.148575 Bootstrap Hybrid 0.42997 0.418778 0.351726 Bootstrap Normal 0.54917 0.485234 0.361531 Bootstrap PCTL 0 0.418778 0.418778 Bootstrap T 0.95828 4.351963 0.542228 Jackknife 0.66050 0.561788 0.333603 Normal theory 0.95360 0.573977 0.573977 References ---------- Dixon Dixon, P.M. (1993), "The bootstrap and the jackknife: Describing the precision of ecological indices," in Scheiner, S.M. and Gurevitch, J., eds., Design and Analysis of Ecological Experiments, New York: Chapman & Hall, pp 290-318. E&T Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, New York: Chapman & Hall. Hall Hall, P. (1992), The Bootstrap and Edgeworth Expansion, New York: Springer-Verlag. Hjorth Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods, London: Chapman & Hall. S&T Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York: Springer-Verlag. Stine Stine, R. (1990), "An introduction to bootstrap methods: Examples and ideas," Sociological Methods & Research, 18, 243-291. ********************************************************************/ %******************************* JACK *******************************; %macro jack( /* Jackknife resampling analysis */ data=, /* Input data set. If the data set does not support direct access via the POINT= option, do NOT use the %BYSTMT macro in the %ANALYZE macro. */ stat=_numeric_,/* Numeric variables in the OUT= data set created by the %ANALYZE macro that contain the values of statistics for which you want to compute jackknife distributions. */ id=, /* One or more numeric or character variables that uniquely identify the observations of the OUT= data set within each BY group. No ID variables are needed if the OUT= data set has only one observation per BY group. The ID variables may not be named _TYPE_, _NAME_, or _STAT_. */ biascorr=1, /* 1 for bias correction; 0 otherwise. */ alpha=.05, /* significance (i.e., one minus confidence) level for confidence intervals; blank to suppress confidence intervals. */ print=1, /* 1 to print the jackknife estimates; 0 otherwise. */ chart=1 /* 1 to chart the jackknife resampling distributions; 0 otherwise. */ ); %if %bquote(&data)= %then %do; %put ERROR in JACK: The DATA= argument must be specified.; %goto exit; %end; %global _jackdat; %let _jackdat=&data; %global vardef; %let vardef=DF; %local jack by useby; %let useby=0; *** compute the actual values of the statistics; %let by=; %analyze(data=&data,out=JACKACT); %if &syserr>4 %then %goto exit; *** find number of observations in the input data set; %local nobs; data _null_; call symput('nobs',trim(left(put(_nobs,12.)))); if 0 then set &data nobs=_nobs; stop; run; %if &syserr>4 %then %goto exit; %if &useby %then %do; %jackby(data=&data,print=0); %if &syserr>4 %then %goto exit; %let by=_sample_; %analyze(data=JACKDATA,out=JACKDIST); %if &syserr>4 %then %goto exit; %end; %else %do; %jackslow(data=&data); %if &syserr>4 %then %goto exit; %end; %if &chart %then %do; %if %bquote(&id)^= %then %do; proc sort data=JACKDIST; by &id; run; proc chart data=JACKDIST(drop=_sample_); vbar &stat; by &id; run; %end; %else %do; proc chart data=JACKDIST(drop=_sample_); vbar &stat; run; %end; %end; %jackse(stat=&stat,id=&id,alpha=&alpha,biascorr=&biascorr,print=&print) %exit:; %mend jack; %macro jackby( /* Jackknife resampling */ data=&_jackdat, print=0 ); data JACKDATA/view=JACKDATA; do _sample_=1 to &nobs; do _i=1 to &nobs; if _i^=_sample_ then do; _obs_=_i; set &data point=_i; output; end; end; end; stop; run; %if &syserr>4 %then %goto exit; %if &print %then %do; proc print data=JACKDATA; id _sample_ _obs_; run; %end; %exit:; %mend jackby; %macro jackslow( /* Uniform jackknife sampling and analysis without BY processing */ data=&_jackdat ); %put %cmpres(WARNING: Jackknife analysis will be slow because the ANALYZE macro did not use the BYSTMT macro.); data JACKDIST; set JACKACT; _sample_=0; delete; run; options nonotes; %local sample; %do sample=1 %to &nobs; %put Jackknife sample &sample; data _TMPD_; drop _i; do _i=1 to &nobs; set &data; if _i^=&sample then output; end; stop; run; %if &syserr>4 %then %goto exit; %analyze(data=_TMPD_,out=_TMPS_); %if &syserr>4 %then %goto exit; data _TMPS_; set _TMPS_; _sample_=&sample; run; %if &syserr>4 %then %goto exit; proc append data=_TMPS_ base=JACKDIST; run; %if &syserr>4 %then %goto exit; %end; %exit:; options notes; %mend jackslow; %******************************* JACKSE *******************************; %macro jackse( /* Jackknife estimates of standard error, bias, and normal confidence intervals */ stat=, id=, alpha=.05, biascorr=1, print=1 ); %global _jackdat; %if %bquote(&_jackdat)= %then %do; %put ERROR in JACKSE: You must run JACK before JACKSE; %goto exit; %end; %if %bquote(&alpha)^= %then %do; *** compute confidence level; %local conf; data _null_; conf=100*(1-&alpha); call symput('conf',trim(left(put(conf,best8.)))); run; %end; %if %bquote(&id)^= %then %do; *** sort the actual statistics; proc sort data=JACKACT; by &id; run; %if &syserr>4 %then %goto exit; %end; *** transpose the actual statistics in each observation; proc transpose data=JACKACT out=JACKACT2 prefix=value; %if %bquote(&stat)^= %then %do; var &stat; %end; %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; proc sort data=JACKACT2; by %if %bquote(&id)^= %then &id; _name_ ; run; %if &syserr>4 %then %goto exit; %if %bquote(&id)^= %then %do; proc sort data=JACKDIST; by &id; run; %if &syserr>4 %then %goto exit; %end; *** compute mean, std, min, max of resampling distribution; proc means data=JACKDIST(drop=_sample_) noprint vardef=n; %if %bquote(&stat)^= %then %do; var &stat; %end; output out=JACKTMP2(drop=_type_ _freq_); %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; *** transpose statistics for resampling distribution; proc transpose data=JACKTMP2 out=JACKTMP3; %if %bquote(&stat)^= %then %do; var &stat; %end; id _stat_; %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; proc sort data=JACKTMP3; by %if %bquote(&id)^= %then &id; _name_ ; run; %if &syserr>4 %then %goto exit; data JACKSTAT; retain &id name value jackmean %if &biascorr %then bias; stderr %if %bquote(&alpha)^= %then alcl; %if &biascorr %then biasco; %if %bquote(&alpha)^= %then aucl confid method; min max n; merge JACKACT2(rename=(_name_=name value1=value)) JACKTMP3(rename=(_name_=name mean=jackmean std=stderr)); by %if %bquote(&id)^= %then &id; name; %if %bquote(&alpha)^= %then %do; length method $20; retain z; drop z; if _n_=1 then do; z=probit(1-&alpha/2); put z=; confid=&conf; method='Jackknife'; end; %end; stderr=stderr*sqrt(&nobs-1); %if &biascorr %then %do; bias=(jackmean-value)*(&nobs-1); biasco=value-bias; %if %bquote(&alpha)^= %then %do; alcl=biasco-z*stderr; aucl=biasco+z*stderr; %end; %end; %else %if %bquote(&alpha)^= %then %do; alcl=value-z*stderr; aucl=value+z*stderr; %end; label name ='Name' value ='Observed Statistic' jackmean='Jackknife Mean' %if &biascorr %then %do; bias ='Estimated Bias' biasco='Bias-Corrected Statistic' %end; stderr='Estimated Standard Error' %if %bquote(&alpha)^= %then %do; alcl ='Estimated Lower Confidence Limit' aucl ='Estimated Upper Confidence Limit' method='Method for Confidence Interval' confid='Confidence Level (%)' %end; min ='Minimum Resampled Estimate' max ='Maximum Resampled Estimate' n ='Number of Resamples' ; run; %if &syserr>4 %then %goto exit; %if &print %then %do; proc print data=JACKSTAT label; id %if %bquote(&id)^= %then &id; name; run; %end; %exit:; %mend jackse; %******************************* BOOT *******************************; %macro boot( /* Bootstrap resampling analysis */ data=, /* Input data set, not a view or a tape file. */ samples=200, /* Number of resamples to generate. */ residual=, /* Name of variable in the input data set that contains residuals; may not be used with SIZE= */ equation=, /* Equation (in the form of an assignment statement) for computing the response variable */ size=, /* Size of each resample; default is size of the input data set. The SIZE= argument may not be used with BALANCED=1 or with a nonblank value for RESIDUAL= */ balanced=, /* 1 for balanced resampling; 0 for uniform resampling. By default, balanced resampling is used unless the SIZE= argument is specified, in which case uniform resampling is used. */ random=0, /* Seed for pseudorandom numbers. */ stat=_numeric_,/* Numeric variables in the OUT= data set created by the %ANALYZE macro that contain the values of statistics for which you want to compute bootstrap distributions. */ id=, /* One or more numeric or character variables that uniquely identify the observations of the OUT= data set within each BY group. No ID variables are needed if the OUT= data set has only one observation per BY group. The ID variables may not be named _TYPE_, _NAME_, or _STAT_ */ biascorr=1, /* 1 for bias correction; 0 otherwise */ alpha=.05, /* significance (i.e., one minus confidence) level for confidence intervals; blank to suppress normal confidence intervals */ print=1, /* 1 to print the bootstrap estimates; 0 otherwise. */ chart=1 /* 1 to chart the bootstrap resampling distributions; 0 otherwise. */ ); %if %bquote(&data)= %then %do; %put ERROR in BOOT: The DATA= argument must be specified.; %goto exit; %end; %global _bootdat; %let _bootdat=&data; %local by useby; %let useby=0; %global usevardf vardef; %let usevardf=0; *** compute the actual values of the statistics; %let vardef=DF; %let by=; %analyze(data=&data,out=_ACTUAL_); %if &syserr>4 %then %goto exit; *** compute plug-in estimates; %if &usevardf %then %do; %let vardef=N; %analyze(data=&data,out=_PLUGIN_); %let vardef=DF; %if &syserr>4 %then %goto exit; %end; %if &useby=0 %then %let balanced=0; %if %bquote(&size)^= %then %do; %if %bquote(&balanced)= %then %let balanced=0; %else %if &balanced %then %do; %put %cmpres(ERROR in BOOT: The SIZE= argument may not be used with BALANCED=1.); %goto exit; %end; %if %bquote(&residual)^= %then %do; %put %cmpres(ERROR in BOOT: The SIZE= argument may not be used with RESIDUAL=.); %goto exit; %end; %end; %else %if %bquote(&balanced)= %then %let balanced=1; *** find number of observations in the input data set; %global _nobs; data _null_; call symput('_nobs',trim(left(put(_nobs,12.)))); if 0 then set &data nobs=_nobs; stop; run; %if &syserr>4 %then %goto exit; %if &balanced %then %bootbal(data=&data,samples=&samples, random=&random,print=0); %else %if &useby %then %bootby(data=&data,samples=&samples, random=&random,size=&size,print=0); %if &syserr>4 %then %goto exit; %if &balanced | &useby %then %do; %let by=_sample_; %analyze(data=BOOTDATA,out=BOOTDIST); %end; %else %bootslow(data=&data,samples=&samples, random=&random,size=&size); %if &syserr>4 %then %goto exit; %if &chart %then %do; %if %bquote(&id)^= %then %do; proc sort data=BOOTDIST; by &id; run; proc chart data=BOOTDIST(drop=_sample_); vbar &stat; by &id; run; %end; %else %do; proc chart data=BOOTDIST(drop=_sample_); vbar &stat; run; %end; %end; %bootse(stat=&stat,id=&id,alpha=&alpha,biascorr=&biascorr,print=&print) %exit:; %mend boot; %macro bootbal( /* Balanced bootstrap resampling */ data=&_bootdat, samples=200, random=0, print=0, ); * Gleason, J.R. (1988) "Algorithms for balanced bootstrap simulations," American Statistician, 42, 263-266; data BOOTDATA/view=BOOTDATA; %bootin; drop _a _cbig _ii _j _jbig _k _s; array _c(&_nobs) _temporary_; /* cell counts */ array _p(&_nobs) _temporary_; /* pointers */ do _j=1 to &_nobs; _c(_j)=&samples; end; do _j=1 to &_nobs; _p(_j)=_j; end; _k=&_nobs; /* number of nonempty cells left */ _jbig=_k; /* index of largest cell */ _cbig=&samples; /* _cbig >= _c(_j) */ do _sample_=1 to &samples; do _i=1 to &_nobs; do until(_s<=_c(_j)); _j=ceil(ranuni(&random)*_k); /* choose a cell */ _s=ceil(ranuni(&random)*_cbig); /* accept cell? */ end; _l=_p(_j); _obs_=_l; _c(_j)+-1; * put _sample_= _i= _k= _l= @30 %do i=1 %to &_nobs; _c(&i) %end;; if _j=_jbig then do; _a=floor((&samples-_sample_-_k)/_k); if _cbig-_c(_j)>_a then do; do _ii=1 to _k; if _c(_ii)>_c(_jbig) then _jbig=_ii; end; _cbig=_c(_jbig); end; end; if _c(_j)=0 then do; if _jbig=_k then _jbig=_j; _p(_j)=_p(_k); _c(_j)=_c(_k); _k+-1; end; %bootout(_l); end; end; stop; run; %if &syserr>4 %then %goto exit; %if &print %then %do; proc print data=BOOTDATA; id _sample_ _obs_; run; %end; %exit:; %mend bootbal; %macro bootby( /* Uniform bootstrap resampling */ data=&_bootdat, samples=200, random=0, size=, print=0 ); %if %bquote(&size)= %then %let size=&_nobs; data BOOTDATA/view=BOOTDATA; %bootin; do _sample_=1 to &samples; do _i=1 to &size; _p=ceil(ranuni(&random)*&_nobs); _obs_=_p; %bootout(_p); end; end; stop; run; %if &syserr>4 %then %goto exit; %if &print %then %do; proc print data=BOOTDATA; id _sample_ _obs_; run; %end; %exit:; %mend bootby; %macro bootslow( /* Uniform bootstrap resampling and analysis without BY processing */ data=&_bootdat, samples=20, random=0, size= ); %put %cmpres(WARNING: Bootstrap analysis will be slow because the ANALYZE macro did not use the BYSTMT macro.); %if %bquote(&size)= %then %let size=&_nobs; data BOOTDIST; set _ACTUAL_; _sample_=0; delete; run; options nonotes; %local sample; %do sample=1 %to &samples; %put Bootstrap sample &sample; data _TMPD_; %bootin; do _i=1 to &size; _p=ceil(ranuni(%eval(&random+&sample))*&_nobs); %bootout(_p); end; stop; run; %if &syserr>4 %then %goto exit; %analyze(data=_TMPD_,out=_TMPS_); %if &syserr>4 %then %goto exit; data _TMPS_; set _TMPS_; _sample_=&sample; run; %if &syserr>4 %then %goto exit; proc append data=_TMPS_ base=BOOTDIST; run; %if &syserr>4 %then %goto exit; %end; %exit:; options notes; %mend bootslow; %******************************* BOOTSE *******************************; %macro bootse( /* Bootstrap estimates of standard error, bias, and normal confidence intervals */ stat=, id=, alpha=.05, biascorr=1, print=1 ); %global _bootdat; %if %bquote(&_bootdat)= %then %do; %put ERROR in BOOTSE: You must run BOOT before BOOTSE; %goto exit; %end; %if %bquote(&alpha)^= %then %do; *** compute confidence level; %local conf; data _null_; conf=100*(1-&alpha); call symput('conf',trim(left(put(conf,best8.)))); run; %end; %if %bquote(&id)^= %then %do; *** sort the actual statistics; proc sort data=_ACTUAL_; by &id; run; %if &syserr>4 %then %goto exit; %if &usevardf %then %do; *** sort the plug-in estimates; proc sort data=_PLUGIN_; by &id; run; %if &syserr>4 %then %goto exit; %end; %end; *** transpose the actual statistics in each observation; proc transpose data=_ACTUAL_ out=_ACTTR_ prefix=value; %if %bquote(&stat)^= %then %do; var &stat; %end; %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; proc sort data=_ACTTR_; by %if %bquote(&id)^= %then &id; _name_ ; run; %if &syserr>4 %then %goto exit; %if &usevardf %then %do; *** transpose the plug-in estimates in each observation; proc transpose data=_PLUGIN_ out=_PLUGTR_ prefix=value; %if %bquote(&stat)^= %then %do; var &stat; %end; %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; proc sort data=_PLUGTR_; by %if %bquote(&id)^= %then &id; _name_ ; run; %if &syserr>4 %then %goto exit; %end; %if %bquote(&id)^= %then %do; proc sort data=BOOTDIST; by &id; run; %if &syserr>4 %then %goto exit; %end; *** compute mean, std, min, max of resampling distribution; proc means data=BOOTDIST(drop=_sample_) noprint; %if %bquote(&stat)^= %then %do; var &stat; %end; output out=_TMP2_(drop=_type_ _freq_); %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; *** transpose statistics for resampling distribution; proc transpose data=_TMP2_ out=_TMP3_; %if %bquote(&stat)^= %then %do; var &stat; %end; id _stat_; %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; proc sort data=_TMP3_; by %if %bquote(&id)^= %then &id; _name_ ; run; %if &syserr>4 %then %goto exit; data BOOTSTAT; retain &id name value bootmean %if &biascorr %then bias; stderr %if %bquote(&alpha)^= %then alcl; %if &biascorr %then biasco; %if %bquote(&alpha)^= %then aucl confid method; min max n; merge _ACTTR_(rename=(_name_=name value1=value)) %if &usevardf %then _PLUGTR_(rename=(_name_=name value1=plugin)); _TMP3_(rename=(_name_=name mean=bootmean std=stderr)); by %if %bquote(&id)^= %then &id; name; %if %bquote(&alpha)^= %then %do; length method $20; retain z; drop z; if _n_=1 then do; z=probit(1-&alpha/2); put z=; confid=&conf; method='Bootstrap Normal'; end; %end; %if &biascorr %then %do; bias=bootmean-%if &usevardf %then plugin; %else value;; biasco=value-bias; %if %bquote(&alpha)^= %then %do; alcl=biasco-z*stderr; aucl=biasco+z*stderr; %end; %end; %else %if %bquote(&alpha)^= %then %do; alcl=value-z*stderr; aucl=value+z*stderr; %end; label name ='Name' value ='Observed Statistic' bootmean='Bootstrap Mean' %if &usevardf %then %do; plugin='Plug-In Estimate' %end; %if &biascorr %then %do; bias ='Approximate Bias' biasco='Bias-Corrected Statistic' %end; stderr='Approximate Standard Error' %if %bquote(&alpha)^= %then %do; alcl ='Approximate Lower Confidence Limit' aucl ='Approximate Upper Confidence Limit' confid='Confidence Level (%)' method='Method for Confidence Interval' %end; min ='Minimum Resampled Estimate' max ='Maximum Resampled Estimate' n ='Number of Resamples' ; run; %if &syserr>4 %then %goto exit; %if &print %then %do; proc print data=BOOTSTAT label; id %if %bquote(&id)^= %then &id; name; run; %end; %exit:; %mend bootse; %******************************* BOOTCI *******************************; %macro bootci( /* Bootstrap percentile-based confidence intervals. Creates output data set BOOTCI. */ method, /* One of the following methods must be specified: PERCENTILE or PCTL HYBRID T BC BCA Requires the %JACK macro */ stat=, /* Numeric variables in the OUT= data set created by the %ANALYZE macro that contain the values of statistics for which you want to compute bootstrap distributions. */ student=, /* For the T method only, numeric variables in the OUT= data set created by the %ANALYZE macro that contain the standard errors of the statistics for which you want to compute bootstrap distributions. There must be a one-to-one between the VAR= variables and the STUDENT= variables */ id=, /* One or more numeric or character variables that uniquely identify the observations of the OUT= data set within each BY group. No ID variables are needed if the OUT= data set has only one observation per BY group. The ID variables may not be named _TYPE_, _NAME_, or _STAT_ */ alpha=.05, /* significance (i.e., one minus confidence) level for confidence intervals */ print=1); /* 1 to print the bootstrap confidence intervals; 0 otherwise. */ %global _bootdat; %if %bquote(&_bootdat)= %then %do; %put ERROR in BOOTCI: You must run BOOT before BOOTCI; %goto exit; %end; *** check method; data _null_; length method $10; method=upcase(symget('method')); if method=' ' then do; put 'ERROR in BOOTCI: You must specify one of the methods ' 'PCTL, HYBRID, T, BC or BCa'; abort; end; else if method='PERCENTILE' then method='PCTL'; else if method not in ('PCTL' 'HYBRID' 'BC' 'BCA' 'T') then do; put "ERROR in BOOTCI: Unrecognized method '" method "'"; abort; end; call symput('qmethod',method); run; %if &syserr>4 %then %goto exit; %if &qmethod=T %then %do; %if %bquote(&stat)= | %bquote(&student)= %then %do; data _null_; put 'ERROR: VAR= and STUDENT= must be specified with the T method'; run; %goto exit; %end; %end; *** sort resampling distributions; %if %bquote(&id)^= %then %do; proc sort data=BOOTDIST; by &id _sample_; run; %if &syserr>4 %then %goto exit; %end; *** transpose resampling distributions; proc transpose data=BOOTDIST prefix=col out=BOOTTRAN(rename=(col1=value _name_=name)); %if %bquote(&stat)^= %then %do; var &stat; %end; by %if %bquote(&id)^= %then &id; _sample_; run; %if &syserr>4 %then %goto exit; %if &qmethod=T %then %do; *** transpose studentizing statistics; proc transpose data=BOOTDIST prefix=col out=BOOTSTUD(rename=(col1=student _name_=studname)); var &student; by %if %bquote(&id)^= %then &id; _sample_; run; %if &syserr>4 %then %goto exit; data BOOTTRAN; merge BOOTTRAN BOOTSTUD; label student='Value of Studentizing Statistic' studname='Name of Studentizing Statistic'; run; %if &syserr>4 %then %goto exit; %end; proc sort data=BOOTTRAN; by %if %bquote(&id)^= %then &id; name %if &qmethod=BC | &qmethod=BCA %then value; %else %if &qmethod=T %then _sample_; ; run; %if &syserr>4 %then %goto exit; %if &qmethod=T %then %do; *** transpose the actual statistics in each observation must get data set in unsorted order for merge; proc transpose data=_ACTUAL_ out=_ACTTR_ prefix=value; %if %bquote(&stat)^= %then %do; var &stat; %end; %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; *** transpose the actual studentizing statistics; proc transpose data=_ACTUAL_ prefix=col out=_ACTSTUD(rename=(_name_=studname col1=student)); var &student; %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; *** merge statistics with studentizing statistics; data _ACT_T_; merge _ACTTR_ _ACTSTUD; label student='Value of Studentizing Statistic' studname='Name of Studentizing Statistic'; run; %if &syserr>4 %then %goto exit; proc sort data=_ACT_T_; by %if %bquote(&id)^= %then &id; _name_ ; run; %if &syserr>4 %then %goto exit; data BOOTTRAN; merge BOOTTRAN _ACT_T_(rename=(_name_=name)); by %if %bquote(&id)^= %then &id; name ; value=(value-value1)/student; run; %if &syserr>4 %then %goto exit; %end; %if &qmethod=BC | &qmethod=BCA %then %do; %if &qmethod=BCA %then %do; %global _jackdat; %if %bquote(&_jackdat)^=%bquote(&_bootdat) %then %do; %jack(data=&_bootdat,stat=&stat,id=&id,alpha=&alpha, chart=0,print=&print); %if &syserr>4 %then %goto exit; %end; *** estimate acceleration for BCa; proc means data=JACKDIST noprint vardef=df; %if %bquote(&stat)^= %then %do; var &stat; %end; output out=JACKSKEW(drop=_type_ _freq_ _sample_) skewness=; %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; *** transpose skewness; proc transpose data=JACKSKEW prefix=col out=_ACCEL_(rename=(col1=skewness _name_=name)); %if %bquote(&stat)^= %then %do; var &stat; %end; %if %bquote(&id)^= %then %do; by &id; %end; run; %if &syserr>4 %then %goto exit; proc sort data=_ACCEL_; by %if %bquote(&id)^= %then &id; name ; run; %if &syserr>4 %then %goto exit; %end; *** estimate median bias for BC; data _BC_; retain _alpha _conf; drop value value1; if _n_=1 then do; _alpha=α _conf=100*(1-_alpha); call symput('conf',trim(left(put(_conf,best8.)))); end; merge _ACTTR_(rename=(_name_=name)) BOOTTRAN; by %if %bquote(&id)^= %then &id; name; if first.name then do; n=0; _z0=0; end; n+1; _z0+(value<value1)+.5*(value=value1); if last.name then do; _z0=probit(_z0/n); output; end; run; %if &syserr>4 %then %goto exit; *** compute percentiles; data BOOTPCTL; retain _i _lo _up _nplo _jlo _glo _npup _jup _gup alcl aucl; drop _alpha _sample_ _conf _i _nplo _jlo _glo _npup _jup _gup value; merge BOOTTRAN _BC_ %if &qmethod=BCA %then _ACCEL_;; by %if %bquote(&id)^= %then &id; name; label _lo='Lower Percentile Point' _up='Upper Percentile Point' _z0='Bias Correction (Z0)'; if first.name then do; %if &qmethod=BC %then %do; _lo=probnorm(_z0+(_z0+probit(_alpha/2))); _up=probnorm(_z0+(_z0+probit(1-_alpha/2))); %end; %else %if &qmethod=BCA %then %do; drop skewness; retain _accel; label _accel='Acceleration'; _accel=skewness/(-6*sqrt(&_nobs))* (&_nobs-2)/&_nobs/sqrt((&_nobs-1)/&_nobs); _i=_z0+probit(_alpha/2); _lo=probnorm(_z0+_i/(1-_i*_accel)); _i=_z0+probit(1-_alpha/2); _up=probnorm(_z0+_i/(1-_i*_accel)); %end; _nplo=min(n-.5,max(.5,fuzz(n*_lo))); _jlo=floor(_nplo); _glo=_nplo-_jlo; _npup=min(n-.5,max(.5,fuzz(n*_up))); _jup=floor(_npup); _gup=_npup-_jup; _i=0; end; _i+1; if _glo then do; if _i=_jlo+1 then alcl=value; end; else do; if _i=_jlo then alcl=value; else if _i=_jlo+1 then alcl=(alcl+value)/2; end; if _gup then do; if _i=_jup+1 then aucl=value; end; else do; if _i=_jup then aucl=value; else if _i=_jup+1 then aucl=(aucl+value)/2; end; if last.name then do; output; end; run; %if &syserr>4 %then %goto exit; %end; %else %do; %local conf pctlpts pctlpre pctlname; %let pctlpre=a; %let pctlname=lcl ucl; data _null_; _alpha=α _conf=100*(1-_alpha); call symput('conf',trim(left(put(_conf,best8.)))); %if &qmethod=PCTL %then %do; _lo=_alpha/2; _up=1-_lo; %end; %else %if &qmethod=HYBRID | &qmethod=T %then %do; _up=_alpha/2; _lo=1-_up; %end; _lo=100*_lo; _up=100*_up; call symput('pctlpts',trim(left(put(_lo,best8.)))||' '|| trim(left(put(_up,best8.)))); run; %if &syserr>4 %then %goto exit; proc univariate data=BOOTTRAN noprint pctldef=5; var value; output out=BOOTPCTL n=n pctlpts=&pctlpts pctlpre=&pctlpre pctlname=&pctlname; by %if %bquote(&id)^= %then &id; name; run; %if &syserr>4 %then %goto exit; %end; data BOOTCI; retain &id name value alcl aucl confid method n; merge %if &qmethod=T %then _ACT_T_(rename=(_name_=name value1=value)); %else _ACTTR_(rename=(_name_=name value1=value)); BOOTPCTL; by %if %bquote(&id)^= %then &id; name; %if &qmethod=HYBRID %then %do; aucl=2*value-aucl; alcl=2*value-alcl; %end; %else %if &qmethod=T %then %do; aucl=value-aucl*student; alcl=value-alcl*student; %end; confid=&conf; length method $20; method='Bootstrap '||symget('method'); label name ='Name' value ='Observed Statistic' alcl ='Approximate Lower Confidence Limit' aucl ='Approximate Upper Confidence Limit' confid='Confidence Level (%)' method='Method for Confidence Interval' n ='Number of Resamples' ; run; %if &syserr>4 %then %goto exit; %if &print %then %do; proc print data=BOOTCI label; id %if %bquote(&id)^= %then &id; name; run; %end; %exit: %mend bootci; %******************************* ALLCI *******************************; %macro allci( /* Computes all types of confidence intervals available in BOOTCI. Creates output data set ALLCI. */ stat=, /* Numeric variables in the OUT= data set created by the %ANALYZE macro that contain the values of statistics for which you want to compute bootstrap distributions. */ student=, /* For the T method only, numeric variables in the OUT= data set created by the %ANALYZE macro that contain the standard errors of the statistics for which you want to compute bootstrap distributions. There must be a one-to-one between the VAR= variables and the STUDENT= variables */ id=, /* One or more numeric or character variables that uniquely identify the observations of the OUT= data set within each BY group. No ID variables are needed if the OUT= data set has only one observation per BY group. The ID variables may not be named _TYPE_, _NAME_, or _STAT_ */ alpha=.05, /* significance (i.e., one minus confidence) level for confidence intervals */ keep=, /* Variables to keep in the output data set containing the confidence intervals; can be used to avoid warnings from PROC TRANSPOSE */ print=1); /* 1 to print the bootstrap confidence intervals; 0 otherwise. */ %if %bquote(&keep)^= %then %let keep=(keep=&keep); %bootci(bca,stat=&stat,id=&id,alpha=&alpha,print=0) data ALLCI; set bootci&keep; run; %bootci(bc,stat=&stat,id=&id,alpha=&alpha,print=0) proc append data=bootci&keep base=ALLCI force; run; %bootci(pctl,stat=&stat,id=&id,alpha=&alpha,print=0) proc append data=bootci&keep base=ALLCI force; run; %bootci(hybrid,stat=&stat,id=&id,alpha=&alpha,print=0) proc append data=bootci&keep base=ALLCI force; run; %if %bquote(&student)^= %then %do; %bootci(t,stat=&stat,id=&id,student=&student,alpha=&alpha,print=0) proc append data=bootci&keep base=ALLCI force; run; %end; proc append data=bootstat&keep base=ALLCI force; run; proc append data=jackstat&keep base=ALLCI force; run; %if &print %then %do; proc print data=ALLCI label; id %if %bquote(&id)^= %then &id; name; run; %end; %mend allci; %macro bystmt; %let useby=1; by &by; %mend bystmt; %macro vardef; %let usevardf=1; vardef=&vardef %mend vardef; %macro bootin; /* INTERNAL USE ONLY input an observation from the original data set */ %if %bquote(&residual)^= %then %do; array _r(&_nobs) _temporary_; /* residuals */ do _i=1 to &_nobs; set &data point=_i; _r(_i)=&residual; end; %end; %else %do; drop _i; %end; %mend bootin; %macro bootout(obs); /* INTERNAL USE ONLY output an observation to the resampled data set */ %if %bquote(&residual)^= %then %do; set &data point=_i; &residual=_r(&obs); &equation; %end; %else %do; set &data point=&obs; %end; output; %mend bootout;