Info | ||
---|---|---|
| ||
This section provides an overview of the basic area-level model and some extensions.
|
Info | ||
---|---|---|
| ||
Basic backgroundArea-level small area estimation models are popular models in research and for official statistics due to their easy application and interpretation. In the following, basic ideas are described. For a deeper methodological introduction to the model, the reader is referred to standard literature. For this model type, area-level data is assumed, i.e., the design-unbiased direct estimate \(\hat{\theta}_{i}^{Dir}\) for the parameter of interest \(\theta_{i}\), which could be the income mean in domain \(i\) the sampling variance of this direct estimate, and auxiliary information are available at domain level. In the basic area-level model, it is assumed that the population value \(\theta_{i}\) differs from the direct estimate \(\hat{\theta}_{i}^{Dir}\) by a sampling error which leads to a sampling model \(\hat{\theta}^{Dir}_{i} = \theta_{i} + e_{i}, i = 1, ..., D,\) where \(\hat{\theta}_{i}^{Dir}\) is a design-unbiased direct estimator, and \(e_{i}\) are sampling errors with mean 0 and known variance \(\sigma^{2}_{e,i}\). A linking model expresses a linear relationship between the population value \(\theta_{i}\) and domain-level auxiliary information \(x\). The combination of the sampling and the linking model leads to a special linear mixed model \(\hat{\theta}^{Dir}_{i} = x_{i}^{\top} \beta + u_{i} + e_{i}, i = 1, ..., D\) where \(\beta\) is the vector of regression parameters and \(u_{i}\) are domain-specific random effects with mean 0 and variance \(\sigma^{2}_{u}\). The regression parameters and the random effects variance are generally unknown and need to be estimated. Common estimation approaches are empirical best linear unbiased prediction (EBLUP), empirical and hierarchical Bayesian methods. It is often assumed that both error terms are normally or at least symmetrically distributed around 0. The empirical best linear unbiased predictor can also be expressed as a weighted average of the direct estimate and the regression-synthetic part \(\hat{\theta}^{EBLUP}_{i} = \hat{\gamma}_{i} \hat{\theta}^{Dir}_{i} + (1 - \hat{\gamma}_{i}) x_{i}^{\top} \hat{\beta} \) where \(\hat{\beta}\) is the weighted least squares estimator of the regression parameters and \(\hat{\gamma}_{i} = \frac{\hat{\sigma}_{u}}{\hat{\sigma}_{u} + \sigma_{e, i}}\) denotes the shrinkage factor for domain \(i\) which weights the direct estimate \(\hat{\theta}_{d}^{Dir}\) and the regression-synthetic part \(x_{i}^{\top} \hat{\beta}\). With increasing sampling variance \(\sigma^{2}_{e,i}\), the weight on the direct estimate decreases. Since the direct estimates are only available for sampled domains, the non-sampled or out-of-sample domains are predicted with the regression-synthetic part \(\hat{\theta}^{out}_{i} = x_{i}^{\top} \hat{\beta}.\) |
Info | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||
How to applyBesides a basic understanding of the methodology, the application of approaches with statistical software is of interest for practitioners. While the inputs partly depend on the chosen software packages, some common data inputs are described. Inputs The starting point of the application is area-level data including a design-unbiased direct estimate at domain level, the sampling variance and predictors at domain level. In the presence of non-sampled domains, these domains will not have a direct estimate but the predictors need to be available for all domains for which estimates are required, sampled and non-sampled. While the sampling variance is assumed to be known in the model, it often needs to be estimated in practical applications. To estimate the sampling error variance, unit-level survey data usually needs to be available. For the other necessary arguments as e.g., the estimation method of the model parameters, defaults (predefined values) are often set to simplify the application. Outputs The return of all software packages are domain-specific estimates, Furthermore, all software packages provide an estimation of the mean squared error for each of the domain-specific estimates.. Implementation The table below gives a first overview of modules in standard software that enable the application of the basic area-level model. Additional to the basic area-level model, most modules/packages also have a specific focus like robust parameter estimation or the consideration of spatial-correlation among domains. Some of these extensions are described below. Availability of the basic area-level model in standard statistical software
|
Info | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||
Estimation of sampling error variance Anchor |
In the basic area-level model, the sampling error variance, \(\sigma^{2}_{e,i}\), is assumed to be known. However, in most practical applications this assumption is not fulfilled and the variance needs to be estimated. When the unit-level data is available, one way could be the direct estimation of the variance of the direct estimator, \(var(\hat{\theta}^{Dir}_{i})\), with analytical or replicate methods: \(\hat{\sigma}^{2}_{e,i} = var\left(\hat{\theta}^{Dir}_{i}\right).\) However, the variance estimates may not be stable due to the small sample sizes in areas or domains of interest. Therefore, the smoothing of \(\hat{\sigma}^{2}_{e,i}\) by generalized variance functions (GVF) is used in several applications. The higher efficiency of the estimator is obtained by smoothing models as follows: \(\log\left[var\left(\hat{\theta}^{Dir}_{i}\right)\right] = x_{i}^{v\top} \beta^{v} + \epsilon^{v}_{i} \quad \text{with} \quad \epsilon^{v}_{i} \sim N(0, \sigma_{\epsilon^{v}}^{2}).\) where \(x_{i}^{v\top}\) are auxiliary variables, \(\beta^{v}\) is the vector of regression parameters and \(\epsilon^{v}_{i}\) are independent errors. Based on the estimated parameters, the estimate for the sampling error variance is obtained by \(\hat{\sigma}^{2}_{e,i} = \exp\left[x_{i}^{v\top} \hat{\beta}^{v}\right] \Delta.\) where \(\Delta\) is a bias-correction factor, which could be specified in different ways. Fúquene et al. (2019) use a correction which relies on the assumption that the errors are normally distributed. However, Hidiroglou et al. (2019) observe empirically that the resulting estimator of \(\Delta\) is sensitive to deviations from the normality assumption. Therefore, Hidiroglou et al. (2019) propose a correction factor using a method of moments that does not rely on the normality assumption. Disregarding the correction factor could lead to an underestimation of the domain variances. When transformations are used for the direct estimate, the variance of the transformed direct estimate is often approximated by large sample theories. Two common transformations used in the area-level model are the log and the arcsin or inverse sine transformation. The log transformation may help to counteract missing linearity or normality on the error terms. When the logarithm of the direct estimate is modeled, also the (estimated) sampling error variance needs to be adjusted. One option is the following: \(\hat{\theta}^{Dir, log}_{i} = \log\left(\hat{\theta}^{Dir}_{i}\right) \quad \text{and} \quad \hat{\sigma}^{2, log}_{e,i} = \left(\hat{\theta}^{Dir}_{i} \right)^{-2}\hat{\sigma}^{2}_{e,i}.\) The arcsin transformation is commonly applied in practice to model ratios or proportions. The direct estimate is transformed with the arcsin transformation, while the sampling error variance is approximated with the effective sample size \(\tilde{n}\) as follows: \(\hat{\theta}^{Dir, arcsin}_{i} = \sin^{-1}\left(\sqrt{\hat{\theta}^{Dir}_{i}}\right) \quad \text{and} \quad \hat{\sigma}^{2, arcsin}_{e,i} = 1 / 4\tilde{n}_{i}.\) where \(\tilde{n}_{i}\) is the effective sample size in domain \(i\) calculated by dividing the domain sample size by the design effect. Various approaches to estimate the sampling error variance for the area-level model
All of the above approaches estimate the sampling error variance and take the resulting estimate as known in the model which could lead to an underestimation of the uncertainty. Thus, You and Chapman (2006) propose a Bayesian approach that considers that the sampling error variance is estimated. According to Bell (2008), two general rules can be derived from a sensitivity analysis of the estimation of the sampling error variance:
Whenever one of the above approaches leads to reasonable estimates in most domains but unreasonable results in some domains, these domains could also be excluded from the modelling process and be treated as non-sampled domains. Availability of adjustments regarding the sampling error variance in standard statistical software
Please note that the transformations also apply to the direct estimate. However, it is addressed in this section on how the sampling error variance is estimated since the usage of a transformation in the area-level model always applies to both, the direct estimate and its variance estimator. |
Info | ||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||
Estimation of model parameters Anchor |
The model parameters to be estimated in the basic area-level model are the domain random effects variance \(\sigma_{u}^{2}\) and the fixed-effects parameters \(\beta). Common estimation approaches are empirical best linear unbiased prediction (EBLUP), empirical (EB) and hierarchical Bayesian (HB) methods. In practice, EBLUP and HB are usually used. The following table briefly shows the distinctions between the approaches but for detailed information, please see the literature recommendations. EBLUP vs. Empirical Bayes vs. Hierarchical Bayes (based on Ghosh and Rao (1994))
Since the estimation approaches for area-level models implemented in standard statistical software are EBLUP and HB, the following is only focused on these two approaches. For the estimation of the unknown model variance, standard EBLUP approaches are the Fay-Herriot moment estimator (FH), maximum likelihood (ML) and residual maximum likelihood (REML) estimator. In contrast to the likelihood approaches, the Fay-Herriot moment estimator does not rely on a distributional assumption. Following Rao and Molina (2015), the likelihood estimators are more efficient than the Fay-Herriot moment estimator. Thus, REML is often set as a default when fitting the basic model. The estimation with these approaches could, however, lead to a negative variance estimate which would be set to zero. To counteract this possibility, adjusted likelihood methods are developed that ensure a positive variance estimation (Li and Lahiri 2010, Yoshimori and Lahiri 2014). The fixed-effects parameters are obtained by weighted least squares estimates that incorporate the estimated random effects variance. The HB approach requires the specification of a subjective prior on the model parameters, namely the domain random effects variance \(\sigma_{u}^{2}\) and the fixed-effects parameters \(\beta\). The prior can be informative or non-informative. Prior information could be derived from expert knowledge or relevant previous studies. While the HB approach is straightforward, the prior selection should be careful since improper priors could lead to improper posteriors, and thus SAE estimates. Availability of the different estimation approaches in standard statistical software
|
Info | ||||||
---|---|---|---|---|---|---|
| ||||||
Uncertainty measurement
All software packages report estimates for the uncertainty along with the SAE domain estimates. For the EBLUP approaches, estimators of the mean squared error (MSE) have either analytical formulations or are derived by replication methods. The analytical approaches depend on the chosen estimation approach for \(\sigma_{u}^{2}\). The flow charts in the vignette of the R package emdi give a good overview of MSEs selected to the corresponding estimation methods. Returned MSE estimator depending on chosen estimation method (see Harmening et al. 2020) For the HB approach, the uncertainty is obtained by the posterior variance as an estimator for the MSE. |
Info | ||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||
Extensions Anchor |
Additional to the basic area-level model, some extensions are implemented in standard software. In the following, the ideas behind the extensions are briefly described and the table gives an indication of which packages provide the extension. For detailed information about the methodology, please see the original reference or the package descriptions/vignettes.
Overview of availability of the area-level model in statistical software (not comprehensive)
|
Info | ||
---|---|---|
| ||
ReferencesMarhuenda, Y., Molina, I. and Morales, D. (2013). Small area estimation with spatio-temporal Fay-Herriot models, Computational Statistics and Data Analysis, 58, 308–325. Rao, J.N.K. and Molina, I. (2015). Small Area Estimation. New York: Wiley. |