Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info
iconfalse

This section provides an overview of the basic area-level model and some extensions. 

Table of Contents

Info
iconfalse

Basic background

Area-level small area estimation models are popular models in research and for official statistics due to their easy application and interpretation. In the following, basic ideas are described. For a deeper methodological introduction to the model, the reader is referred to standard literature

For this model type, area-level data is assumed, i.e., the design-unbiased direct estimate \(\hat{\theta}_{i}^{Dir}\) for the parameter of interest \(\theta_{i}\), which could be the income mean in domain \(i\) the sampling variance of this direct estimate, and auxiliary information are available at domain level. 

In the basic area-level model, it is assumed that the population value \(\theta_{i}\) differs from the direct estimate \(\hat{\theta}_{i}^{Dir}\) by a sampling error which leads to a sampling model

\(\hat{\theta}^{Dir}_{i} = \theta_{i} + e_{i}, i = 1, ..., D,\)

where \(\hat{\theta}_{i}^{Dir}\) is a design-unbiased direct estimator, and \(e_{i}\) are sampling errors with mean 0 and known variance \(\sigma^{2}_{e,i}\).

A linking model expresses a linear relationship between the population value \(\theta_{i}\) and domain-level auxiliary information \(x\).  The combination of the sampling and the linking model leads to a special linear mixed model

\(\hat{\theta}^{Dir}_{i} = x_{i}^{\top} \beta + u_{i} + e_{i}, i = 1, ..., D\)

where \(\beta\) is the vector of regression parameters and \(u_{i}\) are domain-specific random effects with mean 0 and variance \(\sigma^{2}_{u}\). The regression parameters and the random effects variance are generally unknown and need to be estimated. Common estimation approaches are empirical best linear unbiased prediction (EBLUP), empirical and hierarchical Bayesian methods. It is often assumed that both error terms are normally or at least symmetrically distributed around 0.

The empirical best linear unbiased predictor can also be expressed as a weighted average of the direct estimate and the regression-synthetic part

\(\hat{\theta}^{EBLUP}_{i} = \hat{\gamma}_{i} \hat{\theta}^{Dir}_{i} + (1 - \hat{\gamma}_{i}) x_{i}^{\top} \hat{\beta} \)

where \(\hat{\beta}\) is the weighted least squares estimator of the regression parameters and \(\hat{\gamma}_{i} = \frac{\hat{\sigma}_{u}}{\hat{\sigma}_{u} + \sigma_{e, i}}\) denotes the shrinkage factor for domain \(i\) which weights the direct estimate \(\hat{\theta}_{d}^{Dir}\) and the regression-synthetic part \(x_{i}^{\top} \hat{\beta}\). With increasing sampling variance \(\sigma^{2}_{e,i}\), the weight on the direct estimate decreases.

Since the direct estimates are only available for sampled domains, the non-sampled or out-of-sample domains are predicted with the regression-synthetic part

\(\hat{\theta}^{out}_{i} = x_{i}^{\top} \hat{\beta}.\)

Info
iconfalse

How to apply

Besides a basic understanding of the methodology, the application of approaches with statistical software is of interest for practitioners. While the inputs partly depend on the chosen software packages, some common data inputs are described.

Inputs

The starting point of the application is area-level data including a design-unbiased direct estimate at domain level, the sampling variance and predictors at domain level. In the presence of non-sampled domains, these domains will not have a direct estimate but the predictors need to be available for all domains for which estimates are required, sampled and non-sampled. While the sampling variance is assumed to be known in the model, it often needs to be estimated in practical applications. To estimate the sampling error variance, unit-level survey data usually needs to be available. For the other necessary arguments as e.g., the estimation method of the model parameters, defaults (predefined values) are often set to simplify the application.

Outputs

The return of all software packages are domain-specific estimates, Furthermore, all software packages provide an estimation of the mean squared error for each of the domain-specific estimates..

Implementation 

The table below gives a first overview of modules in standard software that enable the application of the basic area-level model. Additional to the basic area-level model, most modules/packages also have a specific focus like robust parameter estimation or the consideration of spatial-correlation among domains. Some of these extensions are described below.

Availability of the basic area-level model in standard statistical software

ModelCommentRStataSAS                    Python
Basic modelDerived for means and totalssae, rsae, emdi, JoSAE, hbsae, BayesSAEfayherriot, FHSAE

Prototype by Statistics Canada, Mukhopadhyay and McDowell

samplics
Info
iconfalse

Estimation of sampling error variance
Anchor
samplingVariance
samplingVariance

In the basic area-level model, the sampling error variance, \(\sigma^{2}_{e,i}\), is assumed to be known. However, in most practical applications this assumption is not fulfilled and the variance needs to be estimated.  When the unit-level data is available, one way could be the direct estimation of the variance of the direct estimator, \(var(\hat{\theta}^{Dir}_{i})\), with analytical or replicate methods:

\(\hat{\sigma}^{2}_{e,i} = var\left(\hat{\theta}^{Dir}_{i}\right).\)

However, the variance estimates may not be stable due to the small sample sizes in areas or domains of interest. Therefore, the smoothing of \(\hat{\sigma}^{2}_{e,i}\) by generalized variance functions (GVF) is used in several applications. The higher efficiency of the estimator is obtained by smoothing models as follows:

\(\log\left[var\left(\hat{\theta}^{Dir}_{i}\right)\right] = x_{i}^{v\top} \beta^{v} + \epsilon^{v}_{i} \quad \text{with} \quad \epsilon^{v}_{i} \sim N(0, \sigma_{\epsilon^{v}}^{2}).\)

where \(x_{i}^{v\top}\) are auxiliary variables, \(\beta^{v}\) is the vector of regression parameters and \(\epsilon^{v}_{i}\) are independent errors. Based on the estimated parameters, the estimate for the sampling error variance is obtained by

\(\hat{\sigma}^{2}_{e,i} = \exp\left[x_{i}^{v\top} \hat{\beta}^{v}\right] \Delta.\)

where \(\Delta\) is a bias-correction factor, which could be specified in different ways. Fúquene et al. (2019) use a correction which relies on the assumption that the errors are normally distributed. However, Hidiroglou et al. (2019) observe empirically that the resulting estimator of \(\Delta\) is sensitive to deviations from the normality assumption. Therefore, Hidiroglou et al. (2019) propose a correction factor using a method of moments that does not rely on the normality assumption. Disregarding the correction factor could lead to an underestimation of the domain variances.

When transformations are used for the direct estimate, the variance of the transformed direct estimate is often approximated by large sample theories. Two common transformations used in the area-level model are the log and the arcsin or inverse sine transformation. The log transformation may help to counteract missing linearity or normality on the error terms. When the logarithm of the direct estimate is modeled, also the (estimated) sampling error variance needs to be adjusted. One option is the following:

\(\hat{\theta}^{Dir, log}_{i} = \log\left(\hat{\theta}^{Dir}_{i}\right) \quad \text{and} \quad \hat{\sigma}^{2, log}_{e,i} = \left(\hat{\theta}^{Dir}_{i} \right)^{-2}\hat{\sigma}^{2}_{e,i}.\)

The arcsin transformation is commonly applied in practice to model ratios or proportions. The direct estimate is transformed with the arcsin transformation, while the sampling error variance is approximated with the effective sample size \(\tilde{n}\) as follows:

\(\hat{\theta}^{Dir, arcsin}_{i} = \sin^{-1}\left(\sqrt{\hat{\theta}^{Dir}_{i}}\right) \quad \text{and} \quad \hat{\sigma}^{2, arcsin}_{e,i} = 1 / 4\tilde{n}_{i}.\)

where \(\tilde{n}_{i}\) is the effective sample size in domain \(i\) calculated by dividing the domain sample size by the design effect.

Various approaches to estimate the sampling error variance for the area-level model

ApproachShort descriptionIssues
Analytical or replication methodDirect estimate of the variance of the direct estimator is obtained and taken as known.The estimate could be unstable and lead to a misleading uncertainty estimation. The approach gets generally more problematic, the smaller the domain sample.
Generalized variance functionsThe relationship between the direct variance and its expectation is described and predicted GVF variances are taken as known.
Suitable predictors are needed for a good prediction of the variance.
ApproximationsDirect variance is approximated and taken as known.Approximations are usually derived for large samples, but used in small domains.

All of the above approaches estimate the sampling error variance and take the resulting estimate as known in the model which could lead to an underestimation of the uncertainty. Thus, You and Chapman (2006) propose a Bayesian approach that considers that the sampling error variance is estimated.

According to Bell (2008), two general rules can be derived from a sensitivity analysis of the estimation of the sampling error variance:

  • Underestimation of the sampling error variance is more problematic when the variance of the domain random effect is relatively small to the estimated sampling error variance, which is the more likely case in an SAE application since the sampling error variance is assumed to be large in small domains.
  • Overestimation of the sampling error variance is more problematic when the domain variance of the random effect is relatively large compared to the estimated sampling error variance, which is less likely in an SAE application.

Whenever one of the above approaches leads to reasonable estimates in most domains but unreasonable results in some domains, these domains could also be excluded from the modelling process and be treated as non-sampled domains.

Availability of adjustments regarding the sampling error variance in standard statistical software

ModelComment                                          RStataSAS                    Python
With log transformationCounteracts non-linearity and non-normality of error terms.emdifayherriot



With arcsin transformationAdjustment for ratios and proportions.emdifayherriot

With unknown sampling varianceIncorporates additional uncertainty of the variance estimation.BayesSAE


Please note that the transformations also apply to the direct estimate. However, it is addressed in this section on how the sampling error variance is estimated since the usage of a transformation in the area-level model always applies to both, the direct estimate and its variance estimator. 

Info
iconfalse

Estimation of model parameters 
Anchor
areaModelParameter
areaModelParameter

The model parameters to be estimated in the basic area-level model are the domain random effects variance \(\sigma_{u}^{2}\) and the fixed-effects parameters \(\beta). Common estimation approaches are empirical best linear unbiased prediction (EBLUP), empirical (EB) and hierarchical Bayesian (HB) methods. In practice, EBLUP and HB are usually used. The following table briefly shows the distinctions between the approaches but for detailed information, please see the literature recommendations.

EBLUP vs. Empirical Bayes vs. Hierarchical Bayes (based on Ghosh and Rao (1994))

TypeComparison
EBLUP
  • Classical frequentist framework
  • Small area parameters can be expressed as linear combinations of fixed and random effects
  • Estimators minimize the mean squared error among the class of linear unbiased estimators of fixed parameters → Best linear unbiased predictor (BLUP)
  • Variance of random effects is assumed to be known but needs to be estimated in practical applications → Empirical best linear unbiased predictor (EBLUP)

Empirical Bayes

  • Posterior distribution of the domain parameters of interest given the data is obtained
  • Model parameters are assumed to be known
  • Parameters are estimated from the marginal distribution
  • Inferences are based on the estimated posterior distribution
Hierarchical Bayes
  • Prior distribution on model parameters is specified
  • Posterior distribution of the domain parameters of interest is obtained
  • Inferences are based on the posterior distribution → parameter of interest is estimated by posterior mean and the uncertainty is estimated by the posterior variance


Since the estimation approaches for area-level models implemented in standard statistical software are EBLUP and HB, the following is only focused on these two approaches.

For the estimation of the unknown model variance, standard EBLUP approaches are the Fay-Herriot moment estimator (FH), maximum likelihood (ML) and residual maximum likelihood (REML) estimator. In contrast to the likelihood approaches, the Fay-Herriot moment estimator does not rely on a distributional assumption. Following Rao and Molina (2015), the likelihood estimators are more efficient than the Fay-Herriot moment estimator. Thus, REML is often set as a default when fitting the basic model. The estimation with these approaches could, however, lead to a negative variance estimate which would be set to zero. To counteract this possibility, adjusted likelihood methods are developed that ensure a positive variance estimation (Li and Lahiri 2010, Yoshimori and Lahiri 2014). The fixed-effects parameters are obtained by weighted least squares estimates that incorporate the estimated random effects variance.

The HB approach requires the specification of a subjective prior on the model parameters, namely the domain random effects variance \(\sigma_{u}^{2}\) and the fixed-effects parameters \(\beta\). The prior can be informative or non-informative. Prior information could be derived from expert knowledge or relevant previous studies. While the HB approach is straightforward, the prior selection should be careful since improper priors could lead to improper posteriors, and thus SAE estimates.

Availability of the different estimation approaches in standard statistical software

ModelComment                                          RStataSAS                    Python
EBLUPE.g., FH, ML, REMLsae, rsae, emdi, JoSAEfayherriot, FHSAE

Prototype by Statistics Canada, Mukhopadhyay and McDowell

samplics
EBLUP (adjusted)Ensure positive variance estimates.emdifayherriotPrototype by Statistics Canada
HBPriors for model parameters need to be selected.hbsae, BayesSAE


Info
iconfalse

Uncertainty measurement
Anchor
areaLevelMSE
areaLevelMSE

All software packages report estimates for the uncertainty along with the SAE domain estimates. For the EBLUP approaches, estimators of the mean squared error (MSE) have either analytical formulations or are derived by replication methods. The analytical approaches depend on the chosen estimation approach for \(\sigma_{u}^{2}\). The flow charts in the vignette of the R package emdi give a good overview of MSEs selected to the corresponding estimation methods.

Returned MSE estimator depending on chosen estimation method (see Harmening et al. 2020)

For the HB approach, the uncertainty is obtained by the posterior variance as an estimator for the MSE.

Info
iconfalse

Extensions
Anchor
areaLevelExtensions
areaLevelExtensions

Additional to the basic area-level model, some extensions are implemented in standard software. In the following, the ideas behind the extensions are briefly described and the table gives an indication of which packages provide the extension. For detailed information about the methodology, please see the original reference or the package descriptions/vignettes.

  • Spatial correlation: The basic area-level model assumes that the domain random effects are independent. A potential spatial correlation between neighboring areas can be considered by a combination of a small area estimator and a simultaneously autoregressive (SAR) model. The method proposed by Petrucci and Salvati (2009) is an EBLUP estimator with spatially correlated random area effects using information from neighboring areas. Additional to the required input for a basic area-level model, a proximity matrix that describes the neighborhood structure of the areas needs to be specified.
  • Spatial and temporal correlation: When information for each domain is available for several time periods, it can be valuable to use the additional information over time. The spatio-temporal model proposed by Marhuenda et al. (2013) adds random effects for the time periods nested within domains to the spatial model. The time correlation can be captured by a first order autoregressive process (AR(1)). For all time periods, data needs to be available within each domain. Differences to the model input are the proximity matrix describing the neighborhood structure of the areas, the number of domains and the number of time periods.
  • Under heteroscedasticity: One common model assumption in the basic area-level models is the constant variance of the area random effects. When this assumption is violated, model-based methods may result in unreliable estimates. Thus, Breidenbach et al. (2018) propose a method that addresses heteroscedasticity in area-level model-based approaches.
  • Robust estimation: In the presence of influential outliers, robust estimation methods may help to reduce the influence of outlying observations on the estimation (see Schoch 2012 and Warhnholz 2016). For the estimation, a tuning constant k that regulates the degree of robustness needs to be specified. With k towards infinity, the predictions equal to the standard EBLUP.
  • With measurement errors: In case the additional data source is assumed to have measurement errors, e.g., when the data comes from a larger survey or big data sources instead of registers or a census, the measurement errors can be considered in the area-level model. Ybarra and Lohr (2008) propose a model that takes the additional variability of the second data source into account. For the usage of this model, variance-covariance matrices per domain have to be specified.
  • Unmatched models: The basic area-level model assumes a linear sampling model and linking model for the direct estimates \(\hat{\theta}\) and the parameters of interest \(\theta\), respectively. The model is unmatched if not \(\theta\) but \(h(\theta)\) is modeled with a linear function (see e.g., You and Rao 2002). This could be the case when the parameter of interest is, e.g., a proportion within the range of 0 and 1 or a count.

Overview of availability of the area-level model in statistical software (not comprehensive)


CommentsRStataSASPython
With spatial correlationConsiders correlation among domains.sae, emdi, BayesSAEFHSAE

With spatial and temporal correlation

Takes into account the spatial correlation between neighboring areas and also incorporates historical data.

saeFHSAE

Under heteroscedicityAddresses the potential heteroscedasticity in residual errors.JoSAE


Robust estimationDownweights influencing outliers.rsae, emdi


With measurement errorsAuxiliary data has known measurement errorsemdi


UnmatchedFor proportions, rates and counts.BayesSAE

Prototype by Statistics Canada,

Mukhopadhyay and McDowell


Info
iconfalse

References

Bell, W.R. (2008). Examining Sensitivity of Small Area Inferences to Uncertainty About Sampling Error Variances, Proceedings of the Survey Research Section, American Statistical Association, pp. 327–334.

Breidenbach, J., Magnussen, S., Rahlfa, J. and Astrupa, R. (2018). Unit-level and area-level small area estimation under heteroscedasticity using digital aerial photogrammetry data, Remote Sensing of Environment, 212, 199–211.

Fúquene, J., Cristancho, C., Ospina, M. and Morales, D. (2019). Prevalence of international migration: an alternative for small area estimation. Working paper.

Ghosh, M. and Rao, J. N. K. (1994). Small Area Estimation: An Appraisal, Statistical Science, 9(1), 55-76.

Harmening, S., Kreutzmann, A., Pannier, S. Salvati, N. and Schmid, T. (2020). A Framework for Producing Small Area Estimates Based on Area-Level Models in R.

Hidiroglou, M. A., Beaumont, J.-F. and Yung, W. (2019). Development of a small area estimation system at Statistics Canada, Survey Methodology, 45(1), 101-126.

Li, H. and Lahiri, P. (2010).  An Adjusted Maximum Likelihood Method for Solving Small Area Estimation Problems, Journal of Multivariate Analysis,101(4), 882–902.

Petrucci, A. and Salvati, N. (2006). Small area estimation for spatial correlation in watershed erosion assessment, Journal of Agricultural, Biological and Environmental Statistics, 11(2), 169–182.

Marhuenda, Y., Molina, I. and Morales, D. (2013). Small area estimation with spatio-temporal Fay-Herriot models, Computational Statistics and Data Analysis, 58, 308–325.

Rao, J.N.K. and Molina, I. (2015). Small Area Estimation. New York: Wiley.

Schoch, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Dara Sets, Austrian Journal of Statistics, 41(4), 243-265.

Ybarra, L.M.R. and Lohr, S.L. (2008). Small Area Estimation When Auxiliary Information Is Measured with Error. Biometrika, 95(4), 919–931.

Yoshimori, M. and Lahiri, P. (2014). A New Adjusted Maximum Likelihood Method for the Fay-Herriot Small Area Model, Journal of Multivariate Analysis,124, 281–294.

You, Y. and Rao, J.N.K. (2002b), Small Area Estimation Using Unmatched Sampling and Linking Models, Canadian Journal of Statistics, 30, 3–15.

You, Y. and Chapman, B. (2006). Small Area Estimation Using Area Level Models and Estimated Sampling Variances, Survey Methodology, 32(1), 97­-103.

You, Y. and Rao, J.N.K. (2002). Small area estimation using unmatched sampling and linking models, The Canadian Journal of Statistics, 30(1), 3-15.

Warnholz, S (2016). Small Area Estimation Using Robust Extensions to Area Level Models. Ph.D. thesis, Freie Universität Berlin.