This section describes basic approaches for domain estimation and a typical notational framework.

From direct estimation to small area models

The options for the estimation of domain indicators are extensive. Traditionally, national statistical offices prefer design-based approaches to estimate population indicators since no model assumptions are required for valid inferences and properties considering the survey design are known, such as design-unbiasedness. These include estimators based on the application of weights to the survey sample units belonging to the domains or small areas and model-assisted estimators, as the generalized regression estimator (see Cassel et al. 1976, Särndal et al. 1992 ). Both options are assumed to lead to large variances when sample sizes are not large enough in the domains of interest. Assuming that a large domain contains the small domains of interest (subdomains) and all having the same characteristics, a reliable direct estimator for the large domain could be used to obtain indirect estimates for the smaller domains (subdomains) of interests. These so-called synthetic estimators produce predictions with low variability but possibly large biases when the assumption of the same distribution of relevant characteristics in all domains is not fulfilled. A combination of both, design–based and synthetic approaches, aims to balance the possible bias and variability (also known as composite estimates).

Among others, ESSnetSAE (2012) proposes to start with a triplet of estimates:

  • direct,
  • synthetic, and
  • composite estimates.

The simpler approaches can be used if the results are sufficient, e.g., if the CV is below the required threshold. For example, coefficients of variation should not exceed 15% for domains and 18% for small domains at the Italian National Institute of Statistics (ISTAT) and Statistics Canada uses three categories of reliability for the Labor Force Survey: no release restriction for a CV ≤ 16.5% , added warning when 16.5% < CV≤ 33.3% and otherwise, the data is not recommended for release. A good overview of different practices on how to define precision requirements is the Handbook on precision requirements and variance estimation for ESS households surveys. If the obtained estimates are not reliable, they can still be useful for comparison with results obtained from more complex model-based approaches. Another reason for a simpler approach could be the requirement to produce a large number of indicators in a timely manner and the overall capacity.

ESTIMATORSPROSCONS

Direct

  • Unbiased estimators considering the sampling design.
  • The aggregates sum up to the estimate for a larger domain.
  • The variance is observed to be large for small domains.
  • No information is obtained for non-sampled domains.

Synthetic

  • Usually has a smaller variance than the direct estimator.
  • Estimates can be obtained for non-sampled domains.
  • It can be heavily biased.
  • The aggregates do not sum up to the estimate for a larger domain without adjustment.
  • The model is only valid for domains where the auxiliary information explains the between-domain heterogeneity.
  • Model diagnostics need to be conducted to prevent a large bias.

Composite

  • As combination of direct and synthetic estimator, they will have a smaller variance than the direct and a smaller bias than the synthetic estimator.
  • The weight given to the synthetic estimator is not determined by the fit of the model, i.e. by the explanatory power of the auxiliary information.
  • No information is obtained for non-sampled domains.
  • The aggregates do not sum up to the estimate for a larger domain without adjustment.
Example for SDG17: State level indicators from ICT household surveys in Brazil

The example is described in detail in Bertolini Coelho et al. (2020).

Goal: A department of the Brazilian Network Information Center (NIC.br) called Regional Center for Studies on the Development of the Information Society (Cetic.br) collects data about access and use of information and communication technologies (ICT) in Brazil. Data users are interested in timely publication of main ICT indicators for the 27 Brazilian states.

Indicator of interest: Proportion of households with computers, and proportion of households with Internet access (which is similar to 17.8.1 Proportion of individuals using the Internet).

Disaggregation dimension: Brazilian states.

Data availability: The annual Survey on the Use of ICT in Brazilian Households contains almost 33,000 households. Reliable estimates for the five larger regions North, Northeast, Southeast, South and Center-West can be produced.

SAE methods: Average of consecutive years, pooling samples of consecutive years, a single-year composite estimator considering the regions as yielding synthetic estimates; and a composite estimator based on pooling samples from two consecutive years, and using the regions as yielding synthetic estimates. The simpler approaches are chosen due to a wide range of indicators that need to be produced in a timely manner after data collection.


Small area estimation models

In these guidelines, the focus is on small area estimation models. These help to obtain predictors (estimators) with a lower variability at domain level, and with a possible bias that tends to be moderate, if the model is appropriate. The models can be summarized as mixed models with random domain-specific effects accounting for variation between domains that is not explained by auxiliary information.

While SAE models may be different in their specifications, they are built on the same notational framework. A finite population of size  is partitioned into  domains  of sizes , where  refers to the ith domain and to the jth household/individual. A random sample of size  is drawn from this population which leads to observations in each domain. If , the domain is not in the sample.

The wide range of small area estimation models can roughly be classified into two model types: 

  • Area-level models relate a domain indicator with domain-specific auxiliary information.
  • Unit-level models use the unit-level survey data for fitting a model and unit-level auxiliary information for producing estimates in all domains.

The basic area- and unit-level models are also known as Fay-Herriot (Fay and Herriot 1979) and Battese-Harter-Fuller (Battese et al. 1988) model, respectively.


MODEL CHARACTERISTICS 
MODEL TYPES

AREA-LEVELUNIT-LEVEL
PROS
  • Usually has a smaller MSE than the direct estimator, so the estimator is more efficient.
  • The model considers unexplained heterogeneity among the domains.
  • The estimator is a weighted average of the direct estimator and a regression-synthetic part and gives more weight to the direct estimator with decreasing sampling error.
  • The sampling design can be considered by using the weighted direct estimator.
  • Additional information is only needed at the domain-level which reduces confidentiality issues.
Estimates can be obtained for non-sampled domains.
  • It has a smaller MSE, i.e., it improves the efficiency, compared to the direct estimator but also to an estimate obtained with an area-level model (assuming the model is correct, in presence of outlier unit values, the area-level model may lead to more efficient estimates).
  • The model considers unexplained heterogeneity among the domains.
  • The estimator is a weighted average of the survey regression estimator and the regression-synthetic part and gives more weight to the survey regression estimator with increasing sample size.
Estimates can be obtained for non-sampled domains.
CONS
  • It is a model-based approach, i.e., model diagnostics need to be conducted.
  • The sampling error variance is assumed to be known but usually needs to be estimated in practical applications. The additional uncertainty is often not considered in the MSE estimation.
  • For each indicator, a new model needs to be fitted.
The aggregates do not sum up to the estimate for a larger domain without adjustment.
  • It is a model-based approach, i.e., model diagnostics need to be conducted.
  • The sampling design is not considered.
  • The data requirements are stricter than for area-level models. The additional information needs to be available in the same definition and optimally, even at the unit level which often means confidentiality issues.
  • The aggregates do not sum up to the estimate for a larger domain without adjustment.
  • Results can be affected by individual outliers, though there are robust alternatives available.

Terms and definitions

Chapter 2 of the Guidelines on small area estimation for city statistics and other functional geographies provided by Eurostat offers a nice overview of standard terms and definitions in small area estimation.









Pros and Cons

The guidelines provided by Molina give an extensive overview of the advantages and disadvantages of the most common small area estimation models (in Spanish).
































Extended/Adjusted models

The Pros and Cons apply to the standard models. There are several extensions and adjustments in the literature of small area estimation that address some of the issues.

Some examples

Auto-benchmarking for the Fay-Herriot area-level model:

You, Y., Rao, J.N.K. and Hiridoglou, M. (2013). On the performance of self benchmarked small area estimators under the Fay-Herriot area level model. Survey Methodology, 39(1), 217-229.

Pseudo-EBLUP to consider sampling weights:

You, Y., and Rao, J.N.K. (2002). A pseudo empirical best linear unbiased prediction approach to small area estimation using survey weights. The Canadian Journal of Statistics, 30, 431-439.


  • No labels