Need for robust estimators

Need for robust estimators#

Elevation data often contain outliers that can be traced back to instrument acquisition or processing artefacts, and which hamper further analysis.

In order to mitigate their effect, the analysis of elevation data can integrate robust statistics at different levels:

Robust estimators for the central tendency and statistical dispersion used during Coregistration, Bias correction and Uncertainty analysis,
Robust estimators for estimating spatial autocorrelation applied to error propagation in Uncertainty analysis,
Robust optimizers for the fitting of parametric models during Coregistration and Bias correction.

Yet, there is a downside to robust statistical estimators. Those can yield less precise estimates for small samples sizes and, in some cases, hide patterns inherent to the data. This is why, when outliers show identifiable patterns, it can be better to first resort to outlier filtering and perform analysis using traditional statistical measures.

Important

In xDEM, robust estimators are used everywhere by default.

Measures of central tendency and dispersion#

Central tendency#

The central tendency represents the central value of a sample, and is core to the analysis of sample accuracy (see Grasping accuracy and precision). It is most often measured by the mean. However, the mean is a measure sensitive to outliers. Therefore, in many cases (e.g., when working with unfiltered DEMs) using the median as measure of central tendency is preferred.

When working with weighted data, the weighted median which corresponds to the 50^th weighted percentile can be used as a robust measure of central tendency.

The numpy.median() is used by default in the alignment routines of Coregistration and Bias correction.

(Source code, .png)

Dispersion#

The statistical dispersion represents the spread of a sample, and is core to the analysis of sample precision (see Grasping accuracy and precision). It is typically measured by the standard deviation. However, very much like the mean, the standard deviation is a measure sensitive to outliers.

The median equivalent of a standard deviation is the normalized median absolute deviation (NMAD), which corresponds to the median absolute deviation scaled by a factor of ~1.4826 to match the dispersion of a normal distribution. It is a more robust measure of dispersion with outliers, defined as:

\[ \textrm{NMAD}(x) = 1.4826 \cdot \textrm{median}_{i} \left ( \mid x_{i} - \textrm{median}(x) \mid \right ) \]

where \(x\) is the sample.

Note

The NMAD estimator has a good synergy with Dowd’s variogram for spatial autocorrelation, as their median-based measure of dispersion is the same.

The half difference between 84^th and 16^th percentiles, or the absolute 68^th percentile can also be used as a robust dispersion measure equivalent to the standard deviation. When working with weighted data, the difference between the 84^th and 16^th weighted percentile, or the absolute 68^th weighted percentile can be used as a robust measure of dispersion.

The gu.stats.nmad() is used by default in Coregistration, Bias correction and Uncertainty analysis.

Measures of spatial autocorrelation#

Variogram analysis exploits statistical measures equivalent to the covariance, and is therefore also subject to outliers. Based on SciKit-GStat, xDEM allows to specify robust variogram estimators such as Dowd’s variogram based on medians defined as:

\[ 2\gamma (h) = 2.198 \cdot \textrm{median}_{i} \left ( Z_{x_{i}} - Z_{x_{i+h}} \right ) \]

where \(h\) is the spatial lag and \(Z_{x_{i}}\) is the value of the sample at the location \(x_{i}\).

Note

Dowd’s estimator has a good synergy with the NMAD for estimating the dispersion of the full sample, as their median-based measure of dispersion is the same (2.198 is the square of 1.4826).

Other estimators can be chosen from SciKit-GStat’s list of estimators.

Dowd’s variogram is used by default to estimate spatial auto-correlation of elevation measurement errors in Uncertainty analysis.

(Source code, .png)

Regression analysis#

Least-square loss functions#

When performing least-squares linear regression, the traditional loss functions that are used are not robust to outliers.

A robust soft L1 loss default is used by default to perform least-squares regression through scipy.optimize in Coregistration and Bias correction.

Robust estimators#

Other estimators than ordinary least-squares can be used for linear estimations. The Coregistration and Bias correction methods encapsulate some of those robust methods provided by sklearn.linear_models:

The Random sample consensus estimator RANSAC,
The Theil-Sen estimator,
The Huber loss estimator.

References and more reading

References:

Dowd (1984), The Variogram and Kriging: Robust and Resistant Estimators,
Höhle and Höhle (2009), Accuracy assessment of digital elevation models by means of robust statistical methods.