The need for robust statistics#

Digital elevation models often contain outliers that can be traced back to instrument acquisition or processing artefacts, and which hamper further analysis.

In order to mitigate their effect, xDEM integrates robust statistics at different levels:

Robust optimizers for the fitting of parametric models during Coregistration and Bias correction,
Robust measures for the central tendency (e.g., mean) and dispersion (e.g., standard deviation), to evaluate DEM quality and converge during Coregistration,
Robust measures for estimating spatial autocorrelation for uncertainty analysis in spatialstats.

Yet, there is a downside to robust statistical measures. Those can yield less precise estimates for small samples sizes and, in some cases, hide patterns inherent to the data. This is why, when outliers show identifiable patterns, it is better to first resort to outlier filtering (see filters) and perform analysis using traditional statistical measures.

Measures of central tendency and dispersion#

Central tendency#

The central tendency represents the central value of a sample, and is core to the analysis of sample accuracy (see Analysis of accuracy and precision). It is most often measured by the mean. However, the mean is a measure sensitive to outliers. Therefore, in many cases (e.g., when working with unfiltered DEMs) using the median as measure of central tendency is preferred.

When working with weighted data, the weighted median which corresponds to the 50^th weighted percentile can be used as a robust measure of central tendency.

The median is used by default in the alignment routines of Coregistration and Bias correction.

Dispersion#

The statistical dispersion represents the spread of a sample, and is core to the analysis of sample precision (see Analysis of accuracy and precision). It is typically measured by the standard deviation. However, very much like the mean, the standard deviation is a measure sensitive to outliers.

The median equivalent of a standard deviation is the normalized median absolute deviation (NMAD), which corresponds to the median absolute deviation scaled by a factor of ~1.4826 to match the dispersion of a normal distribution. It has been shown to provide more robust measures of dispersion with outliers when working with DEMs (e.g., Höhle and Höhle (2009)). It is defined as:

\[ \textrm{NMAD}(x) = 1.4826 \cdot \textrm{median}_{i} \left ( \mid x_{i} - \textrm{median}(x) \mid \right ) \]

where \(x\) is the sample.

nmad = xdem.spatialstats.nmad

Note

The NMAD estimator has a good synergy with Dowd’s variogram for spatial autocorrelation, as their median-based measure of dispersion is the same.

The half difference between 84^th and 16^th percentiles, or the absolute 68^th percentile can also be used as a robust dispersion measure equivalent to the standard deviation. When working with weighted data, the difference between the 84^th and 16^th weighted percentile, or the absolute 68^th weighted percentile can be used as a robust measure of dispersion.

Important

The NMAD is used by default for estimating elevation measurement errors in spatialstats.

Measures of spatial autocorrelation#

Variogram analysis exploits statistical measures equivalent to the covariance, and is therefore also subject to outliers. Based on SciKit-GStat, xDEM allows to specify robust variogram estimators such as Dowd’s variogram based on medians (Dowd (1984)) defined as:

\[ 2\gamma (h) = 2.198 \cdot \textrm{median}_{i} \left ( Z_{x_{i}} - Z_{x_{i+h}} \right ) \]

where \(h\) is the spatial lag and \(Z_{x_{i}}\) is the value of the sample at the location \(x_{i}\).

Note

Dowd’s estimator has a good synergy with the NMAD for estimating the dispersion of the full sample, as their median-based measure of dispersion is the same (2.198 is the square of 1.4826).

Other estimators can be chosen from SciKit-GStat’s list of estimators.

Important

Dowd’s variogram is used by default to estimate spatial auto-correlation of elevation measurement errors in spatialstats.

Regression analysis#

Least-square loss functions#

When performing least-squares linear regression, the traditional loss functions that are used are not robust to outliers.

A robust soft L1 loss default is used by default when xDEM performs least-squares regression through scipy.optimize.

Robust estimators#

Other estimators than ordinary least-squares can be used for linear estimations. The Coregistration and Bias correction methods encapsulate some of those robust methods provided by sklearn.linear_models:

The Random sample consensus estimator RANSAC,
The Theil-Sen estimator,
The Huber loss estimator.

The need for robust statistics

Contents

The need for robust statistics#

Measures of central tendency and dispersion#

Central tendency#

Dispersion#

Measures of spatial autocorrelation#

Regression analysis#

Least-square loss functions#

Robust estimators#