DesignOfStatistics
Things to do:
Bugs
A number of bugs must be fixed.
- Datasets with only cell or row data (Kitware)
- Models are now multiblocks of tables (Sandia)
Features
Among the considered features:
Integral
These features are related to making statistics pervasive throughout the user interface.
- Find data (SNL: analysis, KW: consult on GUI). Examples of find strategies include:
- The first selection mechanism for univariate data, arguably the most common one in the eyes of the typical analyst, is to select a subset of the data set based on relative distance from a reference value. Typically, the former would be the standard deviation and the latter the average, with those statistics being either calculated from the same data set or coming from a training data. However those values can also represent other type of reference information, such as expert knowledge or specification. This relative distance is called the 1-dimensional Mahalanobis distance.
The user will then be interested in selecting only those data points whose Mahalanobis distance is greater or smaller than a certain threshold from the reference value. Note that this threshold can be made to be 0, meaning that in this case only those data points exactly equal to the reference value (typical use cases: extrema) would be selected.
Practically speaking, this is implemented by the means of vtkDescriptiveStatistics. - An other use case, also univariate, makes use of arbitrary quantiles: the most commonly used approach here is the inter-quartile range (IQR), outside of which are located the smallest 25% and the largest 25% of the data set. This is an alternative to the mean/standard deviation mechanism whose main interest is that it makes use of more robust statistics: for instance, the median (which is at the center of the IQR) is statistically much more robust than the mean which is very sensitive to noise (a single value can throw it off arbitrarily far).
In this approach the analyst is not limited to quartiles, but can use any type of quantiles, whose most commonly used types, aside from quartiles (25%-50%-75%) are deciles (increments of 10%) and percentiles (increments of 1%).
From an implementation point of view, this is achieved thanks to vtkOrderStatistics. - In the case of intrinsically multi-dimensional data (e.g., velocity field), a natural selection mechanism is to use the n-dimensional Mahalanobis distance, i.e., the generalization of the relative distance to arbitrary dimensions by replacing the deviation with a positive-definite matrix which can (but does not have to) represent the covariance matrix of some training data set, possibly the same set as the one from which the subset of interest must be extracted. Note that this amounts to assuming an underlying linear (Gaussian) model for the data.
In this case as in the univariate case, those values which are further or closer in terms of this n-dimensional distance than a certain reference point (now n-dimensional itself) are selected.
The implementation is possible with an arbitrary number of dimensions by using vtkMultiCorrelativeStatistics.
- The first selection mechanism for univariate data, arguably the most common one in the eyes of the typical analyst, is to select a subset of the data set based on relative distance from a reference value. Typically, the former would be the standard deviation and the latter the average, with those statistics being either calculated from the same data set or coming from a training data. However those values can also represent other type of reference information, such as expert knowledge or specification. This relative distance is called the 1-dimensional Mahalanobis distance.
Based on correlative statistics (bivariate): find all entries with 2-D distance (i.e. relative distance with respect to given covariance matrix) to reference 2-D point greater or smaller than a certain threshold. This amounts to comparing deviation with respect to underlying linear regression and is useful to pick outliers in this respect. Also can be easily done in n-D case with the multi-variate correlation engine.
- Calculator (additional statistics buttons for some descriptive statistics) (SNL: analysis, KW: consult on GUI)
- Automatic statistics filter (decides what to do for the user given a dataset and variables of interest on it) (SNL)
- Hashing and caching statistics when file formats don't include them. (SNL + KW) See feature request 11401.
Advanced Filters
This more or less what is currently in PV, plus the following:
- Hypothesis testing (SNL)
- Linked selection for the model tables (KW, SNL: consult)
- Access to more (or all) parameters in a generic way (SNL, KW: some GUI work may be required)
Design
For 4.0, we are going to stick with a very small subset of statistics functionality and expose that through the Find Data dialog.
The following diagram illustrates a possible configuration for the Find Data dialog's value-entry widget.