Thursday, June 13, 2013

Skill Assessment and Statistical Hypothesis Testing for Model Evaluation

“Skill assessment” for water quality models refers to the results of a set of statistical and graphical techniques to quantify the goodness-of-fit for a water quality model. These techniques are applied to compare observations with model predictions; Stow et al. (2009) list and describes statistics such as the correlation coefficient, root mean square error, and average absolute error as skill assessment options for univariate comparisons.

In earlier work that preceded the “skill assessment” designation, Reckhow et al. (1990) proposed that a statistical test be used with observations and predictions to effectively serve as a model verification hypothesis test. Since statistical hypothesis tests are typically set up with the hope/goal of rejection of the null hypothesis, Reckhow et al. proposed that the test be structured so that rejection of the stated null hypothesis is indicative of a verified model, given a pre-specified acceptable error level. For example, consider the null hypothesis H0 where the true mean of the absolute values of the prediction error is 2 mg/L. And consider the alternative hypothesis H1 where the true mean of the absolute values of the prediction error is less than 2 mg/L. This is a one-sided test in that the rejection region and H1 are on only one side (less than). The hypotheses can be tested in the conventional manner, with rejection of H0 (and acceptance of H1) as the result indicating successful model verification. When the null hypothesis is true, the sampling distribution of the test statistic is centered on 2 mg/L, and the rejection region is located in the left tail of the distribution only. To test model goodness-of-fit with hypotheses assuming this structure, the model user must select an acceptable error level. In the example given here, the acceptable error level corresponds to 2 mg/L. In the paper, Reckhow et al. described applications of the chi square test, t-test, Kolmogorov-Smirnov test, regression analysis and the Wilcoxon test using this approach.
In a previous blog post (“Is Conventional Water Quality Modeling a Charade?” posted on April 30th) I suggested that in most cases the data set aside for verification are not that different from the calibration data. To make users aware of that fact, I proposed a statistic for model verification rigor based on the differences between the calibration and verification data. Ultimately, I think that a verification rigor statistic should be combined with the skill assessment statistics discussed above for an improved assessment of the confidence that a model user should have in model applications. I plan to address that approach in an upcoming blog post.
Reckhow, K.H., J.T. Clements, and R.C. Dodd. 1990. Statistical evaluation of mechanistic water quality models. Journal of Environmental Engineering. 116:250-268.

Stow, C.A., J. Jolliff, D.J. McGillicuddy, S.C. Doney, J.I. Allen, M.A.M. Friedrichs, K.A. Rose, and P. Wallhead. 2009. Skill assessment for coupled biological/physical models of marine systems. Journal of Marine Systems 76:4-15.

No comments:

Post a Comment