“Skill assessment” for water quality models refers to the
results of a set of statistical and graphical techniques to quantify the
goodness-of-fit for a water quality model. These techniques are applied to
compare observations with model predictions; Stow et al. (2009) list and
describes statistics such as the correlation coefficient, root mean square
error, and average absolute error as skill assessment options for univariate
comparisons.
In earlier work that preceded the “skill assessment”
designation, Reckhow et al. (1990) proposed that a statistical test be used
with observations and predictions to effectively serve as a model verification
hypothesis test. Since statistical hypothesis tests are typically set up with
the hope/goal of rejection of the null hypothesis, Reckhow et al. proposed that
the test be structured so that rejection of the stated null hypothesis is
indicative of a verified model, given a pre-specified acceptable error level. For example, consider
the null hypothesis H0 where
the true mean of the absolute values
of the prediction error is 2 mg/L. And consider
the alternative hypothesis H1 where the true mean of the absolute
values of the prediction error is less than 2 mg/L. This is a one-sided test in that the rejection
region and H1 are on only one side (less than). The hypotheses can be tested
in the conventional manner, with rejection of H0 (and acceptance of H1) as the result indicating
successful model verification. When the null hypothesis is true, the sampling distribution of the test statistic is centered on 2 mg/L, and the rejection region
is located in the left tail of the distribution only. To test model goodness-of-fit with hypotheses assuming
this structure, the model
user must select an acceptable
error level. In the
example given here, the acceptable error level corresponds to 2 mg/L. In the paper, Reckhow
et al. described applications of the chi square test, t-test,
Kolmogorov-Smirnov test, regression analysis and the Wilcoxon test using this
approach.
In a previous blog post (“Is Conventional Water Quality
Modeling a Charade?” posted on April 30th) I suggested that in most
cases the data set aside for verification are not that different from the
calibration data. To make users aware of that fact, I proposed a statistic for
model verification rigor based on the differences between the calibration and
verification data. Ultimately, I think that a verification rigor statistic
should be combined with the skill assessment statistics discussed above for an
improved assessment of the confidence that a model user should have in model
applications. I plan to address that approach in an upcoming blog post.
Reckhow, K.H., J.T. Clements, and
R.C. Dodd. 1990. Statistical evaluation of mechanistic water quality models. Journal of Environmental Engineering.
116:250-268.
Stow, C.A., J. Jolliff, D.J. McGillicuddy, S.C. Doney, J.I. Allen,
M.A.M. Friedrichs, K.A. Rose, and P. Wallhead. 2009. Skill assessment for
coupled biological/physical models of marine systems. Journal of Marine
Systems 76:4-15.
No comments:
Post a Comment