In the development and application of water quality models,
it is standard practice to set aside data, not used in calibration, for model
verification purposes. This approach is based on the reasoning that the
set-aside-data provide a test of the model under new conditions and thus
reflect how the model will perform when applied for prediction. How plausible
is this reasoning?
Consider the situation where a model is calibrated with data
from 2010-2011, and then data from 2012 are used for verification. What is
likely to be different between these calibration and verification data sets?
Will these differences be sufficient to give us confidence that the calibrated
model can be relied upon for predictions when important forcings/inputs (e.g.,
pollutant loadings to a waterbody) change?
In essentially all cases, the major differences between
2010-2011 and 2012 datasets are likely to be natural forcing functions such as
hydrology, temperature, and solar radiation. It is extremely unlikely that the
forcing functions that are the focus of the model application, such as LULC
changes in a watershed or point source pollutant discharges, will change very
much. To the extent that pollutant loads to a waterbody change over this time
period, it will largely be due to changes in hydrology.
So, conventional water quality model verification has become
basically a charade. This situation is not the fault of modelers; rather, it is
simply the consequence of limited available data. Nonetheless, water quality
modelers who employ this approach to model verification need to be more candid
about the limited value of conventional model verification.
As an alternative, here is the basis for a statistical test
that could provide a measure of the rigor in model verification. To begin,
consider the figure below displaying histograms of dissolved oxygen data for
model calibration and verification:
The next figure overlays the calibration and verification
histograms for Case 1; notice how similar they are. The lack of difference
between these two data sets indicates that “verification” lacks rigor;
essentially, the model is being re-assessed with calibration-like data.
Now consider Case 2 below:
An overlay of the two histograms, shown below, indicates
that the calibration and verification data sets are different, which suggests
that verification is more rigorous than in case 1. However, note that the
verification data in case 2 show DO to be lower than for model calibration.
Since model applications are quite likely to address improved water quality and
higher dissolved oxygen, the verification test may be rigorous but it does not
reflect conditions expected for model use.
Now consider Case 3 below:
In case 3, the histogram of verification data is different
from the histogram of calibration data, and this time the verification DO are
higher than the calibration DO, which is a more likely prediction scenario.
In conclusion, to evaluate the rigor of the verification
exercise, I recommend that modelers apply a Kolmogorov-Smirnov test, or a
Chi-Square test, to quantitatively assess the difference between the
calibration and verification data sets. If
this becomes routine practice, the accumulated results will provide us with a comparative
basis for having confidence that a water quality model can be used to reliably
predict water quality in response to management changes.