Error propagation is an important
but under-utilized uncertainty analysis technique that allows a modeler to estimate
the impact of errors from all uncertain factors (e.g., parameters, inputs,
initial conditions, boundary conditions, model equations) on the model
response(s). The two commonly used error propagation techniques traditionally
have been first-order error analysis and Monte Carlo simulation. A related
approach, sensitivity analysis, allows the modeler to assess quantitatively the
impact of a subset of model terms (often just one) on model response.
First-order error analysis is
based on the approximation of the randomness in one variable (e.g., a reaction
rate) with the first nonzero central moment, and the characterization of a
functional relationship with the first-order terms of the Taylor series. This
means that error in an input variable (e.g., x) is assumed to be fully characterized by the variance and that
this error is converted to error in the endogenous variable (e.g., y) through a linearization of the
equation. The usefulness of first-order
error analysis is a function of the validity of these approximations.
Consider a simple functional
relationship (equation 1):
If
this relationship is reasonably “well behaved” (e.g., not highly nonlinear) and
if the standard deviation of x is not
too large, then (equation 2):
where E is the expectation operator. (The expectation or expected value
is the probabilistic average of a random variable. Under random sampling, the
expected value of a variable is its mean.) Likewise, under the same
assumptions, a Taylor series expansion of f(x)
may be used to approximate the variance (equation 3):
Employing only the first two
terms of the Taylor series and taking the expansion about the mean, ẋ, equation 3 becomes (equation 4):
Taking the variance of equation 4
and noting that variance f(ẋ) equals
zero, this equation is transformed to the bivariate form of the error propagation
equation (equation 5):
where s is the sample standard deviation and s2 is the sample variance.
For a multivariate relationship,
there is a straightforward extension of equation 5, taking into consideration
the covariation between predictor variables (equation 6):
Equation 6 shows that the error in a model
prediction due to errors in the variables is a function
of the individual variable error (sxi), a
sensitivity factor (df/dx), expressing the "importance" of each variable, and the correlation (p) among variables.
This relationship is sufficiently general so that "parameters"
may be substituted for "variables" in the previous
sentence with no change in meaning.
Equation 6 is the error
propagation equation that is the basis of first-order analysis. The method receives its name from the fact
that only the first-order, or linear, terms of the Taylor series are retained.
The degree to which this approximation is successful may be assessed with the
aid of figure 1 below.
Figure 1. First order
estimation
The model relating x to f(x)
in figure 1 is assumed to be nonlinear for the sake of generality, and the
straight line tangent to this model is the first-order approximation.
First-order error analysis then is graphically portrayed by the dashed lines
which convert the error in x to an
error in f(x). The success of this
error approximation method is determined by the following:
a.
The degree of nonlinearity in the model. As the
model becomes increasingly nonlinear, f'(x )∆x becomes less accurate as a measure of ∆f. This means that highly nonlinear models may not be
amenable to first-order treatment of error.
b.
The size of the error term. For a nonlinear
model, the accuracy of the first-order error estimate is a function of the
error in x (represented by ∆x in figure 1). Small errors in x coupled with near linearity for the
model are favorable conditions for effective application of first-order
analysis.
c.
The acceptable level of error (due to inaccuracy
in the error analysis) for the issue under study.
d.
The extent to which the distribution of errors
is represented by the mean and standard deviation of the distribution. For
complex or skewed error distributions, the mean and deviation may be
inadequate, leading to a faulty estimate of error in f(x).
Low-cost, fast computing that supports Monte Carlo simulation has
largely made applications of first-order error analysis relatively rare in
recent years. Monte Carlo simulation is a conceptually simple alternative to first-order
error analysis; it was so-named because it shares characteristics of randomness
with gambling casinos. Under this technique, probability density functions are
assigned to each characteristic (e.g., variable or parameter), reflecting the
uncertainty in that characteristic. Values are then randomly chosen from each
probability distribution. These values are inserted into the model, and a
prediction is calculated. After this is repeated a large number (several
hundred to several thousand) of times, an empirical distribution of predicted
model response develops, which reflects the combined uncertainties “flowing”
through the model.
As an example, consider the
simple model (equation 7):
Figure
2 displays the error distributions for the uncertain parameters (β1 and β2) and
for the uncertain model equation (ε). At each step in the Monte Carlo
simulation, each distribution is randomly sampled to yield a single value that
is inserted into equation 7, and a value for the response variable y is calculated. After several hundred (or several thousand)
runs of the model, the predicted responses, y,
can be tabulated or plotted in a
histogram; this histogram reflects the errors and the model structure.
If the parameters β1 and β2 are correlated (this is not uncommon in water quality
models), then individual sampling steps in the Monte Carlo procedure cannot be
undertaken independently. Instead, the sampling of values from the correlated
probability distributions must be undertaken sequentially, with the probability
distribution of the second parameter (either parameter may be selected first or
second) conditional on the value of the first parameter selected; this
“conditionality” reflects the correlation between the two parameters.
An essential condition for success of Monte Carlo simulation for error
propagation with water quality models is that the error terms and the
parameter covariances need to be estimated. Estimation of the parameter errors
and covariances is possible with a statistical (e.g., regression) model, but
may be difficult to impossible for large water quality models with many
parameters, as the available data often do not contain sufficient information
to estimate parameter errors and covariances. Note that variances and covariances
among measured water quality variables (e.g., the “x” in equation 7) are not
the same as the variances and covariances among the model parameters (β1 and β2). For example, a model parameter may be
“phytoplankton settling velocity” in a lake, which is typically not measured; a
variable may be phytoplankton density, which is often measured (as chlorophyll
a). With commonly-measured water quality
data, it may not be possible to estimate parameter errors and
covariances. Techniques presented below
can partially address this conundrum.
Among experienced
water modelers, it is understood (but generally not acknowledged) that many
‘‘sets’’ of parameter values will fit a model about equally well; in other
words, similar predictions can be obtained by simultaneously manipulating several
parameter values in concert. This is plausible in part because all models are
approximations of the real world, and because most model parameters represent
aggregate or “effective” processes (spatially and temporally averaged at some scale)
and are unlikely to be represented by a fixed constant across scales.
Additionally, many mathematical structures produce extreme correlation between
model parameters, even when a model is over-determined. This condition, called
‘‘equifinality,’’ is well-documented in the hydrologic sciences, but the
concept has rarely been discussed in the water quality sciences. I believe that
the recognition of equifinality should change the perspective of water quality
modelers from seeking a single ‘‘optimal’’ value for each model parameter, to
seeking a distribution of parameter sets that all meet a predefined fitting criterion.
These acceptable parameter sets may then provide the basis for estimating model
prediction error associated with the model parameters.
The development of methods for identifying plausible
parameter sets for large multi-parameter environmental models with limited
observational data is best understood through the regionalized (or generalized)
sensitivity analysis (RSA). RSA is a Monte Carlo sampling approach to assess
model parameter sensitivity; this method was initially proposed as a means to
prioritize future sampling and experimentation for model and parameter improvements.
Regionalized sensitivity analysis is simple in concept, and is a useful way to
use limited information to bound model parameter distributions. Given a
particular model and a system (e.g., water body) being modeled, the modeler
first defines the plausible range of certain key model response variables
(e.g., chlorophyll a, total nitrogen) as the ‘‘behavior.’’ Outside the range is
‘‘not the behavior.’’ The modeler then samples from (often uniform)
distributions of each of the model parameters and computes the values for the
key response variables. Each complete sampling of all model parameters, leading
to prediction, results in a ‘‘parameter set.’’ All parameter sets that result
in predictions of the key model response variables in the ‘‘behavior’’ range
are termed ‘‘behavior generating’’ and thus become part of the model parameter
distribution. The parameter sets that do not meet this behavior criterion are
termed ‘‘nonbehavior generating.’’ The cumulative distribution function (CDF)
of each parameter distribution from these two classes of parameter sets
(behavior generating and nonbehavior generating) can be compared for the
evaluation of model parameter sensitivity. For a particular parameter, if the
behavior generating and nonbehavior generating distributions are substantially
different, then prediction of the key response variables is sensitive to that
parameter. Hence, resources devoted
toward model improvement might be preferentially allocated toward improved estimation
of that parameter. In addition, we can consider the distribution of the
behavior generating parameter sets as reflecting equifinality. Thus, the
empirical distribution characterizes the error (variance and covariance)
structure in the model parameters, conditional on the model and on the fitting
criterion (the defined plausible range of key response variables).
Generalized Likelihood Uncertainty Estimation (GLUE) is an
extension of RSA; the RSA binary system of acceptance ⁄ rejection of behavioral
⁄ nonbehavioral simulations is replaced in GLUE by a ‘‘likelihood’’ measure
that assigns different levels of confidence (weighting) to different parameters
sets. By effectively evaluating
the fit of parameter sets, RSA, GLUE, and Markov Chain Monte Carlo (MCMC)
provide useful information for model parameter error propagation. These
techniques can be used to develop plausible parameter sets, which
collectively express the parameter covariance (parameter error and correlation)
structure to help address equifinality. Each of these techniques can be
used to create a multi-parameter distribution that is “behavior generating” to
characterize parameter sets for a water quality model. This distribution can
then become the basis for Monte Carlo simulation for error propagation; this is
different from standard Monte Carlo simulation in that parameter sets, not
individual parameters, are sampled. By sampling parameter sets to assess
prediction uncertainty, we incorporate the parameter variance-covariance
structure into the simulation results. While this still leaves model (equation)
error unaddressed, it does provide the opportunity to advance our understanding
of the error in water quality model predictions.