Monday, June 10, 2013

An Assessment of Techniques for Error Propagation (Uncertainty Analysis) in Water Quality Models

Error propagation is an important but under-utilized uncertainty analysis technique that allows a modeler to estimate the impact of errors from all uncertain factors (e.g., parameters, inputs, initial conditions, boundary conditions, model equations) on the model response(s). The two commonly used error propagation techniques traditionally have been first-order error analysis and Monte Carlo simulation. A related approach, sensitivity analysis, allows the modeler to assess quantitatively the impact of a subset of model terms (often just one) on model response.

First-order error analysis is based on the approximation of the randomness in one variable (e.g., a reaction rate) with the first nonzero central moment, and the characterization of a functional relationship with the first-order terms of the Taylor series. This means that error in an input variable (e.g., x) is assumed to be fully characterized by the variance and that this error is converted to error in the endogenous variable (e.g., y) through a linearization of the equation.  The usefulness of first-­order error analysis is a function of the validity of these approximations.

Consider a simple functional relationship (equation 1):


If this relationship is reasonably “well behaved” (e.g., not highly nonlinear) and if the standard deviation of x is not too large, then (equation 2):
 
where E is the expectation operator. (The expectation or expected value is the probabilistic average of a random variable. Under random sampling, the expected value of a variable is its mean.) Likewise, under the same assumptions, a Taylor series expansion of f(x) may be used to approximate the variance (equation 3):




Employing only the first two terms of the Taylor series and taking the expansion about the mean, , equation 3 becomes (equation 4):


Taking the variance of equation 4 and noting that variance f() equals zero, this equation is transformed to the bivariate form of the error propagation equation (equation 5):


where s is the sample standard deviation and s2 is the sample variance.

For a multivariate relationship, there is a straight­forward extension of equation 5, taking into consideration the covariation between predictor variables (equation 6):


Equation 6 shows that the error in a model prediction due to errors in the variables is a function of the individual variable error (sxi), a sensitivity factor (df/dx), expressing the "importance" of each variable, and the correlation (pamong variables. This relationship is sufficiently general so that "parameters" may be substituted for "variables" in the previous sentence with no change in meaning.

Equation 6 is the error propagation equation that is the basis of first-order analysis.  The method receives its name from the fact that only the first-order, or linear, terms of the Taylor series are retained. The degree to which this approximation is successful may be assessed with the aid of figure 1 below. 
 

Figure 1.  First order estimation

The model relating x to f(x) in figure 1 is assumed to be nonlinear for the sake of generality, and the straight line tangent to this model is the first-order approximation. First-order error analysis then is graphically portrayed by the dashed lines which convert the error in x to an error in f(x). The success of this error approximation method is determined by the following:
a.       The degree of nonlinearity in the model. As the model becomes increasingly nonlinear, f'(x )∆x becomes less accurate as a measure of ∆f. This means that highly nonlinear models may not be amenable to first-order treatment of error.
b.      The size of the error term. For a nonlinear model, the accuracy of the first-order error estimate is a function of the error in x (represented by ∆x in figure 1). Small errors in x coupled with near linearity for the model are favorable conditions for effective application of first-order analysis.
c.       The acceptable level of error (due to inaccuracy in the error analysis) for the issue under study.
d.      The extent to which the distribution of errors is represented by the mean and standard deviation of the distribution. For complex or skewed error distributions, the mean and deviation may be inadequate, leading to a faulty estimate of error in f(x).

Low-cost, fast computing that supports Monte Carlo simulation has largely made applications of first-order error analysis relatively rare in recent years. Monte Carlo simulation is a conceptually simple alternative to first-order error analysis; it was so-named because it shares characteristics of randomness with gambling casinos. Under this technique, probability density functions are assigned to each characteristic (e.g., variable or parameter), reflecting the uncertainty in that characteristic. Values are then randomly chosen from each probability distribution. These values are inserted into the model, and a prediction is calculated. After this is repeated a large number (several hundred to several thousand) of times, an empirical distribution of predicted model response develops, which reflects the combined uncertainties “flowing” through the model.

As an example, consider the simple model (equation 7):


Figure 2 displays the error distributions for the uncertain parameters (β1 and β2) and for the uncertain model equation (ε). At each step in the Monte Carlo simulation, each distribution is randomly sampled to yield a single value that is inserted into equation 7, and a value for the response variable y is calculated.  After several hundred (or several thousand) runs of the model, the predicted responses, y


can be tabulated or plotted in a histogram; this histogram reflects the errors and the model structure.
If the parameters β1 and β2 are correlated (this is not uncommon in water quality models), then individual sampling steps in the Monte Carlo procedure cannot be undertaken independently. Instead, the sampling of values from the correlated probability distributions must be undertaken sequentially, with the probability distribution of the second parameter (either parameter may be selected first or second) conditional on the value of the first parameter selected; this “conditionality” reflects the correlation between the two parameters.
An essential condition for success of Monte Carlo simulation for error propagation with water quality models is that the error terms and the parameter covariances need to be estimated. Estimation of the parameter errors and covariances is possible with a statistical (e.g., regression) model, but may be difficult to impossible for large water quality models with many parameters, as the available data often do not contain sufficient information to estimate parameter errors and covariances. Note that variances and covariances among measured water quality variables (e.g., the “x” in equation 7) are not the same as the variances and covariances among the model parameters (β1 and β2). For example, a model parameter may be “phytoplankton settling velocity” in a lake, which is typically not measured; a variable may be phytoplankton density, which is often measured (as chlorophyll a).  With commonly-measured water quality data, it may not be possible to estimate parameter errors and covariances.  Techniques presented below can partially address this conundrum.

Among experienced water modelers, it is understood (but generally not acknowledged) that many ‘‘sets’’ of parameter values will fit a model about equally well; in other words, similar predictions can be obtained by simultaneously manipulating several parameter values in concert. This is plausible in part because all models are approximations of the real world, and because most model parameters represent aggregate or “effective” processes (spatially and temporally averaged at some scale) and are unlikely to be represented by a fixed constant across scales. Additionally, many mathematical structures produce extreme correlation between model parameters, even when a model is over-determined. This condition, called ‘‘equifinality,’’ is well-documented in the hydrologic sciences, but the concept has rarely been discussed in the water quality sciences. I believe that the recognition of equifinality should change the perspective of water quality modelers from seeking a single ‘‘optimal’’ value for each model parameter, to seeking a distribution of parameter sets that all meet a predefined fitting criterion. These acceptable parameter sets may then provide the basis for estimating model prediction error associated with the model parameters.

The development of methods for identifying plausible parameter sets for large multi-parameter environmental models with limited observational data is best understood through the regionalized (or generalized) sensitivity analysis (RSA). RSA is a Monte Carlo sampling approach to assess model parameter sensitivity; this method was initially proposed as a means to prioritize future sampling and experimentation for model and parameter improvements. Regionalized sensitivity analysis is simple in concept, and is a useful way to use limited information to bound model parameter distributions. Given a particular model and a system (e.g., water body) being modeled, the modeler first defines the plausible range of certain key model response variables (e.g., chlorophyll a, total nitrogen) as the ‘‘behavior.’’ Outside the range is ‘‘not the behavior.’’ The modeler then samples from (often uniform) distributions of each of the model parameters and computes the values for the key response variables. Each complete sampling of all model parameters, leading to prediction, results in a ‘‘parameter set.’’ All parameter sets that result in predictions of the key model response variables in the ‘‘behavior’’ range are termed ‘‘behavior generating’’ and thus become part of the model parameter distribution. The parameter sets that do not meet this behavior criterion are termed ‘‘nonbehavior generating.’’ The cumulative distribution function (CDF) of each parameter distribution from these two classes of parameter sets (behavior generating and nonbehavior generating) can be compared for the evaluation of model parameter sensitivity. For a particular parameter, if the behavior generating and nonbehavior generating distributions are substantially different, then prediction of the key response variables is sensitive to that parameter.  Hence, resources devoted toward model improvement might be preferentially allocated toward improved estimation of that parameter. In addition, we can consider the distribution of the behavior generating parameter sets as reflecting equifinality. Thus, the empirical distribution characterizes the error (variance and covariance) structure in the model parameters, conditional on the model and on the fitting criterion (the defined plausible range of key response variables).

Generalized Likelihood Uncertainty Estimation (GLUE) is an extension of RSA; the RSA binary system of acceptance ⁄ rejection of behavioral ⁄ nonbehavioral simulations is replaced in GLUE by a ‘‘likelihood’’ measure that assigns different levels of confidence (weighting) to different parameters sets. By effectively evaluating the fit of parameter sets, RSA, GLUE, and Markov Chain Monte Carlo (MCMC) provide useful information for model parameter error propagation. These techniques can be used to develop plausible parameter sets, which collectively express the parameter covariance (parameter error and correlation) structure to help address equifinality. Each of these techniques can be used to create a multi-parameter distribution that is “behavior generating” to characterize parameter sets for a water quality model. This distribution can then become the basis for Monte Carlo simulation for error propagation; this is different from standard Monte Carlo simulation in that parameter sets, not individual parameters, are sampled. By sampling parameter sets to assess prediction uncertainty, we incorporate the parameter variance-covariance structure into the simulation results. While this still leaves model (equation) error unaddressed, it does provide the opportunity to advance our understanding of the error in water quality model predictions.

No comments:

Post a Comment