This article lays out the fundamentals of data assimilation as used in biogeochemistry. It demonstrates that all of the methods in widespread use within the field are special cases of the underlying Bayesian formalism. Methods differ in the assumptions they make and information they provide on the probability distributions used in Bayesian calculations. It thus provides a basis for comparison and choice among these methods. It also provides a standardised notation for the various quantities used in the field.

The task of improving current knowledge by considering
observed phenomena is a fundamental part of the scientific process.
The mechanics of the process are a matter for historians of science
and a matter of fierce debate, while its fundamental reliability is a
matter for philosophers, literally a metaphysical question. Here we
are concerned with building and using algorithms for the process. The step by
step construction of such algorithms from agreed upon logical premises is
beyond our scope here but is masterfully presented
by

It is important to note at the outset the generality of the machinery we will present, especially because many of the common objections to the methods spring from misunderstandings of this generality. For example, it is a common complaint that data assimilation can improve a model but cannot test it. The methods are, however, capable of considering multiple models and assigning probabilities to them; a test of their relative performance.

At the outset we should settle some questions of terminology. The phrases
“data assimilation”, “parameter estimation”, “inverse modelling” and “model–data fusion”
are used with overlapping meanings in the literature. “Inversion” is
most commonly used to describe improving knowledge of the inputs of a
model given observations of its output. The term arises since the
technique reverses the normal direction of causality

We are concerned with algorithms to improve the performance of models through the use of data. The improvement will come via knowledge of the state (quantities the model calculates), boundary conditions (quantities we insert into the model) or structure of the model. Performance will be assessed by the model's ability to reproduce observations, especially those not included in the data assimilation algorithm.

The application of data assimilation to biogeochemistry has proceeded
from informality to formal methods and from small to larger problems.
One can see the study of

The flourishing of the general approach has also led to a great diversification of methods, reflecting repeated borrowings from other fields. This has been extremely fruitful, but can be confusing for a novice. Our aim in this paper is to reintroduce the fundamental theory and demonstrate how these methods are implementations of that theory. We will aim to tread a careful path between generality and simplicity. If we succeed, a reader should be well-placed to understand the relationships among the methods and applications presented later. We will also introduce a consistent notation that is sufficiently general as to capture the range of possible applications.

The structure of the paper is as follows. In Sect.

In what follows we will not be using mathematically precise
language. We think the trade-off of ambiguity for simplicity suits our
introductory task. Some of the key concepts are introduced in

For our purposes, we define an event as a statement about the condition
of a system. This is deliberately general; such statements can take
many forms.
Examples include categorical or discrete statements (e.g. “including
isopycnal mixing improves ocean circulation models”), logical propositions (e.g. “increasing soil temperatures lead to increased soil respiration”)
or quantitative statements like “the Amazon forest is a sink of
between 1 and 2 Pg C yr

This definition of event already takes us down one fork of the major distinction in statistics roughly described as Bayesians vs. frequentists. This is a Bayesian definition; a frequentist will usually refer to an event as the outcome of an experiment.

The concept of probability is so simple and universal that it is hard
to find a definition that is more than a tautology. It is a function
mapping the set of events onto the interval

At this point, any discussion of statistics bifurcates into two
apparently incompatible methodologies, roughly termed frequentist and
Bayesian. Debate between the two schools has been carried on with
surprising vigour. See

Frequentists generally pose problems as the quest for some property of
a population. The property must be

Bayesian statisticians concern themselves with the state of knowledge
of a system. They regard the task of inference as improving this state
of knowledge. We generate knowledge of some property of the system by
applying some mathematical operation to the state of knowledge of the
underlying system. Although we may call this estimating some property
of the system, the calculations involved are quite different from the
estimators of frequentist statistics. As a practical example a
frequentist may estimate a mean by averaging their sample while a
Bayesian may calculate an integral over their probability density.
It is also important to note that for a Bayesian the state of
knowledge does not necessarily come only from the sample itself. This
use of information external or

The notation in use in this field is as varied as the nomenclature.
This is unavoidable where similar techniques have been developed
independently to
answer the needs of many different fields. This diversity complicates
the task for novices and experts alike as they compare literature from
different fields. Producing a notation at once compact
enough to be usable and general enough to cover all cases is a
difficult task.

Table of notation used throughout the issue. Borrows
heavily from

In this section we will sketch the fundamental solution of the inverse problem. This will motivate a more detailed description of the various inputs in the following section.

The recipe for combining these two sources of knowledge is
traditionally introduced via conditional probabilities and expressed
as Bayes' theorem

We postulate that the true state of the system is consistent with our
prior knowledge of it and the measurements we take. The link between the
system's state and the observations is created by an observation
operator (often called a forward operator).

The case for one target variable and one observation is described in
Fig.

The solution of the problem is the intersection or
multiplication of the three PDFs. Adding the requirement that the
system state must take a value, we obtain the solution (panel b) by projecting the joint PDF onto the

Illustration of Bayesian inference for a system with one
target variable and one observation. Panel

In this section we describe the various inputs to
Eq. (

It is important at the outset to distinguish between target variables

We often wish to calculate the uncertainty in some predicted quantity. We may inadvertently eliminate a target variable whose uncertainty contributes to that of our quantity of interest.

We are not always capable of guessing in advance which target
variables will be well-constrained by the assimilation. We should let the observations determine which variables can be
constrained rather than guessing in advance

If a target variable that should be important for explaining an
observation is removed, the information from that observation may
contaminate other target variables

Point three above has been the subject of much work in the flux
inversion community in which our target variables are gridded surface
fluxes of some atmospheric tracer. Often we group or aggregate grid
cells into regions and assume the flux pattern inside these regions is
perfectly known. Our target variables comprise multiples or scaling
factors for these patterns.

We use an example to illustrate the process.

Decide on a quantity of interest. This will be driven by the science question we are addressing, e.g. the global mean temperature in 2100 or the biospheric carbon uptake of North America. Often it will not be a target variable but the result of a forecast or diagnostic calculation.

Use sensitivity analysis to find variables to which the quantity
of interest is
sensitive

Generate prior PDFs for the most sensitive variables and, with these,
calculate the prior PDF in the quantity of interest generated by each
potential target variable. This and the previous steps should never be separated; the
sensitivity to a given target variable is best expressed in uncertainty
units for that variable. This provides a common basis for comparing
sensitivities of different variables in different units. For
example,

Target the most important variables for our quantity of
interest with the cut-off being
computationally driven. This led

The PDF for the prior

In practice, prior information is most often taken from a model providing
a mechanistic or process-based representation of the system

Alternatives that reduce the reliance on potentially subjective prior information
do exist, where less restrictive prior assumptions are made. The case of uniform
priors described above is one such example. One practical example in biogeochemical
data assimilation is the
“geostatistical” approach

It is important to note that the various approaches used to construct the prior in data
assimilation problems are not disjoint options, but rather can be thought of as points
along a continuum of options that reflect different choices about the assumptions
that a researcher is willing to make about the system

We also note that often much more attention is paid to the parameters
that define the location of the prior PDF (such as the mode) while
less attention is paid to its spread (e.g. the variance in the case
of Gaussian distributions). The two are equally important as
they both control the intersection of PDFs in Fig.

The PDF for the observation

The PDF for the observation operator

In some cases it may be possible to calculate uncertainty in the
observation operator in advance rather than diagnosing it from the
assimilation. If there is an ensemble of observation operators
available

Although they are not one of the ingredients in
Fig.

The boundary between the observation operator and the dynamical model depends on our choice of target variables. If we consider only the initial state and assimilate observations over a time window (the classic case of four-dimensional variational data assimilation, 4D-Var) then the dynamical model forms part of the mapping between the unknowns and the observations so is properly considered part of the observation operator. If we include the system state from the assimilation window (weak-constraint 4D-Var) then the dynamical model enters as a separate object with its own PDF. We will see shortly how this can be incorporated explicitly into the statistical framework. One confusing outcome of this distinction is that the same model could play the role of an observation operator or a dynamical model depending on how it is used.

The solution to the data assimilation problem depends on many inputs that are
themselves uncertain. These include parameters of the input
PDFs such as variances or choices of observation operators.
Frequently we regard these as fixed, which is likely to underestimate
the uncertainty of the estimates. One solution is to
incorporate these extra variables into the data assimilation process. The
problem is to deal with the complex relationships such as the
dependence of the likelihood

The chain rule of probabilities

In our case we extend our target variables

For our case Eq. (

A prototype of this approach was given by

Finally we should note that the hierarchical approach outlined here
subsumes the description in Sect.

We saw in Sect.

Second, we see that the only physical model involved is the observation operator. All the sophisticated machinery of assimilation is not fundamental to the problem, although we need it to derive most of the summary statistics.

Third there is no requirement that a solution exists or, if it does,
it can have vanishingly low probability. In Fig.

We thank Ian Enting for pointing this out.

A general approach for characterising the posterior probability distribution of the target variables is to devise an approach for generating a large number of samples, or realisations, from this multivariate posterior distribution. We briefly describe a few example approaches that fall under this umbrella.

One approach to solving Eq. (

A range of more sophisticated Monte Carlo sampling approaches exist for multidimensional problems. The most straightforward of these is direct sampling of the posterior PDF, feasible for the case where this posterior PDF is sufficiently simple to allow sampling from the distribution. This would be the case, for example, for multivariate Gaussian or lognormal distributions.

In cases where direct sampling is not feasible, the strategy often
becomes one of sequentially sampling from a surrogate distribution
(often termed the proposal distribution) in such a way that the
ensemble of samples ultimately represents a sample from the more
complex target distribution. Arguably, the simplest of the approaches
falling under this umbrella is rejection sampling. In this approach,
a sample is generated from a proposal distribution that is simple
enough to be sampled directly, and that sample is accepted or rejected
with a probability equal to the ratio between the likelihood of the
sampled value based on the target versus the proposal distribution,
normalised by a constant

A particularly useful class of Monte Carlo sampling strategies are Markov chain Monte Carlo (MCMC) approaches. In these approaches, the samples form a Markov chain, such that each sample is not obtained independently, but rather as a perturbation of the last previously accepted sample.
Rather than sampling the posterior PDF directly, one can also sample each of three PDFs sequentially
(Fig.

Another commonly used MCMC algorithm, the Gibbs sampler, can be seen
as a special case of the Metropolis–Hastings algorithm where the
target variables are sampled individually and sequentially, rather
than sampling the full multivariate PDF of target variables as a
single step

The most important implementation detail for applying MCMC approaches is the choice of the perturbation. Good strategies are adaptive, lengthening the step size to escape from improbable regions of the parameter space and shortening it to sample probable parts (noting the comments on independence above).

Note that it takes at least one forward run for every accepted point,
meaning only inexpensive observation operators can be treated this way. Furthermore
the original algorithm cannot be parallelised because each calculation
requires the previous one. Some related techniques are amenable to
parallel computation

We often describe the solution of the assimilation problem using summary statistics of the posterior PDF. This can be because we assume that the posterior PDF has an analytic form with a few parameters that can describe it (e.g. the mean and standard deviation of the Gaussian) or because we need a numerical representation for later processing or comparison. This nearly universal practice is valuable but requires care. For example, consider a target variable in the form of a time series. We can represent the mode of the posterior PDF as a line graph and the posterior uncertainty as a band around this line. The true time series is a realisation drawn from within this band. Any statements about the temporal standard deviation of the true time series must be based on PDFs generated from these realisations. These estimates will generally yield larger variability than that from our most likely realisation.

The linear Gaussian case is so common that it deserves explicit treatment, however, provided the caveat above is borne in mind.

If

The Gaussian form for the inputs allows one important simplification
without approximation. We noted in Sect.

The largest computation in this method is usually the calculation of

Following the introduction of formal Bayesian methods to atmospheric
assimilations by

A common case in physical systems has the state of the system evolving
according to some dynamical model

At any time step

Calculate a probability distribution for the new state of the
system

Use

This yields a posterior PDF

For the simplest case of linear

The original conception of the Kalman filter was for dynamical control
in which the state of the system is continually adjusted in accordance
with arriving observations.
For data assimilation applied to biogeochemistry our motivation is to
hindcast the state of the system and, optionally, infer some
underlying constants

Note that by default, the time series of the maximum a posteriori estimates is not physically consistent; e.g. it does not conserve tracer mass from one assimilation step to another. However, the time series of the full PDFs includes trajectories that are physically consistent.

Much of the computational machinery of data assimilation aims to calculate various summary statistics of the posterior PDF. The utility of this approach depends on whether these parameters accurately summarise the posterior PDF. It is best to choose the target summary statistic based on knowledge of the posterior PDF; there is no point estimating two parameters if the posterior distribution is described by only one (e.g. the Poisson distribution). Here we limit ourselves to discussions of the two most common parameters: the maximum a posteriori estimate and the posterior covariance. Solving for these parameters does not itself assume a Gaussian posterior but we often do assume such normality.

The mode of the posterior PDF is the value of the unknowns
that maximises the posterior probability. It is often summarised as
the maximum a posteriori or MAP estimate.
If

We
cannot always write the PDFs this way, e.g. the uniform PDFs in
Fig.

The minimisation of

The quadratic form is sufficiently common to warrant more development.
From Eq. (

Any reasonable summary of a posterior PDF will involve a parameter describing
its spread as well as location. This spread almost always has a
higher dimension than the mode because it must describe the joint
probabilities of parameters. This immediately raises a problem of
storage and computation because even moderate assimilation problems may
include 100 000 unknowns for which representation
of the joint PDF is intractable.
In purely Gaussian problems it can be
shown that the Hessian or second derivative

An alternative approach is to generate an ensemble of modes using
realisations of the prior and data PDFs

The main limitations of the traditional Kalman filter for data
assimilation are its assumption of Gaussian PDFs, the requirement for
linear

Computationally we need to run a dynamical model for each realisation
of the ensemble. This differs from Eq. (

Particle filters or sequential Monte Carlo methods relax the
assumptions inherent in the EnKF. Like the EnKF, the particle filter samples the underlying
PDFs with a series of realisations rather than propagating the mean
and variance of the PDF. There are steps in the EnKF that assume
Gaussian probability distributions and the particle filter avoids
these, seeking to fit the true underlying PDF.
We parallel the description of the Kalman
filter algorithm (Sect.

At any time step

Propagate each realisation forward in time
using the dynamical model

Account for the uncertainty in the dynamical model by perturbing

For each

Using some numerical procedure with inputs of

We see that the last three steps mirror the multiplication of PDFs
described in Eq. (

The relaxation of the Gaussian assumption is at once a strength and
weakness of the particle filter. It can handle more complex PDFs than
the EnKF but it also requires many more samples because it samples the
posterior PDF rather blindly while the EnKF is guided by the Kalman
update step into regions of high probability. This difference has
meant that, while the EnKF has been used successfully on some very
large problems (such as numerical weather prediction) the particle
filter has been limited to the same problem sizes as the approaches
described in Sect.

An overview of the
principal ideas of particle filtering and improvements upon the basic
filter is given by

A common feature of biogeochemical systems is a large range of timescales. For example, terrestrial models may include both photosynthesis (sub-seconds) to forest succession (centuries). Slow processes often couple weakly to instantaneous observations so they must be constrained by long observational series. This can make the problem computationally intractable. A common solution is to break the problem into shorter assimilation windows. This is different from the case of numerical weather prediction where the main consideration is the need for a new forecast every few hours and the advantage that the dynamical model can be considered perfect for short enough times.

The choice of assimilation
window for biogeochemical applications is guided by several factors.

Throughout this paper we have adopted a thematic structure. Here we switch views to a narrative description. This is partly to situate the various methods we have described but also to give some signposts for the future. We cannot hope to be exhaustive, nor to trace the rise and fall of each method. Rather, we will identify drivers for these developments and show examples of how they have guided the field. Our intention is to support the rest of the paper rather than provide a comprehensive review.

We can identify four major drivers for the development of
biogeochemical data assimilation:

development of the underlying methods and their percolation into the field,

rise in the scale and complexity of models and data sets,

rise in computational power,

diversifying needs within and beyond science.

One of the first clearly recognisable applications of data
assimilation to biogeochemistry

The addition of ancillary information is frequently termed
regularisation since its practical purpose is to limit implausible
solutions in poorly constrained parts of parameter space. For most
problems, solving inverse problems meant solving a version of
Eq. (

The importation of the explicit Bayesian methods which we have
reviewed here largely supplanted the alternative regularisation
methods, though some studies

Classical analytical methods (Sect.

Once the number of both observations and target variables renders
computation of Jacobians unfeasible, variational methods
(Sect.

Versions of the Kalman filter are some of the natural choices whenever the
dynamical evolution of the model state is explicitly included.
Few atmospheric studies have exploited this capability,
though

Kalman filters were frequently limited by the need to calculate Jacobians for each target variable. This was circumvented by the introduction of the EnKF. Provided one accounts for the problems of ensemble size, the EnKF's combination of efficiency and simplicity has led it to supplant other KF approaches.

The movement towards inclusion of explicit dynamics in biogeochemical
inverse problems has increased the use of EnKF and variational methods
since both handle the nonlinear formulations common in biogeochemical
models. In terrestrial
models,

There have been some attempts to compare the strengths and weaknesses
of the various approaches above.

In previous sections, we have described how many assimilation methods are alternative algorithms for the same underlying problem. Different methods have risen to prominence in response to combinations of the four drivers mentioned above. Discussion of future directions is necessarily speculative but we can suggest how some of the drivers will evolve and their corresponding influence.

Computational constraints are less active now and likely to remain so. This is true even when we consider the increases in data size and complexity. The increased availability of large numbers of processors rather than their individual speed will likely privilege naturally parallel algorithms such as ensemble methods.

The rising complexity and diversity of models will also promote ensemble approaches. The strongest analogy is with climate prediction and seasonal forecasting where ensembles frequently include different dynamical formulations as well as different parameter settings. As of yet there have been few attempts at multi-model ensemble data assimilation in biogeochemistry with the TransCom project being the highest profile example. One reason for this is the scarcity of biogeochemical models with data assimilation capability. Another advantage of ensemble methods is a reduced entry barrier.

A more likely brake on the evolution of data assimilation in
biogeochemistry is the quality of the models themselves. A
precondition for the utility of data assimilation is that information
learned in one environment or regime is applicable elsewhere, or that
a model can make coherent inferences from different observations of
the same system. As examples,

Finally, the distinctions we have drawn between methods here represent
a snapshot in time. Methods are likely to be matched and combined
differently in future. At the moment most assimilation studies choose
either external inputs (boundary conditions or parameters) or model
state as target
variables (Sect.

There is a wide range of techniques available for assimilating data into models such as those used in biogeochemistry. Most of these, however, inherit the basic formalism of Bayesian inference in which posterior probabilities are constructed as conjunctions or multiplications of probability distributions describing prior knowledge, measured quantities and models relating the two. The methods differ in the information they assume or adopt on the various components. In general the more restrictive the assumptions the more powerful the statistical apparatus available to analyse the system.

This paper contains some sketches of mathematical development but does not rely on any particular piece of code or data set.

PJR led the writing of the paper, AMM wrote several sections and redacted the whole paper. FC wrote several sections, redacted the whole paper and was primarily responsible for regularising the notation.

The authors declare that they have no conflict of interest.

This article is part of the special issue “Data assimilation in carbon/biogeochemical cycles: consistent assimilation of multiple data streams (BG/ACP/GMD inter-journal SI)”. It is not associated with a conference.

We acknowledge the support from the International Space Science Institute (ISSI). This publication is an outcome of the ISSI's Working Group on “Carbon Cycle Data Assimilation: How to consistently assimilate multiple data streams”.

This research was supported by the Australian Research Council (grant no. DP1096309).

This paper was edited by Marko Scholze and reviewed by three anonymous referees.