Fitting a line to two measured variables is considered one of the simplest statistical procedures researchers can carry out. However, this simplicity is deceptive as the line-fitting procedure is actually quite a complex problem. Atmospheric measurement data never come without some measurement error. Too often, these errors are neglected when researchers make inferences from their data.

To demonstrate the problem, we simulated datasets with different numbers of
data points and different amounts of error, mimicking the dependence of the atmospheric new particle
formation rate (

Atmospheric measurements always come with some measurement error. Too often, these errors are neglected when researchers make inferences based on their data. Describing the relationship between two variables typically involves making deductions in a more general context than that in which the variables were directly studied. If the relationship is not defined correctly, the inference is also not valid. In some cases, the bias in the analytical method is even given a physical meaning.

When analysing the dependencies of two or more measured variables, regression models are usually applied. Regression models can be linear or non-linear, depending on the relationship between the datasets that are analysed. Standard regression models assume that the independent variables of the model have been measured without error and that the model only accounts for errors in the dependent variables or responses. In cases where the measurements of the predictors contain error, estimating using standard methods (usually ordinary least squares, OLS) does not tend to provide the true parameter values, even when a very high number of data points is used. In linear models, the coefficients are underestimated (e.g. Carroll et al., 2006); however, in non-linear models, the bias is likely to be more complicated (e.g. Schennach, 2004). If predictor variables in regression analyses contain any measurement error, methods that account for errors should be applied – particularly when errors are large. Thus, test variables in this study were chosen such that they included significant uncertainties in both the independent and dependent variables.

The sulfuric acid concentration (

Some published results have shown discrepancies in the expected

Multiple attempts have been made to present methods that account for errors in the predictor variables for regression-type analyses, going back to Deming (1943). However, the traditional least squares fitting method is still the de facto line-fitting method due to its simplicity and common availability in frequently used software. In atmospheric sciences, Cantrell (2008) drew attention to the method introduced by York (1966) and York et al. (2004) and listed multiple other methodological papers utilising similar methodology. Pitkänen et al. (2016) raised awareness of the fact that errors are not accounted for in the predictor variables in the remote-sensing community, and this study partly follows their approach and introduces multiple methods to account for the errors in predictors. Cheng and Riu (2006) studied methods involving heteroscedastic errors, whereas Wu and Yu (2018) approached the problem with measurement errors via weighted regression and applied some techniques that are also used in our study.

Measurement errors in each variable must be taken into account using
approaches known as errors-in-variables (EIV) regression. EIV methods simply
mean that errors in both variables are accounted for. In this study, we
compared OLS regression results to six different regression methods (Deming
regression, principal component analysis regression, orthogonal regression,
Bayesian EIV regression and two different bivariate regression methods) that are known to be able to take errors in variables into account and provide (at
least asymptotically) unbiased estimates. In this study, we focus exclusively on linear EIV methods, but it is important to acknowledge that
non-linear methods also exist, e.g. ORDPACK introduced in
Boggs et al. (1987) and
implemented in Python SciPy and

Measurement data contain different types of errors. Usually, the errors are divided to two main class: systematic error and random error.

Systematic errors,
commonly referred as bias, in experimental observations usually come from
the measurement instruments. They may occur because something is wrong
with the instrument or its data handling system, or due to operator error. In line fitting, bias cannot be taken into
account and needs to be minimised by way of careful and regular instrument
calibrations and zeros or data preprocessing. The random error, in comparison, may
have different components; the two components discussed here are the natural error
and the measurement error. In addition, one should note the existence of
equation error (discussed in Carroll and Ruppert, 1996), which refers to using an inappropriate form of a fitting equation. Measurement error is more generally understood; it is where measured values
do not fully represent the true values of the variable being measured. This
also contains sampling error (e.g. in the case of

In the analysis of the
measurement data, some amount of these errors are known or can be estimated,
but some of the error will usually remain unknown; this should be kept in mind
when interpreting fits. Even though the measurement error is taken into
account, the regression fit may be biased due to unknown natural error. In
this study, we assume that the errors of the different variables are
uncorrelated, but in some cases this has to be accounted for, as noted, for example, in Trefall and Nordö (1959) and
Mandel (1984). The correlation between the errors of two
variables, measured with separate instruments, independent of each other,
such as formation rate and

The data used in this study consist of simulated new particle formation
rates at 1.7 nm (

We carried out fits for the linear dependency of the logarithms of the two study
variables, such that the equation for the fit was given by

OLS minimises the squared distance of the
observation and the fit line either in the

In measured data, the variables that are observed are not

We simulated a dataset mimicking the new particle formation rates (

We generated 1000 random noise-free

Simulated observations of the noise-free

These choices regarding generating simulated data reflect what real dataset can often be like: the bulk of the data approximates a log-normal distribution with one of the tails possibly being thinned or cut close to a limit of detection of an instrument or close to a limit of the data filtering criterion. In our simulated data, each negative observation and each negative noise-free value was replaced with a new random simulated value, which only slightly offsets the final distribution from a perfectly symmetric log-normal shape.

Simulating the observations tends to generate infrequent extreme outlier observations from the infinite tails of the normal distribution. We discarded outliers with an absolute error larger than 3 times the combined standard uncertainty of the observation in order to remove the effect of outliers from the regression analysis. This represents the quality control procedure in data analysis and also improves the stability of our results between different simulations.

In order to show that the results gained with simulated data are also applicable
in real measurement data, we applied our methods to data measured in the
CLOUD chamber and published by Dunne et al. (2016). Fig. 1 in Dunne et al. (2016) shows nucleation rates (

Regression lines fitted to the simulated data comparing all of the respective methods. The whiskers refer to the measurement error used for simulation.

Differences between the regression methods are illustrated in four different ways: firstly, by showing line fits on a scatterplot of simulated data; secondly, by illustrating how the slopes change when the uncertainty in the measured variables increase; thirdly, by showing the sensitivity of the fits on number of observations; and finally, by showing how the fits are affected by adding outliers in the data. Regression fits using each of the respective methods are shown in Fig. 1.

As we know that the noise-free slope

The sensitivity of the methods was first tested by varying the uncertainty
in the

The uncertainties used in the simulation for the sensitivity test for increasing uncertainty.

Sensitivity test for increasing uncertainty in simulated data. Black markers show the initial dataset described in Sect. 3. The dashed line indicates the noise-free slope.

The sensitivity of the methods to the decreasing

Effect of sample size on the uncertainty of different fits. Lines show the median and shading illustrates the 1 standard deviation range of slope estimates for 40 repeated random samples. The dashed line indicates the noise-free slope.

The sensitivity of the predictor variable

We also applied an alternative method for simulating the data to different testing methods. The main difference compared with our method was that the
distribution of noise-free

Effect of outliers in the data.

Regression lines fitted to data from Dunne et al. (2016), in a similar fashion to Fig. 1. In

Figure 5 shows the fits of the data from Dunne et al. (2016). As expected,
the fit using OLS is underestimated at both temperatures:

Ordinary least squares regression can be used to answer some simple
questions regarding data, such as “How is

As a test study, we simulated a dataset mimicking the dependence of the
atmospheric new particle formation rate on the sulfuric acid concentration. We
introduced three major sources of uncertainty when establishing inference from
scatterplot data: the increasing measurement error, the number of data points and
the number of outliers. In Fig. 1, we showed that for simulations where
errors are taken from real measurements of

From this, we can recommend that if the uncertainty in the predictor variable is known, York bivariate, or another method able to use known variances,
should be applied. If the errors are not known, and they are estimated from
data, BLS and ODR were found to be the most robust in cases with increasing
uncertainty (relative error,

We also carried out a case study on data measured in the CLOUD chamber and published by
Dunne et al. (2016). In these analyses, we saw
that our above-mentioned recommended methods also performed best for these data. Our tests indicated that the slope

Python code for running the methods can be found on GitHub:

Simulated datasets used in the example analysis are given in the Supplement.

In this appendix, we introduce the minimising criteria (

OLS minimises the sum of squares vertical distances (residuals)
between each point and the fitted line. OLS regression minimises the
following criterion:

ODR (

DR is the ML (maximum likelihood) solution of Eq. 1 when

The DR minimises the criterion

BLS is a generic name but here we refer to the formulation
described in Francq and Govaerts (2014) and references therein. BLS takes
errors and heteroscedasticity in both axes into account and is usually written in matrix notation. BLS minimises the criterion

A second bivariate regression method that was used in this study is an
implementation of the regression method described by York et al. (2004,
Section III). The minimisation criterion is described in York (1968):

PCA can be applied for bivariate and multivariate cases.

For one independent and one dependent variable, the regression line is

This is a standard technique that gives the regression coefficients

Bayes EIV regression estimate applies Bayesian inference using the
popular Stan software tool (

The Stan tool solved regression problems using 1000 iterations, and it
provided a posteriori distributions for the model parameters

The supplement related to this article is available online at:

SM prepared the paper with contributions from all co-authors. SM, MRAP and SI performed the formal analysis. MRAP simulated the data. SM, AA and KEJL formulated the original idea. SM, MRAP and AL developed and implemented the methodology. SM, MRAP, TN and AL were responsible for the investigation and validation of the data and methods.

The authors declare that they have no conflict of interest.

This research has been supported by the Nessling Foundation and the Academy of Finland (grant no. 307331).

This paper was edited by Fangqun Yu and reviewed by three anonymous referees.