Technical note : Effects of Uncertainties and Number of Data points on Inference from Data-a Case Study on New Particle Formation

Fitting a line on a scatterplot of two measured variables is considered as one of the simplest statistical procedures researchers can do. However, this simplicity is deceptive as the line fitting procedure is actually quite a complex problem. 10 Atmospheric measurement data never comes without some measurement error. Too often, these errors are neglected when researchers are making inferences from their data. To demonstrate the problem, we simulated datasets with different amounts of data and error, mimicking the dependence of atmospheric new particle formation rate (J1.7) on sulphuric acid concentration (H2SO4). Both variables have substantial measurement error and thus they are good test variables for our study. We show that ordinary least squares (OLS) regression 15 results in strongly biased slope values compared with six error-in-variables (EIV) regression methods (Deming, Principal component analysis, orthogonal, Bayesian EIV, and two different bivariate regression methods) known to take into account errors in the variables.


Introduction
Atmospheric measurement data never comes without some measurement error.Too often, these errors are neglected when researchers are making inferences based on their data.Describing the relationship between two variables typically involves making inferences in some more general context than was directly studied and if the relationship is ill formulated, the inference is not valid either.In some cases, the bias in analysis method is even given a physical meaning When analysing dependencies of two or more measured variables, regression models are usually applied.A regression model can be linear or nonlinear, depending on the data.Standard regression models assume that the independent variables of the models have been measured without error and the models account only for errors in the dependent variables or responses.In cases where the measurements of the predictors contain error, estimating with standard methods, usually Ordinary Least Squares (OLS), do not tend to the true parameter values, not even asymptotically.In linear models, the coefficients are underestimated but in nonlinear models, the bias is likely to be more complicated (e.g.Schennach 2004).Measurement error needs to be taken into account especially when dealing with parameters with large errors.Thus, we chose such parameters as our test variables in this study.Sulphuric acid (H2SO4) is known to affect strongly the formation rates (J) of aerosol particles (Kirkby et al., 2016;Kuang et al., 2008;Kulmala et al., 2006;Kürten et al., 2016;Metzger et al., 2010;Riccobono et al., 2014;Riipinen et al., 2007;Sihto et al., 2006;Spracklen et al., 2006).The relationship between J and H2SO4 is typically written as log10(J) = β*log10(H2SO4)+α (Seinfeld and Pandis, 2016).In addition, parameterizations based on the results from these fits have been implemented in global models, e.g. in (Dunne et al., 2016;Metzger et al., 2010;Spracklen et al., 2006), to estimate the effect of new particle formation on the global aerosol.Theoretically, the slope of this relationship is related to the amount of sulphuric acid molecules in the nucleating critical cluster in homogeneous nucleation, based on the first nucleation theorem (Vehkamäki, 2006).. Some published results already show discrepancies regarding the J vs H2SO4 dependence.Kuang et al. (2008) used the unconstrained least squares method and obtained β=1.99 for the slope whereas Sihto et al. (2006) ended up with β=1.16 by using OLS from the same measurement campaign when analysing data from Hyytiälä in 2003.They had some differences in pre-treatment of data and used different time window but a notable proportion of this inconsistency is very likely due to different methods for making the fit.The problem in the relationship of H2SO4 and J has been acknowledged already in Paasonen et al. (2010) who noted that bivariate fitting method like presented in York et al. (2004) should be applied but could not be used due to the lack of proper error estimates for each quantity.They were not aware of the methods, which do not need to know the errors beforehand, but are using the estimated variances.We will here introduce appropriate tools for that.
Multiple attempts have been made to introduce methods using errors in predictor variables for the scientists applying regression-type analysis for their data; starting from Deming (1943).However, the traditional lest squares fitting still holds the position as the de facto line fitting method due to its simplicity.In atmospheric sciences, Cantrell (2008) drew attention to the method introduced by York (1966) and York et al. (2004) and listed multiple other methodological papers introducing similar methodology.Pitkänen et al. (2016) raised the problem into knowledge in remote sensing community and this study partly follows them and introduces multiple methods to take account the errors in predictors.Cheng and Riu (2006) studied methods with heteroscedastic errors whereas Wu and Yu (2018) approached the problem with measurement errors via weighted regression and applied some methods also used in our study.
Measurement errors in each variable of the model must be taken into account in the regression analysis by applying some errors-in-variables (EIV) regression.In this study, we compared OLS regression results to six different regression methods (Deming regression, Principal component analysis regression, orthogonal regression, Bayesian EIV regression and two different bivariate regression methods) known to be able to take into account errors in variables and provide (at least asymptotically) unbiased estimates.In this study, we will focus only on linear EIV methods but it is important to acknowledge that there also exist nonlinear methods described e.g.ORDPACK introduced in Boggs, Byrd, and Schnabel (1987) and implemented in Python SciPy and R (Boggs et al., 1989;Spiess, 2015).ORDPACK is somewhat improved version of orthogonal regression, as it minimizes the Mahalanobis distance from the data points to the regression line, instead of minimizing the sum of squares of the perpendicular distances, so that arbitrary covariance structures are acceptable and is specifically set up so that a user can specify measurement error variances and covariance point by point.

2
Materials and Methods

Data illustrating the phenomenon
In line fitting, data usually contain two types of error: natural error and measurement error.Measurement error is more generally understood, it is where measured values do not fully represent the true values of variable being measured.This also contains sampling error, e.g. in the case of H2SO4 measurement the sampled air in the measurement instrument is not representative sample of outside air.Natural error is that the true connection between the two variables is varying by some natural or physical cause e.g.certain amount of H2SO4 does not cause same number of new particles formed.In the analysis of measurement data, some amount of these errors are known or can be estimated, but some of it will usually remain unknown, which should be kept in mind when interpreting data.Even though the measurement error is taken into account, the regression fit may be biased due to natural error (Carroll and Ruppert, 1996).
The data used in this study consist of new particle formation rate at 1.7 nanometre size (J1.7) and sulphuric acid (H2SO4) concentration simulated to mimic observations of pure sulphuric acid nucleation from CLOUD chamber in CERN (Kürten et al. 2016; https://home.cern/about/experiments/cloud)with matching expected value, variance and covariance structure.The chamber data at CERN is the best characterized and controlled set of new particle formation (NPF) experiments in the history of aerosol science so far.The Proton Synchrotron provides an artificial source of "cosmic rays" that simulates natural conditions of ionization between ground level and the stratosphere.The core is a large (volume 26m3) electro-polished stainless steel chamber with precise temperature control at any tropospheric temperature, precise delivery of selected gases and vapours and ultrapure humidified synthetic air.Existing data includes the most suspected candidates for atmospheric NPF, including sulphuric acidammoniawater (Kirkby et al., 2011), sulphuric acidamine (Almeida et al., 2013) and ion induced organic nucleation (Kirkby et al., 2016).The actual nucleation of new particles occurs at slightly smaller size.After formation, they grow by condensation to reach the detection limit (1.7 nm) of the instrument and J1.7 thus refers to the formation rate of particles as the instrument detects them.These variables were chosen because they are both known to have considerable measurement errors and their relationship is frequently under inference (Kirkby et al., 2016;Kürten et al., 2016;Riccobono et al., 2014;Tröstl et al., 2016) which makes them good illustrative variables for this study.

Regression methods
We made fits for the linear dependency of logarithms of the two study variables, such that the equation for the fit was given by where y represents log10(J1.7),x is log10(H2SO4), β:s are the coefficients estimated from the data and ε is the error term.In order to demonstrate the importance of taking into account the measurement errors in the regression analysis, we tested seven Bayes EIV the error can be approximated with a distribution.DR and BLS can be applied with both, errors given by the user and measurement variance based errors.In this study, we applied measurement variance based errors for them.

Data simulation
In measured data the variables that are observed are not X and Y, but (X+ex) and (Y+ey), where ex and ey are the uncertainty in the measurements and the true X and Y cannot be exactly known.Thus, we chose to use simulated data, where we know the true X and Y, to illustrate how the different line fitting methods perform in different situations.
We simulated a dataset mimicking new particle formation rate (J1.7) and sulphuric acid concentration (H2SO4) reported from CLOUD-chamber measurements in CERN.Both variables are known to have substantial measurement error and thus they are good test variables for our study.Additionally, the relationship of logarithms of these variables is quite often described with linear OLS regression and thus the inference may be flawed.
Simulating the observations tends to generate infrequent extreme outlier observations from the infinite tails of the normal distribution.We discarded these outliers with an error larger than three times the combined standard uncertainty of the observation in order to remove the effect of outliers from the regression analysis.This signifies the quality control procedure in data analysis and it also improved the stability of our results between different simulations.Regression fits with all methods in use are shown in Figure 1.As we know that the "true slope" βtrue=3.30we can easily see how the methods perform.The worst performing method was OLS, with βols=1.55,which is roughly half of the βtrue.The best performing methods with equal accuracy were ODR (βODR=3.27),Bayes EIV (βBEIV=3.24)and BLS (βBLS=3.22)whereas York (βYork=3.15),Deming (βDR=2.95) and PCA (βPCA=2.92)slightly underestimated the slope.
The sensitivity of the methods was first tested by varying the uncertainty in H2SO4 observations.We simulated six datasets with 1000 observations and with variating absolute and relative errors, listed in Table 1, and made fits with each method on all datasets separately.The performance of the methods is shown in Figure 2, with the results corresponding to Figure 1 are marked with black colour.It shows that when the uncertainty is smaller, the bias in OLS fit is smaller but the bias increases significantly when more uncertainty is added to data.Decrease in performance can also be seen with ODR, which is overestimating the slope, and PCA, DR and Bayes EIV, which all underestimate the slope.Bivariate methods, BLS and York, seem to be quite robust for increasing uncertainty, as the slopes are not changing considerably.
The sensitivity of methods on decreasing number of observations was tested by picking 100 random samples from the 1000 simulation dataset with n of 3,5,10,20,30,50,70,100,300 and 500 and making fits for all samples with all methods.The average slopes and their standard errors are shown in Figure 3.It is clear that when the number of observations is 10 or less, the variation in estimated slopes is considerably high.When n≥20 the average slopes stabilize close to their characteristic level, except for Bayes EIV, which needs more than 100 observations for that.The most sensitive methods for small n are Bayes EIV, ODR and PCA and thus they should not be applied for data with small n.
The sensitivity for outliers in predictor variable H2SO4 was tested with two different scenarios.First, the outliers were let to be randomly either high or low numbers.In the second scenario, outliers were allowed to be only high numbers, which is often the case in H2SO4 and aerosol concentration measurements as the lowest numbers are cleaned out from the data when they are smaller than the detection limit of the measurement instrument.Five cases with n=1000 were simulated with increasing number of outliers (0, 5, 10, 20, 100) and 10 repetitions of H2SO4 values with different set of outliers.Outliers were defined such that xobs-xtrue>3*combined standard uncertainty.The most sensitive methods for outliers in both scenarios were OLS and Bayes EIV.High number of outliers caused underestimation to PCA and DR, especially in high outlier case, and slight overestimation to BLS in random outlier case.York Bivariate and ODR were not affected at all in either case and BLS had only small variation between the 10 replicates in the estimated slope.We did not test how big number of outliers would break all of the methods as it might not be meaningful to interpret anymore data with more than 10% of outliers.

Conclusions
Simple linear regression can be used to answer some common questions, such as is Y related to X but if we are interested on the strength of the relationship then error-in-variance methods should be applied.There is no single correct method to make the fit, because the methods measure slightly different things about the data.The choice of method has to base on the properties of data and the specific research question.There are usually two types of error in the data: natural and measurement error.
Even if the natural error in the data is not known, taking into account the measurement error improves the fit significantly.In addition, no matter how small the measurement error would be, it should be taken account because taking it into account will never lead to more biased estimator.
As a case study, we simulated a dataset mimicking the dependence of atmospheric new particle formation rate on sulphuric acid concentration.We introduced three major sources of uncertainty when doing inference from scatterplot data: increasing measurement error, number of data points and number of outliers.In Fig Atmos.Chem.Phys.Discuss., https://doi.org/10.5194/acp-2018-1125Manuscript under review for journal Atmos.Chem.Phys.Discussion started: 11 December 2018 c Author(s) 2018.CC BY 4.0 License.
Atmos.Chem.Phys.Discuss., https://doi.org/10.5194/acp-2018-1125Manuscript under review for journal Atmos.Chem.Phys.Discussion started: 11 December 2018 c Author(s) 2018.CC BY 4.0 License.differentline-fitting methods.Ordinary Least Squares (OLS), not taking account the uncertainty in x-variable, and Deming regression (DR,Deming, 1943), Principal component analysis (PCA,Hotelling, 1957) regression, orthogonal regression (ODR,Boggs, Byrd, and Schnabel 1987), Bayesian EIV regression(Kaipio and Somersalo, 2005) and two different bivariate least squares methods byYork et al., (2004), and Francq and Govaerts (BLS, 2014), known to be able to take account errors in variables and provide (at least asymptotically) unbiased estimates.The differences between the methods comes from the criterion they minimize when calculating the coefficients and how they take account the measurement errors.The minimizing criteria for all methods are given in the supplement S1.Two of the methods, PCA and ODR, account only for the measurement variance, whereas Bayes EIV and York bivariate regression require known estimates for measurement errors.Though for Atmos.Chem.Phys.Discuss., https://doi.org/10.5194/acp-2018-1125Manuscript under review for journal Atmos.Chem.Phys.Discussion started: 11 December 2018 c Author(s) 2018.CC BY 4.0 License. 4 Results Differences between the regression methods are illustrated with four different ways.First, by showing line fits on scatterplot of simulated data.Secondly, illustrating how the slopes change when the uncertainty in the measured variables increase, thirdly by showing the sensitivity of the fits on number of observations and finally showing how the fits are affected by adding outliers in the data.

Figure 1 .
Figure 1.Regression lines fitted to the simulated data with all methods in comparison.Whiskers in data points refer to the measurement error used for simulation

Figure 3 .
Figure 3.Effect of sample size on the uncertainty of different fits.Lines show the median and shading illustrates one standard deviation range of slope estimates for 40 repeated random samples.Dashed line indicates the "true slope".

Figure 4 .
Figure 4. Effect of outliers in the data.Random outliers case on left panel and only high positives on right panel.Lines show the median and shading shows one standard deviation of slope estimates in ten repeated studies.Dashed line indicates the "true slope".
1, we showed that in case of errors from real measurements of J1.7 and H2SO4 four of the methods gave slopes close to "true" known value: BLS, York bivariate, Bayes EIV and ODR.Estimates from BLS and York bivariate remained stable even when the uncertainty in simulated H2SO4 was increased drastically in Fig 2.The main message to learn in Fig3is that with small numbers of observations all fit methods are highly uncertain.BLS showed out to be the most accurate with smallest sample sizes of 10 and less, ODR stabilized with 20 observations and York bivariate and Bayes EIV needed 100 or more data points to become accurate.After that, they approach the true value asymptotically, while the OLS slope, in contrast, converges towards an incorrect value.With the increasing number of outliers in Fig4ODR and York bivariate showed out to be the most stable ones, even with 10% of observations classified as outliers in both test cases.BLS remained stable in the case with only high outliers.Bayes EIV was the most sensitive to outliers with OLS.From this, we can give a recommendation that if the uncertainty in predictor is known, York bivariate, or other method able to use known variances, should be applied.If the errors are not known, and they are estimated from data, BLS and ODR showed out to be the most robust in cases of increasing uncertainty and with high number of outliers.If the number of observations is less than 10, and the uncertainties are high, we recommend considering twice if a regression fit is appropriate at all.However, with our simulation tests BLS showed out to be the most robust with small data.Bayes EIV has significant advantages if the number of observations is high enough and there are not too many outliers, as it is able to estimate the errors in data with distributions.