Atmospheric new-particle formation (NPF) is a very non-linear process that includes atmospheric chemistry of precursors and clustering physics as well as subsequent growth before NPF can be observed. Thanks to ongoing efforts, now there exists a tremendous amount of atmospheric data, obtained through continuous measurements directly from the atmosphere. This fact makes the analysis by human brains difficult but, on the other hand, enables the usage of modern data science techniques. Here, we calculate and explore the mutual information (MI) between observed NPF events (measured at Hyytiälä, Finland) and a wide variety of simultaneously monitored ambient variables: trace gas and aerosol particle concentrations, meteorology, radiation and a few derived quantities. The purpose of the investigations is to identify key factors contributing to the NPF. The applied mutual information method finds that the formation events are strongly linked to sulfuric acid concentration and water content, ultraviolet radiation, condensation sink (CS) and temperature. Previously, these quantities have been well-established to be important players in the phenomenon via dedicated field, laboratory and theoretical research. The novelty of this work is to demonstrate that the same results are now obtained by a data analysis method which operates without supervision and without the need of understanding the physics deeply. This suggests that the method is suitable to be implemented widely in the atmospheric field to discover other interesting phenomena and their relevant variables.

New-particle formation (NPF) is an important source of aerosol particles and
cloud condensation nuclei (CCN) and in a vast number of atmospheric
environments ranging from remote continental areas to heavily polluted urban
centres

Based on data from field measurements, several studies investigated the
relations between NPF and meteorological conditions

Due to the practical limitations, the measurement campaigns typically last
from weeks to months and they often have a dedicated focus. On the one hand, such
an approach enables a very detailed inspection for a somewhat narrower scope, but,
on the other hand, there is a risk of overlooking important processes falling
outside the chosen, predetermined scope. One way to circumvent this issue is
to have long-term continuous measurements of a wide variety of atmospheric
variables. Nowadays there is more and more focus on continuous observations
as described by

Previously,

In order to understand the effects of atmospheric variables to NPF in
Hyytiälä, Finland, a comprehensive study was done by

The previously used data mining approaches are mostly based on classification methods. Although these methods seem to be suitable tools for finding correlation between variables in complex systems, the used implementation may not be always effective for this case. The first reason concerns the used features, such as mean and standard deviations. This practice compresses the measurement data into a single quantity for each day, which may potentially lead to information loss in the data. Secondly, the implementation procedure is computationally expensive. This requires the exploration of all possible models and variable combinations to find the best pairs. The models also need to be run multiple times to ensure their stability.

To overcome the above-mentioned issues, we propose here an alternative method – based on information theory – to be used in atmospheric data analysis. Mutual information (MI), one of the many information quantities, measures the amount of information that can be obtained about one random variable by observing another one. In this paper, MI is first introduced and then used to find the maximal amount of shared information between atmospheric variables and NPF. In other words, the goal is to find the most relevant atmospheric variables in relation to NPF events using a data-driven information theoretic method based on the data set measured at Hyytiälä, Finland.

In this study, we utilize the data measured during the years 1996–2014 at
the Station for Measuring Forest Ecosystem-Atmosphere Relations (SMEAR) II
station in Hyytiälä, Finland, operated by Helsinki University

The SMEAR II station is located in Hyytiälä forestry field station in
southern Finland (61

The SMEAR II station has been established for multidisciplinary research,
including atmospheric sciences, soil chemistry and forest ecology. The
station consists of a measurement building, a 72 m high mast, a 15 m tall
tower and two mini-watersheds. It is equipped with extensive research
facilities for measurement of various gases' concentration, various fluxes,
meteorological parameters (e.g. temperature, wind speed and direction,
relative humidity), solar and terrestrial radiation (e.g. ultraviolet rays),
and atmospheric aerosols (e.g. particle size distribution). The measurements
for forest ecophysiology and productivity, such as photochemical reflectance,
and the measurements for soil and water balance also take place there. A
detailed description of the continuous measurements performed at this station
can be found in

In this study, we used four types of continuous measurement data: gas
concentrations, meteorological conditions, radiation variables and aerosol
particle concentrations. The gases include nitrogen monoxide (

Table

The name of used atmospheric variables and symbols displayed in the results used in this study.

In addition to directly measured variables, there are few derived variables
included in this study. The aerosol particle condensation sink determines
how rapidly molecules and small particles condense onto preexisting aerosol
particles and it is strongly related the shape of the size distribution

Sulfuric acid (

Examples of non-event and event days at Hyytiälä, Finland,
in May 2005. A non-event day

Finally, it is essential to have a database of aerosol particle formation
days – without such database the correlation analysis between NPF and
atmospheric variables cannot be performed. We used a database of the years
1996–2014, generated by the atmospheric scientists at Helsinki University.
The database has been created by visual inspection of the continuously
measured aerosol size distributions over a size range of 3–1000 nm at
the SMEAR II Hyytiälä forest

Before the raw data can be fed into an analysis model, they need to be preprocessed first and these steps will be outlined below. After that, the mutual information method will be introduced.

The (raw) data used in this paper range from 1 January 1996 to
31 December 2014, totalling 18 years. The first step in preprocessing is to
exclude the undefined days, as the focus is to find the correlation between
aerosol particle formation days and atmospheric variables. In order to reduce
the amount of irrelevant data, we then eliminate nighttime data points in
all atmospheric variables. When the atmospheric photochemistry is most
intense (during the daytime), the strongest and long-lasting events of the
atmospheric NPF are typically observed

Information theory is a mathematical representation of the conditions and
parameters affecting the transmission and processing of information

This subsection introduces briefly the basic concepts of information
quantities, as well as the definitions and notations of probabilities that
will be used throughout the paper. In-depth explanation concerning the
principles of information theory can be found for example in

Entropy is a key measure in information theory. It quantifies the amount of
uncertainty involved in the value of a random variable. If

The mutual information (MI) of two random variables is a measure of the
mutual dependence between these two variables. MI is thus a method for
measuring the degree of relatedness between data sets. MI and its relation to
joint and conditional entropies is illustrated visually in
Fig.

Venn diagram of entropy properties. The area covered by red and
orange is the entropy of

More formally the mutual information of

Mutual information (MI) vs. Pearson (

MI has found its use in modern science and technology, for example in search
engines

The MI implementation is straightforward for discrete distributions because the
required probabilities for calculating MI can be computed precisely based on
counting. However, the MI implementation for continuous distributions may be
tricky because the probability distribution function is often unknown. A
binning method can be implemented for calculating MI involving continuous
distribution. This method makes the data completely discrete by grouping the
data points into bins in the continuous variables. Nevertheless, the choice
of binning size (i.e. the number of data points per bin) is a non-trivial
task, since this choice often leads to different MI result. The binning
method does not allow MI calculation between two data sets that have
different resolution – this would be a major obstacle in this study.
Therefore, in the current investigation we will use the so-called nearest-neighbour method

This subsection explains the nearest-neighbour MI method adopted from

An illustration of the computing process for the mutual information estimator based on the nearest-neighbour method.

After obtaining MI for every point

Figure

Prior to demonstrating the result of MI application on the atmospheric data in
Sect.

MI capability in detecting a non-linear relationship between two variables on
an artificial benchmark data set is already illustrated in
Fig.

We consider the intensity of blackbody radiation. The monochromatic emissive
power of a blackbody

The relationship between solar spectral irradiance and wavelength.
The notations

The results section is divided into two subsections. The first part presents the result of MI correlation analysis between atmospheric variables and NPF. The second part then discusses the scatter plot of several relevant atmospheric variables to NPF.

In this study, the atmospheric variables are continuous values while the
aerosol formation days classification is discrete. Hence, we implemented the
MI based on the nearest-neighbour method for finding the correlation between these
two data sets, explained earlier in Sect.

Figure

MI correlation level between NPF and a variety of atmospheric
variables: gases concentration and aerosols

The top subplot in Fig.

The MI found that ozone (

On the other hand, the result suggests that sulfur dioxide (

The middle subplot presents the MI correlation level for all measured
meteorological variables. Some variables with the subscript “ave” are
averages of meteorological variables measured at different heights. As the
top subplot, we calculated the mean and standard deviation of their MI
correlation level and display them as a rectangular bar with a whisker. The
middle subplot shows that there is a very strong correlation between NPF and
relative humidity (

In contrast, wind speed (

The bottom subplot shows the MI level of several radiation variables. It can
be seen that most radiation variables have a strong relation with NPF. This
fact was discussed earlier by

The correlation between concentrations of particles with different sizes from
3 to 1000 nm and NPF is illustrated as a coloured panel in
Fig.

The correlation levels are obtained through the MI method for the particle number size distribution at the SMEAR II station in Hyytiälä. The colour shows the level of correlation. The first three columns represent three periods between years 1996 and 2014, where each period consists of the correlation level for 6 years. The last column indicates the total correlation level for 18 years. Note that the dark blue on period 1 for particles larger than 500 nm is due to unavailable data for that size range.

The scatter matrix plot between five selected atmospheric variables
and NPF. The red and blue dots represent event and non-event days,
respectively. The notation

In order to understand in depth the results from the aforementioned MI
analysis, a scatter plot matrix was generated, as shown in
Fig.

First, we focus on the histogram plots located on the sub-axes along the
diagonal. It can be seen that the event and non-event days are well separated
in the cases of Glob and RHTd. These histogram plots demonstrate very well
that NPF has a positive (

The next focus is on the variables

The above conclusion would be very challenging to make by using a linear
correlation analysis for variables

This subsection demonstrates the analysis complexity by observing the
histogram and scatter plots for some variables, such as

This paper extends and complements the analysis of a previous data mining
study on atmospheric data, conducted by

The MI method reports very similar findings with the previous atmospheric studies. The water content and sulfuric acid concentration are found to be strongly correlated with NPF. Furthermore, the results also suggest that NPF is influenced by temperature, relative humidity, CS and radiation. According to the results from the MI analysis, the measurements taken at different heights have similar correlation with NPF.

As shown in the previous subsection, this method is more powerful than a linear correlation analysis. Therefore, this method should be used in the first place before performing a deeper data analysis method, such as through histogram and scatter plots. This method could act as an early correlation detection for any atmospheric variables.

This work uses the longest available data sets of NPF observations with
simultaneously measured ambient variables. As future works, we will seek to
investigate the use of the method on different atmospheric data sets. For
instance, robust correlation analysis is required for understanding other
variables influencing atmospheric process, such as volatile organic compounds
(VOCs) and aerosol particles at sizes below

In order to enrich the analysis, the database from other SMEAR stations as
well as previous research campaigns should be included. The data may contain
more variation because they are measured in different locations. One
anticipated obstacle is the scarceness of NPF days classification databases.
Although an automatic classification algorithm to create such a database has
been called for

Data measured at the SMEAR II station are available on the
following web page:

MAZ, VH, RR, HJ and ASF designed the study. MAZ and RR developed the methodology. MAZ performed data and statistical analysis. VH, PP, VMK and MK contributed to the interpretation of the data and the results. VH and HJ suggested the use of experimental data from the SMEAR II station. All authors contributed to writing the manuscript.

The authors declare that they have no conflict of interest.

This work is supported by the European Research Council (ERC) via ATM-GTP (grant number 742206) and the Academy of Finland Centre of Excellence in Atmospheric Sciences (project number 307331). Edited by: Fangqun Yu Reviewed by: two anonymous referees