Intercomparison of midlatitude tropospheric and lower-stratospheric water vapor measurements and comparison to ECMWF humidity data

. Accurate measurement of water vapor in the climate-sensitive region near the tropopause is very challeng-ing. Unexplained systematic discrepancies between measurements at low water vapor mixing ratios made by different instruments on airborne platforms have limited our ability to adequately address a number of relevant scientiﬁc questions on the humidity distribution, cloud formation and climate impact in that region. Therefore, during the past decade, the scientiﬁc community has undertaken substantial efforts to un-derstand these discrepancies and improve the quality of water vapor measurements. This study presents a comprehensive intercomparison of airborne state-of-the-art in situ hygrometers deployed on board the DLR (German Aerospace Center) research aircraft HALO (High Altitude and LOng Range Research Aircraft) during the Midlatitude CIRRUS (ML-CIRRUS) campaign conducted in 2014 over central Europe. The instrument intercomparison shows that the hygrometer measurements agree within their combined accuracy ( ± 10 % to 15 %, depending on the humidity regime); total mean values agree within 2.5 %. However, systematic differences on the order of 10 % and up to a maximum of 15 % are found for mixing ratios below 10 parts per million (ppm) H 2 O. A comparison of relative humidity within cirrus clouds does not indicate a systematic instrument bias in either water vapor or temperature measurements in the upper troposphere. Furthermore, in situ measurements are compared to model data from the European Centre for Medium-Range Weather Forecasts (ECMWF) which are interpolated along the ML-CIRRUS ﬂight tracks. We ﬁnd a mean agreement within ± 10 % throughout the troposphere and a signif-icant wet bias in the model on the order of 100 % to 150 % in the stratosphere close to the tropopause. Consistent with previous studies, this analysis indicates that the model deﬁcit is mainly caused by too weak of a humidity gradient at the tropopause.

Abstract. Accurate measurement of water vapor in the climate-sensitive region near the tropopause is very challenging. Unexplained systematic discrepancies between measurements at low water vapor mixing ratios made by different instruments on airborne platforms have limited our ability to adequately address a number of relevant scientific questions on the humidity distribution, cloud formation and climate impact in that region. Therefore, during the past decade, the scientific community has undertaken substantial efforts to understand these discrepancies and improve the quality of water vapor measurements. This study presents a comprehensive intercomparison of airborne state-of-the-art in situ hygrometers deployed on board the DLR (German Aerospace Center) research aircraft HALO (High Altitude and LOng Range Research Aircraft) during the Midlatitude CIRRUS (ML-CIRRUS) campaign conducted in 2014 over central Europe. The instrument intercomparison shows that the hygrometer measurements agree within their combined accuracy (±10 % to 15 %, depending on the humidity regime); total mean values agree within 2.5 %. However, systematic differences on the order of 10 % and up to a maximum of 15 % are found for mixing ratios below 10 parts per million (ppm) H 2 O. A comparison of relative humidity within cirrus clouds does not indicate a systematic instrument bias in either water vapor or temperature measurements in the up-per troposphere. Furthermore, in situ measurements are compared to model data from the European Centre for Medium-Range Weather Forecasts (ECMWF) which are interpolated along the ML-CIRRUS flight tracks. We find a mean agreement within ±10 % throughout the troposphere and a significant wet bias in the model on the order of 100 % to 150 % in the stratosphere close to the tropopause. Consistent with previous studies, this analysis indicates that the model deficit is mainly caused by too weak of a humidity gradient at the tropopause.

Introduction
Water vapor is one of the most important trace gases in Earth's atmosphere due to its large influence on the radiation budget and atmospheric dynamics. It absorbs and emits infrared radiation throughout the entire profile of the atmosphere (Kiehl and Trenberth, 1997). The radiative effect of small changes in water vapor concentration is most pronounced in the upper troposphere and lower stratosphere (UTLS), where absolute H 2 O mixing ratios are 2-4 orders of magnitude lower than at ground level (e.g., Ramanathan and Inamdar, 2006;Solomon et al., 2010;Riese et al., 2012). Besides the direct radiative effect, water vapor also provides one of the strongest feedback parameters to temperature changes in the atmosphere (Manabe and Wetherald, 1967;Dessler et al., 2008).
Additionally, water vapor is the most important parameter for cloud formation and lifetime. From an energy perspective, clouds not only influence the radiation balance but also redistribute energy through latent heat during condensation and evaporation. Changes in latent heat fluxes influence global dynamics like the Hadley circulation and extratropical storm tracks (Schneider et al., 2010). The radiative effect of clouds is more complex than the effect of greenhouse gases due to very inhomogeneous cloud cover and different microphysical and thus radiative properties of clouds at different altitudes. The opposing effects of the reflection of solar shortwave radiation and the trapping of longwave radiation determine the net radiative effect of clouds, whether cooling or heating, depending on cloud properties, surface albedo, sun elevation etc. (e.g., Liou, 1986;Lynch, 1996;Lee et al., 2009).
The various atmospheric processes related to water vapor impose challenges for its measurement. The measurement accuracy and resolution required to improve our understanding of the atmosphere strongly depend on the research question. Regarding the radiative effect of stratospheric H 2 O, the main challenge is the absolute accuracy at mixing ratios below 10 parts per million (ppm, equivalent to µmol mol −1 ) since small changes of less than 1 ppm significantly impact the radiation budget (Solomon et al., 2010). For cloud effects, the challenge is even bigger, especially in very cold ice clouds where ice supersaturation and cloud properties are strongly linked (Jensen et al., 2005;Shilling et al., 2006;Krämer et al., 2009). A 10 % difference in relative humidity with respect to ice (RH i ), which falls within the combined uncertainty in water vapor and temperature measurements, can result in substantially different cloud properties.
During the past several decades, a number of H 2 O measurement intercomparisons during field campaigns including aircraft in situ, balloon-borne and satellite instruments revealed that the relative measurement uncertainty in water vapor mixing ratio was significantly higher than 10 %, even occasionally exceeding 100 % at the lowest mixing ratios in the lower stratosphere (e.g., Oltmans et al., 2000;Vömel et al., 2007;Weinstock et al., 2009). These large discrepancies motivated the comprehensive intercomparison campaign AquaVIT-1 at the AIDA (Aerosol Interaction and Dynamics in the Atmosphere) cloud chamber in Karlsruhe in 2007  and the follow-up but as-yet undocumented campaigns AquaVIT-2 and -3 in 2013 and 2015, respectively. In the controlled environment of the cloud chamber, the agreement between the instruments during AquaVIT-1 was better compared to the measurements on the different airborne platforms but still in the 20 % range for mixing ratios between 1 and 10 ppm. As a consequence, novel concepts and instruments (e.g., Thornberry et al., 2013;Kauf-mann et al., 2014Kauf-mann et al., , 2016Buchholz et al., 2017) and improved techniques for in-flight (Rollins et al., 2011) and ground calibration (Meyer et al., 2015) were developed to improve the accuracy of H 2 O measurements.
Since space and measurement time on research aircraft are limited and expensive, intercomparable airborne data sets of water vapor measurements are scarce (e.g., Kiemle et al., 2008;Jensen et al., 2017a). The most recent comprehensive intercomparison was conducted in 2011 on the NASA WB-57 high-altitude aircraft during the MACPEX (Midlatitude Airborne Cirrus Properties EXperiment) campaign (Rollins et al., 2014). Similar to the present study, five different hygrometers using differing water vapor detection techniques were mounted on the aircraft. In the dry regime below 10 ppm, instruments were found to typically agree within their stated combined accuracies. However, the authors argue that the remaining discrepancies are very likely of systematic nature and result from undetermined offsets in flight (Rollins et al., 2014). Referring to the accuracy required to address the questions noted above, it seems that significant progress has been made in recent years. However, the current measurement accuracy still limits our ability to appropriately assess questions regarding, for instance, stratospheric water vapor trends.
The aim of this study is to provide another step towards a better understanding of the accuracy of airborne water vapor measurements. We present a comprehensive intercomparison of the primary airborne state-of-the-art hygrometers operated by the German research community. This unique data set is used to assess the performance of the individual instruments and to provide a solid basis for comparison to the Integrated Forecast System (IFS) of the European Centre for Medium-Range Weather Forecasts (ECMWF). Section 2 briefly describes the ML-CIRRUS campaign during which five independent in situ hygrometers were operated simultaneously. Section 3 provides a summary of the different instruments. The methodology of the intercomparison is described in Sect. 4, while the intercomparison itself is discussed in Sect. 5. In addition, this section also includes a comparison of relative humidity inside of cirrus clouds as well as an intercomparison of in situ measurements with ECMWF IFS model data.

ML-CIRRUS campaign
The ML-CIRRUS campaign with the DLR (German Aerospace Center) research aircraft HALO (High Altitude and LOng Range Research Aircraft) took place in March and April 2014 with the aircraft based in Oberpfaffenhofen, Germany. A detailed summary of the scientific goals, the flight strategy and the instrumentation is given in . During the campaign period, HALO performed 16 research flights with 88 flight hours in total. The flights were designed for a comprehensive characterization of mid- latitude cirrus and contrail cirrus using in situ as well as remote-sensing instruments. ML-CIRRUS aimed for a better understanding of cirrus cloud formation in different meteorological conditions Luebke et al., 2016;Wernli et al., 2016;Urbanek et al., 2017) to improve our estimation of the radiative impact of cirrus (Krisna et al., 2018) as well as for air traffic impacts on high cloud cover Grewe et al., 2017). Therefore, the flight plans were mainly designed to obtain a maximum number of flight hours either within cirrus clouds for in situ measurements or approximately 1 km above cirrus for lidar and dropsonde measurements. The implications of the flight strategy on the water vapor intercomparison are discussed in Sect. 4.1. Looking for cirrus cloud life cycle under different meteorological conditions, the flights covered almost the entire region of central Europe from the northern British coast down to Portugal (Fig. 1). To achieve the scientific goals of the mission, the HALO payload for ML-CIRRUS comprised instruments to measure cloud particles, aerosols, trace gases and dynamic parameters. The aircraft cabin was equipped with several novel in situ instruments for trace gases and aerosols, dropsondes and a differential absorption lidar (DIAL) system for water vapor and cloud measurements. Furthermore, cloud particles and aerosols were measured in situ using a set of nine wing probes. Since this paper focusses on the intercomparison of the in situ water vapor measurements during ML-CIRRUS, only those instruments will be described here in detail. A full list of instruments, their descriptions and references can be found in Voigt et al. (2017).

Instruments
The HALO payload for ML-CIRRUS included five different water vapor instruments, which provides the opportunity to compare different measurement methods and a comparison of both gas phase and total water measurements. In particular, three completely independent measurement principles for water vapor were used: mass spectrometry (AIMS-H 2 O), Lyman-α photofragment fluorescence spectroscopy (FISH) and tunable diode laser absorption spectroscopy (SHARC, HAI and WARAN). While AIMS-H 2 O and SHARC measured gas phase water vapor via a backward-facing inlet, FISH, HAI and WARAN measured total water (gas phase + evaporated cloud particles) using forward-facing inlets. A summary of key parameters for each instrument is given in Table 1.

AIMS-H 2 O
The Atmospheric Ionization Mass Spectrometer for water vapor (AIMS) is a linear quadrupole mass spectrometer designed to measure low water vapor mixing ratios typical for the upper troposphere and lower stratosphere Thornberry et al., 2013) and, in a different configuration, HCl, HNO 3 and SO 2 . The instrument samples gas phase water vapor through a backwardfacing heated inlet. After passing a pressure regulation valve, sample air is directly ionized in an electrical discharge ion source. Inside the ion source multiple ion-molecule reactions form H 3 O + (H 2 O) n ion clusters with n = 0. . . 2. The abundance of these ion clusters is then measured by the mass spectrometer and used to quantify the original water vapor molar mixing ratio in the ambient air. In order to accurately link the ion count rate with the H 2 O mixing ratio, the instrument is calibrated in flight by regularly adding a water vapor standard generated by the catalytic reaction of hydrogen and oxygen to form H 2 O on a heated Pt surface (Rollins et al., 2011). AIMS operates at a measurement range between 1 and 500 ppm with an overall accuracy of 7 % to 15 %, mainly depending on the actual water vapor concentration . During ML-CIRRUS ambient air was sampled through 8.5 mm ID Synflex tubing, and a bypass flow was used to reduce the residence time of air in the inlet line to below 0.2 s. This results in a real measurement frequency of ∼ 4 Hz, corresponding to around 50 m horizontal resolution. In order to achieve the best possible accuracy of the instrument, it was calibrated once or twice during each research flight. The stability of the calibration standard was guaranteed by six ground reference measurements against a MBW 373LX dew point mirror during the campaign period.

FISH
FISH (Fast In situ Stratospheric Hygrometer) is a closedcell Lyman-α photofragment fluorescence hygrometer which has been operated on various research aircraft for more than 20 years (Meyer et al., 2015;Schiller et al., 2009). The operating principle of the instrument is described in detail by Zöger et al. (1999). It uses the Lyman-α radiation of an UV lamp at 121.6 nm to dissociate water molecules into single H atoms and excited-state OH molecules. Returning to the ground state, the OH molecules emit radiation at a wavelength between 285 and 330 nm. The intensity of this radiation is proportional to the water vapor molar mixing ratio in the measurement cell and is quantified using a photomultiplier tube. FISH is calibrated regularly at ground level to relate the measured signal to the water vapor mixing ratio using a MBW DP30 dew point mirror as a reference instrument. A detailed description of the calibration procedure can be found in Meyer et al. (2015). FISH is able to measure water vapor mixing ratios in a range from 1 to 1000 ppm. The overall uncertainty during ML-CIRRUS was determined to be 6 % relative and ±0.4 ppm absolute offset uncertainty. FISH was connected to a forward-facing inlet to sample total water. The pressure difference between inlet (static + dynamic) and gas exhaust (only static) ensures a flow rate > 10 standard L min −1 and thus allows for fast measurements in UTLS and cirrus conditions.

SHARC
SHARC (Sophisticated Hygrometer for Atmospheric Re-searCh) is a tunable diode laser (TDL) hygrometer developed at DLR Flight Experiments. It is a closed-cell hygrometer which uses the absorption line of water vapor at 1.37 µm. To cover a wide humidity range, SHARC uses a dual-path Herriott type cell with a single-pass absorption length of approximately 0.17 m and a multi-pass absorption length of approximately 8 m. The cell is completely fiber-coupled to minimize parasitic absorption outside the measurement volume and has a very compact volume of 83 cm 3 . The measurement range is from 10 to 50 000 ppm, constrained by the detection limit of the absorption signal at low water vapor mixing ratios. The overall uncertainty is 5 % relative and ±1 ppm absolute offset uncertainty. SHARC was operated with a 6.35 mm backward-facing stainless-steel inlet during ML-CIRRUS sampling gas phase H 2 O with a total flow of 15 standard L min −1 at ground level, decreasing to 1.5 standard L min −1 at the highest flight levels. The realtime data reduction uses a multi-line Voigt fit at 5 Hz to calculate the water vapor mixing ratio. For the intercomparison, the data were averaged to 1 Hz. The instrument was calibrated on the ground against a MBW 373LX dew point mirror.

HAI
HAI (Hygrometer for Atmospheric Investigations) is a fourchannel TDL hygrometer which uses two different absorption wavelengths (1.37 and 2.6 µm) in both closed-and opencell geometries . HAI uses a complete physical model in combination with spectral water absorption line parameters mostly measured at the Physikalisch-Technische Bundesanstalt Braunschweig (PTB) (Pogány et al., 2015) and monitors pressure, temperature and absorption path length in order to calculate the water vapor concentration for a given absorption spectrum without prior calibration. The accuracy of this approach was verified recently by a side-by-side comparison (Buchholz et al., 2014) of a previous PTB laser absorption spectrometer with the German national primary humidity standard. HAI has 1.5 m optical path length for the closed cell and 4.2 m for the open path. For this work, we use data from the 1.37 µm closed-cell channel of HAI in the range of 20 to 40 000 ppm since only that channel provided data within the required uncertainty margin during ML-CIRRUS. The overall uncertainty for this channel is 4.3 % relative and ±3 ppm absolute offset uncertainty. The closed cell was connected to a 12.7 mm forward-facing stainless-steel inlet and was actively pumped. The effective time resolution of the instrument is 0.7 s, corresponding to a spatial resolution at flight altitude of around 150 m.

WARAN
The WARAN (WAter vapoR ANalyzer) instrument consists of a commercial WVSS-II (SpectraSensors Inc., USA) tunable diode laser instrument in combination with a custom inlet and an additional pump for the flow through the measurement cell (Kaufmann et al., 2014;Groß et al., 2014). While the instrument was operated on other campaigns parallel to a frost point hygrometer , during ML-CIRRUS the WARAN was integrated in the AIMS rack and connected to a forward-facing inlet to sample total water. The inlet pylon was the same as used for AIMS-H 2 O. As for the other instruments operating with a forward-facing inlet, only cloud-free measurement sequences are used for the intercomparison. The instrument was calibrated on the ground after the ML-CIRRUS campaign using a MBW 373LX dew point mirror as a reference. Due to the high detection limit of the instrument (> 50 ppm, stated by the manufacturer), the intercomparison of this instrument is limited to tropospheric conditions. During ML-CIRRUS the WARAN was mainly used to detect cloud water. Due to the enhancement of ice particles in the inlet by a factor between 20 and 35, measured total-water mixing ratios are relatively high (Afchine et al., 2018). Hence the instrument detection limit allows for cloud water quantification for most clouds except for very thin cirrus.

Additional instrumentation
For data evaluation with respect to relative humidity, cloud detection and model intercomparison, we use additional parameters measured on board HALO during ML-CIRRUS. Static pressure and static temperature are measured by the Basis HALO Measurement and Sensor System (BA-HAMAS; Krautstrunk and Giez, 2012;Giez et al., 2017). The accuracy of the pressure sensor is 0.3 hPa; the accuracy of the static temperature measurement is 0.5 K. The SHARC hygrometer (see Sect. 3.3) is also part of BAHAMAS. Cloud detection was done using data from the Cloud and Aerosol Spectrometer with Detection of Polarization (CAS-DPOL), which was mounted under the wing of HALO (Baumgardner et al., 2001;Voigt et al., 2017). The cloud probe measures particles in a size range between 0.5 and 50 µm and is thus sensitive to natural cirrus as well as contrail ice particles.

Methodology and conditions for intercomparison
Similar to previous approaches (e.g., Rollins et al., 2014;Fahey et al., 2014;Meyer et al., 2015) we use the data set from the entire ML-CIRRUS campaign in order to achieve good statistics. This section describes the framework of the intercomparison and the methodology of the data evaluation including the determination of a water vapor reference value.

Flight strategy
A discussion of the flight strategy during ML-CIRRUS is important since the campaign did not aim for a statistically uniform sampling in terms of water vapor but rather the investigation of cirrus clouds. The flight patterns typically consist of three components: 1. sampling inside cirrus clouds in order to obtain in situ information on particle distribution and their interaction with trace gases and aerosols, 2. remote-sensing segments of cirrus clouds by lidar and radiation measurements where HALO flew ∼ 1 km above the cirrus, 3. transferring flight segments to approach specific weather systems like warm conveyor belts or mountain lee wave regions over western Europe (dark blue and magenta flight tracks in Fig. 1).
In total, we have around 160 000 1 Hz data points in the UTLS with H 2 O mixing ratios between 3 and 1000 ppm. Of those data points, approximately 22 % are in stratospheric conditions (θ > 350 K), and 33 % are in-cloud measurements. The dedicated search for cirrus conditions leads to a higher detection frequency of both in-cirrus and above-cirrus sampling relative to their natural occurrence. Since we expect the mode value of the RH i distribution to be close to 100 % inside cirrus (e.g., Ovarlez et al., 2002;Jensen et al., 2017b), this allows for an independent check of the absolute values of the gas phase water vapor measurements. However, extensive in situ sampling in cirrus limits the data for intercomparison of total and gas phase instruments. The remote-sensing legs and the transfer segments provide a comprehensive water vapor data set within the lower stratosphere. The lidar requires a certain vertical distance from the cirrus upper edge; hence most of the stratospheric data were sampled roughly 1 km above that level. Directly above cirrus level fewer data points are sampled. During the transfer segments, flight altitude and horizontal position of the aircraft are independent of meteorological conditions; however, due to the typical high flight altitude of HALO, most of these data points are within the lower stratosphere.
Overall, the ML-CIRRUS flight strategy shifts the sampling of water vapor compared to unbiased sampling of the UTLS in a way that there is a higher detection frequency of humid upper-tropospheric air within cirrus clouds, higher detection frequency of stratospheric measurements at a distance of around 1 to 1.5 km to the tropopause and only a small detection frequency of data in dry tropospheric conditions and directly above the tropopause. However, the measurement strategy should only affect the amount of data in certain water vapor ranges and not the performance of each instrument within its specification.

Data processing and filtering
In order to construct a consistent data set from all five water vapor instruments on board HALO, the specific time resolutions and response characteristics are considered for each instrument. The goal is to retain as much information as possible while minimizing data-processing-related artifacts. Since all instruments reported data either with a non-uniform frequency or in 1 Hz intervals, the latter was used to unify the data. For AIMS, the 1 Hz data are created by averaging over three data points. Data from FISH are on a 1 Hz integer time basis. For SHARC and HAI, the 1 Hz resolution data are interpolated onto integer values. The only instrument with a lower time resolution than 1 Hz is the WARAN with ∼ 0.4 Hz. Since it is not useful to interpolate this data set onto a 1 Hz interval, each measured value is assigned to the closest integer time value. This processing allows comparison of the H 2 O measurements directly without imposing any substantial interpolation artifacts in the measured values which could affect the interpretation of the intercomparison.
Since three instruments (FISH, HAI and WARAN) measured total water, cloud sequences were filtered out for the comparison of gas phase H 2 O. The cloud filtering was done in a two-step process using both the total water measurements themselves and cloud probe particle measurements by the CAS-DPOL. To make sure that in-cloud data are definitely filtered out, all data with total water concentrations above saturation are flagged as "in-cloud". However, this implies that supersaturated cloud-free conditions are left out as well. As a quality check for the filtering procedure, particle concentrations measured by the CAS-DPOL are used to double-check the cloud mask. In this step, very few additional data points are rejected, which might be due to very thin sublimating clouds or the different positions of cloud probe under the wing and water vapor inlets at the top fuselage.
Further data filtering was applied manually in order to clear data that suffer from obvious sampling artifacts. Concerning AIMS, the pressure regulation of the instrument  during ML-CIRRUS was not fast enough to compensate for the pressure drop during the fast first ascent on each flight. For this reason, H 2 O data in that region are not reliable and not included in the archived data set. Furthermore, there are a few ascent and descent sequences where one or more instruments showed a significant time lag of a couple of seconds compared to the other instruments. The causes of these lags and their intermittent occurrence are not clear, and the respective sequences are filtered out.

Reference value
The determination of a reference value for the intercomparison is guided by various considerations. One possibility is the agreement on a common reference instrument. The airborne intercomparison during MACPEX (Rollins et al., 2014), for example, used the Harvard Lyman-α as a single instrument reference. However this approach is complicated for the instrument combination deployed during ML-CIRRUS since there was no instrument on HALO which measured gas phase H 2 O and simultaneously covered the complete range of mixing ratios. For that reason, we follow the approach of the AquaVIT campaign described in Fahey et al. (2014), where the mean value of a set of instruments was used as a reference. This allows for a combined intercomparison of data in the lower stratosphere (AIMS, FISH) and in the upper troposphere in cirrus clouds (AIMS, SHARC) and clear sky (AIMS, FISH, SHARC, HAI). We further compare the middle troposphere at higher H 2 O mixing ratios (SHARC, HAI, WARAN). The reference value for each 1 s step is calculated as the mean of AIMS, FISH, SHARC and HAI data points with the condition that at least two instruments provided valid data for a single time step. For the lower stratosphere, the reference is the mean value of AIMS and FISH measurements. For the troposphere, generally all four instruments are used for the calculation of the reference except for cloud sequences and depending on data availability. Data from the WARAN are not included in the reference calculation since their uncertainty is significantly higher.

Intercomparison
The basis for the intercomparison of H 2 O data during ML-CIRRUS is time series from each instrument, an example sequence of which is shown in Fig. 2c for the flight on 3 April 2014. For all total-water instruments, only cloud-free data are used for the intercomparison. This flight aimed for in situ and remote measurements of thin cirrus over Germany which were potentially influenced by Saharan dust (Weger et al., 2018). Flight altitude and water vapor mixing ratios in Fig. 2 show the alternation of tropospheric in situ legs (H 2 O ∼ 30. . . 120 ppm) and lidar legs in the stratosphere (H 2 O ∼ 5 ppm). Except for the WARAN, which seems to measure too high at the beginning of the flight, all instruments agree reasonably well in both upper troposphere and lower stratosphere. Figure 2a shows a profile for the upper troposphere and lower stratosphere using data from the second descent (indicated by dotted lines in Fig. 2c). The instruments follow the same structures in both regions with a much higher variation in H 2 O mixing ratios in the upper troposphere. The agreement also holds for the second profile down to 3 km altitude (Fig. 2b); however mixing ratios there are too high to be measured by AIMS and FISH. The short ascent to 8 km after the profile shows a significant deviation between SHARC, HAI and WARAN. Both total-water instruments (HAI and WARAN) measure higher values than the SHARC, which is most likely due to wet contamination of their measurement cells when encountering liquid clouds during the descent. Sequences with such contamination are identified for the entire data set and filtered out for the intercomparison (less than 1 % of the data).

Correlation of single instruments
To investigate the overall performance of the different measurement systems, 12 ML-CIRRUS flights were combined similar to the one shown in Fig. 2. This complete data set is used to produce the scatterplots in Fig. 3, where selections of four combinations of instrument pairs are displayed. The scatterplot of AIMS and FISH in Fig. 3a shows a very close correlation from below 4 ppm up to ∼ 600 ppm corresponding, to the upper limit of AIMS. For stratospheric mixing ratios below 10 ppm the correlation broadens, with AIMS exhibiting a tendency to higher humidity values and FISH to lower humidity values. Figure 3b shows the correlation between AIMS and SHARC, the two instruments measuring solely gas phase H 2 O, and is thus the only correlation plot where in-cloud data are displayed together with clearsky data. Consistent with Fig. 3a this correlation is very nar- Figure 4. Relative difference of the measurements of AIMS, FISH, SHARC, HAI and WARAN from the mean H 2 O molar mixing ratio value which is used as a reference (details see text). The small dots are the single measurement points (1 Hz values). The big squares, triangle and circle are mean values of the relative difference for specific bins of H 2 O mixing ratio. The broad bars represent the 25th-75th percentile, while the narrow bars stand for the 10th-90th percentile within the bins. All points with a deviation between −1 % and +1 % fall on the ±1 line. Values in the gray box on the left-hand side represent the overall mean values for the different instruments. row, slightly widening only for the high concentrations at the upper AIMS measurement limit. A similar narrow correlation is found for HAI vs. FISH (Fig. 3c) from 20 ppm up to 1000 ppm. For all three scatterplots (Fig. 3a-c) correlation coefficients are higher than 0.99. In contrast to panels a-c, Fig. 3d spans the range to higher humidity from 10 to 10 000 ppm, displaying data from WARAN and SHARC. Between 100 and 300 ppm, the WARAN shows a slight dry bias, which disappears for higher mixing ratios. Compared to the other instruments, the WARAN exhibits a significantly larger scatter, with complete sequences lying well above the one-to-one line. These sequences are associated with initial ascent during the flights, where the WARAN occasionally shows a wet bias (data points marked orange in Fig. 3d). These data points are omitted from the intercomparison. The dry bias and larger scatter are also reflected in the correla-tion coefficient, which is 0.94 for Fig. 3d. The comparison with WARAN measurements during other campaigns suggests that the deviations are likely caused by systematic offsets in the original calibration of the instrument. Thus, the analysis is probably only valid for this specific instrument during the ML-CIRRUS campaign. Overall, the correlation plots indicate a good agreement for AIMS, FISH, SHARC and HAI throughout the entire campaign.

Deviation with respect to reference value
In order to quantify the performance of each instrument, the deviations of each instrument from the reference value (see Sect. 4.3) are displayed in Fig. 4, similar to previous studies Rollins et al., 2014). On the x axis, the H 2 O reference value is shown. The y axis denotes the relative difference for each instrument from that reference value. The small dots are the measured 1 Hz values; the big symbols are mean values for logarithmic bins in H 2 O. Additionally, the broad bars represent the interquartile range in each bin, and the narrow bars are the 10th-90th percentiles. In the gray box on the left, mean values and respective percentiles for the entire data set of each instrument are shown. As shown in Table 2, the mean deviations of AIMS, FISH, SHARC and HAI are below 2.5 %, indicating that there is no consistent systematic bias when averaging over the entire data set. The situation looks different for the WARAN instrument, where the dry bias at low H 2 O mixing ratios can be clearly seen in the H 2 O-resolved deviation but not in the overall mean (Fig. 4e).
When looking at Fig. 4a and b in more detail, the agreement between AIMS and FISH in the lower stratosphere below 10 ppm seems good with single values of both instruments mostly falling within ±15 %. Since these are the only two instruments measuring in the low ppm range, the plot is a direct comparison of both instruments. In fact, there is a systematic difference between both instruments for humidity conditions between 4 and 10 ppm. In that region the mean values of the instruments differ by 4 % to 16 %, with AIMS measuring higher and FISH measuring lower mixing ratios. Interestingly, the difference between the instruments for the driest conditions (3.5 to 4.5 ppm) is smaller than for the next several bins (2.4 % vs. 6.5 %). However, the spread in the data is too large to judge if this difference is significant. Examining all of the time series plots from the campaign (not shown) reveals that there are some distinct stratospheric legs where AIMS is up to 1 ppm higher than FISH (corresponding to a relative deviation of ∼ 20 %). The reason for this deviation is not completely clear; one explanation could be a contamination of the AIMS vacuum system. However, it is unlikely that this is the only cause since the behavior changes occasionally from one leg to another within the same flight. For upper-tropospheric measurements (where more than the two instruments contribute to the reference value), the agreement of the mean values with the reference is better than 5 %. The same holds for the SHARC measurements (Fig. 4c) throughout its complete range with a slight tendency to lower mixing ratios (3 % to 4 %) compared to the reference between 30 and 200 ppm. HAI data (Fig. 4d) also fall in the same range of variation, with mean values being consistently slightly higher by about 3 % than the reference value in the range between 30 and 2000 ppm. For both SHARC and HAI, the single measurement scatter is within ±20 % with respect to the reference. Considering the fact that all four instruments contribute to the reference value, one can state that FISH and SHARC tend to consistently report slightly lower mixing ratios than AIMS and HAI. The WARAN measurements (Fig. 4e) fall off compared to the other four instruments, exhibiting a significant low bias for mixing ratios below 300 ppm. However, these data are still within the uncertainty specifications of the instrument (see Table 1). Figure 5. PDFs of relative humidity with respect to ice calculated from AIMS (black) and SHARC (green) data and the static air temperature measurement on HALO inside cirrus clouds for the entire campaign. Dark green indicates overlap regions. The cloud flag is the same used for filtering the total water measurements. The center of the respective distribution is 94 % for SHARC and 97 % for AIMS.

Comparison of relative humidity in clouds
The comparison of relative humidity measurements in clouds can be considered as a further measure for the quality of the H 2 O measurements which is independent from any kind of reference value. In contrast to measurements in liquid clouds, much stronger deviations of RH i from saturation are possible in ice clouds due to their higher thermodynamic inertia. RH i inside cirrus clouds can be very variable due to advection as well as small-scale turbulence inside the cloud (e.g., Gettelman et al., 2006;Petzold et al., 2017). However, if the measurements include a sufficiently even sampling of meteorological conditions, a distribution of RH i with a mode value close to 100 % would be expected. In order to calculate RH i from the measured H 2 O mixing ratios, we have used the static temperature and static pressure measurements on board HALO to calculate water vapor partial pressure and saturation pressure. The saturation pressure over ice is calculated using Eq. (7) from Murphy and Koop (2005).
Here, we compare in-cloud measurements of RH i for the two water vapor instruments with backward-facing inlets, AIMS and SHARC (see also Fig. 3b). In total, more than 50 000 in-cloud data points were acquired during ML-CIRRUS, with numbers varying between 2000 and 11 000 for individual flights. The frequency distribution of RH i for the entire data set of the ML-CIRRUS campaign is shown in Fig. 5. Data from both instruments are almost normally distributed, with mean values slightly below ice saturation. Fitting a normal distribution to both data sets, they peak at RH i = 97 % for AIMS (52 700 data points) and 94 % for SHARC (56 300 data points). The full width at half max- imum of the distribution is 26.7 % for AIMS and 19.4 % for SHARC. Both distributions are slightly asymmetric with a tail towards higher supersaturation which is more pronounced in the SHARC measurements. This agrees with results from Ovarlez et al. (2002), who find similar asymmetric distributions for temperatures below −40 • C. Considering the instrumental uncertainties, both distributions appear reasonable. However, the question remains whether the slight shift of the center of the RH i distribution relative to 100 % is caused by systematic instrument biases (H 2 O and temperature), inlet issues (e.g., sucking in and evaporating ice particles) or a sampling bias in the flight strategy. If the sampling were biased toward either forming/growing cirrus or evaporating cirrus, one would expect a positive or negative RH i bias with respect to saturation, respectively. During ML-CIRRUS, individual flights typically targeted specific meteorological conditions, e.g., the updraft region of warm conveyor belts or mountain wave cirrus. Hence, a sampling bias for individual flights is very likely. In order to investigate that, Figure 6 shows the mean values for the in-cloud RH i distributions of AIMS and SHARC including interquartile ranges and 10th-90th-percentile ranges for each flight. For flight nos. 1-5, AIMS and SHARC devi-ate by 4 % to 8 % with one exception on flight no. 3, where the deviation is around 20 %. These data originate from a two-step profile through cirrus clouds with high updraft velocities over the Balearic Islands. During that flight, there is a systematic difference between AIMS and SHARC which is most pronounced during the two cirrus transects (difference of around 20 % compared to 7 % to 10 % during the rest of the flight). From the high updraft velocity, one would rather expect supersaturation inside the cirrus. For flight no. 7, there is not enough in-cloud data from AIMS to produce a reasonable RH i distribution. For flight nos. 8, 9 and 10, the agreement of both instruments is almost perfect, while for the last two flights AIMS tended to measure slightly lower RH i values than SHARC but with a difference of less than 3 %. The spread of the RH i measurements is similar for both instruments (AIMS interquartile range: 10 % to 20 %; SHARC interquartile range: 8 % to 17 %), with the lower values for SHARC arising from a slightly better precision.
The observed trend could be an indication of instrumental drift over the campaign period; however we cannot state which instrument is subject to a drift. Flights with mean super-or subsaturation are almost evenly distributed for AIMS, while SHARC measurements are slightly subsaturated, especially during the first half of the campaign. From the present data, we do not have clear evidence for an overall sampling bias during the campaign. A possible bias affecting RH i derived from both instruments could be a bias in the static temperature measurement on board HALO since we use the same temperature information for both instruments. However, the median and mean values of the distributions deviate by less than 6 % from saturation for most of the flights, indicating that temperature is not significantly off.

Comparison to the ECMWF numerical weather prediction model
The extensive ML-CIRRUS in situ data set of uppertropospheric and lower-stratospheric humidity further enables an evaluation of the accuracy of UTLS humidity in the ECMWF (European Centre for Medium-Range Weather Forecasts) numerical weather prediction (NWP) model. A correct representation of water vapor is crucial for weather and climate prediction via various pathways. Besides the tro-posphere, where water vapor is obviously important for cloud formation and precipitation, the stratospheric mean state also influences the predictability in the troposphere (Douville, 2009). Moreover, biases in modeled stratospheric water vapor can induce a frequently observed cold bias in the extratropics (e.g., Boer et al., 1992;Stenke et al., 2008;Chen and Rasch, 2012). The model data used for analysis of ML-CIRRUS are provided by the Integrated Forecasting System (IFS) of the ECMWF (IFS Version 40r1). For analysis, we use a combination of analysis data with hourly forecasts starting every 12 h from the analysis at 00:00 and 12:00 UTC. The data set covers the region of 20-70 • N, 60 • W-20 • E. The model includes 137 vertical model levels, with pressure intervals of 18 hPa near 7 km altitude and 7 hPa near 15 km height. For typical flight altitudes near 11.5 km (200 hPa) the vertical resolution is around 300 m (10 hPa). The horizontal resolution of the data used is 0.5 • . Higher horizontal resolution would be available from IFS but would not provide more information due to the hourly time resolution. The data are interpolated linearly to the measurement position for a given HALO position (latitude and longitude) above the WGS84 reference ellipsoid. Vertical interpolation is performed in the logarithm of pressure fields (which varies more smoothly than pressure) based on the static pressure measured by HALO-BAHAMAS (Schumann et al., 2015). The output frequency is 0.1 Hz along the flight track, resulting in a distance of roughly 2 km between adjacent data points. The reference H 2 O mixing ratio is averaged accordingly over 10 s intervals. Except for the time resolution, the methodology of the intercomparison of model data and measurements is the same as used in Sect. 5.2, simply treating the interpolated model data as a "new" instrument. In Fig. 7, the relative deviation of the ECMWF data is plotted against the measured reference H 2 O value (same method as used for Fig. 4). The small dots represent the interpolated model data point for each valid reference value (see Sect. 4.3). Similarly to in Fig. 4, the black triangles denote bin-wise mean values of the relative difference, while the gray bars and whiskers represent the interquartile range and the 10th-90th-percentile range, respectively. In order to get an idea if the sampled air mass is of stratospheric or tropospheric origin, the individual data points are colorcoded with potential temperature averaged from the HALO onboard measurements.
As can be seen in Fig. 7, the comparison between the model and measurements is different in two distinct humidity regimes. At the higher tropospheric mixing ratios above 30 ppm, there is a remarkably good agreement between mean bin values, and the interquartile range is mostly within ±10 %. The single values exhibit a larger scatter, resulting in 10th and 90th percentiles of around −30 % and +20 %, respectively. This could be expected considering the high natural variability in water vapor compared to the model resolution. The distribution of mean relative differences suggests a slight bias in that region, with ECMWF being slightly lower. With a mean value near 3 %, this bias is very small when considering the overall scatter of the data and the interpolation of the model onto the flight path. The interpolation procedure is also the reason for the single data points resembling the shape of a mirrored S. This behavior results from comparing the measurement signal with high spatial variability with the rather smooth model data. When using a logarithmic y scale and the more variable measured mixing ratio as a reference on the x axis, it results in an S-like shape in the individual data points.
The character of the intercomparison differs for lower mixing ratios below 30 ppm found in the tropopause region and the lower stratosphere. In that region, the model significantly overestimates the humidity. The biggest differences between measurement and model occur at mixing ratios between 5 and 8 ppm, typical values for the region directly above the tropopause. The maximum difference is found in the bin between 5.5 and 6.5 ppm, where the mean difference is 115 % (statistics from 382 data points). The difference decreases again for mixing ratios below 5 ppm, indicating a better agreement between measurement and model with increasing distance to the tropopause. The mean difference for the driest bin (3.5 to 4.5 ppm with 2383 data points) of 46 % is less than half of the more humid neighboring bins. However, it still is significant and positive, meaning that ECMWF shows a systematic wet bias for the entire probed region in the lower stratosphere in spring.
The maximal differences close to tropopause mixing ratios indicate that the difference between measurement and model is caused by too weak of a humidity gradient at the tropopause, partially explained by the model grid resolution of about 300 m vertically near the tropopause. Here, narrow inversions may form between subsiding dry stratospheric air and upward mixing of humid cold tropospheric air (Birner et al., 2002) which might not be covered by the coarse resolution of a global model. The difference in humidity gradients is directly evident in the humidity profiles. Figure 8 shows one ascent (Fig. 8a) and one descent (Fig. 8b) through the entire tropopause region on 11 April 2014. Consistent with Fig. 7, we observe a good agreement between model and measurement in the troposphere. Directly above the tropopause, the humidity gradient in the model is weaker compared to the measurements for both profiles, resulting in overestimation of water vapor by the model in that region. This feature is independent from the absolute height of the tropopause (∼ 11.8 km in Fig. 8a, ∼ 10.4 km in Fig. 8b), which is well represented in the model when comparing measured and modeled temperature profiles. With increasing vertical distance to the tropopause, measurement and model approach similar values, which is consistent with the overall intercomparison in Fig. 7. The region above the tropopause where we observe a significant difference between measurements and model varies from around 1 km above the tropopause in Fig. 8a to around 3 km in Fig. (8b). Thus, the  Our results support previous studies (e.g., Kunz et al., 2014;Dyroff et al., 2015). The latter study shows a good agreement between measurement and model for vertical distances to the tropopause of 6 km and higher and model wet bias between 2 and 6 km above the tropopause for the extratropics. During ML-CIRRUS, the maximum distance above the tropopause was 3.5 km; hence the measurements are probably not stratospheric enough to leave the wet-bias region. However, the trend towards better agreement deeper in the stratosphere can be seen in the color coding in Fig. 7 as well as in Table 3 where mean difference are binned by potential temperature rather than the mixing ratio. It turns out that the wet bias strongly peaks at potential temperatures between 350 and 360 K (mean difference of 88 %), whereas it decreases from there with increasing altitude in the stratosphere (higher potential temperature) as well as into the troposphere (lower potential temperature). Both single profiles and the overall intercomparison allow the observed differences in lower-stratospheric humidity to be attributed to the too-weak humidity gradient of ECMWF above the tropopause compared to the observations in European spring conditions.

Discussion and summary
We intercompare water vapor measurements from different state-of-the-art in situ instruments on board the DLR research aircraft HALO during the midlatitude UTLS field project ML-CIRRUS. It is the first comprehensive intercomparison of all primary airborne hygrometers operated by the German research community including three TDL instruments (HAI, SHARC and WARAN), one mass spectrometer (AIMS) and the established Lyman-α hygrometer FISH. The intercomparison includes a large span of humidity conditions from lower-stratospheric to lower-tropospheric H 2 O molar mixing ratios, with different instruments covering different parts of the mixing ratio spectrum. This work focusses on the intercomparison of gas phase water vapor measurements, meaning that only clear-sky data are used from instruments measuring total water (HAI, FISH and WARAN). The flight strategy of ML-CIRRUS focused on the investigation of midlatitude cirrus clouds with in situ and remote-sensing (lidar) instrumentation. Hence, the majority of data points originate from the midlatitude upper troposphere and lower stratosphere above Europe and the western Atlantic in spring 2014. The agreement between the in situ instruments, expressed by the relative difference to a reference value (mean value of at least two instruments), is generally good and consistent with previous intercomparison studies (Rollins et al., 2014). For all instruments except the WARAN, the overall mean deviation from the reference value is below 2.5 %. This is an indication for the successful efforts to improve the accuracy of UTLS H 2 O measurements during the past decade, motivated by large discrepancies that have been found before . Still, systematic discrepancies remain between the instruments in specific regimes which need to be addressed in order to improve our understanding of the humidity budget in the lowermost stratosphere or of cirrus formation under very cold conditions (Gao et al., 2004;Krämer et al., 2009;Jensen et al., 2017a). One major issue is the difference between FISH and AIMS for stratospheric mixing ratios below 10 ppm. The observation that the mass spectrometer AIMS measures systematically higher mixing ratios than FISH is similar to the findings during the MACPEX intercomparison (Rollins et al., 2014). During that campaign, the maximum difference of bin mean values is 13.7 % in the range 5.5-6.5 ppm. Although this difference is still within the combined uncertainty of the instruments, it hampers the detailed investigation of trends in the lower-stratospheric water vapor budget, which are of the same order of magnitude and highly uncertain, even in their sign (e.g., Hegglin et al., 2014;Lossow et al., 2018).
We investigate RH i measurements in cirrus clouds from AIMS and SHARC as an independent metric of the abso-lute accuracy of the H 2 O measurements. This is not straightforward, as RH i in cirrus clouds is known to differ significantly from saturation depending on the dynamics of the cloud. Still, considering a sufficiently large database, the data can be used as an independent indicator of the absolute accuracy of the measurements under UTLS conditions. Data from both instruments have a mode value close to ice saturation (less than 10 % difference of mean value for all flights). An overall instrumental or sampling bias seems unlikely since flights with mean super-and subsaturation in clouds are almost evenly distributed. The same holds for a possible bias in the aircraft temperature measurement which would similarly propagate into the RH i distribution. However, we do observe a drift between the in-cloud measurements of the two instruments over the course of the measurement campaign. While AIMS measures higher RH i values than SHARC in the beginning of the campaign, mean RH i values agree much better during the second half of the campaign. When considering the entire data set (including clear-sky data), this drift is not apparent, which makes a change in the performance of one instrument unlikely.
A comparison of the measured H 2 O mixing ratios with ECMWF IFS data is accomplished using the same methodology as for the instrument intercomparison. The gridded ECMWF data are interpolated in space and time along the flight path of HALO with a resolution of 0.1 Hz. Measurement and model show generally good agreement throughout the upper troposphere with bin-wise mean values of the difference typically within ±10 % (consistent with, e.g., Flentje et al., 2007) with a slight tendency towards a model dry bias which, however, is not statistically significant. Below mixing ratios of 30 ppm, we observe a significant wet bias in the ECMWF model with highest mean deviation from the measurements around 6 ppm or at a potential temperature of 355 K. In that regime, mean deviations are on the order of 100 % with an interquartile range of 70 to 140 %. The large wet bias of the model in the tropopause region is consistent with findings in previous studies (e.g., Kunz et al., 2014;Dyroff et al., 2015). The model wet bias decreases substantially at higher potential temperatures, leading to a mean difference of only 17 % at potential temperatures above 370 K. The fact that the model bias shows a clear maximum at the tropopause indicates that this issue is likely caused by too strong numerical smoothing reducing humidity gradients near the tropopause rather than an overall bias of stratospheric mixing ratios. Kunz et al. (2014) found a similar feature with good agreement between FISH measurements and ECMWF reanalysis data at altitudes higher than 6 km above the tropopause. The issue of too-weak gradients at the tropopause is discussed extensively by, for example, Birner et al. (2002), Gray et al. (2014) and Saffin et al. (2017). In particular, the lower-stratospheric wet bias is very sensitive to the horizontal interpolation of the specific humidity field in the semi-Langrangian IFS model (Diamantakis, 2014), leading to a too high diffusivity, which in turn causes a cold bias at the extratropical tropopause (Stenke et al., 2008). However, it is difficult and cost intensive to address the issue in the model since it would require adjusting core dynamical model processes or increasing the model resolution (Saffin et al., 2017;Pope et al., 2001). Additionally, the model suffers from a lack of assimilated information on lower-stratospheric water vapor since specific humidity data from radiosondes are only assimilated below a certain threshold pressure level (depending on the type of sonde; see Andersson et al., 2007). Given the large model uncertainty in H 2 O concentrations close to the tropopause, it is difficult, for example, to correctly evaluate the radiative effects of water vapor in that region where the atmosphere is very sensitive to even small changes in H 2 O (Solomon et al., 2010;Riese et al., 2012).
Despite the limitation to one-dimensional data for the in situ measurements, high-spatial-resolution data as obtained from aircraft can help to point out important small-scale differences which are difficult to assess when comparing model to satellite data due to their limited (especially vertical) resolution (e.g., Lamquin et al., 2009). The intercomparison shows that our approach to comparing in situ data with model data can be particularly useful for investigating model performance around the tropopause. Hence, it could be worthwhile to extend this type of intercomparison to reanalysis data like the new climate reanalysis data set (ERA-5) of the ECMWF or include further NWP models like the Icosahedral Nonhydrostatic (ICON) model from the German Weather Service.
Data availability. Data are accessible via the HALO database (https://halo-db.pa.op.dlr.de/mission/2) (HALO database, 2018). They can be accessed after signing a data agreement. Operational meteoro-logical analyses are archived in the MARS archive at ECMWF (https://www.ecmwf.int/en/forecasts/documentation-and-support/ changes-ecmwf-model/ifs-documentation, ECMWF, 2018) Author contributions. SK, CV, RH and TJW performed the AIMS measurements; MK and CR performed the FISH measurements; MZ and AG performed the SHARC measurements; BB and VE performed the HAI measurements; US provided and prepared the ECMWF data; and SK performed the study and wrote the manuscript with help from CV, TT and US. All authors commented on the manuscript.
Competing interests. The authors declare that there is no conflict of interest.
Special issue statement. This article is part of the special issue "ML-CIRRUS -the airborne experiment on natural cirrus and contrail cirrus in mid-latitudes with the high-altitude long-range re-search aircraft HALO (ACP/AMT inter-journal SI)". It is not associated with a conference.