Near-surface meteorology during the Arctic Summer Cloud Ocean Study (ASCOS): evaluation of reanalyses and global climate models

. Atmospheric measurements from the Arctic Summer Cloud Ocean Study (ASCOS) are used to evaluate the performance of three atmospheric reanalyses (European Centre for Medium Range Weather Forecasting (ECMWF)- Interim reanalysis, National Center for Environmental Prediction (NCEP)-National Center for Atmospheric Research (NCAR) reanalysis, and NCEP-DOE (Department of Energy) reanalysis) and two global climate models (CAM5 (Community Atmosphere Model 5) and NASA GISS (God-dard Institute for Space Studies) ModelE2) in simulation of the high Arctic environment. Quantities analyzed include near surface meteorological variables such as temperature, pressure, humidity and winds,


Introduction
Both modeling and observational studies demonstrate that the Arctic is warming at a rate faster than the rest of the globe (e.g., IPCC, 2007;Serreze et al., 2009;Serreze and Francis, 2006;Rigor et al., 2000).While planetary warming trends are thought to be largely the result of elevated greenhouse gas concentrations, the "Arctic amplification" described above is yet to be fully understood.Various ideas have been presented for possible drivers of this Arctic amplification, including feedbacks resulting from changes to snow and ice areal coverage (e.g., Curry et al., 1995), changes to clouds and atmospheric composition (e.g., Kay and Gettelman, 2009;Screen and Simmonds, 2010;Boé et al., 2009), large-scale circulation patterns in both the atmosphere and ocean (e.g., Graversen et al., 2008) and natural, low-frequency variability (e.g., Chylek et al., 2009).To date, there is little consensus on which of these processes is most important to understanding Arctic amplification, or even whether Arctic amplification as described constitutes a robust signal (Polyakov et al., 2002).
Part of the reason for this lack of understanding is a general dearth of observations at Arctic latitudes.Because of this limitation, many of the studies mentioned in the previous Published by Copernicus Publications on behalf of the European Geosciences Union.

G. de Boer et al.: ASCOS model evaluation
paragraph utilize modeling tools such as reanalysis products and global climate models to arrive at their conclusions.Ironically, the reasoning behind the need for using these modeling tools (limited observational records) also creates a challenging environment for model validation and limits the ability of data assimilation techniques to constrain models.In particular, the representation of cloud and radiation processes has been demonstrated to be problematic at high latitudes.Walsh et al. (2008) used measurements from the Atmospheric Radiation Measurement (ARM) program's north slope of Alaska site at Barrow (71.3 • N, 156.6 • W) to evaluate cloud and radiation fields in four different atmospheric reanalyses.These included the National Center for Environmental Prediction (NCEP) and National Center for Atmospheric Research (NCAR) reanalysis (hereafter R-1, Kalnay et al., 1996), the European Centre for Medium-Range Weather Forecasting 40 yr reanalysis (ERA-40, Uppala et al., 2005), the NCEP-NCAR North American Reanalysis (NARR, Mesinger et al., 2006) and the Japan Meteorological Agency and Central Research Institute of Electric Power Industry 25 yr reanalysis (JRA-25, Onogi et al., 2007).This work illustrated that the reanalyses generally underpredicted area-weighted cloud fraction, resulting in a corresponding over-prediction of downwelling shortwave and an under-prediction of downwelling longwave radiative flux densities at the earth's surface.These differences are alarming in that the measurements were obtained at a location that includes routine radiosonde launches, which are subsequently assimilated into the reanalysis products.This direct integration of local measurements should result in simulated large-scale atmospheric conditions similar to those observed, a luxury not shared by more remote portions of the Arctic.More recently, Zib et al. (2012) used measurements from the Baseline Surface Radiation Network (BSRN) stations at Barrow and Ny-Alesund (78.9 • N, 11.9 • E) to evaluate cloud and radiative properties in the National Aeronautics and Space Administration (NASA) Modern-Era Retrospective Analysis for Research and Applications (MERRA, Rienecker et al., 2011), the NCEP Climate Forecast System Reanalysis (CFSR, Saha et al., 2010), the National Oceanographic and Atmospheric Administration (NOAA) Twentieth Century Reanalysis Project (20CR, Compo et al., 2011), the ECMWF-Interim reanalysis (herafter ERA-I, Dee et al., 2011), and the NCEP-Department of Energy (DOE) reanalysis (hereafter R-2, Kanamitsu et al., 2002).This next generation of reanalyses demonstrates large differences in cloud occurrence from one product to the next, including inconsistencies between relative cloud amounts from one site to the other.Radiative flux densities, while marginally better at Barrow, are still demonstrated to be problematic, with R-2 demonstrating the largest biases.Most recently, Jakobson et al. (2012) evaluated the performance of several reanalysis products over the central Arctic Ocean.Using measurements from the Tara drifting ice station (Gascard et al., 2008;Vihma et al., 2008) the authors demonstrate that the ERA-Interim reanalysis outperforms several others, including R-2 and MERRA, using a ranking system.These rankings were calculated through comparison of the analysis results from these products for fields including air temperature, specific humidity, relative humidity and wind speed.
Global climate models have similarly struggled with simulation of Arctic surface meteorology, clouds, and surface radiation.Walsh et al. (2002) demonstrated that while early climate models produced seasonal cloud cycles that were similar to observational estimates, the amount of cloud cover varied dramatically from one model to the next.As may be expected, these discrepancies led to major differences in radiation, with summertime differences of nearly a factor of three in the surface radiative budget.A broader look at models involved with the 3rd Coupled Model Intercomparison Project (CMIP3) by Svensson and Karlsson (2011) illustrated large variability among the models in their representation of net surface energy flux on an Arctic-wide scale (> 66.6 • N) for present-day wintertime conditions.These discrepancies make deriving concrete results for the causes behind Arctic amplification challenging, particularly given the limited model validation efforts that have taken place.More recently, de Boer et al. (2012) performed an evaluation of 20th century simulations, completed using the Community Climate System Model version 4 (CCSM4).They demonstrated that Arctic clouds, including their phase partitioning and liquid and ice water paths, were grossly misrepresented in CCSM4, with notable impacts on the surface energy budget.Interestingly, these errors appeared to have only small impacts on the simulated surface air temperature, with model cold biases on the order of 1-2 K when compared to ERA-40, which itself has been shown to have 1.5 K warm biases compared to International Arctic Buoy Programme/Polar Exchange at the Sea Surface (IABP/POLES) measurements (Liu et al., 2008).Similar cloud and radiation discrepancies were exposed in work by Inoue et al. (2006) and Tjernström et al. (2008) while evaluating regional climate model performance over the Surface Heat Budget of the Arctic Ocean (SHEBA) Experiment site.
In the current work, we evaluate some of the tools described above using measurements obtained during the Arctic Summer Cloud Ocean Study (ASCOS, Tjernström et al., 2013).The ASCOS data set provides us with a unique opportunity to assess the performance of models at high latitudes.Evaluations of basic surface meteorology and the surface energy budget, as represented in atmospheric reanalyses and global climate models, are completed using this data set.Because this is a multi-model global evaluation, with only limited variables and output formats available, averaging and interpolation of model results is necessary at times to ensure the most fair comparison possible.Not all variables are handled in the same way, and therefore these actions are outlined in the individual result sections as necessary.Section 2 provides a brief overview of ASCOS, Sect. 3 provides some background on the models, Sect. 4 outlines results of the  evaluation, and finally Sect. 5 provides discussion of these results and a summary.

ASCOS
During August of 2008, the Swedish ice breaker Oden served as a drifting base camp for the Arctic Summer Cloud Ocean Study (ASCOS, Tjernström et al., 2013).The vessel headed north from Svalbard with a goal to spend as much time as possible in the central Arctic ice pack.Between 12 August and 2 September, Oden was moored to a 3 × 6 km ice floe just north of 87 • N latitude (see Fig. 1).Much of the analysis performed in the current work was completed for this drifting time period.During this time, the expedition first experienced a ten-day period where surface air temperatures (T air,sfc ) were near or slightly above the freezing point and low cloud cover was abundant.This was followed by a slightly colder period with T a,sfc between 266 and 272 K. Finally, starting around 29 August, temperatures fell further (below 263 K) and open water began to close with the initiation of the fall freeze-up of sea ice.
A wide variety of instrumentation was deployed during ASCOS in order to comprehensively sample the atmosphere and surface.Included were a series of in situ instruments measuring properties such as air temperature, air pressure, relative humidity, and wind speed and direction.This included sensors at the surface as well as sensors used for profiling, with profiles obtained four times daily via radiosonde.In addition to in situ instrumentation, surface-based remote sensors were deployed to measure clouds and radiation.These included a 35 GHz millimeter cloud radar (MMCR, Moran et al., 1998), ceilometers, a scanning 60 GHz radiometer, a dual wavelength (24 / 31 GHz) microwave radiometer (MWR, Westwater et al., 2001), an S-band cloud and precipitation radar, a 449 MHz wind profiler, a suite of broadband radiometers measuring surface short-and longwave radiation flux densities, and turbulence masts used to estimate turbulent energy fluxes.A description of the surface energy budget measurements, along with error estimates, can be found in Sedlar et al. (2011).
When combined, this suite of instruments provides an indepth view of atmospheric processes.In addition to derivation of cloud macrophysical properties (height, thickness, fractional coverage, etc.), information on cloud phase is obtainable via methods described in Shupe (2007).Additionally, estimates of cloud liquid water path (LWP) are obtained from the MWR, with an estimated error of approximately 25 g m −2 (Westwater et al., 2001).
Unlike several other large measurement campaigns over the Arctic Ocean, radiosonde measurements from ASCOS were not submitted to the Global Telecommunications System (GTS, Birch et al., 2012), meaning that they were not included in the data sets used for model initialization and data assimilation.Conversely, 6-hourly surface-based wind and pressure observations from the Oden were submitted to the GTS.While the one-month observation period does not provide us with a long-term record against which to evaluate our models, this data set does represent one of the longestlasting, comprehensive sets of high-Arctic measurements not included in the GTS, to our knowledge.Therefore, ASCOS measurements represent an independent data set that can be used to directly evaluate reanalysis model performance.

ERA-interim
The European Centre for Medium Range Weather Forecasting (ECMWF) has released multiple reanalysis products.Here, we evaluate the performance of the most recent version, the ERA-Interim (hereafter ERA-I, Dee et al., 2011).ERA-I provides global analyses of atmospheric and surface state variables from 1989 to the present.It builds upon the successful ERA-40 (Uppala et al., 2005) product which covered the years between 1957-2002, with notable differences.ERA-I extends the number of atmospheric pressure levels archived from 23 to 37, is run at a higher resolution (T255, approximately 0.7 • ), and includes a number of additional cloud and state variables in its output.Advances in data assimilation techniques include the introduction of a 12-hourly 4-D variational (4-D-VAR) scheme, improved formulation of background error constraint, improved humidity analysis, improved model physics, and improved quality control, to name a few.Observationally, ERA-I utilizes all of the observational data sets from the ERA-40 project, and adds altimeter wave height information, winds and clear sky radiances from EUMETSAT, ozone profiles and radio occultation measurements.Boundary forcing fields come from a combination of ERA-40 reanalysis output (before 2001) and the ECMWF operational analysis (after 2001).

NCEP-NCAR (R-1)
The National Centers for Environmental Prediction (NCEP) teamed up with the National Center for Atmospheric Research (NCAR) to produce a reanalysis product to support research and climate monitoring communities (NCEP-NCAR, hereafter R-1, Kalnay et al., 1996).Originally planned to cover the 40 yr period from 1957-1996, the project was later expanded to include years up to present day.The underpinning model's horizontal resolution (T62, approximately 1.9 • ) is significantly lower than that of ERA-I, as is the number of vertical levels (28 for R-1).This results in a loweratmospheric vertical resolution ranging from 80 m (lowest grid box) to 384 m (7th grid box at 850 mb).Examples of assimilated data sets include a variety of satellite-based measurements, radiosondes from the GTS, sea ice characteristics from the ECMWF, and surface ocean data.R-1 utilizes a 3-D variational (3-D-VAR) analysis scheme known as a spectral statistical interpolation.Unlike for ERA-I, forecast products are not archived for R-1 beyond the 6 h forecast time.These 6 h forecast fields are in addition to analysis time values provided for limited variables.

NCEP-DOE (R-2)
To improve upon R-  et al., 2002).R-2 utilizes the same spatial and temporal resolution as R-1 (T62, 28 levels, 6 h), and makes use of similar raw observational data sets.Differences in the data sets include the removal of data from the Special Sensor Microwave Imager (SSM/I), and the addition of limited additional data after 1993.Additionally, errors pertaining to bogus data in the Southern Hemisphere, snow cover analysis, humidity diffusion, oceanic albedo, relative humidity discontinuities and snowmelt were fixed for the R-2 product.New system components included in R-2 include rainfall assimilation over land to improve surface soil moisture, a smoothed orography and an updated treatment of snow.Other differences result from improvements to model physics made between the creation of R-1 and R-2.These include a new planetary boundary layer scheme, new shortwave radiation scheme, a retuned convective parameterization, improved cloud-top radiative cooling, updated cloud-tuning coefficients and further improvements to the radiation scheme.As with R-1, only 6 h forecasts are distributed, along with limited analysis fields.

CAM5
Another set of simulations evaluated in this study were completed using a recent version of the Community Atmosphere Model (CAM5.1,Neale et al., 2010).In order to recre-ate conditions from the ASCOS period, these simulations were completed using the CCPP (Climate Change Prediction Program)-ARM (Atmospheric Radiation Measurement) Parameterizations Testbed (CAPT, Phillips et al., 2004).CAPT utilizes operational analyses from numerical weather prediction centers to initialize CAM5 and produce short-term forecasts.In this instance, the European Centre for Medium-Range Weather Forecasts (ECMWF) Year of Tropical Convection (YOTC) analysis was used to initialize forecasts within CAPT.The analysis data are interpolated from the finer-resolution analysis grid of 0.150 • and 91 levels to the CAM5 grids using procedures outlined in Boyle et al. (2005).These procedures use a slightly different interpolation approach for each of the dynamic state variables (i.e., horizontal winds, temperature, specific humidity and surface pressure), along with careful adjustments to account for the difference in representation of the earth's topography between models.A series of 6-day hindcasts were initialized every day at 00:00 UTC from the ECMWF analysis for the entire YOTC period from 1 May 2008 to 30 April 2010.Only the atmospheric winds, temperature and moisture are initialized, while the rest of the initial variables (land and atmosphere) come from an additional ECMWF-nudged run of the same model.Skin surface temperature and sea ice are prescribed using the NOAA optimum interpolation (OI) sea surface temperature (SST) V2.These data are weekly means on a 1 • × 1 • grid and are interpolated in time from weekly to the model time step.Since the model has a spin-up period to adjust to ECMWF conditions, the ASCOS time series are created by concatenating hours 24-48 from each hindcast.CAM5 was run with the finite volume dynamic core at resolution of 0.90 • × 1250 • in the horizontal and utilizes 30 vertical levels.This version of CAM contains a range of significant enhancements and improvements in the representation of physical processes.Except for the deep convection scheme, most other physical parameterizations have been updated from CAM4 to CAM5 (Neale et al., 2010).In addition, a three-mode modal aerosol scheme (MAM3) has been implemented in CAM5 to provide internally mixed representations of number concentrations and mass for Aitken, accumulation and coarse aerosol modes (Liu et al., 2012).These major physics enhancements permit new research capability for assessing the impact of aerosol on cloud properties.In particular, they provide a physically based estimate of the impact of anthropogenic aerosol emissions on the radiative forcing of climate by clouds.
We also analyze simulations from a development version of CAM5 that includes several changes expected to improve model clouds (CAM5-PF).In particular, this new simulation includes a new parameterization for cloud macrophysics (which combines stratiform liquid cloud fraction, condensation, and evaporation) based on a truncated Gaussian PDF for sub-grid variability in saturation excess (defined as total water mixing ratio minus ice mixing ratio minus liquid saturation mixing ratio).This truncated Gaussian also replaces the gamma sub-grid scale distribution previously used for microphysics.This improves inter-process consistency and slightly reduces microphysical depletion rates.Further improvements include sub-stepping macro-and microphysics to improve coupling between condensational growth and microphysical erosion of cloud and fixing an inconsistency between the liquid water content and droplet number used by microphysics.The latter change distributes liquid across more droplets (further reducing microphysical depletion), while the former keeps parameterizations from bouncing liquid between physically unrealistic states.These collective changes are expected to increase high-latitude cloud fraction and liquid water path, while handling microphysical depletion more appropriately.

NASA GISS-ModelE2
GISS (Goddard Institute for Space Studies)-ModelE2 simulations were completed using an updated version of the NASA GISS GCM ModelE2 specifically developed for the 5th IPCC (Intergovernmental Panel on Climate Change) assessments (CMIP5).The CMIP5 version of the GISS-ModelE2 is improved in a number of respects over that used for CMIP3 and described in Schmidt et al. (2006) and Hansen et al. (2007).Firstly, the model has a higher horizontal and vertical resolution (2 lat × 2.5 longitude, 40 layers).The vertical layers are distributed on a non-uniform grid, with spacing of roughly 25 mb (250 m) from the surface to 850 mb, and roughly 40-50 mb (400-700 m) from 850 to 415 mb.Secondly, various physics components have been upgraded from the CMIP3 version, namely the convection scheme, stratiform cloud scheme and gravity wave drag.The simulations discussed here further include the aerosol microphysics scheme MATRIX (Bauer et al., 2008).
For this work the model is run continuously for 2008, covering the ASCOS campaigns described above.In order to force representative meteorology in the GCM, the model uses prescribed sea surface temperatures and sea ice, and the horizontal wind components of the model are nudged towards the R-2 reanalysis.R-2 winds are available on a 6-hourly time step and are linearly interpolated to the model's 30 min time step.It is important to recognize that this nudging may introduce systematic errors in these runs tied to any biases present in the analysis used for the nudging.However, it is believed that because the nudging only impacts horizontal winds, most biases are connected to the physics of the GISS-ModelE2 itself.The aerosol scheme uses the CMIP5 emissions by Lamarque et al. (2010).This setup has previously been used by Bauer andMenon (2012) andde Boer et al. (2013) to evaluate forecast-mode simulations using ModelE.

Notes on sampling
A major consideration in evaluations such as in this study is how to best analyze available measurements to appropriately represent the scales inherent to model grid boxes (Mc-Comiskey and Feingold, 2012).While in situ and surfacebased observations such as those obtained during ASCOS have the potential to capture process-level relationships, they do not necessarily capture the spatial variability included within a model grid box without alteration of the measurements.A simple approach is aggregation (averaging) of data over timescales that begin to capture the spatial variability in the model grid.Naively, it may be assumed that it would be appropriate to average over a time period that covers the full scale of the grid box, assuming some advective velocity (e.g., 10 m s −1 ).At 2 • , this requires averaging of periods on the order of 6-7 h.Using this technique is not practical, however, as it blurs the evolution of quantities occurring within the diurnal cycle.An alternative approach entails averaging over shorter periods (e.g., one hour) in order to capture some of the sub-grid scale variability in the measurements, while maintaining signals inherent in an evolving atmosphere.This short aggregation timescale is very appropriate for time periods featuring consistent large-scale meteorological conditions and a relatively homogeneous surface, but may fail during frontal passages or at coastal sites.In this work, comparison is additionally complicated by the way in which parameters are presented within the reanalysis products.For example, while most variables are provided on a 6-hourly timescale, some of the values presented (e.g., liquid water path, cloud fraction) represent instantaneous values for that time (but still averaged across the model grid box), other variables (e.g., precipitation, radiation) are provided as average values over the 6 h period.Complicating things further, the observational data sets used for this study represent different timescales as well, with most quantities aggregated into hourly averages and others represented as 3-hourly averages.In this evaluation, we have done our best to be consistent in our comparisons.For variables represented in the reanalyses as averages, we have taken the time period over which this average is taken and computed averages from the measurements and GCM output as well.Likewise, for instantaneous values, we have interpolated the measurements and GCM results to most closely match the time of the reanalysis data set.The (at least) hourly averaging used on the measurements ensures that some spatial variability is accounted for in those data sets, even when "instantaneous" evaluations are done since the highest temporal resolution used is a one-hour average.Further details are provided in the individual sections in Sect. 5.

Surface meteorology
Figure 2 gives an overview of the performance of different modeling products in simulating basic surface meteorological quantities for the time that Oden was drifting with the pack ice north of 87 • N latitude.Included are (from top to bottom) 2 m air temperature (T air,sfc , K), surface air pressure (P air,sfc , mb), 2 m air relative humidity (RH air,sfc , %), the 10 m zonal wind component (U , m s −1 ) and the 10 m meridional wind component (V , m s −1 ).For all variables, the model results have been linearly interpolated in space to the exact location of the Oden at a given time, reducing (but not eliminating) the influence of resolution that would result from a nearest grid cell comparison, and eliminating the influence of jumping between grid boxes with the movement of Oden during the campaign.The time series represented in Fig. 2 provide a comparison between the observations, reanalyses and GCMs.Biases are calculated and presented as distributions in the right-hand panels of Fig. 2, with black circles representing the median difference, the box representing the interquartile range (IQR) and the whiskers representing the extent of the 5th and 95th percentiles of the biases.Additionally, values for median biases and correlations for all evaluated variables are presented in Table 1.For the variables compared here, the reanalyses provide these variables as instantaneous values at the 6-hourly intervals.In order to most closely compare these quantities between the GCMs (which provide 1 h averages and 3-hourly instantaneous output for CAM5 and GISS, respectively), the observations (which are provided as one-minute averages) and the reanalyses, observations are averaged on a one-hour scale to account for sub-grid variability within the models, and then interpolated, along with the GCM output, to the 6-hourly reanalysis times.
Looking first at T air,sfc (Fig. 2a), it is apparent that there is quite a bit of variability from one product to the next, and that there does not appear to be a clear "best" model in terms of agreement with observations.Statistically, R-2 outperforms other models in terms of the median bias, though it is clear that this bias is derived from a distribution with significant variability.ERA-I is correlated much more closely with the observations (correlation of 0.82) than the other models are.As noted above, the first portion of this period (roughly 12-17 August) features T air,sfc near the freezing point.This is captured by the models, with most within 1-2 • of the observations.There is a short period of cooler temperatures on 16 August, which only the CAM5 models appear to simulate well.Unlike the original CAM5 version, CAM5-PF T air,sfc does not recover back to the freezing point as quickly as in the observations, taking nearly two days to recover.Beyond 17 August, the observations, ERA-I and R-1 remain in this near-freezing point state for another four days.The Mod-elE2 has a cooler period from 17-19 August, while CAM5, CAM5-PF, and R-2 all cool off quickly on 18 August.Both versions of CAM5 cool to temperatures around 265 K by 20 August, which is cooler than the observed T air,sfc decrease to 267 K on 21 August.R-2 captures this cool period most successfully, with ERA-I, R-1 and ModelE2 all remaining too warm.The time period from 24 August until 31 August is an interesting one in that the observations along with R-1, Mod-elE2 and ERA-I reanalyses all feature temperatures that are only slightly below the freezing point, while CAM5 and R-2 generally have much colder (around 265 K) temperatures and CAM5-PF oscillates between these states.Physical mechanisms driving these differences will be explored later in the paper.
All of the model products generally perform well in simulating synoptic scale weather phenomena.Figure 2b illustrates a comparison of P air,sfc at Oden between the different products.With the exception of R-1 and R-2, differences from the observations are generally small.All products capture the general pattern, with P air,sfc generally increasing between 13 August and 2 September.Looking at the right-hand panel of Fig. 2b, the R-2 6 h forecast demonstrates a clear high bias in P air,sfc (4.93 mb), while the R-1 6 h forecast demonstrate a low bias (−3.01 mb).Biases at all forecast scales for ERA-I, GISS ModelE2 and CAM5 are generally at 1 mb or less, which should not be surprising since surface pressure measured aboard Oden was submitted to the GTS.One interesting point to note is that CAM5-PF does feature larger variability in the P air,src errors than CAM5.With P air,sfc generally represented well, it is not surprising that simulation of 10 m winds was also quite good.
Both the zonal and meridional winds (Fig. 2d, e) produced by CAM5, GISS ModelE2 and reanalysis products follow the observed winds closely.The only clear exceptions to this are the CAM5-PF errors between 17 August and 20 August and 24 August to 30 August, which correspond with P air,sfc biases in that model.GISS ModelE2 meridional winds also are biased high, likely the result of forcing from R-2, which demonstrates a similar high bias.The right-hand panels of Fig. 2d, e illustrate that, with the exception of CAM5-PF and GISS ModelE2 meridional winds, differences between simulated and observed winds were generally around 1 m s −1 or smaller, and that mean differences fall very close to the zero line.ERA-I featured the highest correlations in both the zonal and meridional winds (.92 and .91,respectively), and also was very well correlated to the surface pressure (.99).
Finally, looking at the simulation of RH air,sfc (Fig. 2c), the models produce values that are at times very different from observed values.ERA-I produces RH air,sfc values near 100 % for the entire observation period.There are periods of time where this is the correct solution, though the observations feature significantly more variability than ERA-I does.R-1 tends to be drier than the observations, particularly during the period from 21-24 August where the model has RH air,sfc values between 75-90 % and observations hover between 95-100 %.R-2 features a similar drop in RH air,sfc , though it is not as large, and, unlike R-1, is close to observations during the period from 24 August to 2 September.CAM5 and CAM5-PF both feature less variability than the observations, but feature values that fall in the middle of the observed variability.Ultimately, this works to reduce the median bias for these models (Fig. 2c, right-hand panel), and ultimately result in some of the lowest biases of all of the models with poor correlation to the observations (Table 1).

Clouds and precipitation
ASCOS was in general very cloudy, a fact reflected in the observed cloud fractions (CF) in Fig. 3a.Three-hourly averaged values were at 100% for most of the campaign, with only a few time periods in early September where values dropped significantly.As with surface meteorology, the models and observations provide different quantities for reporting cloud fraction.ERA-I, R-1 and R-2 all provide an instantaneous value every six hours, while GISS ModelE2 provides a 3hourly instantaneous value.The observations are provided as a 3-hourly average, while CAM5 provides 1-hourly averages.In order to attempt to make a more consistent comparison, CAM5, GISS and observational values are interpolated to the reanalysis times without averaging further.While CF is only a rough means for evaluating model performance in cloud simulation, the model results do not compare favorably with the observations.ERA-I featured CFs that were closest to observations, with only a couple of periods where the CF fell below 100 %.The fact that ERA-I only reports instantaneous values at the 6-hourly times, while the other models (and observations) report 6-hourly averages, does make it likely that some differences will occur.During times where ERA-I CF did stray from 100 %, it was not by much, falling only to values around 70-80 %.Towards the end of the measurement period, ERA-I retains 100 % CF, while the observations feature less cloud.In comparison, the other models generally feature less cloud than both the observations and ERA-I.Both R-1 and R-2 consistently produce far too few clouds.This is particularly true for the period between 26-29 August, where low, stratiform clouds persisted in the observations but both NCEP products have cloud fractions near 10 %.ModelE2 features significant variability in its representation of cloud fraction, spanning values between 0-100 %.
The two CAM5 versions both underpredict cloud cover, and while the original CAM5 version has fewer very low values, CAM5-PF performs better during the previously mentioned stratiform cloud period during the end of August.As will be explored in more detail in the following section, this increased cloud cover is generally responsible for CAM5-PF's warmer surface temperatures during this time period when compared with CAM5 (Fig. 2a).Biases for the GCMs are shown to be comparable, with all three models underestimating cloud fraction by roughly 20 %, though the timing of these differences becomes quite important in the overall impact of this bias on surface energy balance and meteorology.
Cloud fraction only provides us with a limited perspective on the performance of models in simulating clouds.Perhaps more important, at least from the point of view of surface ra-diation, is the amount of liquid water contained within these clouds.Unfortunately, not all of the models include the simulated liquid water path in the publicly available output.For those that do, ERA-I provides 6-hourly instantaneous values, GISS ModelE2 provides 3-hourly instantaneous values, CAM5 provides 1-hourly averages and the observations are available as 15 s averaged values.To improve the comparison, observational estimates were averaged on a 1-hourly basis to account for spatial variability within model grid boxes.These averages, along CAM5 1-hourly averages and GISS ModelE2 3-hourly instantaneous values, were interpolated to the times available for the reanalyses.Figure 3b presents the time series of liquid water path for the models that provide this quantity (ERA-I, CAM5, CAM5-PF and ModelE2) as well as the observations.The observations presented here represent a lower bound on LWP due to the need to correct data for time periods with liquid precipitation.Here, we replace the unrealistically high values reported during these times with the value reported directly before precipitation starts.This likely results in an observational underestimate of the LWP during these times.As with cloud fraction, ERA-I generally matches the observations closely.While differences still exist, ERA-I and GISS ModelE2 feature liquid water for most of the observation period.Both versions of CAM5 have multiple periods where very little liquid water exists.Improvements made for CAM5-PF present themselves clearly in the representation of LWP from 23-28 August, with CAM5-PF featuring liquid-containing clouds during this stratiform cloud period.As will be shown in the following section, this increase in LWP during those dates results in improved simulation of short-and longwave radiation during those dates as well, ultimately resulting in a more accurate depiction of the surface energy budget and near surface temperatures.
Precipitation rates are in general challenging to model and observe.Precipitation during the period of observation for this quantity was generally very light, with rates between 0-0.2 mm h −1 .With the exception of GISS ModelE2, which provides 3-hourly instantaneous precipitation rates, all of the other models provide averaged quantities.All of the reanalyses provide 6-hourly mean precipitation, while the observations and CAM5 are both available as 1-hourly means.In order to bring these estimates closer together, the observations and the GCMs are averaged over the same 6-hourly period for which averages are provided in the reanalyses.The models fail, in general, to reproduce the more significant precipitation events observed (e.g., 17 August, 20 August, 23 August, 29 August).There does not appear to be a clearly superior model in terms of reproducing these light precipitation events.The right-hand panels of Fig. 3c demonstrate the challenges models appear to have with accurate representation of precipitation.While all of the mean biases are relatively small, the variability in the biases is quite large.This indicates not only the inability of models to correctly simulate the magnitude of the precipitation, but also the issues occurring with respect to correctly timing the precipitation, as demonstrated by the relatively low (and statistically insignificant) correlations in Table 1.Most of the models feature light precipitation throughout the observation period, which is greater than the observed precipitation for much of that time.Additionally, some of the models (R-1 on 18 August; CAM5-PF on 23 and 25 August) produce more significant precipitation events during times where very little precipitation was observed.

Surface energy budget
Figures 4 and 6 provide an overview of the surface energy budget terms as governed by the following equation: where F LW and F SW represent the net longwave and shortwave radiative fluxes, respectively, F SH represents the surface sensible heat flux, F LH represents the surface latent heat flux and Q SFC represents the residual flux, including subtracted conduction terms.The influence of the conduction terms on the overall energy budget can be significant, but because we are evaluating the atmosphere these terms are not discussed here.The signs of radiative terms are in line with their impact on the surface, with positive values acting to heat the surface and negative values acting to cool the surface, while the turbulent latent and sensible heat flux terms follow their traditionally applied sign convention, with negative values warming the surface and positive values cooling the surface.In all panels of Figs. 4 and 6 the ASCOS observations are presented in black, with various model results indicated in colored lines.For the surface energy budget terms, ERA-I, R-1 and R-2 all provide 6-hourly averaged values, while CAM5 and the observations provide 1-hourly averages and GISS ModelE2 provides 3-hourly instantaneous values.
To best compare these different values, the 1-hourly averages from CAM5 and the observations, as well as the 3-hourly instantaneous GISS ModelE2 values, are averaged over the 6 h period represented by the reanalyses.Looking first at surface shortwave radiation (Fig. 4a-c), there are substantial differences between the different models.ERA-I appears to be the only model that comes close to resembling observed values, with CAM5, R-1 and R-2 featuring excessive downwelling shortwave radiation and Mod-elE2 featuring too little downwelling shortwave radiation, particularly in the first of the observational period.While some of this may be the result of differences in atmospheric chemical composition and the radiative transfer codes applied, this result should generally not be considered surprising if we look back at the cloud properties contained in the different models.R-1 and R-2, which generally featured the lowest CFs, also demonstrate the largest positive biases in downwelling shortwave radiation.CAM5 CFs, which, while closer to the observations than the NCEP products, were still low and also allow for excessive solar radiation to reach the surface.CAM5-PF had improved cloud properties, and generally features better agreement with the observed downwelling shortwave radiation, though it does have larger variability than CAM5.ERA-I CF was generally comparable to that observed, and correspondingly, the downwelling shortwave is also comparable to that observed.While ERA-I LWP appears to be lower at times than observed, these differences occur during periods where the LWP is high enough for most sunlight reaching the surface to be diffuse anyway, resulting in reduced differences in the downwelling shortwave radiation.The GISS ModelE2, which generally slightly underestimates CF and LWP, features downwelling shortwave radiative flux densities that are generally biased slightly high.Upwelling shortwave radiation demonstrates a similar pattern, with ERA-I most closely resembling the observations, and most other models featuring excessive outgoing radiation.ModelE2 has too little upwelling shortwave at the surface, which should not be surprising considering its underestimate of incoming shortwave radiation.The upwelling shortwave radiative flux density is governed in part by the surface albedo produced in each of the models (Fig. 5a), a quantity that varies substantially from one model to the next.These differences can be explained by differences between how the model products handle snow and sea ice.For example, the large difference between R-1 and R-2 is attributed to differences in specified sea ice cover between the two products, which results in a 10 yr zonally-averaged difference at northern high latitudes that is roughly comparable to that detected for the ASCOS period (Kanamitsu et al., 2002).ERA-I does not allow for snow to collect on sea ice surfaces, and therefore precipitation does not directly impact surface albedo as it does in nature.Instead, surface albedo of sea ice is prescribed to vary seasonally, with monthly values based on estimates taken from Ebert and Curry (1993), with a bare ice value used for summer and dry snow values used for winter months.This climatological estimate is responsible for the gradual increase in albedo demonstrated to be present in ERA-I as the result of changes in local sea ice concentration as well as a seasonal shift to higher albedo values with time.

G. de
The smaller variability is a result of the impact of clouds and precipitation on the shortwave spectrum, which results in small variations in the broadband shortwave radiation at the surface.Because the albedo is sensitive to these changes in the spectrum of shortwave radiation, there appears to be a correlation between precipitation and the shortwave albedo (Fig. 5b).CAM5 albedos are in general the closest to observed values, with a transition from lower values at the beginning of the observational period to a value close to that of snow for the latter portion of the campaign.Albedo values over ice in CAM5 are temperature dependent, and it becomes obvious when comparing CAM5 and CAM5-PF that the influence of cloud cover on near surface temperatures impacts albedo dramatically, with near surface warming between 26-30 August, resulting in lower CAM5-PF albedos during those dates.Interestingly, the CAM5 albedo appears to be negatively correlated to precipitation events, which implies that the model is producing rain rather than snow.Figure 5b illustrates this behavior for the CAM5, R-2 and GISS ModelE simulations, with the 6 h period directly after the precipitation in these models negatively correlated with precipitation.The negative correlation is not directly related to the precipitation, but both the GISS ModelE2 and CAM5 simulations produce rain for some of the precipitation events.GISS ModelE2 features significantly lower albedos than observed or produced by other models and also appears to have a weak negative correlation to precipitation events at the time of the precipitation event and the time periods directly thereafter.The surface albedo values used in the GISS ModelE2 simulation are based on observations for this time of year from SHEBA and appear to include a high melt pond fraction that acts to reduce the surface albedo.This underestimation of surface albedo exacerbates the high bias in downwelling shortwave radiation in the GISS ModelE2, resulting in too little solar radiation leaving the earth's surface.Interestingly, the net impact of the issues discussed above is relatively small due to the presence of compensating errors and the resulting general agreement in net solar radiation at the surface.F SW biases for most of the models are smaller than may be expected given the relatively large biases in the upwelling and downwelling terms (see box plots on right-hand side of Fig. 4), with median values of 5.33 (R-2), 7.11 (ERA-I), 9.21 (CAM5), and 9.47 (CAM5-PF) W m −2 .The only major exceptions to this are R-1 and GISS ModelE2, which are shown to be biased high by 48.08 and 30.26W m −2 , respectively.This comparison demonstrates the extent to which surface albedo values can result in compensating errors that nearly cancel out in the net shortwave radiative flux density.Both R-1 and GISS ModelE have relatively low albedos, resulting in large positive imbalances in shortwave radiation, while R-2 and the two CAM5 versions have higher albedos, which help to cancel out their excessive surface downwelling radiation.ERA-I's albedo, which is also generally too low, helps to make up for the low bias in downwelling shortwave radiative flux density.
In the Arctic environment, longwave radiation is a crucial contributor to the surface energy budget due to the reduced influence of solar radiation during all but summer months.Figure 4d-f illustrates the models' performance in simulating surface longwave radiation.Upwelling longwave radiation is governed primarily by surface temperature, and errors in this quantity are generally the result of problems with the lower boundary condition.Conversely, surface downwelling longwave radiation is governed by atmospheric temperature structure and atmospheric optical depth.The CAM5 and CAM5-PF simulations are demonstrated to have outgoing surface longwave radiation values that are slightly higher (less negative) than observed.This would imply a colder surface, possibly a result of having elevated sea ice concentrations.Improvement in the representation of clouds results in a warming of the near-surface environment in CAM5-PF for the second half of the observation period, which in turn results in more upwelling longwave radiation (as well as more downwelling longwave radiation) at the surface when compared to the CAM5 simulation.The insufficient surface downwelling longwave radiation in CAM5 is in part responsible for temperature biases for the same time period demonstrated in Fig. 2. The other models generally produce outgoing surface longwave values that are closer to observations.ModelE2, while performing respectably with regard to outgoing longwave radiation, performs rather poorly in its prediction of downwelling radiation, implying that while surface temperatures are close to what they should be, atmospheric temperature structure, clouds and/or aerosols are poorly represented.Both R-1 and R-2 trend towards underprediction (not negative enough) of incoming and outgoing longwave radiation during the second half of the observation period.This, along with underestimation of downwelling longwave radiation in CAM5 and ModelE2, is at least partially a result of errors in the cloud fields, as illustrated in Fig. 8. ERA-I, with the best estimates of LWP behind GISS ModelE2, performs best in simulating net surface longwave radiation, but does not capture the reductions in downwelling longwave during 21-24 August and after 30 August.Both versions of CAM5 feature lower LWP, with more values falling below the 30 g m −2 boundary between black-and gray-body clouds (Shupe and Intrieri, 2004).The reduced emissivity associated with these thinner clouds results in a significant reduction in surface LW NET , as shown in Fig. 8a.
Looking at the net surface longwave radiation, ERA-I easily outperforms other models with a median bias of less than −1.43 W m −2 .All other models are shown to radiate excessively to the atmosphere, with median biases of −54.58 (R-1), −31.07 (R-2), −34.12 (CAM5), −15.50 (CAM5-PF) and −19.30(ModelE2) W m −2 .As discussed above, these biases are in large part due to problems with simulated cloud cover.CAM5-PF shows significant improvement in the simulation of longwave radiation when compared with CAM5, mainly as a result of improved simulation of liquid-containing stratiform clouds during the second half of the observation period.One of the more confusing results comes from GISS ModelE2, which appears to closely match the observations in both LWP and CF (Fig. 8b), underrepresents the surface net longwave radiation substantially.This is likely a result of clouds occurring at the wrong altitude, and therefore emitting at the wrong (colder) temperature.
Sensible heat fluxes (Fig. 6a) are generally small compared to the radiative terms.These fluxes represent the turbulent transfer of heat across temperature gradients near the surface, and are directed upward when temperature decreases with height at a rate exceeding the adiabatic lapse rate.ERA-I outperforms the other models with a small median bias of −0.85 W m −2 .GISS ModelE2 is the next best performer, with a median bias of 2.60 W m −2 .R-1, CAM5 and CAM5-PF all have median biases between 5.8-6.4W m −2 , while G. de Boer et al.: ASCOS model evaluation R-2 has the largest median error (17.84 W m −2 ).Given the limited amount of information available, it is challenging to say why exactly R-2 features large biases compared to other models.R-2 near-surface air temperature biases (Fig. 2a) are often smaller than those of other models, and based on upwelling longwave radiation biases, R-2 surface temperatures are not necessarily worse than those for CAM5.R-2 surface winds are perhaps a little bit too high, but the biases here are again not out of line with those detected for CAM5 and GISS ModelE2, for example.This may leave the parameterization of bulk transfer coefficient as a potential culprit, and unfortunately the models do not provide the detailed output required to correctly diagnose where the issues lie beyond what is discussed above.
Latent heat fluxes, representing heat transfer through turbulent transfer of moisture across specific humidity gradients, are illustrated in Fig. 6d.Again, as with sensible heat, values are generally much smaller than with the radiative terms.The CAM5 simulation generally agrees most closely with observed values, with a median bias of −0.37 W m −2 .R-2 also performs well, with a median bias of −0.47 W m −2 .These are followed by CAM5-PF (median bias −1.61 W m −2 ), ERA-I (2.21 W m −2 ), ModelE2 (6.28 W m −2 ) and R-1 (−7.47 W m −2 ).ERA-I and ModelE2 are the only models that demonstrate positive median biases, likely indicating a smaller decrease of specific humidity with height in the models' near surface atmosphere than was observed.The relative change in performance between the sensible and latent heat components appears to indicate that the biases result from factors beyond those in common between the two terms, such as near surface winds.
Somewhat incredibly, R-1, which was illustrated to be among the worst performing models for many of the individual terms discussed above, seems to most closely portray the net Q SFC term with a bias of only 2.97 W m −2 .ERA-I and GISS ModelE2 are the only other products that demonstrate a positive bias in Q SFC (7.24 and 22.38 W m −2 , respectively).The other models all excessively lose heat at the surface, with R-2 featuring a median bias of −11.68 W m −2 , CAM5 a bias of −18.62 W m −2 and CAM5-PF a bias of −7.57W m −2 .A large portion of these biases is demonstrated to come from longwave radiative biases, with most of those being the result of the downwelling longwave.This illustrates the important role that cloud physics play in regulating this vital component of the climate system.Additionally, this result demonstrates clearly the influence of compensating errors on overall evaluations, and that in general, care must be taken to incorporate all of the budget terms in model evaluations, rather than simply looking at the net product term.If we instead look at the distribution of the absolute values of biases of the budget terms, only including the individual components (SW up , SW down , LW up , LW down , F SH , F LH ), (Fig. 7) we see a very different representation of model performance.Here, ERA-I outperforms the other models with significantly smaller biases in the budget terms, with GISS ModelE2 a close second.CAM5PF demonstrates significant improvement over CAM5, with R-1 and R-2 falling in between these two.This figure can be thought of as a step towards diagnosing the models' abilities to get the right answer for the right reasons, rather than simply getting the right answer.

Lower tropospheric temperature structure
In order to evaluate the impact of the surface radiative balance terms on atmospheric state, here we assess the models' ability to simulate lower atmospheric temperature structure.Figure 9 illustrates temperature biases relative to radiosonde observations for the lowest 3000 m of the atmosphere for all five model time series.On the right of each figure is a profile demonstrating the mean (bold line) temperature bias profile, along with the interquartile range of these biases for each model grid box level.Biases detected for the lowest model levels follow biases in the surface energy budget, with CAM5, CAM5-PF and R-2 all featuring cold biases close to the earth's surface.CAM5-PF has a reduced cold bias when compared to CAM5, but also features a 1-2 • warm bias higher in the atmosphere.R-1 and ERA-I have slight warm biases low in the atmosphere, which is not surprising given the neutral or positive biases they demonstrate in Q SFC .Both R-1 and R-2 feature cold biases in the upper portion of the evaluated domain, while ERA-I and CAM5 have very small net biases at higher altitudes and CAM5-PF has a warm bias.
All models except for ERA-I and GISS ModelE2 have warm biases between 500-1000 m, resulting from insufficient lowlevel cloud cover in those simulations.
Looking at this in a different way, we can evaluate the models' ability to simulate the potential temperature difference between the surface and 850 mb (hereafter LTS 850 , Fig. 10).For the ASCOS period, 850 mb fell between 1200-1600 m in altitude.LTS 850 can be thought of as representative of the mean stability of the lower atmosphere, with stability increasing with the difference between these temperatures.All of the models accurately represent LTS 850 during the first days of the observational period (12 August-15 August).After this, the models begin to diverge somewhat, with R-1 and R-2 generally being less stable than observations until 24 August, when R-2 suddenly shifts to generally being more stable than the observations.Looking back at Fig. 9, we can see that the instability (relative to observations) between 15 and 24 August in R-1 and R-2 is the result of a combination of near-surface warm bias and slight 850 mb cold bias.After 24 August, R-2 features a cold nearsurface temperature, which results in its shift to a more stable lower atmosphere.When looking at the distributions of LTS 850 (Fig. 10b), the general picture is confirmed, with R-1 having an atmosphere that is generally less stable than observations, and R-2 more closely matching observations.ERA-I generally has lower-atmospheric stabilities that are similar to those observed, with two exceptions.The largest differences between ERA-I and observations occur 21-26 August and 1-2 September, where ERA-I is less stable than observations.During the 21-26 August time period, ERA-I exhibits a near-surface warm bias, resulting mainly from a high bias in the net surface longwave radiation (Fig. 4f).It appears as though a similar bias in surface radiation may have occurred between 1 and 2 September, though we do not have surface radiation measurements for much of that time period.In addition to the surface radiation-induced near-surface temperature bias, ERA-I also has a cold bias around 850 mb for the 1-2 September time period, adding to the reduced stability.Overall, ERA-I is shown to be less stable than observations, and in particular does not capture more stable environments well (right-hand tail of distributions in Fig. 10b).CAM5 and CAM5-PF simulations are generally too stable after 18 August.The exceptions to this include the period 23-24 August where both agree more closely with observed stability, and then the 24-27 August period where CAM5-PF agrees closely with observed stability.The excessive stability is largely the result of near-surface temperatures that are too cold in both versions of this model (Fig. 9).The biases in lower atmospheric stability again closely track the net longwave radiation at the surface (Fig. 4f), which for both versions of CAM5 is generally underpredicted.Time periods mentioned above with improved performance match the time periods in which net longwave radiation is in better agreement with observations.Distributions of lower atmospheric stability for the entire ASCOS period demonstrate the slightly improved simulation of this quantity with CAM5-PF, with more of the values moving towards the less stable (left) side of Fig. 10b.Finally, evaluation of ModelE2 provides us with a somewhat puzzling result.Despite large biases in both the net surface energy flux, an evaluation of ModelE2 lower atmospheric stability results in a very favorable comparison with observations.The main exceptions to this occur 27-28 August (too unstable) and 28-31 August (too stable).
The GISS ModelE2 distribution of LTS 850 matches observations more closely than any of the other models.Figure 10c, d demonstrates the relative contributions of near surface air temperature and 850 mb temperature on LTS 850 .These values were obtained by using one of the observed values (either θ SFC or θ 850 ) together with the model value for the other in order to calculate LTS 850 , and then calculating the bias in LTS 850 relative to values calculated directly from the observations.As discussed above, most of the discrepancies result from differences in the near surface potential temperature, where variability is somewhat larger.The two main exceptions to this are the event on 27 August when all of the models appeared to misrepresent at 850 mb, and the general contribution of CAM5-PF's warm bias at 850 mb.CAM5's excessive stability is clearly illustrated to result mainly from the near-surface potential temperature, with LTS 850 biases much larger when using the model θ SFC than when using the model θ 850 .

Summary and discussion
The different models had variable success in simulating the evaluated properties.A brief summary of the results is included here.
-Surface meteorology: with some minor exceptions, all of the models had small biases in wind speed and direction.Surface pressure was generally well simulated, with the exception of R-1 (−3.01 mb bias) and R-2 (4.93 mb bias).Variables more closely tied to clouds and radiation such as near surface temperature and humidity were shown to be more poorly simulated.Median near-surface temperature biases ranged from −3.39 K (CAM5) to 1.15 K (R-1), while median relative humidity biases ranged from −7.80 % (GISS ModelE2) to 2.39 % (ERA-I) with all models demonstrating substantial variability.ERA-I generally featured superior correlation with observations over the other models for these variables.
-Clouds and precipitation: cloud-related processes continue to represent a major stumbling block for accurate simulation of the Arctic environment.With the exception of ERA-I and GISS ModelE2, models struggled to produce the amount of cloud cover observed.Cloud liquid water path was underestimated by three of the four models that provided this quantity, with GISS ModelE2 producing the closest amount of liquid water (median bias of 2.87 g m −2 ), though care needs to be taken since the observations represent a lower limit, and therefore likely an underestimate of the true liquid water present.CAM5-PF did demonstrate improvement over the standard version in simulating liquid water amount.While overall precipitation amounts were low during ASCOS, for what was recorded, median model errors were generally between 1-5 %, which is likely well within the errors associated with measuring precipitation.However, variability in those numbers was large, with models not reproducing three of the four more substantial precipitation events observed during ASCOS.
-Surface energy budget: model biases in cloud-related processes discussed above result in large biases in the radiative components of the surface energy budget.Shortwave radiation biases are generally closely coupled to cloud biases, as may be expected, with models that have too few clouds also featuring too much downwelling shortwave radiation at the surface, and GISS ModelE2 agreeing most directly with measurements (median bias of 3.97 W m −2 ).-Lower atmospheric temperature structure: the models vary widely on their representation of lower atmospheric temperature.R-2 and both versions of CAM5 both feature near-surface cold biases for much of the ASCOS period, and all models except ERA-I feature a slight warm bias between 500-1200 m.R-1 and R-2 both have significant (around 2 K) cold biases near 3000 m.From a bulk stability perspective, R-1 and ERA-I tend to have lower atmospheres that are slightly less stable than observations, while CAM5 (both versions) tend to be too stable.ModelE2 and R-1 most closely match the observed lower tropospheric stability, as defined by the temperature difference between the surface and 850 mb.
The demonstrated model biases can have wide ranging impacts.For one, reanalysis output is sometimes used to force large-scale or local sea ice and ocean models to evaluate their performance against available observations (e.g., Brodeau et al., 2010;Miller et al., 2007).Curry et al. (2002) evaluated several data sets used to force sea ice models during the SHEBA time period and noted that substantial differences can be found between the different data sets.Our evaluation results in a similar conclusion, with notable differences between products in quantities relevant to ocean processes and sea ice growth and decay, such as winds, precipitation and radiation.Additionally, reanalyses are often used to evaluate the performance of climate models in simulation of the present-day www.atmos-chem-phys.net/14/427/2014/Atmos.Chem.Phys., 14, 427-445, 2014 atmosphere (e.g., de Boer et al., 2012;Schmidt et al., 2006).
For variables with well-characterized, small biases, this may be acceptable, but for other variables (generally those related to clouds, precipitation and radiation), this sort of evaluation is generally inappropriate, given the relatively comparable magnitudes of errors in the reanalyses and differences between reanalyses and models.
Biases in GCMs can cause serious issues with determining equilibrated climatic states.Since small biases in a fully-coupled climate model can run away into new climatic regimes, biases in the model themselves can, over several year periods, result in large errors in predicted climate.This is particularly true in a sensitive area such as the Arctic where small shifts in climatic equilibrium can result in stark transitions, such as those from ice-covered to melted surfaces.One challenging aspect of a model evaluation such as this one is distinguishing between the errors resulting from the model itself and those resulting from the data set from which forecasts were initialized.In the present study, there are some GCM biases that appear to result from the model itself, such as the lack of low clouds in CAM5 and the associated surface energy balance errors and biases in lower atmospheric stability.
The ASCOS campaign helps to illustrate some of the issues faced by models of the Arctic environment.For the reanalyses, there may be significant advances in product accuracy with the integration of additional observations into the GTS.This would help to improve the analyses used for forecast initialization, giving the model less room to stray from the measured atmospheric state.In terms of model performance, current parameterizations continue to be challenged in correctly simulating Arctic processes.While previous campaigns (e.g., SHEBA, ASCOS) have greatly improved our understanding of some of these processes, others could use additional improved observations to enhance our understanding.In particular, cloud processes are demonstrated here to cause problems for all of the models involved, with errors in the representation of clouds translating to surface energy and near-surface temperature errors.This can have significant impacts on climate simulation as these imbalances will likely drive the climate into an altered state through various feedback processes.The short duration of ASCOS ultimately results in an incomplete evaluation of these models and similar data sets need to be obtained for other times of year and other parts of the Arctic to complete such evaluations.Ultimately, increased observations are necessary to better understand and consequently simulate this environment.This is particularly true over minimally-studied areas such as the marginal ice zone, which continues to expand with the shrinking sea ice pack.

Fig. 1 .
Fig. 1.A map illustrating the path of Oden (red line) during AS-COS.The region of ice drift (black box) is enlarged, and gray-scale points provide the location of the ship on given dates.

Fig. 2 .
Fig.2.Time series of basic meteorological quantities from top to bottom: surface air temperature, surface air pressure, surface air relative humidity, 10 m U wind component, and 10 m V wind component.Included are lines for the observations from Oden (bold black line), CAM5 (standard version, dark green; CAM5-PF, light green), the GISS-ModelE2 (orange), and R-1 (dark blue), R-2 (light blue) and ERA-I (red) reanalyses.ERA-I, R-1 and R-2 lines represent 6hourly analysis (0 h) time instantaneous values, while the CAM5 lines depict the interpolated value for the analysis time from the model 1-hourly averages.The right-hand side of the figure includes distributions depicting the median (black dot), IQR (wide bar) and 10th-90th percentiles (thin bar) of the differences between simulated and observed values.

Fig. 3 .
Fig.3.Time series of common cloud and precipitation quantities (from top to bottom: average cloud fraction, average liquid water path, and average precipitation rate).Included are lines for the observations from Oden (bold black line), CAM5 (standard version, dark green; CAM5-PF, light green), the GISS-ModelE2 (orange), and R-1 (dark blue), R-2 (light blue) and ERA-I (red) reanalyses.The insert in (c) modifies the scale of the vertical axis to better distinguish between models where all have low values.All lines represent six-hour average values.Please note that R-1 and R-2 do not provide liquid water path, and are therefore not included in the center figure.The right-hand side of the figure includes distributions depicting the median (black dot), IQR (wide bar) and 10th-90th percentiles (thin bar) of the differences between simulated and observed values.

Fig. 4 .
Fig. 4. Time series of surface energy budget terms (from top to bottom: average downwelling shortwave flux density, average upwelling shortwave flux density, average net shortwave flux density, average downwelling longwave flux density, average upwelling longwave flux density, average net longwave flux density).Included are lines for the observations from Oden (bold black line), CAM5 (standard version, dark green; CAM5-PF, light green), the GISS-ModelE2 (orange), and R-1 (dark blue), R-2 (light blue) and ERA-I (red) reanalyses.All lines represent six-hour average values.The right-hand side of the figure includes distributions depicting the median (black dot), IQR (wide bar) and 10th-90th percentiles (thin bar) of the differences between simulated and observed values.

Fig. 5 .
Fig.5.Time series of surface albedo as calculated from the 6 h average shortwave radiation terms.Included are lines for the observations from Oden (bold black line), CAM5 (standard version, dark green; CAM5-PF, light green), the GISS-ModelE2 (orange), and R-1 (dark blue), R-2 (light blue) and ERA-I (red) reanalyses.Triangles along the top of the figure represent 6-hourly periods that averaged precipitation of 0.5 mm h −1 or greater.

Fig. 6 .
Fig. 6.Time series of surface energy budget terms from top to bottom: average sensible heat flux density, average latent heat flux density, and net surface energy (radiation, latent heat and sensible heat).Included are lines for the observations from Oden (bold black line), CAM5 (standard version, dark green; CAM5-PF, light green), the GISS-ModelE2 (orange), and R-1 (dark blue), R-2 (light blue) and ERA-I (red) reanalyses.All lines represent six-hour average values.The right-hand side of the figure includes distributions depicting the median (black dot), IQR (wide bar) and 10th-90th percentiles (thin bar) of the differences between simulated and observed values.

Fig. 7 .
Fig. 7. Combined distributions of the absolute values of biases in LW up , LW down , SW up , SW down , F SH and F LH .Distributions are depicted, including the median (black dot), IQR (wide bar) and 10th-90th percentiles (thin bar).

Fig. 8 .
Fig. 8. Figures demonstrating the range of values of liquid water path for observations and models that provide it, as well as its relationship to surface net longwave radiation (a), and cloud fraction (b).The circles represent the mean value over the ASCOS observational period, and the whiskers demonstrate the extent of the 10th and 90th percentiles of the data sets.

Fig. 9 .
Fig. 9. Time-height cross sections of air temperature bias (modelobservations, left).Profiles on the right side illustrate the mean temperature bias with height (bold line) and the interquartile range of the bias (lighter, horizontal lines).

Fig. 10 .
Fig. 10.Time series (a) and histograms (b) of lower tropospheric stability (LTS 850 ), as defined by the potential temperature difference between 850 mb and the surface.The lower panels (c, d) illustrate the relative contributions of near-surface potential temperature and the potential temperature at 850 mb to biases in LTS 850 .

Table 1 .
Median biases and correlation (in parentheses) to observations for each of the evaluated variables.Italicized correlations are not statistically significant at the 95 % level.The best performing statistics for each variable are included in bold.