Size-dependence in chord characteristics from simulated and observed continental shallow cumulus

. In this study, we compare long-term lidar observations against a full month of large eddy simulations of continental shallow cumulus clouds. The goal is to evaluate if the simulations can reproduce the mean observed vertical velocity and moisture structure of cumulus clouds, and to establish if these properties depend on the size of the cloud. We propose methods to compare continuous chords of cloud detected from Doppler and Raman lidars with equivalent chords derived from 1D and 3D 5 model output. While the individual chords are highly variable, composites of thousands of observed and millions of simulated chords contain a clear signal. We ﬁnd that the simulations underestimate cloud size and fraction, but successfully reproduce the observed structure of vertical velocity and moisture perturbations. There is a clear scaling of vertical velocity and moisture anomalies below the chords with chord size, but the moisture anomalies are only 1-2 % higher than the horizontally mean values. The differences between the observations and simulations is smaller than the difference in sampling the modelled 10 chords in time or space. The shape of the vertical velocity and moisture anomalies from cloud chords sampled spatially from 3D model snapshots is almost perfectly symmetric. In contrast, the chords sampled temporally from the lidar observations and 1D model output have a marked asymmetry with stronger updrafts and higher moisture anomalies occurring earlier on.

horizontal domain size of 25.6 km 2 . Between a height of 6 km to the domain top at 9 km, the vertical grid stretches from 25 m to 15 0m. Adaptive time stepping with a constant Courant-Friedrichs-Lewy criterion results in a time step typically between 1 and 2 seconds. The simulations were run with periodic boundary conditions, homogeneous and prescribed surface fluxes, and a prescribed radiative tendency profile. Two moment warm microphysics was used, even though precipitation was negligible.
Average statistics and 3D output of all thermodynamic variables was provided every 1800 s. Raw column data was provided 5 every time step (i.e., about every second) at 4 2 = 16 locations throughout the domain, with each column spaced 6.4 km apart in both the x and y direction.

Simulation evaluation
Before beginning our chord analysis we first determine to what degree our simulated cloudy boundary laer matches the observations by comparing cloud base height and cloud fraction. Thanks to the evaluation data provided as part of the LASSO 10 library along with the forcing data, we can easily compare the hourly cloud base height and cloud fraction. We find that the cloud base height of simulations and observations agree well, with the average simulated cloud base being only 35 m lower than the observed ( Figure 1). As expected, the boundary layer deepening over the course of the day is clearly visible in the individual simulations. This good agreement of cloud base height matches was also found for the original LASSO simulations (Gustafson et al., 2020), and can be directly attributed to the forcing data provided by the LASSO project being well calibrated. 15 Measurements of the cloud fraction in broken cloud fields from point-like measurements struggle to capture a sufficient spatial/temporal sample size to derive a representative value, causing strong variance over time and binary like behaviour (Rossow, 1989). Due to this behaviour, the comparison of simulated against observed cloud fraction has a very strong scattering with no clear temporal evolution visible ( Figure 2). This strong scattering is not evident in the model simulations, which have a smooth daily evolution thanks to the 25.6x25.6 km model domain that provides a very solid sampling size to calculate a cloud 20 fraction from. Further more, while the observations contain the full heterogeneity present in reality, thanks to the spatially homogeneous forcing applied to the simulations there is no mesoscale variability in the simulated cloud fields.
In regards to the mean cloud fraction over all available data points the simulations underestimate cloud fraction by 20 % compared to the total sky imager cloud fraction (TSI), and by 40 % compared to the low cloud fraction provided by the Active Remote Sensing of Clouds (ARSCL) value added product (Figure 2 large dots). Given the large differences between the two observational products, we can not say for certain if and how strongly the cloud fraction is underestimated in the simulations.
This general underestimation is also present in the original LASSO simulations as shown by (Gustafson et al., 2020), and matches the experience of previous continental shallow cumulus LES over the ARM-SGP site (Zhang et al., 2017) and over Caubauw in the Netherlands (Schalkwijk et al., 2015). Recently Fast et al. (2019) showed that using a more realistic surface 5 moisture distribution leads to larger on longer lived clouds, indicating that our cloud underestimation could be due to the homogeneous surface conditions of the LASSO setup. A striking feature of our cloud fraction comparison ( Figure 2) is how often the model contains no clouds when the observed cloud fractions lies between 0.0 and 0.2. According to Gustafson et al. (2020), this is at least partially due to the model spinup causing the simulated clouds to appear roughly 2 hours later than in the observations. 10 For the rest of the paper we will work with the assumption that the simulated cloud base height is accurate, and that the simulated cloud fraction is too low by an unknown amount between 15 and 45 %.

Detecting cloud chords and interpolating scenes
To avoid confusion we will first explain how we use the terms cloud chord, chord base height, and scene in this paper as illustrated in Figure 3. By cloud chord we mean a continuous 1D string of cloud. Cloud chords are also referred to as traverses (e.g. Warner, 1970b) or intercepts (e.g. Rodts et al., 2003) in airplane studies, and can theoretically fully describe the cloud field if one assumes a simplified cloud shape (Wood and Field, 2011) . As we are comparing with lidar observations, our cloud 5 chords are sampled from the surface, and a single cloud chord can be sampled from multiple different overlapping clouds which do not touch each other in the 3D field as hinted at by the orange chord in Figure 3. Following Lareau et al. (2018), we define the chord base height as the 25th percentile of the bottoms of the cloud chord grid cells. We use the term scene to refer to the 2D vertical slice from the surface through the cloud chord. The height of the scene is regularized from zero at the surface to 1 at the chord base height, and extends before and behind the cloud chord as shown in the two scenes shown in Figure 3.

10
All scenes are regularized to the same 2D grid, which allows us to merge multiple scenes together into composites. Through  The filled in grid boxes are cloudy cells. The green and orange cells show the cloudy cells which would be detected as two chords from the surface. The thick black lines mark the chord length and the chord base height of the two chords. The two lower boxes show the resulting 2D scenes after regularizing the two chords with their respective chord base height and length. For an example using Doppler lidar data see Lareau et al. (2018, Figure 1). out all of this paper we will only sort chords according to their length, and do not sort them into forced or active (as done by Lamer and Kollias, 2015) or into updrafts and downdrafts (as done by Lareau et al., 2018).
In this section we detail how we define and detect cloud chords from 1D column output and 3D snapshots, followed by an analysis of how these two approaches differ. We try to be as consistent as possible with the definitions used by Lareau et al. (2018), and the exact technical implementation is described in the scripts used for this paper which are included in the 5 supplementary material. Since the model output is not identical to the measurements used by Lareau et al. (2018) we must first define some analogue definitions. Lareau et al. (2018) detect cloudiness from a threshold back-scatter value, whereas we consider a model grid box to be cloudy when the liquid water mixing ration is higher than 10 − 6 kg/kg. While we and Lareau et al. (2018) both only take cloudy cells that are no more than 300 m higher than the convective boundary layer (CBL) into account, we must use a different definition of convective boundary layer height. This is because the vertical velocity variance 10 used by Lareau et al. (2018) is computed from a 2D time-height slice of vertical velocity that contains many gaps where insufficient aerosols are available to retrieve a signal or liquid water attenuates the lidar too strongly. As we can not reproduce when and where the lidar would or would not have a sufficient signal to produce a measurement, we instead use the lifting condensation level (LCL) as an additional criteria. When the LCL is lower than the CBL derived through the the vertical velocity variance threshold, we use the LCL. 15 In many ways it is simpler to detect chords from the simulation output than from the observations used by Lareau et al. (2018). Firstly, we know that only convective clouds occur in the LES simulations, so none of the safeguards used by Lareau et al. (2018) to minimize the chance of accidentally sampling stratus clouds are needed. Secondly, the model output contains no gaps in data to work around. Thirdly, in contrast to the observations the horizontal and vertical wind points in the model are collocated in time and space. And finally, having the full 3D domain makes it possible to accurately split the variable into a horizontal mean and local anomaly. From the observations only the local anomaly of vertical velocity can be determined easily, as we know the the mean vertical velocity must be very close to zero. To calculate the observed anomalies of water vapor mixing ratio discussed in subsection 6 we use a 90 minute running mean.

From 1D columns
The 1D column outputs is created by outputting the model state in specific columns of the model grid at each timestep. Each 1D column output has two dimensions, height z and time t. Using 1D column output allows a direct comparison with the lidar observations as both are stationary 1D measurements, which allows chords to be detected almost identically to the approach illustrated in Figure 1 of Lareau et al. (2018). In contrast, comparing simulations to flight observations is much more difficult 10 because it requires imitating a planes possible flight path as discussed by Hoffmann et al. (2014).
Following Lareau et al. (2018), we require cloud-base height to be within 300 m of the convective boundary layer top and cloud duration greater than 30 s. However, while Lareau et al. (2018) determines CBL height only from the observed vertical velocity variance, we use a mix of the vertical velocity variance and the lifting condensation level. This is due to the fact that the observed vertical velocity is full of gaps due to cloud absorption or lacking aerosols that we can not imitate easily. Another 15 difference is that we do not require the chord duration to be less than 20 minutes. This maximum chord duration was one of the safe guards implemented by Lareau et al. (2018) to reduce the chance of accidentally sampling stratus clouds which are driven by different dynamics.
It is noteworthy that as MicroHH does not have a constant time step, and accordingly the 1D column output has a constantly varying time resolution. For reference, in one simulation the mean timestep over the day is 1.2 s with a standard deviation 20 of 0.2 s. The smallest timestep was 0.04 s, and the largest 1.6 s. On average the MicroHH timestep is very close to the 1.3 s resolution of the Doppler lidar observations.

From 3D snapshots
When calculating the chords and scenes from the 1D column output and the Doppler lidar data, we use time to turn the 1D height measurements into a 2D slice with height and time as the axes. In contrast, to detect chords from the 3D output snapshots 25 we first slice the 3D field into a 2D x-z slice and a 2D y-z slice per snapshot. From these two slices we measure the chord length by moving an imaginary Doppler lidar across the surface. The approach of cutting in x and y direction was used to detect chord length from MODIS data by Wood and Field (2011), and recently by Endo et al. (2019) to determine w across the cloud edge. In contrast to the chords derived from the 1D column output or the observations we are no longer measuring the duration of the chords, but the distance. We do derive a cloud duration using the mean horizontal wind speed at cloud base, but 30 this approximation is only used as a rough comparison (shown in Figure 6).
We also replace the minimal 20 s threshold of Lareau et al. (2018) used for the 1D/DL chords with a 75 m threshold, which equals 3 model grid cells. Due to the chords shorter than 250 having only 10 cells or less, we do not allow for gaps in the 3D An important characteristic of our method of slicing the 3D snapshots in x and y direction is that the detected cloud chords are not oriented in the direction of the wind direction. In contrast, the 1D and Doppler lidar chords are by nature affected by the wind direction which moves the clouds above the static sensor. Wind direction is not commonly taken into account 5 when converting 1D measurements to cloud fractions or similar (e.g. Brooks et al., 2005;Illingworth et al., 2007), but has a noticeable influence on our spatially connected 2D scenes. If the clouds in the domain are consistently orientated in some manner to the wind direction, this could lead to differences in chord length/duration between the 3D and 1D chords. However, while we have not systematically analyzed the orientation of the clouds in regard to wind speed, we did not notice any clear formation of cloud streets or similar, and are confident that the various other differences in methodology between the how 1D 10 and 3D chords are detected have a far greater effect.
While we do not take the wind direction into account when detecting chords, we try to take it into account when calculating the corresponding scenes. The method we devised to do this is by first flipping the x-z and y-z slices in accordance with the wind speed so that the imitation lidar moves along the surface in the opposite direction of the wind in x and y direction. We then calculate and store the regularized scenes from the chords in x and y direction separately for each snapshot. The last step 15 is to weight the scenes from the x and y slices by the strength of the wind in x and y direction before merging them together to one scene. So for example if the wind direction were zero in y direction, then the scene resulting from merging the weighed x and y scenes would be identical to the scenes in x direction. However, if the x and y wind components were identical, the weighted and merged scene would consist to equal parts of the scenes in x and y directions. A missing but crucial detail is how we define the horizontal wind direction. For simplicity's sake we use the mean horizontal wind direction in the 3D snapshot at each height. An example of how taking the wind direction into account affects the resulting vertical velocity scene is shown in preparation), or we could have first rotated and interpolated the 2D horizontal fields onto a 2D grid oriented to the wind direction. However, in this paper all scenes from 3D snapshots are derived using our method introduced above because it is both technically and computationally cheaper, while avoiding any artificial interpolation artefacts.

Differences 1D vs 3D
The largest fundamental difference is that the chords from the 1D column output and the Doppler lidar data are taken from 2D 10 slices of time and height, while the chords from the 3D snapshots are taken from 2D spatial slices. Accordingly the 1D and DL chords are sampled from clouds as they are evolving from cloud birth to death, while the 3D chords are taken from clouds frozen in time. When the horizontal wind speeds are sufficiently low, a 1D chord could contain the complete cloud life cycle from birth to death. We can not deduce from the lidar observations alone at which stage of their life cycle the clouds are when they are sampled. Even neglecting the cloud life cycle, wind shear and rotation will stretch and deform the atmosphere as it 15 passes over the lidar. The way we calculate scenes takes the rotation of the wind direction with height into account, but it is still a crude approximation. No post-processing method can bridge the fundamental difference between the 1D and 3D chords.
Any systematic differences between the simulated 1D and 3D chords that go beyond chord definitions should be due to the difference in sampling over space or time. The final evaluation is a sensitivity test to determine how much chord length and duration depend on the exact definition of cloudiness and if cloud gaps are permitted within a chord.
The differences between the 1D model chords and the Doppler lidar chords will reveal how close the geometry of the simulated cloud fields were to those measured at the ARM-SGP site, and the comparison of the 1D to the 3D model chords 5 will reveal the differences between the temporal and spatial sampling discussed in section 3.3. These differences between the observed and modelled cloud chords could be attributed to two sources. Either the simulated days are not fully representative of the atmospheric conditions of the included observations. Or the simulated days are not 15 representative of the actual conditions at the ARM-SGP site at that time.
In regards to the cloud base, we have already shown that the simulations matched the observations at the ARM-SGP site very well (Figure 1). Despite having the correct cloud base height, the simulations could have too few clouds late in the day when clouds are higher and too many in the morning when clouds are lower. But this is not the case, because the distributions of when during the day the chords were detected show no marked shift (Subplot c, Figure 5). Note that when scanning the lidar 20 observations for chords we exclude all clouds before 10 am to avoid stratocumulus conditions. In regards to the horizontal wind speed, while we can not easily check the horizontal wind against observations because horizontal wind is not one of the LASSO evaluation observations, we find it highly unlikely that the LASSO forcing data would contain such a strong bias. Accordingly, we believe that the discrepancies between the modelled and observed chords are not due to the simulations misrepresenting the conditions at the ARM-SGP site. From this it follows that the differences are predominantly due to the 28 days of simulations The difference in wind speed distribution could have a substantial effect on our analysis, because the faster a cloud is advected over the lidar the less time the cloud has to evolve during the ongoing measurement. But on the other hand, higher horizontal wind speeds also mean more horizontal shear. And a higher shear will lead to a greater deformation of the scene.

30
And lastly, higher wind speeds should reduce the measured cloud duration.

Length and duration
In this subsection we examine the differences in chord length and duration between our 3 methods of chord detection. We discuss which differences can be attributed to either the varying definitions used to determine a chord, or the model underestimation of cloud cover noted in subsection 2.1, or the difference in horizontal wind speed and height noted in the previous subsection 4.1. 5 We begin by comparing the duration of the Doppler lidar chords DL to the 1D output model chords (1D) (Figure 6 a). Both show a strong decrease in chord occurrence in regards to duration, with the most likely chord duration being just above the 30 seconds minimal time as applied by Lareau et al. (2018). The 1D chords can last longer than 20 minutes, given that we did not apply any maximum time measure to the model data because there was no need to filter out stratus clouds. The distribution of the 1D chords and DL chords are remarkably similar, which at first glance would indicate that the observed and simulated 10 clouds are of the same size and shape. However, since the horizontal wind speeds are higher than in the observations ( Figure   5), the chord length, which results from multiplying chord duration with the wind at cloud base, shows that longer chords are substantially more common in the observations than in the simulations (Figure 6 b). From this we can conclude that not only do the simulations have a lower cloud fraction as shown in section 2.1, the simulated clouds are also smaller than those observed.
This again fits with the results of Fast et al. (2019) that using a homogeneous surface in LES leads to smaller individual clouds. 15 It is worth mentioning that the reason that the most common chord length lies at roughly 300 m is because a combination of the 30 second minimum duration and the mean horizontal wind speed being 5-7 m/s. Chords shorter than 300 m can only be measured when the horizontal wind speed is below average.
When interpreting the 3D chords one wonders if the much larger sample size leads to a higher maximum chord lengths.
The maximum value is expected to increase with sample size if all samples are all drawn from the same infinite distribution. That chord occurrence decreases very strongly with increasing length and duration is hardly surprising given that it has been very well established that the cloud size density of cumulus clouds roughly follow a power law of -3 to -2 (e.g. Raga et al., 5 1990;Benner and Curry, 1998;Neggers et al., 2003;Zhao and Girolamo, 2007;Dawe and Austin, 2012;van Laar et al., 2019).
Given that randomly cutting through a cloud with an area of A creates many chords with a length smaller than √ A but only a few larger than √ A (Barron et al., In preparation), it is expected that the chord distribution is heavily dominated by small chords. Our results to not disagree with the constant slope found by Wood and Field (2011) for chords up to 1000 km, as Wood and Field (2011) do not exclusively look at cumulus clouds and the satellite data they used has a maximum pixel resolution of 10 1 km.

Chord definition sensitivity
In this subsection we explore how sensitive the simulated chord distributions are in regard to two changes made when adapting the chord definition of Lareau et al. (2018). The first aspect we look into here is the the influence of allowing no gaps in the 3D chords. We test this by allowing 50 m gaps (2 grid cells) and rerunning the chord detection script for five days. Given that we 15 are now allowing 50 m gaps we only look at chords which are at least 250 m long.
While it is clear that allowing gaps will lead to more longer chords and less short chords, we can not predict how strong that impact will be. A large impact would indicate that the simulated clouds have either many small gaps in the clouds, ragged edges, or that the clouds are often only 50 m away from each other. A small impact indicates that the clouds are well isolated from each other, with few internal gaps. Our test reveals that allowing gaps has a slight but noticeable effect on chords from 250-400 m and longer than 3000 m (Figure 7 a), indicating that the chords in between those lengths are sampled from coherent clouds.
The second aspect we look into is what happens if we relax out definition of cloudy and treat cells with a relative humidity of 99 or 97 % as cloudy as well. This could in theory help explain why there are more longer observed chords than detected from the 1D columns. Using a sub-saturated threshold can be justified by assuming that a volume of 25x25x25 m air with such a 5 high relative humidity could contain some condensed liquid, and that by requiring liquid water to be present in the model cells to count as cloudy we might be stricter than the back scatter threshold used by Lareau et al. (2018). However, by looking at the changes in the cloud duration distribution from 18 days worth of 1D column output we see that using a relative humidity has no noticeable impact on the chord duration ( Figure 7). This strengthens our conclusion that more longer chords are observed because the simulated clouds are smaller. We did find that using a relative humidity slightly lowers cloud base, but not by an 10 amount which could impact our results.
In summary, we are confident that our results are robust to variations in the precise definition of cloudiness and the absence of gaps in the 3D chords, and that the vast majority of out plumes are sampled from coherent clouds.

Chord vertical velocity
Now that we have established how the distributions of the modelled and observed chords compare, we turn to the main ques-

Size dependence
We restrict our analysis to chords shorter than 3000 m for practical and scientific reasons. Given that the cloud base at the ARM-SGP observatory very rarely surpasses 3 km (see Figure 1) we are reasonably sure that most chords longer than 3000 m are either stratus clouds that slipped past the screening, or a chain of smaller cumulus clouds which by chance happen to overlap. And we are not particularly interested in either of theses groups.

5
To compare the vertical velocity we use the mean w over the box marked in each subplot of Figure 9. We consciously decided not to use w at chord base as a variable which is often considered (e.g. Endo et al., 2019) as vertical velocity at cloud base is both easily defined and an important property of mass flux parametrizations Neggers et al. (2006). The first reason we choose to instead look at the mean w over the box in the regularized scene is that by staying 0.2 below cloud base we avoid issues with the exact determination of cloud base height. For example if the cloud base determined by the lidar were 100 m 10 too low the vertical velocity determined would be higher. We also avoid the possibility of drops falling into the air below the cloud contaminating the retrieved signal and leading to lower w. We believe this effect is at least partially responsible for the very uniform w decrease visible just below cloud base (see Figure 9 d and e). We do however have an analysis of w at cloud base in the simulations in section 7.
We also purposefully don't look at percentiles such as the 95th percentile used by Lareau et al. (2018)  While it could be argued that the 1D and DL chords have the same behaviour for chords shorter than 1500 m, the differences 10 between the 1D and 3D chords are much more pronounced (Figure 8). The 3D chords show a stronger scaling until 1000 m, with a weak signal beyond but with a distinct small peak at 1500-1750 m.
So far from our results we conclude that the 1D and DL chords are in good agreement for chords shorter than 1500 m. While a modest bias exists between 1D and DL in that range, it is significantly smaller than the increase of the mean w across that range. In addition, this bias is much smaller than the difference between both 1D columns on the one hand and 3D snapshots

w scene
After assessing the mean vertical velocity, let us take a closer look at the regularized scenes that the mean values were averaged 20 from ( Figure 9) and if the shape of the updraft scenes is size dependent as well. The subplots of the left column are quite similar to those included in the supplementary material of Lareau et al. (2018), but contain more data and are binned differently. The most striking difference between the observed and modeled scenes is caused by ignoring all areas where more than 30 % of the scenes have insufficient aerosols to enable a retrieval. We choose the 70 % criteria as a trade off between making the most of the observations available, while reducing the chance of biasing the scene strongly towards aerosol presence. As a result, 25 the observed scenes only extend to roughly 0.95 of the chord base. Given that we can not be sure if the missing data is biased in regards to w, we are not fully confident in the observed scenes above about 0.8 before and after the chords. Within the cloud chord itself, we are also weary of the vertical velocities between 0.9 and 1.0 regularized height. We suspect that the reduction to 0 in vertical velocity at cloud base is an artifact of the cloud droplets impacting the w retrieval. This is supported by the simulated scenes not having such a feature at all, and also there being no theoretical reason for updrafts to stop at cloud base. While we can not pin down exactly what causes the differences between the 1D and DL scenes, we find that the following general picture has emerged from our analysis. It seems that on average the simulated clouds are more compact and closely linked to convective plumes than the observed plumes. Given how small the differences are for shorter chords there are many possible explanations, such as the difference in horizontal wind shown in Figure 5, or the maybe the observations include some days that had anomalously weak convection. Especially the longer observed chords appear to have have a weaker convective 5 character (Figure 9 g), and show no clear link between chord length and updraft strength. This could indicate that some of the longest observed cloud chords are stratus clouds that slipped through our selection criteria intended to filter stratus clouds. But we can not check if this occurs at all, let alone sufficiently often to have a noticeable effect on the mean value.
In contrast to the DL and 1D scenes, the 3D chords are much more symmetric with only a weak shift of the updraft core towards the start of the chord (i.e. to the right in Figure 9). The symmetry is more pronounced when looking at the two subsiding 10 shells at the edge of cloud chords, which are equally strong and extend to the same height. There is a slight asymmetry in that the down draft region behind the chord (left in Figure 9) is a bit wider, but not to the extent Mallaun et al. (2018) detected in airplane measurements. Also, in contrast to the 1D scenes the updraft is more tightly organised below the cloud chord. This is most clearly visible when looking at the how far the updrafts trail behind the chord, .i.e. to the left in Figure 9.
To go back to the differences between 1D and 3D chords introduced in subsection 3.3, the difference between 1D vs 3D can 15 be attributed to either the life cycle of the cloud evolving as it is sampled, or the upwind advection rotating the stretching the scene as it is advected over the lidar. The shear in horizontal wind speed over the boundary layer can explain why the updrafts in the 1D and DL chords are broader below the chord (Heus and Jonker, 2008). Weaker horizontal winds closer to the surface would stretch the updrafts wide, which would explain the wide weak updrafts close to the surface in the 1D and DL scenes. A wind sheer above the chord base should be visible by the scene above the chord being slightly pulled toward the center of the 20 chord. A close look at the trailing subsiding shell in the 1D chords shows that the subsiding shell bends slightly to the right over the cloud chord (Figure 9 mid column). In contrast, the subsiding shells of the 3D chords extend straight upwards. While this is only a very weak signal, it is consistent with what we would expect from a slight horizontal wind increase above the cloud base.
The asymmetry, however, can not be explained by the horizontal wind shear. Instead, our results suggest that the asymmetry 25 arises from the ongoing cloud life cycle. Given that the average horizontal wind speed is about 6 m/s (see Figure 5), on average we would expect the smallest clouds to age 1.5 minutes from chord beginning to end, the middle chords 3 minutes, and the longest chords 5.5 minutes (upper, middle, and lower row of Figure 9). Given that most shallow cumulus clouds live less than 20 minutes and many far shorter (Dawe and Austin, 2012; Heus and Seifert, 2013), the clouds are substantially younger at the beginning of the chords (i.e. the right in Figure 9) than on the left. We expect younger cumulus clouds to be growing with 30 an active updraft and older clouds to be decaying and sinking. Knowing that the chords are substantially older at the end of the chord, this ageing provides a consistent explanation why updrafts both below and above the chord base are stronger at the beginning of the cloud chords.
Having considered possible explanations for the differences in width and asymmetry between the 1D and DL chords, we have so far not explained why the 3D chords have stronger updrafts. The difference in updraft strength is very likely partially due to the 3D chords not allowing any gaps in the chords. While we have shown that allowing 50 m gaps does not effect our results (subsection 4.3), we know that allowing longer gaps over hundreds of meters does reduce the updraft velocity.
Interestingly, not only are the 3D updrafts stronger, the downdrafts behind and before the chords are also stronger for the biggest chords (Figure 9 i). Given that the long chords have a strong updraft, due to mass conservation in the LES domain there must be downdrafts in between the chords. That these downdrafts are not visible in the 1D chords indicates that while the 3D For interpreting the Raman lidar results it is important to note that the sample size and frequency of the moisture anomalies is substantially lower than that of the vertical velocity. We have roughly a factor of 10 less observed chords chords (Table 2) and the sample frequency is also lower by a factor of 10 (10 s vs 1.3 s). To make use of all the data we have available for the largest chords we expanded the binning of the scenes shown in Figure 10 compared to Figure 9 (1500-3000 m vs 1500-2500 15 m). While the Raman lidar observations tend to extend higher than the CBL, they can not penetrate into the clouds (Lareau, 2020). Since we calculate the anomalies from a 90 minute running mean we do not plot the anomalies higher than 0.95 of the chord base because the missing observations in the cloud bias the resulting anomalies.
As expected, due to the sharp decrease in water vapor in the atmosphere above the boundary layer both observed and simulated moisture anomalies have a very strong maximum at cloud chord base and within the cloud cores ( Figure 10). In both 20 the observations and simulations the positive moisture anomaly reaches all the way to surface, with a smooth and monotonic increase from the surface to cloud base. In our simulations the water vapor values in the lower mixed layer lie roughly between 10 and 15 g/kg, so the mean anomalies at 0.5 chord base height are on the order of 1-2 %. To avoid the signal being dominated by possible small shifts in cloud height, we use a larger and lower averaging box (marked in grey in Figure 10) to calculate the moisture anomalies shown in Figure 11 than we did for the vertical velocities (marked in grey in Figure 9).

25
In contrast to the vertical velocity which turns negative within the cloud edge (Figure 9), the water vapor anomalies remain slightly positive beyond the cloud at the cloud base height. This is consistent with our general understanding of subsiding shells which, while drier than the cloud cores, are still moister than the surrounding air (Heus and Jonker, 2008;Wang and Geerts, 2010;Katzwinkel et al., 2014).
Comparing the RL and 1D moisture anomaly scenes to the 3D scenes results in the same conclusions we deduced from the  substantially effect the results. To truly figure out the size dependence in the observations would likely require substantially more data, preferably with a higher measurement frequency, and a more careful separation of anomalies from the mean that what we use. However, for now we are confident that the decreasing moisture anomalies with length are a direct result of using a running mean to calculate the anomalies.
In summary, the moisture anomaly analysis shows that the simulations match our theoretical expectations while correctly 5 reproducing the observations. The differences between the 1D and 3D scenes are consistent with our findings from the vertical velocity, in that the 1D chords have a weaker signal with asymmetric scenes. Both 1D and 3D scenes show a clear size dependence, while the observed chords have a weak inverse size dependence due to how the anomalies were calculated. In relative terms the anomalies below halfway below the chords are in the range of 1-2 %.
7 Sampling uncertainty 10 In this section we address the question how the uncertainty in cumulus properties depends on how many days of shallow cumulus lidar observations are available. To this purpose the simulation data are used. Assuming a single 1D column output of a shallow cumulus simulation is equivalent to having one lidar measuring over one day, this analysis will determine how many days of shallow cumulus lidar would be needed at the ARM-SGP site to derive the various chord properties to a certain level of certainty. This whole analysis is based on the premise that each individual column output is both independent and 15 equivalent, i.e. we make no distinction between two column outputs from the same day or different days. On average 31 chords are detected from each column, with a standard deviation of 17 and a maximum of 68. For this analysis we exclude all columns with less than 10 chords, leaving us with 386 columns of our total 448. The three properties that will be investigated include the chord length distribution, the mean vertical velocity below the chords as a function of chord length, and the mean w scene.
The sampling uncertainty of the chord length distribution for n columns is determined by repeatedly sampling n columns at random and calculating the chord distribution of all the chords contained in those n columns. As expected, uncertainty in the cord length distribution is directly linked to chord occurrence, with the smallest uncertainty being at about 300 m which is the most common chord length (Figure 12 a). At this point it is informative to compare these results with the DL observations.
The line presenting the DL chords roughly lies within the n = 64 shading, suggesting that at least 64 columns are needed for 5 ruling out sampling being the cause for the difference between observations and simulations.
The mean w at cloud base is displayed in the same manner as the chord length distribution (Figure 12 b). In contrast to chord length, the uncertainty regarding the dependence of mean w at chord base on chord length is relatively constant for chords from 100 to 3000 m. We speculate that two compensating effects are behind this behavior; while longer chords are rarer than shorter chords, their mean w also has a much smaller spread. Accordingly, fewer long chords are needed to reach a representative mean 10 value. An interesting result not directly related to sample sizes is that the shortest chords below 200 m in length actually have a negative vertical velocity on average. This negative velocity for the smallest chords was also found by Rodts et al. (2003) in airplane measurements, and indicates the smallest chords are dominated by chords sampled from the subsiding shells of larger clouds or by chords from dying clouds at the end of their life cycle.
The final variable we look at is the 2D w scene. To determine the uncertainty in the w scene we calculate the root mean 15 square error (RMSE) between two randomly sampled scene composites through a bootstrapping approach. The randomly sampled scenes are generated by drawing 2 · n of columns into two separate groups with n columns each from our total pool of 386 1D columns (again we only include columns which contain at least 10 chords). Once a column is drawn it is removed from the pool so that no column is compared against itself. We then form a composite of all w scenes present in the n columns in each group. And finally we calculate the RMSE between the two w scenes that were calculated from the two pots. This random 20 drawing is repeated 100 times. Note that the 100 is arbitrary, but results remain unchanged for higher numbers.
As each column contains a varying amount of scenes with different lengths and from varying times of day, there is no easy way to predict how the RMSE will behave. Our analysis shows that on average the RMSE behaves as if drawing from normally distributed errors with a standard deviation of 0.3 m/s ( Figure 13). Accordingly, quadrupling the columns included in the analysis halves the sampling uncertainty. For example, to reduce the expected sampling uncertainty from 0.05 m/s to 0.025 25 m/s requires measuring 128 days instead of 32.

Conclusions and discussion
The two main goals of our study are to evaluate the LES concerning (thermo)dynamic perturbations of continental transient shallow cumuli, and to establish if size dependence exists in these features. To achieve this we compared cloud chords observed by Doppler and Raman lidars against chords derived from 1D and 3D model output.

5
-Compared to observations MicroHH in combination with the LASSO forcing leads to lower cloud fraction (similar to Schalkwijk et al., 2015;Zhang et al., 2017;Gustafson et al., 2020) and slightly shorter cloud chords. -Differences between the observations and simulations are smaller than the differences between the 1D and 3D sampling approaches applied to the simulations.
-The quantitative analysis of the observed moisture anomalies dependence on chord size is complicated by the relatively 5 weak anomalies, the need to determine the mean moisture, and the smaller data amounts available from the Raman lidar.
Solving this problem is for now considered a future research effort.
-Our results suggest that the asymmetry in the observed Doppler lidar scenes mainly originates from the cloud evolving while it is being sampled.
-Roughly 30 days of Doppler lidar observations from the ARM-SGP site are needed to get a solid grasp of the chord 10 length distribution, vertical velocity at cloud base, and average w scene.

Discussion
Our success at using measurements to evaluate LES in combination with using LES to better understand the measurements at the ARM-SGP site underlines the usefulness of initiatives to run continuous LES at super-sites such as LASSO (Gustafson et al., 2020) and the JOYCE testbed (Neggers et al., 2012;van Laar et al., 2019) to drive scientific progress. Our methodology 15 also highlights again just how crucial it is to imitate the observations as closely as possible when evaluating LES against