Comment on acp-2020-1193

The manuscript by Li et al. presents an important evaluation research work of NO2 diurnal variation using observations and modelled results from DISCOVER-AQ 2011. The research topic is important and interesting to atmospheric modelling and observation communities. The approach used is comprehensive. Some of the findings (e.g., potential spatial distribution bias in emission inventory, potential bias in ground-based remote sensing instruments) in this work are important for not just modelling groups but also observation groups. But, the presentation of this work should be improved. I would recommend publishing this work if the following concerns and comments can be addressed.


General comments:
The manuscript by Li et al. presents an important evaluation research work of NO 2 diurnal variation using observations and modelled results from DISCOVER-AQ 2011. The research topic is important and interesting to atmospheric modelling and observation communities. The approach used is comprehensive. Some of the findings (e.g., potential spatial distribution bias in emission inventory, potential bias in ground-based remote sensing instruments) in this work are important for not just modelling groups but also observation groups. But, the presentation of this work should be improved. I would recommend publishing this work if the following concerns and comments can be addressed.

Specific comments:
L87-88. Many previous works were not properly cited. As I know, various research work has been done to convert Pandora NO 2 VCD to TVCD or surface values to study diurnal variations. The authors should update relevant knowledge on these. E.g., Kollonige et al., 2017;Spinei et al., 2014;Zhao et al., 2019. I believe some of the results in this work could be compared with previous findings and may cast some light on the research topic. L151-176. These detailed discussions of the wind-filed and precipitations should not be done here, as the reader does not know anything about your trace gas simulation results/discrepancy yet. Such detailed discussions (the author used six figures in total, Figs. S2-S7) of potential causes should be included in a separate discussion section.
L203 and L213. 36-km REAM profiles were used to calculate AMFs for both OMI and GOME-2A. Are these new AMFs have higher or lower (or comparable) resolution compared to the original AMFs used in the satellite data products? Please provide a brief description of how the model output has been smoothed or interpolated to OMI and GOME-2A grids.
L328-353. I saw at least three names for Kzz modelling, and I do have difficulty understanding which one is which. After reading this section back and forth several times, I think two Kzz modellings were used, i.e., Kzz-WRF and Kzz-modified. But, I am not sure if this Kzz-WRF is the same as Kzz-YSU. I can understand the logic of why the authors want to modify Kzz for nighttime, but please improve the descriptions to make it easier for a reader to absorb your idea. L347-351. Some justifications for the selected parameters are missing. A sensitivity test or correlation studies are needed to justify this 5 m s -2 . The idea of a magic number is not impressive. It is difficult to justify the selection with Figure 4, which shows even the modified results still have large discrepancy compare to observations. L364-369. I am worried that the ground observations from various sites should not be studied as a single group. Different local emissions patterns should be addressed. E.g., do all 11 NO 2 sites show the same concentration peak values at 5:00-6:00 LT? Do we see any differences between rural and urban sites?
L388-393. The general impression from Figure 5 is the REAM-4km shows a higher bias than REAM-36km compared to observations. But, this might be misleading. For example, if one looks at Figure 5b from 00:00 to 5:00 LT, the green line shows a better agreement with observations. Please provide some comments on this. The study sites should be grouped into at least two categories, e.g., rural and urban. Fig. S14. I guess the authors want to show the Pandora TVCD should be corrected; otherwise, the results could be biased low due to a missing surface layer. I agree with the assumption, but it needs to be studied carefully ( Fig. S14 shows some indication but not good enough). Fig. S14a shows that for some sites (e.g., SERC), one can expect Pandora to miss up to 20% of NO 2 columns. However, this is not reflected by Fig. S14b at all. If this 20% difference is true, it can be verified relatively easier than other sites. Could you plot Fig. S14b for each Pandora site separately?

L435-442 and
L469-488. The findings here are critical for the research community to understand the discrepancy between aircraft, ground-based in situ, ground-based remote sensing, and models. The synthetic aircraft TVCDs have better agreement with REAM especially for 15:00 to 17:00 LT. The agreements between REAM and aircraft profiles ( Figure 6) are very nice. So, for me, it looks like Pandora TVCDs are the one that has a major low bias. But, Figure 5 also shows that the REAM has a large positive bias compared to groundbased in situ observations from 15:00 to 17:00 LT (especially for REAM-4km). Can authors conclude if Pandora TVCDs are not accurate in this period? These results may affect the claim of accuracy of Pandora NO2 VCD is 2.7×10 15 molecules cm -2 in L218. Also, from Fig. S9, it is clear that the observed diurnal variations at different sites could be very different. This matched with the large error bars on the REAM modelled results in Fig. 7. But why Pandora TVCDs from 11 sites show very stable results (small error bars) in Figure  7? The current explanations are not good enough to convince me. Besides understanding the model resolutions, this could be another highlight of this research work. So, I would suggest the authors provide more investigation, explanations, or discussions.

Technical corrections:
L194. Please modify the description of estimated uncertainty. "molecules cm -2 + 25%" does not make sense.
L217-218. The description of the precision of Pandora NO 2 VCD is not correct. In Herman et al. 2009, the 0.01 DU (or 2.7×10 15 molecules cm -2 ) precision is for slant column (not VCD). For Pandora NO 2 VCD, the estimated precision is about 0.02 DU (e.g., Zhao et al., 2020).
L288-294. The scale ratios look consistent between ECO and C42. The one that needs extra caution is CY42 (Thermo Model 421I-Y). But, if the Thermo Model 42I-Y NOy analyzer measurements are not used in this study at all (see L1175-1176), there is no need to include such detailed discussions (it will only confuse the reader). Or, at least, this information should be moved to supplement. I would suggest authors move other figures such as Fig. S1 to here, which should be more important (for the reader to understand the model scales/grids and locations of observations used in this study).
L377. Figure 6 is used before Figure 5. Please swap the order of the figures. Figure 8. Please use different symbols for >400m and <400m lines. Also, the caption said there are three bins, but I did not see proper labels for the "400m -3.63 km". Are those >400m lines represents "400m -3.63km" results? Please make sure the legends match with the caption. Figure 10. Description of the purple circles on panels a-c is needed. Fig. S1 should be modified. The symbols for different observations jams together and very difficult to see. One should use other means to show instruments at a single site, e.g., a pie chart.