|I reviewed the previous version of this manuscript and had a number of scientific and technical comments. The authors have put a lot of work into this revision, addressing most of my concerns and those of the other reviewers. I appreciate that. The study is much improved as a result.|
I do have some remaining comments on the paper, mostly because it has been substantially rewritten from last time. I recommend further revisions to address these. I would be happy to review the next version. I realize there are a lot of comments below, but that should not be taken negatively: this is important work, and the revised version really is a lot stronger than the original submission. I just want to make sure that the final paper is fully correct and as strong as it reasonably could be. So please take these comments positively. I think the next frontier is how to best assess level 3 uncertainty and combine different data records and this analysis will be an important step in that direction, by helping lay out the current state of many well-used data sets.
Line 18: this says 10 products but line 91 says 11. I count 11 satellite products in e.g. Table 1.
Line 55: Satellites have been launched since before the 1990s and the word ‘continuous’ feels a little strange as we really only get a new aerosol-capable mission every few years. It might be better to say something like “For the last few decades, satellite instruments have been launched with increasing capability for remote sensing of aerosol measurements, which ...”.
Line 63: there are other instruments too so I would add “among others” at the end of this sentence. Otherwise a non-expert reader might this this is a complete list.
Line 80: Tropical, not topical.
Line 97: It would be worth mentioning that while data until 2017 are used, many of these products (MISR. MODIS, VIIRS) are ongoing as the instruments are still returning data.
Lines 117-119: aside from Thomas (2009), the other ORAC references (Sayer 2011, Poulsen 2012) are for ORAC cloud products and not ORAC aerosol products. Sayer AMT 2010 (https://doi.org/10.5194/amt-3-813-2010) and the Thomas (2009) paper might be better references for the aerosol data used here.
Line 142: I would say “dark land” for DT and “bright and dark land” for DB, as the spatial coverage of the two is quite different.
Line 149: The SeaWiFS data base is only for SeaWiFS Deep Blue. MODIS Deep Blue has its own data base from MODIS measurements. I suggest replacing “using the SeaWiFS surface reflectance products” with “using atmospherically-corrected data from the long time series of measurements”. This is more general to the Deep Blue products.
Lines 149, 150: There is a new reference for DB C6.1 which can be used instead of Hsu et al (2004, 2006, 2013) here. That is Hsu et al (2019), https://doi.org/10.1029/2018JD029688. Some of the methodology has changed a lot from the older work. I realise that 2019 paper is very new so it is understandable the authors prepared their revision before seeing it.
Line 150: I would also mention that the C6.1 DT land algorithm has an update to reduce biases in urban areas by using a different surface reflectance model there. This is described in Gupta et al (2016), https://doi.org/10.5194/amt-9-3293-2016.
Line 155: the authors might cite Sayer et al (2019) https://doi.org/10.1029/2018JD029598 here in addition to their own work, as that is the Deep Blue team evaluation of the C6.1 data. Again I realise this is very new.
Line 158-159: I’d add Hsu et al (2019) in here again since that paper covers VIIRS too.
Line 160: I’d add this Sayer (2018) https://doi.org/10.1002/2017JD027412 paper here as it covers the VIIRS application of SOAR. Note that this is a separate Sayer (2018) paper from the one already cited, so one should be 2018a and the other 2018b.
Line 213: it might be worth noting that here sigma is the one standard deviation uncertainty. This is important because for the significance testing later the focus is on 90% or 95% significance thresholds, which are closer to the two standard deviation uncertainty. So it’s good to make this clear.
Line 219: I’d add another sentence here because p-values are often interpreted incorrectly, especially by non-specialist readers. Something like: “The p-value represents the probability of obtaining results at least as extreme as those found, under the null hypothesis of there being no relation between AOD and time.”
Line 220: I’d say “decrease” the fraction, not “exclude” the fraction. You can never know for certain if and where you have false positives, only estimate the number you can expect to have across your data set.
Line 227: I’d expand on alpha a bit more here. My understanding is that this mean you expect that no more than 5% of your significant results are in fact false positives. But please check and clarify this.
Line 228: I’d say “Statistical metrics” instead of “Statistical metric”.
Section 3.3: one criticism from several reviewers of the previous version was the use of expected errors (EE) for level 2 products on the level 3 data. In this version the authors have removed the EEs, which is good. However I think it might be worth mentioning that they exist in case non-expert readers wonder. So somewhere in here I would add a sentence like: “Although several satellite products provide an expected level of uncertainty on AOD, this refers to level 2 products and not the level 3 products used here. As a result these level 2 expected uncertainties are not applicable to studies like this. Level 3 uncertainty estimates have not yet been developed for these AOD products.”
Line 236: this should be clarified. Consistent within one standard deviation or more? And compared by combining trend uncertainties linearly or in quadrature? (I think it should be in quadrature, if one can assume uncertainty estimates from different data sets are independent.)
Line 249: I would say “differences” rather than “estimation uncertainties”. Also, looking at Table 2, I am not sure exactly which statistics are being referred to here? For example the text says of the AATSRs, SU is best but ADV is worst. But it really depends on which statistic is of most interest. For example ORAC has lowest correlation over land, but has highest N. And some of the differences are tiny: is R of 0.766 (SU) really statistically distinguishable from 0.734 (ADV), for example? All of these summary statistics presented here are only estimates of the true population behavior, and no uncertainty estimates are provided for these correlations, biases, etc. I am concerned that much of early section 4.1 is over-interpreting small and possibly indistinguishable differences between very similar numbers. I do not think it is good to emphasize the global-scale comparison for that reason, particularly when there is such regional diversity (as the authors note at the start of section 4.2). The importance of not over-interpreting small global-scale statistics must at least be acknowledged in the text. I would rather see this shortened (and probably put the table in the Supplement to avoid over-interpretation by readers) and put more attention to the regional analysis (i.e. move the relevant material out of the Supplement and into the main paper). Otherwise the authors have to justify why the global-scale numbers are important for any particular purpose, and why it is relevant to give them to three decimal places when it is doubtful that these estimates are robust to that level of precision. If the authors want to keep Table 2, I’d assess the expected level of uncertainty on these statistics, and round them to that level. My intuition is that this would be two decimal places in most cases.
Line 261 (and again 299-300): the reason for low SeaWiFs coverage is partly the time period studied. The instrument operated from 1997-2010 but the comparison is only for 2006-2010. It had some temporary failures in 2008, 2009, and 2010 which cause missing monthly data. While only a small fraction of the record is missing out of the 13-year mission, pretty much all the gaps are in the 5-year period the authors study. So the numbers here are not really representative of SeaWiFS as a whole. This is an example of over-interpreting results without looking for the bigger picture, and again should be acknowledged in the text. Using a different time period SeaWiFS would still have less coverage than some of the other instruments, but the difference would be a lot less stark. Later on (section 5.1) the authors acknowledge this but it’s really a relevant point for this section too.
Figure 4: As this contains 40 panels, it is somewhat tricky to make out the details. I appreciate what the authors are trying to do here: give a rough illustration of the statistical metrics across all of the sites and data sets. But I think it is too hard to see. I am not sure how best to deal with this. One option would be to make it into 4 figures (one for each of N, MAE, RMSE, RMB) and arrange them 2 across by 5 down. That would give each panel 4 times the area it has now, which would make things easier to distinguish. Another option would be to put this plot in the continental-scale comparison, and instead of showing data for each site, color each continent with the continental-average results. That would get across the point of showing regional and inter-data set variability but since each panel would have only a dozen or so larger shaded areas instead of 300 dots, it would be more readable. Then the current Figure 4 could be moved to the SI. I will leave it up to the authors and Editor. My point here is, the goal of this figure is a good one, but 40 panels with about 300 different dots on each is too much for a single figure. We just can’t see the details at that level.
Figure 5 and discussion: how is this spatial coverage and AOD calculated? Is it area-weighted? The retrievals are done on equal-area grids but the L3 aggregates used here are equal-angle. High latitude grid cells will therefore be weighted disproportionately high if the L3 coverage is just averaged. I think area-weighting (i.e. each grid cell is weighted by cosine(central latitude) is the better way to do it, since it’s more representative of actual surface area on the Earth. Please clarify and replot if this was not what was done. The text does not say. My guess is that it might be a simple equal-angle average rather than area-weighted, as the maximum spatial coverage around 70% seems quite low for monthly products.
Table 4, Figure 7: same question and suggestion as for Figure 5.
Lines 431-432: I would delete this sentence. Weak autocorrelation doesn’t really imply much about suitability for trend analysis. All it tells you is how strongly the data are autocorrelated. This is something that is part of the uncertainty analysis but by itself isn’t informative for suitability.
Line 443: please define “a” explicitly here. I infer from later in the paragraph that it is the AOD trend per year? It needs to be stated clearly.
Line 473: “lack of retrieval over land” is misleading. There are tens of thousands of retrievals over land. I suggest “lower coverage over land”.
Section 6.3: I would delete this section from the paper. I don’t feel that it really fits with the rest, so it’s just adding length. The authors state the results are similar to previous trend studies. So it’s just going to add a page or two to the paper’s length without much new material. I think it’s better to keep the paper focused on the main topic, comparing the data sets.
Line 530-531: Typically “spatial continuity” refers to unrealistic jumps in the data values rather than gaps in the data (which is what the authors are talking about here). Also, it feels like the authors exaggerate on line 531 when they say MODIS provides “almost full coverage” when earlier figures say the number is more like 70%, which I don’t think is “almost full”. I suggest the authors say about SeaWiFS and ADV that they have “lower spatial coverage”. The comment about MODIS might be ok if the coverage is recalculated as an area-weighted instead (see earlier comment), but in my view “almost full” implies something like 90% or more. Otherwise I’d just say MODIS gives “highest coverage”. I’d also give the numbers for the lowest and highest cases so the reader has a reminder how much difference we’re talking about.
Line 543-546: Following from my earlier comment about section 6.3, I suggest deleting these sentences.
Conclusion: I think an extra paragraph or two needs to be added at the end. So we have done all this work but I am missing concrete suggestions for what to do next. In my view we’ve learned that on a monthly scale, coverage differs a lot between data sets. So if we want to get a bigger picture, what do we do? Should we next try averaging products together on a daily or monthly scale? Or should we pick products based on time period? Should we try some bias correction? Where are the most important areas for algorithm teams to focus effort to improve coverage or decrease the diversity? These are the sorts of questions I think the Conclusion should discuss, to guide future research.