CloudViT: exploring cloud type classification  with vision transformers in global satellite data

Lenhardt, Julien; Quaas, Johannes; Sejdinovic, Dino; Klocke, Daniel

doi:10.5194/acp-26-5447-2026

Articles | Volume 26, issue 8

https://doi.org/10.5194/acp-26-5447-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/acp-26-5447-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 26, issue 8

Research article

|

22 Apr 2026

Research article |

| 22 Apr 2026

CloudViT: exploring cloud type classification with vision transformers in global satellite data

Julien Lenhardt, Johannes Quaas, Dino Sejdinovic, and Daniel Klocke

Download

Final revised paper (published on 22 Apr 2026)
Preprint (discussion started on 02 Oct 2024)

Interactive discussion

Status: closed

CC1:
'Comment on egusphere-2024-2724', Chen Zhou, 30 Oct 2024

This paper presents CloudViT, a novel cloud classification method based on Vision Transformers (ViTs) and cloud properties derived from MODIS satellite data. The authors aim to classify cloud types across global datasets using spatial patterns of cloud properties such as cloud top height (CTH), cloud optical thickness (COT), and cloud water path (CWP). The method is evaluated on co-located ground-based observations and satellite data, producing accurate classifications of different cloud types. The approach is further tested with applications to General Circulation Models (GCMs), notably ICON-Sapphire, showcasing CloudViT's ability to generalize cloud type retrievals at kilometer-scale resolution.
CloudViT leverages self-supervised learning for pretraining and contrastive learning to overcome the limited number of labeled cloud observations. The method is robust, showing competitive performance when compared to traditional methods and CNN-based approaches, and effectively captures global cloud distributions, including complex cloud types like cumuliform and stratiform clouds. I think the paper is suitable for acceptance with minor revisions.

Minor Comments:
L142: Change "retrieved" to the verb form "retrieve."
L177: Replace "requires" with "require" to agree with the plural subject.
L209: In the sentence "this type of model, alongside CNNs, are," replace "are" with the singular verb "is" to agree with the subject "this type of model."
L323: Change "cardinal" to "cardinality" to correctly refer to the size or number of elements in a set.
L587-L593: I believe it would be beneficial to discuss the limitations, such as follows：
Since MODIS data is collected through near-nadir scanning, observations in high-latitude regions become oblique, leading to distortions and errors in cloud property retrievals, such as cloud top height and optical thickness. This could potentially affect the model’s performance in polar regions.

Citation: https://doi.org/10.5194/egusphere-2024-2724-CC1
- AC1: 'Reply on RC1', Julien Lenhardt, 25 Feb 2025
  
  Please find our response to the referees in the supplement.
  
  Citation: https://doi.org/10.5194/egusphere-2024-2724-AC1
RC1:
'Comment on egusphere-2024-2724', Anonymous Referee #1, 17 Nov 2024

My previous comment appears as "CC", so I re-posted my comment as "RC" here.
Overview:
This paper presents CloudViT, a novel cloud classification method based on Vision Transformers (ViTs) and cloud properties derived from MODIS satellite data. The authors aim to classify cloud types across global datasets using spatial patterns of cloud properties such as cloud top height (CTH), cloud optical thickness (COT), and cloud water path (CWP). The method is evaluated on co-located ground-based observations and satellite data, producing accurate classifications of different cloud types. The approach is further tested with applications to General Circulation Models (GCMs), notably ICON-Sapphire, showcasing CloudViT's ability to generalize cloud type retrievals at kilometer-scale resolution.
CloudViT leverages self-supervised learning for pretraining and contrastive learning to overcome the limited number of labeled cloud observations. The method is robust, showing competitive performance when compared to traditional methods and CNN-based approaches, and effectively captures global cloud distributions, including complex cloud types like cumuliform and stratiform clouds. I think the paper is suitable for acceptance with minor revisions.

Minor Comments:
L142: Change "retrieved" to the verb form "retrieve."
L177: Replace "requires" with "require" to agree with the plural subject.
L209: In the sentence "this type of model, alongside CNNs, are," replace "are" with the singular verb "is" to agree with the subject "this type of model."
L323: Change "cardinal" to "cardinality" to correctly refer to the size or number of elements in a set.
L587-L593: I believe it would be beneficial to discuss the limitations, such as follows：
Since MODIS data is collected through near-nadir scanning, observations in high-latitude regions become oblique, leading to distortions and errors in cloud property retrievals, such as cloud top height and optical thickness. This could potentially affect the model’s performance in polar regions.

Citation: https://doi.org/10.5194/egusphere-2024-2724-RC1
- AC1: 'Reply on RC1', Julien Lenhardt, 25 Feb 2025
  
  Please find our response to the referees in the supplement.
  
  Citation: https://doi.org/10.5194/egusphere-2024-2724-AC1
RC2:
'Comment on egusphere-2024-2724', Anonymous Referee #2, 20 Dec 2024

The paper shows results of using a ViT model that is pretrained on MODIS data to classify cloud scenes into 4/10 cloud types as defined by WMO. The authors go into great details at times on model training choices etc. The paper is, however, light on physics. The classification performance of the models shown in the paper is honestly quite poor. Accuracy of 0.46 and F1 score of 0.43 for the best model on test data cannot be treated as state-of-the-art. Note that the statistics for the training data are not that much better either. There is something not quite right about for this paper, either the choice of training data, the training procedure, or something else because ViT models are quite capable as the author wrote in the intro, yet the resulting performance is so poor. We urge the authors to investigate this glaring mismatch and improve the model's performance. Otherwise, results from application of such a model are highly unreliable, which defeats the purpose. I therefore suggest a major revision.
Technically, I do not see anything wrong with the general approach in terms of engineering. The authors described how they approached the problem, and given enough data and if the approach is sound, the models used in this paper should give us highly performing models.
The authors also did not do a thorough job at reviewing the literature on cloud type classification using machine learning/ deep learning. Their introduction to the subject seems a bit vague. I suggest the authors pay more attention to the actual physics instead of details of engineering and implementation because this is not an applied machine learning journal.
Minor comments:
Line 34-35: references are needed. This sentence is also a bit disconnected from previous ones.
lines 39-41: references are needed
Lines 46-47: please rewrite this sentence because it is confusing to read.
Line 51: the classification is not done pixel-wise as far as I'm aware.
Lines 62-68: incomplete review of cloud type classifcation
Line 82: 'robust retrievals': this is usually not considered a retrieval since in cloud remote sensing community retrievals have specific meaning.
Line 103: I'm dubious on the point that such classification provides 'a high level of precision'.
Line 112: again, I'm not sure why 'retrievals' are used here. It reads off.
Figure 1: should at least contain panels that show the actual number of training data.
Lines 177-178: not clear what this sentence means.
Line 178-179: present the actual number of training data samples to give readers a clear idea.
Line 190: 128x128 pixels are not the same as 128kmx128km.
Line 197: sloppy language use. Suggest to change.
Line 199: is 4&10 much more detailed than 9 types?
Lines 215-217: what? This sentence is quite confusing. Please rewrite.
Line 243: please use terms consistently. Do not use different terms to refer to the same thing.
Anything after Table 2 and Figure 5 is not worth discussing too much because the performance is just not acceptable. The authors need to dig deeper into their data, approach, or something else to find ways to improve the results before application.

Citation: https://doi.org/10.5194/egusphere-2024-2724-RC2
- AC1: 'Reply on RC1', Julien Lenhardt, 25 Feb 2025
  
  Please find our response to the referees in the supplement.
  
  Citation: https://doi.org/10.5194/egusphere-2024-2724-AC1

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Julien Lenhardt on behalf of the Authors (25 Feb 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (06 Mar 2025) by Minghuai Wang

RR by Anonymous Referee #1 (25 Mar 2025)

RR by Anonymous Referee #2 (10 Apr 2025)

Suggestions for revision or reasons for rejection

Again, this is ACP and the results should focus much more on the atmospheric physics side of things here. Modeling tool is just a means to the end. Right now, the paper focuses too heavily on the modeling details, but lack discussion on physics. More critically, my main criticism remains: the performance is not good. It does not seem to be useful if the model constantly confuses one class with another. With so much noise, one cannot obtain meaningful insights using this tool, which defeats the purpose. The revision did not address this concern. I am giving an honest assessment of the paper in its current form. One shall not sugar coat the performance and try to argue that somehow the results are acceptable. I submit that this kind of performance would not pass any conference screening in an average AI/ML meeting. However, if the editor thinks this is acceptable, I am willing to reconsider my decision/recommendation.
1. Cloud type classification has a large body of literature. The authors added a few that is talking about different schemes of cloud types discussed in this paper. More relevant papers such as Wood and Hartman, 2006; Muhlbauer et al., 2014; Stevens et al 2020; Yuan et al., 2020; Geiss et al., 2023; McCoy et al., 2023. They are dealing with cloud types more relevant to what is being studied here, it seems. Yet, none of them are mentioned.
2. The relative occurrence of the cloud types plot is interesting and should be kept, but the absolute number of samples are also very important. It should not be put into appendix.
3. The performance is still poor. It does not matter what method one uses and what the limitations are. If the performance is so poor, what is point of publishing the results? Certainly, the authors do not consider the results here as negative ones, i.e., failed experiments for the community to learn from. They want to push its application for wider usage. The model should not be applied because of its lack of skill. The authors need to either improve the model skill so that it is useful for usage. Alternatively, results can be presented as an interesting but ultimately not successful attempt at using an advanced tool to address a challenging issue. Reporting negative results is meaningful since it shares valuable experience with the community. For this purpose, I suggest the authors to include a hypothesis on why the model ultimately failed to produce outstanding performance.

Figure 4: strong patterns of reconstruction error is concerning since it suggests the model has strong performance variations, depending on the location.
Table 2: this shows clearly that current modeling technique does NOT really change the performance too much.
Figure 6: Why different colors but single grayscale bar? It is quite confusing the comprehend.
Figure 7: no truth and no error statistics. What is the point?
Table A1: indeed, there is high imbalance in the training set. The authors need to address the challenge and produce performant models despite this.

Hide

ED: Reconsider after major revisions (21 Apr 2025) by Minghuai Wang

AR by Julien Lenhardt on behalf of the Authors (13 Jul 2025) Author's response Author's tracked changes

EF by Katja Gänger (01 Aug 2025) Manuscript

ED: Referee Nomination & Report Request started (02 Aug 2025) by Minghuai Wang

RR by Anonymous Referee #2 (09 Nov 2025)

Suggestions for revision or reasons for rejection

In my opinion, the abstract is improved by acknowledging the limitations of the current model. The authors also included more relevant studies that deal with closely related subjects. I appreciate the authors’ effort in making these changes and trying to improve the results. However, the abstract does not reflect this and still reads like the current state-of-the-art is schemes using 2-D histograms, which is not true. The approach in this article is very similar to the ones I mentioned in the last round. The specific deep learning technique/data source may differ somewhat but given the poor performance it is hard to justify the discussed implications in this paper.
On the point of applying the trained model to high resolution model simulations, the question I raised last time was not answered satisfactorily. This exercise basically tries to apply a model trained on MODIS data and ground observation data, with poor performance, to high-resolution model simulation whose data may be quite different from MODIS, without any validation or even ground truth to compare with. Surely, it begs the question of how to justify it and what we should make of the results because applying a non-performant model to a different dataset that may be quite different from the MODIS data involves two risky steps. Yet, no validation is provided. This ‘next step’ should be paused because results from it may be highly unreliable and hard to interpret.
A really performant model that can be taken as capable of ‘understanding’ cloud classification could be used in this way assuming the data from the high-resolution model are not wildly different from MODIS. Even then, this step has to be trodden carefully because we know that though high-resolution models have many great features, they can still have shortcomings when compared to actual observations.
I hope that the authors understand that this is a real concern I as a reviewer have. Honestly, I appreciate the authors’ effort in trying this method. I would have expected that this framework to work nicely for the same reasons the authors discussed in the manuscript, but the reality is that it did not. However, this paper is best published as a valuable exercise for the community to learn from. I suggest the authors discuss more on why the model does not perform well on the data, what the authors tried to improve the performance, and how they worked or not. Such information would be really valuable for researchers in our community. Packaging it as a powerful new way of studying high-resolution models is misguided given the current results. A simple thought experiment is that if it turns out that we could not train a performant model with this framework in the manuscript, it would suggest that the current framework is not viable. The implication should then be that researchers should avoid this seemingly viable option. The authors should either make it work, or document what they tried, how the results look, and what the implications are.

Hide

ED: Reconsider after major revisions (08 Dec 2025) by Minghuai Wang

AR by Julien Lenhardt on behalf of the Authors (19 Jan 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (20 Jan 2026) by Minghuai Wang

RR by Anonymous Referee #2 (21 Feb 2026)

Suggestions for revision or reasons for rejection

The changes that the authors made are great towards matching the results and significance discussion of the method. I appreciate the effort and like the discussion section. The application to satellite data is also interesting. I do find the discussion section a bit hard to follow. I give a few examples in the following.
‘The method and results highlighted in the previous sections provide useful material to further analyze the developed methodology, but also to be critical of its shortcomings.’
It is not clear what this sentence tries to express. Please make sure to write it with clarity. For example, if you meant to say ‘While our method provides a good foundation to be further developed, it does have its limitations.’ That would be clear and easy to follow.
‘Spatially-resolved cloud properties provide usable context for the CloudViT model to improve the cloud classification, as shown in the comparison to the baseline method with limited spatial information. Introducing this new transformer model architecture additionally improves the classification skill over the CNN backbone mentioned in Lenhardt et al. (2024a). Overall, CloudViT achieves passable performance even on sparsely represented classes for both cases of 4 and 10 cloud types.’
It is true that this new method beats the baseline as shown in the result section. This is indeed progress and has been rightfully highlighted in previous sections. The main point of my previous comments should be highlighted here. The issue is that while the method achieves progress relative to previous methods, its absolute performance is surprisingly low. I say this because previous CCN-based methods (ViT should beat them usually) often achieves metrics that are much better than the current one. Discussions should be made towards this point.
‘The limited colocated dataset proves to be a hurdle for the proper training and evaluation of the method on labelled samples but the generation of an extensive global dataset allows deeper investigation into the cloud types. Improvements could come from a more extensive training dataset which would encompass a larger variety of cloud type samples to certainly enhance the classification’s performance both for the training and testing metrics. The subsequent evaluation exhibits interesting results despite the limited performance on the colocated dataset.’
Authors say that limited collocated dataset ‘proves’ to be a hurdle. Where was the proof, or it is the author’s presumed cause? It is important to show this with evidence. Why ‘but the generation of an extensive global dataset allows deeper investigation into the cloud types’ is the case? The authors should spell out why applying the method to a global dataset allows deeper investigation into the cloud types. Applying a model that achieves around 0.43 f1 score to a global dataset should produce a dataset that contains a lot of errors and uncertainties.
I feel ‘The subsequent evaluation exhibits interesting results despite the limited performance on the colocated dataset.’ is not warranted. The reason remains that applying a model that is relatively performant (as shown in this study, relative to old methods), but non-performant in absolute terms does not provide additional insights even if the patterns can be seemingly interesting. Again, due to the limited performance of the model, we do NOT know the true distribution for the global data. Simply having the global pattern seems not an asset. That remains the fundamental limitation and the authors should discuss this point. As a reviewer, these comments do not mean to take away the progress the paper makes. The authors did a good job at highlighting the potential power of the method and their achievements in the result section. Here it is important to focus on the current limitations and trying to provide insights how to improve the absolute performance metrics further so that we can obtain global data that is closer to the true distribution.
‘Cloud type diagnostics could be a resourceful addition to the panel of assessment methods for model data (Kuma et al., 2023; Kaps et al., 2023) given improvements to achieve remarkable performance in the classification ability as previously described.’
It is not clear what the authors mean here. Please clarify.
The next paragraph is a great example of a discussion. The emphasis is on increasing training data.
‘To improve the spatial coverage of the CloudViT predictions, the direct application to granules from MODIS TERRA would technically not require much more work as the instruments are similar and provide the same cloud properties. An additional benefit would come after the upcoming decommissioning of the CloudSat mission which was providing cloud type retrievals along its track aligned with MODIS. We would then be able to still offer information about cloud types over the same areas even though no vertical information is available and used from our predictions on MODIS level 2 data. As for other satellite cloud products, the main difference would arise, similarly to climate model data, from the potentially different distributions and ranges in the input cloud properties which would need either retraining of the vision transformer or careful scaling to match the distributions seen in MODIS data. Some limitations due to satellite retrieval shortcomings should be taken into account when applying the described method to certain areas. Indeed, since MODIS data is collected through near-nadir scanning, observations in high-latitude regions become oblique, leading to distortions and potential errors in cloud property retrievals, such as cloud top height and optical thickness.’
This is very speculative and deviates from the main focus of the study. The current results are relatively performant but not super performant in absolute terms, yet the authors start to speculate on a completely different idea here. I suggest to remove. As I stated before, the authors do a great job at introducing the method, outlining the potential strength of the method, and present their study setup. They should also do a great job at identifying the existing weakness and discussing potential ways to improve like what’s been done in the previous paragraph. I wish the authors to continue discussing ways that may cause the low absolute performance (I hope the authors agree with this assessment. If not, please let me know why they think the metrics in the paper are good).
‘As mentioned in more details in Appendix D, the input scaling is crucial to ensure proper portability of a method to this other data source.’ Should be ‘to other data source’.
It appears that the authors are interested in applying this to global model data. I think that would be a nice move. However, it is imperative to acknowledge that the first step remains to achieve great performance before such exercise. The result is only as meaningful as the capability of the model/method. Please clarify this point in the discussion. Again, if the authors disagree with this point, please help to enlighten me so that I would not insist on it. I want to see the authors to succeed and provide wonderful tool to analyze global simulation data, but at the same time to note that the current results can only be described as in-progress.

Hide

ED: Publish subject to minor revisions (review by editor) (02 Mar 2026) by Minghuai Wang

AR by Julien Lenhardt on behalf of the Authors (06 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (15 Mar 2026) by Minghuai Wang

AR by Julien Lenhardt on behalf of the Authors (20 Mar 2026) Manuscript

Short summary

Clouds come in various shapes and sizes and constitute a fundamental element of the Earth's climate system. Different cloud types show variable impacts on climate change. We present a new cloud type classification method called CloudViT (Cloud Vision Transformer) relying on spatial patterns of cloud properties obtained from satellite data using machine learning. We can thus help understanding the effects of different cloud types on climate change.