Wei, J; Huang, W; Li, ZQ; Sun, L; Zhu, XL; Yuan, QQ; Liu, L; Cribb, M (2020). Cloud detection for Landsat imagery by combining the random forest and superpixels extracted via energy-driven sampling segmentation approaches. REMOTE SENSING OF ENVIRONMENT, 248, 112005.

A primary challenge in cloud detection is associated with highly mixed scenes that are filled with broken and thin clouds over inhomogeneous land. To tackle this challenge, we developed a new algorithm called the Random-Forest-based cloud mask (RFmask), which can improve the accuracy of cloud identification from Landsat Thematic Mapper (TM), Enhanced Thematic Mapper Plus (ETM + ), and Operational Land Imager and Thermal Infrared Sensor (OLI/TIRS) images. For the development and validation of the algorithm, we first chose the stratified sampling method to pre-select cloudy and clear-sky pixels to form a prior-pixel database according to the land use cover around the world. Next, we select typical spectral channels and calculate spectral indices based on the spectral reflection characteristics of different land cover types using the top-of-atmosphere reflectance and brightness temperature. These are then used as inputs to the RF model for training and establishing a preliminary cloud detection model. Finally, the Super-pixels Extracted via Energy-Driven Sampling (SEEDS) segmentation approach is applied to re-process the preliminary classification results in order to obtain the final cloud detection results. The RFmask detection results are evaluated against the globally distributed United States Geological Survey (USGS) cloud-cover assessment validation products. The average overall accuracy for RFmask cloud detection reaches 93.8% (Kappa coefficient = 0.77) with an omission error of 12.0% and a commission error of 7.4%. The RFmask algorithm is able to identify broken and thin clouds over both dark and bright surfaces. The new model generally outperforms other methods that are compared here, especially over these challenging scenes. The RFmask algorithm is not only accurate but also computationally efficient. It is potentially useful for a variety of applications in using Landsat data, especially for monitoring land cover and land-use changes.