Sakamoto, T (2020). Incorporating environmental variables into a MODIS-based crop yield estimation method for United States corn and soybeans through the use of a random forest regression algorithm. ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 160, 208-228.

Satellite-based remote sensing is a powerful form of technology that can provide food security policy makers with reliable information. This information allows them to estimate final crop yields on a global scale within reasonable time frames and with higher spatial resolution than with the use of pure statistical data. Satellite-based crop yield estimation methods are commonly based on the high correlation between the crop yield and the vegetation index (VI), taken at a specific phenological stage. Although VI-based crop yield estimation methods that make use of one approximation formula can easily and effectively estimate the spatial distribution of corn and soybean yields in the United States, there are still some associated drawbacks to this approach that result in the underestimation of crop yields, especially in irrigated regions. Furthermore, a fundamental problem with this approach is the difficulty in evaluation of environmental stress-related physiological disorders such as sterility, which cannot be evaluated based on VIs as an alternative value to biomass. This study's objective was, thus, to overcome the limitations associated with the conventional approach by incorporating additional environmental variables into the proposed method along with the application of a random forest regression algorithm for estimating United States (US) corn and soybean yields with higher accuracy.This study compared three methods: (1) a conventional method based on a linear regression model (LM method) calibrated using limited past data, (2) a method, which was slightly altered from the LM method in terms of the use of a polynomial regression model (PM method), and (3) the newly proposed method, which involved the application of a random forest regression algorithm and the use of irrigated harvested cropland percentage and reanalysis data for temperature, precipitation, shortwave radiation, and soil moisture (RF method). The time-series correlation between the moderate resolution imaging spectroradiometer (MODIS) wide dynamic ranged vegetation index (WDRVI) and corn and soybean yields were analyzed as part of a preliminary investigation to determine the best time for recording the MODIS WDRVI as an explanatory variable in the study area. The results revealed that the MODIS WDRVI demonstrated the highest correlation with county-level statistical yields 13 days before the silking stage for corn and 6 days before the setting pods stage for soybeans. The regression formulas for the LM and the PM method were developed based on assigning the MODIS WDRVI to these phenological stages as explanatory variables. The advantage of the PM method over the LM method was found to be its adaptability to high-yield counties because of the inherent effect of using a polynomial regression equation. The LM method, which made use of a linear regression equation calibrated using limited past data (2009-2010), could not be adapted to increased yields encountered in recent years without recalibration with the latest data. The RF method learning models were individually optimized for each state and crop. This optimization revealed, that our learning model that incorporated every available variable did not always perform best, probably due to overfitting. In the major irrigated states of Kansas, and Nebraska, the spatial data of the percentage of irrigated harvested cropland improve the estimation accuracy of the RF method for both corn and soybean. In the states of Illinois and Iowa, the RF method, which incorporated primarily the weather-related variables of soil moisture, precipitation, temperature, and shortwave radiation, improved the estimation accuracy due to a response of rainfed agriculture to environmental stress. This is especially true for soybean. The validation results indicated that the estimation accuracy of the RF method (root mean square error RMSE: 0.539 t/ha for corn, 0.206 t/ha for soybeans) was higher than that of the PM method (RMSE: 0.897 t/ha for corn, 0.283 t/ha for soybeans) at the state level, particularly due to the effect of bias correction in irrigated regions.