| 摘要: |
| 臭氧(O3 )浓度受自然因素和人类活动影响,呈现复杂的非线性演化特征,准确预测其浓度对环境管理和决策至关重要。文章以西安市为对象,利用2018—2020年逐小时空气污染物数据及同期ERA5 气象再分析资料,构建卷积神经网络(CNN)、极端梯度提升机(XGBoost)、随机森林(RF)和多元线性回归(MLR)4种模型,进行24 h单步O3浓度预测。结果表明:基于树结构的XGBoost和RF模型整体预测性能优异,尤其在2020年全时段和该年夏季预测中表现突出,其中,XGBoost效果最佳;相比之下,经典的CNN模型并未展现出预期优势,而MLR模型在2020年及该年夏季预测中表现最差。所有模型对O3浓度预测均存在一定程度的高估与低估,特别是对下午时段较高浓度的O3浓度普遍低估,但树模型(XGBoost和RF)能更好地控制预测偏差幅度。进一步通过SHAP值解释2020年预测结果,发现历史 O3浓度、太阳辐射(SOL)和气压(PRS)是影响模型输出的前三大关键特征;在2020年的夏季预测中, O3浓度和辐射相关因子对模型决策贡献尤为显著。研究表明树集成模型在处理O3浓度预测的非线性特征时更具优势,为相关区域空气质量预报提供有效技术参考。 |
| 关键词: O3浓度预测 机器学习 基于树的模型 神经网络 |
| DOI:10.7515/JEE2023074 |
| CSTR:32259.14.JEE2023074 |
| 分类号: |
| 文献标识码:A |
| 基金项目:中国科学院(B类)战略性先导科技专项项目(XDB40000000) |
| 英文基金项目: |
|
| Machine learning-based prediction of O3 concentration from 2018 to 2020 in Xi’an |
|
LIU Nanjian1,2,ZHOU Weijian1,3,LI Guohui1,4
|
|
1.State Key Laboratory of Loess Science, Institute of Earth Environment, Chinese Academy of Sciences, Xi’an 710061 , China2.University of Chinese Academy of Sciences, Beijing 100049 , China3.Shaanxi Key Laboratory of Accelerator Mass Spectrometry Technology and Application, Xi’an AMS Center, Xi’an 710061 , China4.Key Laboratory of Aerosol Chemistry and Physics, Chinese Academy of Sciences, Xi’an 710061 , China
|
| Abstract: |
| Background, aim, and scope Due to the influence of natural and human activities, the change of ozone (O3) concentration is a complex nonlinear process, and accurate prediction of O3 concentration process is of great significance for the decision-making and management for relevant environmental protection department. This study aims to develop and compare four machine learning models for 24 h prediction of O3 concentrations in Xi’an while identifying key influencing factors through model interpretability. The scope is limited to Xi’an using hourly air pollutant and ERA5 meteorological data from 2018—2020, with performance evaluated for the full year and summer of 2020 focusing on prediction accuracy and interpretability. Materials and methods O3 in Xi’an was taken as the research object. Using hourly air-quality monitoring data from 2018 to 2020 and ERA5 meteorological reanalysis data, we constructed a convolutional neural network model (CNN), extreme gradient boosting (XGBoost), random forest model (RF) and multiple linear regression model (MLR) to perform singlestep prediction of O3 concentration for the next 24 h. Results The tree-based models (XGBoost and RF) showed strong prediction performance in 2020 and summer of 2020, with XGBoost performing best, while the classic convolutional neural network model did not exhibit excellent prediction performance, and the MLR model had the worst performance in both 2020 and summer of 2020. Both linear and the nonlinear models overestimated or underestimated O3 concentration in the study area to varying degrees, especially high O3 concentrations in the afternoon. However, the tree-based models better controlled the deviation of their estimated. Finally, in the 2020 prediction, SHAP plots of the two tree-based models (XGBoost and RF) revealed that O3 concentration, solar radiation (SOL), and pressure (PRS) at the previous 24 h timestep were the three most important factors affecting the model output, while in the summer prediction, O3 concentration and radiation factors at the previous 24 h timestep made a critical contribution to model decisions. Discussion Accurately predicting O3 concentration is challenging because it is influenced by complex human activities and weather conditions. In this study, the influencing factors we used were mainly dynamic. Therefore, future research should not only focus on dynamic factors such as meteorological conditions, but also on static variables such as terrain; adding more variables is expected to improve model prediction performance. In addition, this study focused on time-series prediction of O3 concentration, but air pollutants are generally distributed regionally; therefore, spatial dimension should be considered in addition to temporal prediction. The convolutional neural networks are famous for processing image signals, especially the function of extracting abstract features through hidden state operations, therefore, in spatial prediction tasks, deep learning models represented by convolutional neural networks may have great application potential, but it also requires considering computational costs and time costs. Finally, all the machine learning models in this paper underestimate the O3 concentration in the afternoon, however, this is a time of day when human activity is very high. On the one hand, it may be due to the model itself, and on the other hand, it may be due to the limited features used. Conclusions Both the tree -based machine learning models and the deep learning model have different degrees of overestimation or underestimation of O3 concentration in the study area, but in general, the XGBoost model has better predictive ability, while the prediction effect of the CNN model is not particularly outstanding, and the MLR model has the worst predictive performance. Recommendations and perspectives The results of the study can be used as a scientific basis for the prediction and early warning of O3 concentration in Xi’an. It is hoped that in the later work, the deep learning model can be used to predict in the spatial dimension. In addition, embedding physically-based chemical evolution of air pollutants into machinelearning models will greatly increase decision-makers’ confidence in applying them. |
| Key words: O3 concentration prediction machine learning tree-based model neural network |