基于机器学习的西安市2018—2020年O<sub>3</sub>浓度预测

刘南健; 周卫健; 李国辉

用户登录

在线期刊

下载专区

友情链接

引用本文:	刘南健,周卫健,李国辉.2026.基于机器学习的西安市2018—2020年O₃浓度预测[J].地球环境学报,17(1):116-127
	LIU Nanjian,ZHOU Weijian,LI Guohui.2026.Machine learning-based prediction of O₃ concentration from 2018 to 2020 in Xi’an[J].Journal of Earth Environment,17(1):116-127

【打印本页】【下载PDF全文】【HTML】【查看/发表评论】【下载PDF阅读器】【关闭】

←前一篇|后一篇→

过刊浏览高级检索

本文已被：浏览 1262次下载 1066次	码上扫一扫！
分享到：微信更多字体:加大+\|默认\|缩小-
基于机器学习的西安市2018—2020年O₃浓度预测
刘南健^1，2，周卫健^1，3，李国辉^1，4
1.中国科学院地球环境研究所黄土科学全国重点实验室，西安 710061 2.中国科学院大学，北京 100049 3.西安加速器质谱中心陕西省加速器质谱技术及应用重点实验室，西安 710061 4.中国科学院气溶胶化学与物理重点实验室，西安 710061

摘要:

臭氧（O₃ ）浓度受自然因素和人类活动影响，呈现复杂的非线性演化特征，准确预测其浓度对环境管理和决策至关重要。文章以西安市为对象，利用2018—2020年逐小时空气污染物数据及同期ERA5 气象再分析资料，构建卷积神经网络（CNN）、极端梯度提升机（XGBoost）、随机森林（RF）和多元线性回归（MLR）4种模型，进行24 h单步O₃浓度预测。结果表明：基于树结构的XGBoost和RF模型整体预测性能优异，尤其在2020年全时段和该年夏季预测中表现突出，其中，XGBoost效果最佳；相比之下，经典的CNN模型并未展现出预期优势，而MLR模型在2020年及该年夏季预测中表现最差。所有模型对O₃浓度预测均存在一定程度的高估与低估，特别是对下午时段较高浓度的O₃浓度普遍低估，但树模型（XGBoost和RF）能更好地控制预测偏差幅度。进一步通过SHAP值解释2020年预测结果，发现历史 O₃浓度、太阳辐射（SOL）和气压（PRS）是影响模型输出的前三大关键特征；在2020年的夏季预测中， O₃浓度和辐射相关因子对模型决策贡献尤为显著。研究表明树集成模型在处理O₃浓度预测的非线性特征时更具优势，为相关区域空气质量预报提供有效技术参考。

关键词: O₃浓度预测机器学习基于树的模型神经网络

DOI：10.7515/JEE2023074

CSTR：32259.14.JEE2023074

分类号:

文献标识码:A

基金项目:中国科学院（B类）战略性先导科技专项项目（XDB40000000）

英文基金项目:

Machine learning-based prediction of O₃ concentration from 2018 to 2020 in Xi’an

LIU Nanjian^1，2，ZHOU Weijian^1，3，LI Guohui^1，4

1.State Key Laboratory of Loess Science, Institute of Earth Environment, Chinese Academy of Sciences, Xi’an 710061 , China2.University of Chinese Academy of Sciences, Beijing 100049 , China3.Shaanxi Key Laboratory of Accelerator Mass Spectrometry Technology and Application, Xi’an AMS Center, Xi’an 710061 , China4.Key Laboratory of Aerosol Chemistry and Physics, Chinese Academy of Sciences, Xi’an 710061 , China

Abstract:

Background, aim, and scope Due to the influence of natural and human activities, the change of ozone (O₃) concentration is a complex nonlinear process, and accurate prediction of O₃ concentration process is of great significance for the decision-making and management for relevant environmental protection department. This study aims to develop and compare four machine learning models for 24 h prediction of O₃ concentrations in Xi’an while identifying key influencing factors through model interpretability. The scope is limited to Xi’an using hourly air pollutant and ERA5 meteorological data from 2018—2020, with performance evaluated for the full year and summer of 2020 focusing on prediction accuracy and interpretability. Materials and methods O₃ in Xi’an was taken as the research object. Using hourly air-quality monitoring data from 2018 to 2020 and ERA5 meteorological reanalysis data, we constructed a convolutional neural network model (CNN), extreme gradient boosting (XGBoost), random forest model (RF) and multiple linear regression model (MLR) to perform singlestep prediction of O₃ concentration for the next 24 h. Results The tree-based models (XGBoost and RF) showed strong prediction performance in 2020 and summer of 2020, with XGBoost performing best, while the classic convolutional neural network model did not exhibit excellent prediction performance, and the MLR model had the worst performance in both 2020 and summer of 2020. Both linear and the nonlinear models overestimated or underestimated O₃ concentration in the study area to varying degrees, especially high O₃ concentrations in the afternoon. However, the tree-based models better controlled the deviation of their estimated. Finally, in the 2020 prediction, SHAP plots of the two tree-based models (XGBoost and RF) revealed that O₃ concentration, solar radiation (SOL), and pressure (PRS) at the previous 24 h timestep were the three most important factors affecting the model output, while in the summer prediction, O₃ concentration and radiation factors at the previous 24 h timestep made a critical contribution to model decisions. Discussion Accurately predicting O₃ concentration is challenging because it is influenced by complex human activities and weather conditions. In this study, the influencing factors we used were mainly dynamic. Therefore, future research should not only focus on dynamic factors such as meteorological conditions, but also on static variables such as terrain; adding more variables is expected to improve model prediction performance. In addition, this study focused on time-series prediction of O₃ concentration, but air pollutants are generally distributed regionally; therefore, spatial dimension should be considered in addition to temporal prediction. The convolutional neural networks are famous for processing image signals, especially the function of extracting abstract features through hidden state operations, therefore, in spatial prediction tasks, deep learning models represented by convolutional neural networks may have great application potential, but it also requires considering computational costs and time costs. Finally, all the machine learning models in this paper underestimate the O₃ concentration in the afternoon, however, this is a time of day when human activity is very high. On the one hand, it may be due to the model itself, and on the other hand, it may be due to the limited features used. Conclusions Both the tree -based machine learning models and the deep learning model have different degrees of overestimation or underestimation of O₃ concentration in the study area, but in general, the XGBoost model has better predictive ability, while the prediction effect of the CNN model is not particularly outstanding, and the MLR model has the worst predictive performance. Recommendations and perspectives The results of the study can be used as a scientific basis for the prediction and early warning of O₃ concentration in Xi’an. It is hoped that in the later work, the deep learning model can be used to predict in the spatial dimension. In addition, embedding physically-based chemical evolution of air pollutants into machinelearning models will greatly increase decision-makers’ confidence in applying them.

Key words: O3 concentration prediction machine learning tree-based model neural network