Abstract:Due to the complexities of train operation environments—such as frequent lighting variations, occlusions, and significant scale differences among targets—existing models struggle to effectively identify risk objects during train operations, including oncoming trains, pedestrians, falling rocks, and abnormal intrusions. To address these challenges, this study explores the feasibility of applying large visual-language models to risk object identification in train operation scenarios. Based on train operation data, this paper first constructs a risk object dataset using Low-Rank Adaptation(LoRA) fine-tuning, reformatting annotated data into structured natural language descriptions. By applying LoRA fine-tuning to the parameters of the DeepSeek-VL2 multimodal large model, optimized weights for risk object identification in train operation environments are obtained. The model's recognition accuracy is evaluated through natural language question-answering tasks. Experiments show that risk objects in train operation environments can be identified using only natural language instructions, achieving an F1-score of 80.5%. This meets the accuracy requirements for detecting risk objects such as oncoming trains and pedestrians in subway operation scenarios, effectively reducing train collision risks. Moreover, the model exhibits strong generalization capability to adapt to diverse subway train operation scenarios.