A Traffic Police Gesture Recognition Method Based on BiLSTM-Transformer Architecture

Published on 11 June 2026 - Electronics

Authors: Xiaoyu Zhang, Baohua Guo, Sen Wang, Anthony Sigama, David Bassir

To address the issues of insufficient real-time performance and inadequate modeling of temporal features in traffic police gesture recognition, this paper proposes a method based on skeleton keypoints and hybrid temporal modeling. First, YOLOv11m-Pose is employed to detect human skeleton keypoints in video sequences, extracting reliable two-dimensional skeleton features. Second, this study designs a temporal modeling network that integrates a bidirectional long short-term memory (BiLSTM) with a Transformer. The BiLSTM models local temporal continuity and action transition features between adjacent frames, capturing short-term dynamic changes. The Transformer, through its self-attention mechanism, models global temporal dependencies and weights critical time steps to extract long-range discriminative information. Experimental results demonstrate that the proposed method achieved 98.91% for both Accuracy and F1-Score. In terms of Accuracy, it outperformed the BiLSTM and Transformer models by 2.43% and 7.67%, respectively. It outperforms most methods based on recurrent neural networks and feature fusion. Meanwhile, the model achieves an average inference time of just 1.3299 s per gesture sequence. Consequently, this approach strikes a favorable balance between recognition accuracy and real-time performance, demonstrating significant practical value.

Sorry, but this page still haven't any translation.