To read this content please select one of the options below:

COWO: towards real-time spatiotemporal action localization in videos

Yang Yi (College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China)
Yang Sun (College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China)
Saimei Yuan (College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China)
Yiji Zhu (College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China)
Mengyi Zhang (College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China)
Wenjun Zhu (College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China)

Assembly Automation

ISSN: 0144-5154

Article publication date: 18 January 2022

Issue publication date: 24 March 2022

153

Abstract

Purpose

The purpose of this paper is to provide a fast and accurate network for spatiotemporal action localization in videos. It detects human actions both in time and space simultaneously in real-time, which is applicable in real-world scenarios such as safety monitoring and collaborative assembly.

Design/methodology/approach

This paper design an end-to-end deep learning network called collaborator only watch once (COWO). COWO recognizes the ongoing human activities in real-time with enhanced accuracy. COWO inherits from the architecture of you only watch once (YOWO), known to be the best performing network for online action localization to date, but with three major structural modifications: COWO enhances the intraclass compactness and enlarges the interclass separability in the feature level. A new correlation channel fusion and attention mechanism are designed based on the Pearson correlation coefficient. Accordingly, a correction loss function is designed. This function minimizes the same class distance and enhances the intraclass compactness. Use a probabilistic K-means clustering technique for selecting the initial seed points. The idea behind this is that the initial distance between cluster centers should be as considerable as possible. CIOU regression loss function is applied instead of the Smooth L1 loss function to help the model converge stably.

Findings

COWO outperforms the original YOWO with improvements of frame mAP 3% and 2.1% at a speed of 35.12 fps. Compared with the two-stream, T-CNN, C3D, the improvement is about 5% and 14.5% when applied to J-HMDB-21, UCF101-24 and AGOT data sets.

Originality/value

COWO extends more flexibility for assembly scenarios as it perceives spatiotemporal human actions in real-time. It contributes to many real-world scenarios such as safety monitoring and collaborative assembly.

Keywords

Acknowledgements

The financial supports by Natural Science Foundation of Jiangsu Province, China (BK20180693), National Natural Science Foundation of China (61803198), Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China (21KJB520007).

Citation

Yi, Y., Sun, Y., Yuan, S., Zhu, Y., Zhang, M. and Zhu, W. (2022), "COWO: towards real-time spatiotemporal action localization in videos", Assembly Automation, Vol. 42 No. 2, pp. 202-208. https://doi.org/10.1108/AA-07-2021-0098

Publisher

:

Emerald Publishing Limited

Copyright © 2021, Emerald Publishing Limited

Related articles