A Hybrid Framework for Temporal Object Behavior Analysis Using LSTM and Real-Time Detection
DOI:
https://doi.org/10.64229/91wd9131Keywords:
Object Detection, Temporal Reasoning, YOLOv8, LSTM, Behavior Analysis, Video Surveillance, Human Activity RecognitionAbstract
Fall detection in surveillance videos is a critical task with widespread applications in healthcare and public safety. In this paper, we propose a hybrid framework that combines YOLOv8 for real-time object detection with an LSTM based temporal reasoning module to classify human activities across video sequences. Our method captures both spatial appearance and temporal motion patterns, allowing it to distinguish subtle differences between activities such as standing, walking, lying, and falling. We evaluate our model on the UR Fall Detection (URFD) dataset and achieve a validation accuracy of 92.0% and a mean Average Precision (mAP@0.5) of 91.2%. Qualitative results further demonstrate robust predictions even in challenging scenarios with occlusions and lighting variations. An ablation study confirms that integrating temporal reasoning significantly boosts performance over frame-based detection alone. The proposed approach offers a promising solution for real-time fall detection in intelligent surveillance and assistive monitoring systems.
References
[1]Glenn Jocher et al. Ultralytics yolov8: Cutting-edge object detection and segmentation, 2023.
[2]Zhiqing Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object detection with deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems, 30(11):3212–3232, 2019.
[3]Peng Sun, Yuanjun Jiang, Tao Kong, and et al. Spatiotemporal video object detection with partially coupled networks. IEEE Transactions on Image Processing, 30:6019–6030, 2021.
[4]Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6299–6308, 2017.
[5]Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6202– 6211, 2019.
[6]Xiang Han, Zhuoran Li, and Dit-Yan Yeung. Mining inter-video proposal relations for video object detection. In European Conference on Computer Vision (ECCV), pages 288–304, 2020.
[7]Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Space-time attention for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1919–1928, 2021.
[8]Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
[9]Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
[10]Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
[11]Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
[12]Xiang Li, Xin Wang, Chao Ma, Yu Liu, and Yijun Hu. Rt-detr: Real-time detection transformer. arXiv preprint arXiv:2203.15646, 2022. Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625– 2634, 2015.
[13]Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Videos as space-time region graphs. In European conference on computer vision, pages 399–417. Springer, 2018.
[14]Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking with a deep association metric. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468. IEEE, 2016.
[15]Yifu Zhang, Chunyu Wang, Xiao Wang, Wenjun Zeng, Ping Zhou, Hao-Shu Qi, and Hongsheng Lu. Bytetrack: Multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864, 2021.
[16]Yifan Song, Chao Lan, Shi Xingjian, Wenjun Wu, Yi Yan, Jiaxing Zhang, and Yanning Xie. End-to-end video-level representation learning for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6882–6890, 2018.
[17]Yancheng Wu, Xin Liu, Jiwen Chen, Lei Hu, and Xinbo Liu. Multi-level temporal context network for action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12196–12203, 2020.
[18]Carlos Medrano, Raul Igual, In˜aki Plaza, and M. Lo´pez. Automatic fall detection system based on the accelerometer and gyroscope sensors. In 2018 International Conference on Smart Cities and Green ICT Systems (SMARTGREENS), pages 435–441. IEEE, 2018.
[19]Seong-Whan Chun, Jun Seok Hong, Jihwan Lee, and Hyunseok Kim. Abnormal behavior detection in surveillance videos using deep learning: A review. IEEE Access, 7:118705–118723, 2019.
[20]Sepp Hochreiter and Ju¨rgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
[21]Sidra Fareed, Ding Yi, Babar Hussain, and Subhan Uddin. Multi-modal medical image segmentation using vision transformers (vits). Journal of Biohybrid Systems Engineering, 1(1):1–21, 2025.
[22]Sidra Fareed, Ding Yi, Babar Hussain, Subhan Uddin, Aqsa Arif, and Amir Nazar Tajoor. Fedsegnet: A federated learning framework for 3d medical image segmentation. International Journal of Ethical AI Application, 1(2):30–46, 2025.
[23]Subhan Uddin, Babar Hussain, Sidra Fareed, Aqsa Arif, and Babar Ali. A review of fault tolerance techniques in generative multi-agent systems for real-time applications. International Journal of Ethical AI Application, 1(1):43–54, 2025.
[24]Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
[25]Alexey Bochkovskiy, Chien-Yao Wang, and HongYuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection, 2020.
[26]Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[27]Hildegard Kuehne, Hueihan Jhuang, Ester Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: A large video database for human motion recognition. In ICCV, pages 2556–2563, 2011.
[28]Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
[29]Bogdan Kwolek and Miroslaw Kepski. Human fall detection on embedded platform using depth maps and wireless accelerometer. In Computer Vision and Graphics, pages 349–356. Springer, 2014.
[30]Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
[31]Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. In Journal on Image and Video Processing, volume 2008, pages 1–10, 2008.
[32]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016.
[33]Subhan Uddin, Babar Hussain, Sidra Fareed, Aqsa Arif, and Babar Ali. Real-world adaptation of retinexformer for low-light image enhancement using unpaired data. International Journal of Ethical AI Application, 1(2):1–6, 2025.
[34]Woojin Kang, Sungjune Choi, and Sungmin Park. Accident detection in traffic surveillance using spatiotemporal attention-based lstm model. In Sensors, volume 21, page 234, 2021.
[35]Zheng Wu, Tianzhu Zhang, Jun Zhao, and et al. A comprehensive survey on crowd analysis in videos. IEEE Transactions on Circuits and Systems for Video Technology, 30(11):3809–3830, 2020.
[36]Mary E. Tinetti. Falls in the elderly: causes and prevention. Clinics in Geriatric Medicine, 11(4):679– 695, 1995.
[37]Babar Hussain, Jiandong Guo, Sidra Fareed, and Subhan Uddin. Robotics for space exploration: From mars rovers to lunar missions. International Journal of Ethical AI Application, 1(1):1–10, 2025.
[38]Babar Hussain, Jiandong Guo, Fareed Sidra, Bohuan Fang, Luyao Chen, and Subhan Uddin. Enhancing spatial awareness via multi-modal fusion of cnn-based visual and depth features. International Journal of Ethical AI Application, 1(3):13–27, 2025.
[39]Norbert Noury, Thierry Herve´, Vincent Rialle, Guglielmo Virone, and Eric Mercier. Fall detection—principles and methods. Conf Proc IEEE Eng Med Biol Soc, 1:555–558, 2000.
[40]Yi Zhao, Ping Yang, Jie Zhang, Yanhua Guo, and Bing Yu. Vision-based fall detection: A review of the state of the art. Multimedia Tools and Applications, 80:25845–25881, 2021.
[41]Yixuan Cong, Junsong Yuan, and Jiandong Liu. Sparse reconstruction cost for abnormal event detection. In CVPR, pages 3449–3456, 2011.
[42]Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. CVPR, 2010.
[43]Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, and et al. The s¨omething somethingv¨ideo database for learning and evaluating visual common sense. In ICCV, 2017.
[44]Chunxiao Li, Yujie Wang, and Linfeng Zhang. Realtime fall detection for elderly based on improved lstm. Sensors, 20(11):3084, 2020.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Subhan Uddin, Babar Hussain, Noman Ahmad, Adil Hussain, Sidra Fareed (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.