Enhancing Spatial Awareness via Multi-Modal Fusion of CNN-Based Visual and Depth Features

Babar Hussain; Jiandong Guo; Fareed Sidra; Bohuan  Fang; Luyao Chen; Subhan  Uddin

doi:10.64229/gdz8tc37

Authors

Babar Hussain School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author
Jiandong Guo School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author
Fareed Sidra School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author
Bohuan Fang School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author
Luyao Chen School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author
Subhan Uddin School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author

DOI:

https://doi.org/10.64229/gdz8tc37

Keywords:

RGB-D Fusion, Semantic Segmentation, Depth-Aware Perception, Spatial Awareness in CNNs, Intermediate Feature Fusion

Abstract

Achieving accurate spatial awareness is a fundamental requirement for intelligent vision systems operating in complex and dynamic environments, such as autonomous navigation, robotic manipulation, and augmented reality. While Convolutional Neural Networks (CNNs) have demonstrated exceptional performance in tasks such as image classification and semantic segmentation, their inherently two-dimensional structure limits their ability to model and reason about three-dimensional spatial relationships. Specifically, CNNs are constrained by local receptive fields, a lack of explicit geometric context, and their dependence on appearance-based cues, which often results in inaccurate understanding of object boundaries, depth discontinuities, and occlusions in real-world scenes. To address these limitations, this paper investigates the fusion of RGB visual data with depth information through a multi-modal intermediate fusion framework. We propose a lightweight experimental prototype that integrates parallel feature extraction pipelines for RGB images and corresponding depth maps, followed by feature-level fusion to enhance semantic and geometric understanding. The experiment is conducted on the NYU Depth V2 dataset, which provides densely labeled indoor scenes with aligned RGB and depth data. A comparative analysis is performed between a baseline CNN model trained solely on RGB input and a modified model utilizing intermediate fusion of RGB and depth features. Experimental results indicate that the inclusion of depth information significantly improves the model’s ability to delineate object boundaries, resolve foreground-background ambiguities, and maintain semantic coherence across varying spatial scales. The depth-enhanced model demonstrates increased robustness to occlusions and illumination changes, highlighting the practical benefits of integrating geometric cues into visual perception pipelines. These findings provide empirical support for the theoretical premise that multi-modal feature fusion can substantially enhance spatial reasoning in CNN-based architectures. This study contributes both a conceptual understanding and an applied perspective on the design of multi-modal spatial systems. The results serve as a foundation for further development of robust, depth-aware visual perception models with applications in real-time robotics, autonomous systems, and immersive AR/VR environments.

References

[1]C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[2]J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[3]L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoderdecoder with atrous separable convolution for semantic image segmentation,” in European Conference on Computer Vision (ECCV), 2018.

[4]J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[5]H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[6]T. Cortinhal, G. Tzelepis, and E. E. Aksoy, “Salsanext: Fast semantic segmentation of lidar point clouds,” in IV. IEEE, 2020, pp. 453–458.

[7]Z. Wang and A. Gupta, “Learning depth from monocular videos using direct methods,” in European Conference on Computer Vision (ECCV), 2020.

[8]D. Eigen, C. Puhrsch, and R. Fergus, “Predicting depth, surface normals, and semantic labels with a common multi-scale convolutional architecture,” in ICCV, 2015, pp. 2650–2658.

[9]R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Midas: Robust monocular depth estimation in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

[10]C. Couprie, C. Farabet, L. Najman, and Y. LeCun, “Indoor semantic segmentation using depth information,” in International Conference on Learning Representations (ICLR), 2013.

[11]Y. Kim et al., “Hierarchical attention-based fusion for rgb-d scene understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

[12]C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Asian Conference on Computer Vision (ACCV), 2016.

[13]X. Chen et al., “Cross-modal transformer for rgb-d semantic segmentation,” in European Conference on Computer Vision (ECCV), 2022.

[14]X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.

[15]S. Fareed, D. Yi, B. Hussain, and S. Uddin, “Multi-modal medical image segmentation using vision transformers (vits),” Journal of Biohybrid Systems Engineering, vol. 1, no. 1, pp. 1–21, 2025.

[16]M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” in CVPR, 2018, pp. 3684–3692.

[17]H. Li and C. Shen, “Multi-scale fusion for rgb-d scene recognition,” Computer Vision and Image Understanding, vol. 207, p. 103200, 2021.

[18]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[19]O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.

[20]N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European Conference on Computer Vision (ECCV), 2012.

[21]J. Zhang, Y. Li, X. Li, and Q. Zhao, “Lightweight multi-modal fusion for real-time segmentation,” Sensors, vol. 23, no. 2, p. 512, 2023.

[22]T. Yu, K. Lu, and J. Wang, “Attention-based fusion of rgb and depth for indoor scene understanding,” Pattern Recognition Letters, 2022.

[23]M. Jaritz, R. d. Charette, M. Toromanoff, E. Perot, and F. Nashashibi, “Sparse and noisy lidar completion with rgb guidance,” in CVPRW, 2020, pp. 0–0.

[24]J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns for depth completion,” in 3DV. IEEE, 2017, pp. 11–20.

[25]A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.

[26]A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012, pp. 3354–3361.

[27]J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2016.

[28]S. Uddin, B. Hussain, S. Fareed, A. Arif, and B. Ali, “A review of fault tolerance techniques in generative multi-agent systems for real-time applications,” International Journal of Ethical AI Application, vol. 1, no. 1, pp. 43–54, 2025.

[29]M. Ji et al., “Depth-aware vision transformers for rgb-d scene understanding,” IEEE Transactions on Image Processing, vol. 32, pp. 1234–1249, 2023.

[30]S. Uddin, B. Hussain, S. Fareed, A. Arif, and B. Ali, “Real-world adaptation of retinexformer for low-light image enhancement using unpaired data,” International Journal of Ethical AI Application, vol. 1, no. 2, pp. 1–6, 2025.

[31]L. Dvorak and A. Srajer, “Embedded systems and real-time deployment,” in Proceedings of the Embedded Vision Summit, 2023.

[32]H. Tang et al., “Lightweight edge-based fusion models for efficient rgb-d perception,” in Proceedings of the International Conference on Embedded AI Systems, 2023.

[33]B. Hussain, G. Jiandong, S. Fareed, and S. Uddin, “Robotics for space exploration: From mars rovers to lunar missions,” 06 2025.

Enhancing Spatial Awareness via Multi-Modal Fusion of CNN-Based Visual and Depth Features

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License