Publication Type : Conference Proceedings
Publisher : IEEE
Source : 2022 IEEE 7th International conference for Convergence in Technology (I2CT), 2022, pp. 1-7
Url : https://ieeexplore.ieee.org/document/9824488
Campus : Amritapuri
School : School of Computing
Year : 2022
Abstract : Extracting depth information from a single RGB image is a fundamental and challenging task in computer vision with wide-ranging applications. This task cannot be solved using traditional methods like multi-view geometry but only by deep learning. Existing methods using convolutional neural nets produce inconsistent and blurry results due to the lack of long-range dependencies. With the recent success of Transformer networks in computer vision, which can process information locally and globally, we leverage this idea to propose a novel architecture named Focal-WNet in this paper. This architecture consists of two separate encoders and a single decoder. The main aim of this network is to learn most monocular depth cues like relative scale, contrast differences, texture gradient, etc. We incorporate focal self-attention instead of vanilla self-attention to reduce the computational complexity of the network. Along with the focal transformer layers, we leverage a convolutional architecture to learn depth cues that cannot be learned by a transformer alone as some cues like occlusion require a local receptive field and are easier for a conv-net to learn. Extensive experiments show that the proposed Focal-WNet achieves competitive results on two challenging datasets.
Cite this Research Publication : G. Manimaran and J. Swaminathan, "Focal-WNet: An Architecture Unifying Convolution and Attention for Depth Estimation," 2022 IEEE 7th International conference for Convergence in Technology (I2CT), 2022, pp. 1-7, doi: 10.1109/I2CT54291.2022.9824488.