Visual representation learning is ubiquitous in various real-world applications, including visual comprehension, video understanding, multi-modal analysis, human-computer interaction, and urban computing. Due to the emergence of huge amounts of multi-modal heterogeneous spatial/temporal/spatial-temporal data in the big data era, the lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. The majority of the existing methods tend to fit the original data/variable distributions and ignore the essential causal relations behind the multi-modal knowledge, which lacks unified guidance and analysis about why modern visual representation learning methods easily collapse into data bias and have limited generalization and cognitive abilities. Inspired by the strong inference ability of human-level agents, recent years have therefore witnessed great effort in developing causal reasoning paradigms to realize robust representation and model learning with good cognitive ability. In this paper, we conduct a comprehensive review of existing causal reasoning methods for visual representation learning, covering fundamental theories, models, and datasets. The limitations of current methods and datasets are also discussed. Moreover, we propose some prospective challenges, opportunities, and future research directions for benchmarking causal reasoning algorithms in visual representation learning. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable visual representation learning and related real-world applications more efficiently.
As a fundamental task in computer vision, visual object tracking has received much attention in recent years. Most studies focus on short-term visual tracking which addresses shorter videos and always-visible targets. However, long-term visual tracking is much closer to practical applications with more complicated challenges. There exists a longer duration such as minute-level or even hour-level in the long-term tracking task, and the task also needs to handle more frequent target disappearance and reappearance. In this paper, we provide a thorough review of long-term tracking, summarizing long-term tracking algorithms from two perspectives: framework architectures and utilization of intermediate tracking results. Then we provide a detailed description of existing benchmarks and corresponding evaluation protocols. Furthermore, we conduct extensive experiments and analyse the performance of trackers on six benchmarks: VOTLT2018, VOTLT2019 (2020/2021), OxUvA, LaSOT, TLP and the long-term subset of VTUAV-V. Finally, we discuss the future prospects from multiple perspectives, including algorithm design and benchmark construction. To our knowledge, this is the first comprehensive survey for long-term visual object tracking. The relevant content is available at
We present the first comprehensive video polyp segmentation (VPS) study in the deep learning era. Over the years, developments in VPS are not moving forward with ease due to the lack of a large-scale dataset with fine-grained segmentation annotations. To address this issue, we first introduce a high-quality frame-by-frame annotated VPS dataset, named SUN-SEG, which contains 158690 colonoscopy video frames from the well-known SUN-database. We provide additional annotation covering diverse types, i.e., attribute, object mask, boundary, scribble, and polygon. Second, we design a simple but efficient baseline, named PNS+, which consists of a global encoder, a local encoder, and normalized self-attention (NS) blocks. The global and local encoders receive an anchor frame and multiple successive frames to extract long-term and short-term spatial-temporal representations, which are then progressively refined by two NS blocks. Extensive experiments show that PNS+ achieves the best performance and real-time inference speed (170 fps), making it a promising solution for the VPS task. Third, we extensively evaluate 13 representative polyp/object segmentation models on our SUN-SEG dataset and provide attribute-based comparisons. Finally, we discuss several open issues and suggest possible research directions for the VPS community. Our project and dataset are publicly available at
A panoptic driving perception system is an essential part of autonomous driving. A high-precision and real-time perception system can assist the vehicle in making reasonable decisions while driving. We present a panoptic driving perception network (you only look once for panoptic (YOLOP)) to perform traffic object detection, drivable area segmentation, and lane detection simultaneously. It is composed of one encoder for feature extraction and three decoders to handle the specific tasks. Our model performs extremely well on the challenging BDD100K dataset, achieving state-of-the-art on all three tasks in terms of accuracy and speed. Besides, we verify the effectiveness of our multi-task learning model for joint training via ablative studies. To our best knowledge, this is the first work that can process these three visual perception tasks simultaneously in real-time on an embedded device Jetson TX2(23 FPS), and maintain excellent accuracy. To facilitate further research, the source codes and pre-trained models are released at
Glaucoma is a prevalent cause of blindness worldwide. If not treated promptly, it can cause vision and quality of life to deteriorate. According to statistics, glaucoma affects approximately 65 million individuals globally. Fundus image segmentation depends on the optic disc (OD) and optic cup (OC). This paper proposes a computational model to segment and classify retinal fundus images for glaucoma detection. Different data augmentation techniques were applied to prevent overfitting while employing several data pre-processing approaches to improve the image quality and achieve high accuracy. The segmentation models are based on an attention U-Net with three separate convolutional neural networks (CNNs) backbones: Inception-v3, visual geometry group 19 (VGG19), and residual neural network 50 (ResNet50). The classification models also employ a modified version of the above three CNN architectures. Using the RIM-ONE dataset, the attention U-Net with the ResNet50 model as the encoder backbone, achieved the best accuracy of 99.58% in segmenting OD. The Inception-v3 model had the highest accuracy of 98.79% for glaucoma classification among the evaluated segmentation, followed by the modified classification architectures.