
样式: 排序: IF: - GO 导出 标记为已读
-
Optimal Transport with Arbitrary Prior for Dynamic Resolution Network Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-26
Zhizhong Zhang, Shujun Li, Chenyang Zhang, Lizhuang Ma, Xin Tan, Yuan XieDynamic resolution network is proved to be crucial in reducing computational redundancy by automatically assigning satisfactory resolution for each input image. However, it is observed that resolution choices are often collapsed, where prior works tend to assign images to the resolution routes whose computational cost is close to the required FLOPs. In this paper, we propose a novel optimal transport
-
DocScanner: Robust Document Image Rectification with Progressive Learning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-26
Hao Feng, Wengang Zhou, Jiajun Deng, Qi Tian, Houqiang LiCompared with flatbed scanners, portable smartphones provide more convenience for physical document digitization. However, such digitized documents are often distorted due to uncontrolled physical deformations, camera positions, and illumination variations. To this end, we present DocScanner, a novel framework for document image rectification. Different from existing solutions, DocScanner addresses
-
AutoViT: Achieving Real-Time Vision Transformers on Mobile via Latency-aware Coarse-to-Fine Search Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-26
Zhenglun Kong, Dongkuan Xu, Zhengang Li, Peiyan Dong, Hao Tang, Yanzhi Wang, Subhabrata MukherjeeDespite their impressive performance on various tasks, vision transformers (ViTs) are heavy for mobile vision applications. Recent works have proposed combining the strengths of ViTs and convolutional neural networks (CNNs) to build lightweight networks. Still, these approaches rely on hand-designed architectures with a pre-determined number of parameters. In this work, we address the challenge of
-
Lightweight Structure-Aware Attention for Visual Understanding Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-26
Heeseung Kwon, Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Karteek AlahariAttention operator has been widely used as a basic brick in visual understanding since it provides some flexibility through its adjustable kernels. However, this operator suffers from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose
-
PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-25
Peiyuan Zhang, Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Yue Zhou, Xiaosong Jia, Xudong Lu, Jingdong Chen, Xiang Li, Junchi Yan, Yansheng LiWith the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integrating
-
Modeling Scattering Effect for Under-Display Camera Image Restoration Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-25
Binbin Song, Jiantao Zhou, Xiangyu Chen, Shuning XuThe under-display camera (UDC) technology furnishes users with an uninterrupted full-screen viewing experience, eliminating the need for notches or punch holes. However, the translucent properties of the display lead to substantial degradation in UDC images. This work addresses the challenge of restoring UDC images by specifically targeting the scattering effect induced by the display. We explicitly
-
Supplementary Prompt Learning for Vision-Language Models Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-24
Rongfei Zeng, Zhipeng Yang, Ruiyun Yu, Yonggang ZhangPre-trained vision-language models like CLIP have shown remarkable capabilities across various downstream tasks with well-tuned prompts. Advanced methods tune prompts by optimizing context while keeping the class name fixed, implicitly assuming that the class names in prompts are accurate and not missing. However, this assumption may be violated in numerous real-world scenarios, leading to potential
-
Local Concept Embeddings for Analysis of Concept Distributions in Vision DNN Feature Spaces Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-24
Georgii Mikriukov, Gesina Schwalbe, Korinna BadeInsights into the learned latent representations are imperative for verifying deep neural networks (DNNs) in critical computer vision (CV) tasks. Therefore, state-of-the-art supervised Concept-based eXplainable Artificial Intelligence (C-XAI) methods associate user-defined concepts like “car” each with a single vector in the DNN latent space (concept embedding vector). In the case of concept segmentation
-
MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-24
Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang WangLearning robust and scalable visual representations from massive multi-view video data remains a challenge in computer vision and autonomous driving. Existing pre-training methods either rely on expensive supervised learning with 3D annotations, limiting the scalability, or focus on single-frame or monocular inputs, neglecting the temporal information, which is fundamental for the ultimate application
-
RigNet++: Semantic Assisted Repetitive Image Guided Network for Depth Completion Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-23
Zhiqiang Yan, Xiang Li, Le Hui, Zhenyu Zhang, Jun Li, Jian YangDepth completion aims to recover dense depth maps from sparse ones, where color images are often used to facilitate this task. Recent depth methods primarily focus on image guided learning frameworks. However, blurry guidance in the image and unclear structure in the depth still impede their performance. To tackle these challenges, we explore a repetitive design in our image guided network to gradually
-
A Generalized Contour Vibration Model for Building Extraction Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-22
Chunyan Xu, Shuaizhen Yao, Ziqiang Xu, Zhen Cui, Jian YangClassic active contour models (ACMs) are becoming a great promising solution to the contour-based object extraction with the progress of deep learning recently. Inspired by the wave vibration theory in physics, we propose a Generalized Contour Vibration Model (G-CVM) by inheriting the force and motion principle of contour wave for automatically estimating building contours. The contour estimation problems
-
Simplified Concrete Dropout - Improving the Generation of Attribution Masks for Fine-grained Classification Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-22
Dimitri Korsch, Maha Shadaydeh, Joachim DenzlerIn fine-grained classification, which is classifying images into subcategories within a common broader category, it is crucial to have precise visual explanations of the classification model’s decision. While commonly used attention- or gradient-based methods deliver either too coarse or too noisy explanations unsuitable for highlighting subtle visual differences reliably, perturbation-based methods
-
Spatial-Temporal Transformer for Single RGB-D Camera Synchronous Tracking and Reconstruction of Non-rigid Dynamic Objects Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-21
Xiaofei Liu, Zhengkun Yi, Xinyu Wu, Wanfeng ShangWe propose a simple and effective method that views the problem of single RGB-D camera synchronous tracking and reconstruction of non-rigid dynamic objects as an aligned sequential point cloud prediction problem. Our method does not require additional data transformations (truncated signed distance function or deformation graphs, etc.), alignment constraints (handcrafted features or optical flow, etc
-
Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-19
De Cheng, Lingfeng He, Nannan Wang, Dingwen Zhang, Xinbo GaoUnsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook
-
Generalized Closed-Form Formulae for Feature-Based Subpixel Alignment in Patch-Based Matching Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-19
Laurent Valentin Jospin, Hamid Laga, Farid Boussaid, Mohammed BennamounPatch-based matching is a technique meant to measure the disparity between pixels in a source and target image and is at the core of various methods in computer vision. When the subpixel disparity between the source and target images is required, the cost function or the target image has to be interpolated. While cost-based interpolation is easier to implement, multiple works have shown that image-based
-
Learning to Deblur Polarized Images Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-19
Chu Zhou, Minggui Teng, Xinyu Zhou, Chao Xu, Imari Sato, Boxin ShiA polarization camera can capture four linear polarized images with different polarizer angles in a single shot, which is useful in polarization-based vision applications since the degree of linear polarization (DoLP) and the angle of linear polarization (AoLP) can be directly computed from the captured polarized images. However, since the on-chip micro-polarizers block part of the light so that the
-
SimZSL: Zero-Shot Learning Beyond a Pre-defined Semantic Embedding Space Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-17
Mina Ghadimi Atigh, Stephanie Nargang, Martin Keller-Ressel, Pascal MettesZero-shot recognition is centered around learning representations to transfer knowledge from seen to unseen classes. Where foundational approaches perform the transfer with semantic embedding spaces, e.g., from attributes or word vectors, the current state-of-the-art relies on prompting pre-trained vision-language models to obtain class embeddings. Whether zero-shot learning is performed with attributes
-
High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-16
Libo Zhang, Yongsheng Yu, Jiali Yao, Heng FanGenerative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked
-
HumanLiff: Layer-wise 3D Human Diffusion Model Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-16
Shoukang Hu, Fangzhou Hong, Tao Hu, Liang Pan, Haiyi Mei, Weiye Xiao, Lei Yang, Ziwei Liu3D human generation from 2D images has achieved remarkable progress through the synergistic utilization of neural rendering and generative models. Existing 3D human generative models mainly generate a clothed 3D human as an inseparable 3D model in a single pass, while rarely considering the layer-wise nature of a clothed human body, which often consists of the human body and various clothes such as
-
Defending Against Adversarial Examples Via Modeling Adversarial Noise Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-14
Dawei Zhou, Nannan Wang, Bo Han, Tongliang Liu, Xinbo GaoAdversarial examples have become a major threat to the reliable application of deep learning models. Meanwhile, this issue promotes the development of adversarial defenses. Adversarial noise contains well-generalizing and misleading features, which can manipulate predicted labels to be flipped maliciously. Motivated by this, we study modeling adversarial noise for defending against adversarial examples
-
IPAD: Iterative, Parallel, and Diffusion-Based Network for Scene Text Recognition Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-14
Xiaomeng Yang, Zhi Qiao, Yu ZhouNowadays, scene text recognition has attracted more and more attention due to its diverse applications. Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right. Despite the convincing performance, this sequential decoding strategy constrains the inference speed. Conversely, non-autoregressive models provide faster
-
An Information Theory-Inspired Strategy for Automated Network Pruning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-12
Xiawu Zheng, Yuexiao Ma, Teng Xi, Gang Zhang, Errui Ding, Yuchao Li, Jie Chen, Yonghong Tian, Rongrong JiDespite superior performance achieved on many computer vision tasks, deep neural networks demand high computing power and memory footprint. Most existing network pruning methods require laborious human efforts and prohibitive computation resources, especially when the constraints are changed. This practically limits the application of model compression when the model needs to be deployed on a wide
-
Exploring Bidirectional Bounds for Minimax-Training of Energy-Based Models Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-13
Cong Geng, Jia Wang, Li Chen, Zhiyong Gao, Jes Frellsen, Søren HaubergEnergy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning
-
Bamboo: Building Mega-Scale Vision Dataset Continually with Human–Machine Synergy Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-13
Yuanhan Zhang, Qinghong Sun, Yichun Zhou, Zexin He, Zhenfei Yin, Kun Wang, Lu Sheng, Yu Qiao, Jing Shao, Ziwei LiuLarge-scale datasets play a vital role in computer vision. But current datasets are annotated blindly without differentiation to samples, making the data collection inefficient and unscalable. The open question is how to build a mega-scale dataset actively. Although advanced active learning algorithms might be the answer, we experimentally found that they are lame in the realistic annotation scenario
-
A Norm Regularization Training Strategy for Robust Image Quality Assessment Models Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-12
Yujia Liu, Chenxi Yang, Dingquan Li, Tingting Jiang, Tiejun HuangImage Quality Assessment (IQA) models predict the quality score of input images. They can be categorized into Full-Reference (FR-) and No-Reference (NR-) IQA models based on the availability of reference images. These models are essential for performance evaluation and optimization guidance in the media industry. However, researchers have observed that introducing imperceptible perturbations to input
-
CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-09
Jinheng Xie, Songhe Deng, Xianxu Hou, Zhaochuan Luo, Linlin Shen, Yawen Huang, Yefeng Zheng, Mike Zheng ShouWhile promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without
-
Autoregressive Temporal Modeling for Advanced Tracking-by-Diffusion Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-09
Pha Nguyen, Rishi Madhok, Bhiksha Raj, Khoa LuuObject tracking is a widely studied computer vision task with video and instance analysis applications. While paradigms such as tracking-by-regression,-detection,-attention have advanced the field, generative modeling offers new potential. Although some studies explore the generative process in instance-based understanding tasks, they rely on prediction refinement in the coordinate space rather than
-
HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-07
Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng LiRecent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify
-
BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-06
Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, Mingli Zhu, Ruotong Wang, Li Liu, Chao ShenIn recent years, backdoor learning has attracted increasing attention due to its effectiveness on investigating the adversarial vulnerability of artificial intelligence (AI) systems. Several seminal backdoor attack and defense algorithms have been developed, forming an increasingly fierce arms race. However, since backdoor learning involves various factors in different stages of an AI system (e.g.
-
Paragraph-to-Image Generation with Information-Enriched Diffusion Model Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-05
Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di ZhangText-to-image models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model
-
P2Object: Single Point Supervised Object Detection and Instance Segmentation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-03
Pengfei Chen, Xuehui Yu, Xumeng Han, Kuiran Wang, Guorong Li, Lingxi Xie, Zhenjun Han, Jianbin JiaoObject recognition using single-point supervision has attracted increasing attention recently. However, the performance gap compared with fully-supervised algorithms remains large. Previous works generated class-agnostic proposals in an image offline and then treated mixed candidates as a single bag, putting a huge burden on multiple instance learning (MIL). In this paper, we introduce Point-to-Box
-
Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-03
Dhruv Verma, Debaditya Roy, Basura FernandoSituation recognition refers to the ability of an agent to identify and understand various situations or contexts based on available information and sensory inputs. It involves the cognitive process of interpreting data from the environment to determine what is happening, what factors are involved, and what actions caused those situations. This interpretation of situations is formulated as a semantic
-
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-05-03
Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Yonggang WenRecent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL models requires substantial hardware resources, where efficiency is restricted by two key factors: the extended input sequence of the language model with vision features
-
Few-Shot Referring Video Single- and Multi-Object Segmentation Via Cross-Modal Affinity with Instance Sequence Matching Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-28
Heng Liu, Guanghui Li, Mingqi Gao, Xiantong Zhen, Feng Zheng, Yang WangReferring Video Object Segmentation (RVOS) aims to segment specific objects in videos based on the provided natural language descriptions. As a new supervised visual learning task, achieving RVOS for a given scene requires a substantial amount of annotated data. However, only minimal annotations are usually available for new scenes in realistic scenarios. Another practical problem is that, apart from
-
Interaction Confidence Attention for Human–Object Interaction Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-28
Hong-Bo Zhang, Wang-Kai Lin, Hang Su, Qing Lei, Jing-Hua Liu, Ji-Xiang DuIn human–object interaction (HOI) detection task, ensuring that interactive pairs receive higher attention weights while reducing the weight of non-interaction pairs is imperative for enhancing HOI detection accuracy. Guiding attention learning is also a key aspect of existing transformer-based algorithms. To tackle this challenge, this study proposes a novel approach termed Interaction Confidence
-
A Closer Look at Benchmarking Self-supervised Pre-training with Image Classification Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-27
Markus Marks, Manuel Knott, Neehar Kondapaneni, Elijah Cole, Thijs Defraeye, Fernando Perez-Cruz, Pietro PeronaSelf-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels. The model is forced to learn about the data’s inherent structure or context by solving a pretext task. With SSL, models can learn from abundant and cheap unlabeled data, significantly reducing the cost of training models where labels are expensive or inaccessible
-
Data-Adaptive Weight-Ensembling for Multi-task Model Fusion Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-25
Anke Tang, Li Shen, Yong Luo, Shiwei Liu, Han Hu, Bo Du, Dacheng TaoCreating a multi-task model by merging models for distinct tasks has proven to be an economical and scalable approach. Recent research, like task arithmetic, demonstrates that a static solution for multi-task model fusion can be located within the vector space spanned by task vectors. However, the static nature of these methods limits their ability to adapt to the intricacies of individual instances
-
P2P: Part-to-Part Motion Cues Guide a Strong Tracking Framework for LiDAR Point Clouds Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-21
Jiahao Nie, Fei Xie, Sifan Zhou, Xueyi Zhou, Dong-Kyu Chae, Zhiwei He3D single object tracking (SOT) methods based on appearance matching has long suffered from insufficient appearance information incurred by incomplete, textureless and semantically deficient LiDAR point clouds. While motion paradigm exploits motion cues instead of appearance matching for tracking, it incurs complex multi-stage processing and segmentation module. In this paper, we first provide in-depth
-
SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-15
Mingxin Huang, Dezhi Peng, Hongliang Li, Zhenghao Peng, Chongyu Liu, Dahua Lin, Yuliang Liu, Xiang Bai, Lianwen JinEnd-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting
-
D3T: Dual-Domain Diffusion Transformer in Triplanar Latent Space for 3D Incomplete-View CT Reconstruction Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-16
Xuhui Liu, Hong Li, Zhi Qiao, Yawen Huang, Xi Liu, Juan Zhang, Zhen Qian, Xiantong Zhen, Baochang ZhangComputed tomography (CT) is a cornerstone of clinical imaging, yet its accessibility in certain scenarios is constrained by radiation exposure concerns and operational limitations within surgical environments. CT reconstruction from incomplete views has attracted increasing research attention due to its great potential in medical applications. However, it is inherently an ill-posed problem, which,
-
C2RF: Bridging Multi-modal Image Registration and Fusion via Commonality Mining and Contrastive Learning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-15
Linfeng Tang, Qinglong Yan, Xinyu Xiang, Leyuan Fang, Jiayi MaExisting image fusion methods are typically only applicable to strictly aligned source images, and they introduce undesirable artifacts when source images are misaligned, compromising visual perception and downstream applications. In this work, we propose a mutually promoting multi-modal image registration and fusion framework based on commonality mining and contrastive learning, named C2RF. We adaptively
-
Segment Anything in 3D with Radiance Fields Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-09
Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi TianThe Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view
-
A Survey of Representation Learning, Optimization Strategies, and Applications for Omnidirectional Vision Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-10
Hao Ai, Zidong Cao, Lin WangOmnidirectional image (ODI) data is captured with a field-of-view of \(360^\circ \times 180^\circ \), which is much wider than the pinhole cameras and captures richer surrounding environment details than the conventional perspective images. In recent years, the availability of customer-level \(360^\circ \) cameras has made omnidirectional vision more popular, and the advance of deep learning (DL) has
-
AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-07
Xuanmeng Zhang, Jianfeng Zhang, Chenxu Zhang, Jun Hao Liew, Huichao Zhang, Yi Yang, Jiashi FengWe study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a generative model that yields explicit textured 3D meshes
-
Diffusion-Enhanced Test-Time Adaptation with Text and Image Augmentation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-05
Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong LiuExisting test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the
-
NU-AIR: A Neuromorphic Urban Aerial Dataset for Detection and Localization of Pedestrians and Vehicles Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-03
Craig Iaboni, Thomas Kelly, Pramod AbichandaniThis paper presents an open-source aerial neuromorphic dataset that captures pedestrians and vehicles moving in an urban environment. The dataset, titled NU-AIR, features over 70 min of event footage acquired with a 640 \(\times \) 480 resolution neuromorphic sensor mounted on a quadrotor operating in an urban environment. Crowds of pedestrians, different types of vehicles, and street scenes featuring
-
Free Lunch to Meet the Gap: Intermediate Domain Reconstruction for Cross-Domain Few-Shot Learning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-01
Tong Zhang, Yifan Zhao, Liangyu Wang, Jia LiCross-domain few-shot learning (CDFSL) endeavors to transfer generalized knowledge from the source domain to target domains using only a minimal amount of training data, which faces a triplet of learning challenges in the meantime, i.e., semantic disjoint, large domain discrepancy, and data scarcity. Different from predominant CDFSL works focused on generalized representations, we make novel attempts
-
A Fast and Lightweight 3D Keypoint Detector Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-04-01
Chengzhuan Yang, Qian Yu, Hui Wei, Fei Wu, Yunliang Jiang, Zhonglong Zheng, Ming-Hsuan YangKeypoint detection is crucial in many visual tasks, such as object recognition, shape retrieval, and 3D reconstruction, as labeling point data is labor-intensive or sometimes implausible. Nevertheless, it is challenging to quickly and accurately locate keypoints unsupervised from point clouds. This work proposes a fast and lightweight 3D keypoint detector that can efficiently and accurately detect
-
Creatively Upscaling Images with Global-Regional Priors Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-31
Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao MeiContemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., \(1024\times 1024\)). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously
-
$$\hbox {I}^2$$ MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-27
Yunyao Mao, Jiajun Deng, Wengang Zhou, Zhenbo Lu, Wanli Ouyang, Houqiang LiRecent progresses on self-supervised 3D human action representation learning are largely attributed to contrastive learning. However, in conventional contrastive frameworks, the rich complementarity between different skeleton modalities remains under-explored. Moreover, optimized with distinguishing self-augmented samples, models struggle with numerous similar positive instances in the case of limited
-
Advances in 3D Neural Stylization: A Survey Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-28
Yingshu Chen, Guocheng Shao, Ka Chun Shum, Binh-Son Hua, Sai-Kit YeungModern artificial intelligence offers a novel and transformative approach to creating digital art across diverse styles and modalities like images, videos and 3D data, unleashing the power of creativity and revolutionizing the way that we perceive and interact with visual content. This paper reports on recent advances in stylized 3D asset creation and manipulation with the expressive power of neural
-
Pre-training for Action Recognition with Automatically Generated Fractal Datasets Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-26
Davyd Svyezhentsev, George Retsinas, Petros MaragosIn recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This
-
ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-25
Yipeng Zhang, Xin Wang, Hong Chen, Chenyang Qin, Yibo Hao, Hong Mei, Wenwu ZhuWith the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weaknesses: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture
-
LaneCorrect: Self-Supervised Lane Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-24
Ming Nie, Xinyue Cai, Hang Xu, Li ZhangLane detection has evolved highly functional autonomous driving system to understand driving scenes even under complex environments. In this paper, we work towards developing a generalized computer vision system able to detect lanes without using any annotation. We make the following contributions: (i) We illustrate how to perform unsupervised 3D lane segmentation by leveraging the distinctive intensity
-
Camouflaged Object Detection with Adaptive Partition and Background Retrieval Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-22
Bowen Yin, Xuying Zhang, Li Liu, Ming-Ming Cheng, Yongxiang Liu, Qibin HouRecent works confirm the importance of local details for identifying camouflaged objects. However, how to identify the details around the target objects via background cues lacks in-depth study. In this paper, we take this into account and present a novel learning framework for camouflaged object detection, called AdaptCOD. To be specific, our method decouples the detection process into three parts
-
Preconditioned Score-Based Generative Models Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-21
Hengyuan Ma, Xiatian Zhu, Jianfeng Feng, Li ZhangScore-based generative models (SGMs) have recently emerged as a promising class of generative models. However, a fundamental limitation is that their sampling process is slow due to a need for many (e.g., 2000) iterations of sequential computations. An intuitive acceleration method is to reduce the sampling iterations which however causes severe performance degradation. We assault this problem to the
-
FlowSDF: Flow Matching for Medical Image Segmentation Using Distance Transforms Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-22
Lea Bogensperger, Dominik Narnhofer, Alexander Falk, Konrad Schindler, Thomas PockMedical image segmentation plays an important role in accurately identifying and isolating regions of interest within medical images. Generative approaches are particularly effective in modeling the statistical properties of segmentation masks that are closely related to the respective structures. In this work we introduce FlowSDF, an image-guided conditional flow matching framework, designed to represent
-
CT3D++: Improving 3D Object Detection with Keypoint-Induced Channel-wise Transformer Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-20
Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Qiao Liang, Min-Jian Zhao, Jieping YeThe field of 3D object detection from point clouds is rapidly advancing in computer vision, aiming to accurately and efficiently detect and localize objects in three-dimensional space. Current 3D detectors commonly fall short in terms of flexibility and scalability, with ample room for advancements in performance. In this paper, our objective is to address these limitations by introducing two frameworks
-
LR-ASD: Lightweight and Robust Network for Active Speaker Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-19
Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, Yanru ChenActive speaker detection is a challenging task aimed at identifying who is speaking. Due to the critical importance of this task in numerous applications, it has received considerable attention. Existing studies endeavor to enhance performance at any cost by inputting information from multiple candidates and designing complex models. While these methods have achieved excellent performance, their substantial
-
A Solution to Co-occurrence Bias in Pedestrian Attribute Recognition: Theory, Algorithms, and Improvements Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-03-18
Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Haotian Wu, Shiliang Pu, Hanzi WangFor the pedestrian attributes recognition, we demonstrate that deep models can memorize the pattern of attributes co-occurrences inherent to dataset, whether through explicit or implicit means. However, since the attributes interdependency is highly variable and unpredictable across different scenarios, the modeled attributes co-occurrences de facto serve as a data selection bias that hardly generalizes