Main Track Accepted Papers – IJCAI 2024 (2024)

68

Minimizing Weighted Counterfactual Regret with Optimistic Online Mirror Descent

Hang Xu, Kai Li, Bingyun Liu, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng

[+] More

[-] Less

Counterfactual regret minimization (CFR) is a family of algorithms for effectively solving imperfect-information games. It decomposes the total regret into counterfactual regrets, utilizing local regret minimization algorithms, such as Regret Matching (RM) or RM+, to minimize them. Recent research establishes a connection between Online Mirror Descent (OMD) and RM+, paving the way for an optimistic variant PRM+ and its extension PCFR+. However, PCFR+ assigns uniform weights for each iteration when determining regrets, leading to substantial regrets when facing dominated actions. This work explores minimizing weighted counterfactual regret with optimistic OMD, resulting in a novel CFR variant PDCFR+. It integrates PCFR+ and Discounted CFR (DCFR) in a principled manner, swiftly mitigating negative effects of dominated actions and consistently leveraging predictions to accelerate convergence. Theoretical analyses prove that PDCFR+ converges to a Nash equilibrium, particularly under distinct weighting schemes for regrets and average strategies. Experimental results demonstrate PDCFR+’s fast convergence in common imperfect-information games. The code is available at https://github.com/rpSebastian/PDCFRPlus.

List of keywords

Machine Learning -> ML: Game Theory
Game Theory and Economic Paradigms -> GTEP: Noncooperative games

70

Structure-Preserving Physics-Informed Neural Networks with Energy or Lyapunov Structure

Haoyu Chu, Yuto Miyatake, Wenjun Cui, Shikui Wei, Daisuke Furihata

[+] More

[-] Less

Recently, there has been growing interest in using physics-informed neural networks (PINNs) to solve differential equations. However, the preservation of structure, such as energy and stability, in a suitable manner has yet to be established. This limitation could be a potential reason why the learning process for PINNs is not always efficient and the numerical results may suggest nonphysical behavior. Besides, there is little research on their applications on downstream tasks. To address these issues, we propose structure-preserving PINNs to improve their performance and broaden their applications for downstream tasks. Firstly, by leveraging prior knowledge about the physical system, a structure‐preserving loss function is designed to assist the PINN in learning the underlying structure. Secondly, a framework that utilizes structure-preserving PINN for robust image recognition is proposed. Here, preserving the Lyapunov structure of the underlying system ensures the stability of the system. Experimental results demonstrate that the proposed method improves the numerical accuracy of PINNs for partial differential equations (PDEs). Furthermore, the robustness of the model against adversarial perturbations in image data is enhanced.

List of keywords

Machine Learning -> ML: Deep learning architectures
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Machine learning for vision
Machine Learning -> ML: Supervised Learning

97

Automatic De-Biased Temporal-Relational Modeling for Stock Investment Recommendation

Weijun Chen, Shun Li, Xipu Yu, Heyuan Wang, Wei Chen, Tengjiao Wang

[+] More

[-] Less

Stock investment recommendation is crucial for guiding investment decisions and managing portfolios. Recent studies have demonstrated the potential of temporal-relational models (TRM) to yield excess investment returns. However, in the complicated finance ecosystem, the current TRM suffer from both the intrinsic temporal bias from the low signal-to-noise ratio (SNR) and the relational bias caused by utilizing inappropriate relational topologies and propagation mechanisms. Moreover, the distribution shifts behind macro-market scenarios invalidate the underlying i.i.d. assumption and limit the generalization ability of TRM. In this paper, we pioneer the impact of the above issues on the effective learning of temporal-relational patterns and propose an Automatic De-Biased Temporal-Relational Model (ADB-TRM) for stock recommendation. Specifically, ADB-TRM consists of three main components, i.e., (i) a meta-learned architecture forms a dual-stage training process, with the inner part ameliorating temporal-relational bias and the outer meta-learner counteracting distribution shifts, (ii) automatic adversarial sample generation guides the model adaptively to alleviate bias and enhance its profiling ability through adversarial training, and (iii) global-local interaction helps seek relative invariant stock embeddings from local and global distribution perspectives to mitigate distribution shifts. Experiments on three datasets from distinct stock markets show that ADB-TRM excels state-of-the-arts over 28.41% and 9.53% in terms of cumulative and risk-adjusted returns.

List of keywords

Data Mining -> DM: Applications
Data Mining -> DM: Mining spatial and/or temporal data
Machine Learning -> ML: Time series and data streams
Multidisciplinary Topics and Applications -> MTA: Finance

103

TaD: A Plug-and-Play Task-Aware Decoding Method to Better Adapt LLMs on Downstream Tasks

Xinhao Xu, Hui Chen, Zijia Lin, Jungong Han, Lixing Gong, Guoxin Wang, Yongjun Bao, Guiguang Ding

[+] More

[-] Less

Fine-tuning pre-trained models on downstream tasks is a common practice in leveraging large language models (LLMs) today. A critical issue is how to adapt pre-trained models to downstream tasks better, thereby enhancing their performance. This paper introduces Task-aware Decoding (TaD), a plug-and-play method that exploits the difference in probability distributions before and after fine-tuning to boost the performance of LLMs on downstream tasks. The proposed TaD argues that the difference between the pre-finetuning probability distribution and the post-finetuning one represents the direction from common knowledge towards specific downstream-task knowledge. Aligning the final output probability distribution to that direction can probably result in superior downstream task performance, compared to the original fine-tuned model. Experiments on various datasets across four different task categories well demonstrate TaD’s effectiveness on different LLMs, i.e., GPT, BLOOM, and LLaMA, with different fine-tuning methods. Moreover, further experiments reveal that TaD better enhances model performance in data-scarce scenarios.

List of keywords

Natural Language Processing -> NLP: Language generation
Natural Language Processing -> NLP: Applications
Natural Language Processing -> NLP: Language models

128

Physics-Informed Trajectory Prediction for Autonomous Driving under Missing Observation

Haicheng Liao, Chengyue Wang, Zhenning Li, Yong kang Li, Bonan Wang, Guofa Li, Cheng-Zhong Xu

[+] More

[-] Less

This paper introduces a novel trajectory prediction approach for autonomous vehicles (AVs), adeptly addressing the challenges of missing observations and the need for adherence to physical laws in real-world driving environments. This study proposes a hierarchical two-stage trajectory prediction model for AVs. In the first stage we propose the Wavelet Reconstruction Network, an innovative tool expertly crafted for reconstructing missing observations, offering optional integration with state-of-the-art models to enhance their robustness. Additionally, the second stage of the model features the Wave Fusion Encoder, a quantum mechanics-inspired innovation for sophisticated vehicle interaction modeling. By incorporating the Kinematic Bicycle Model, we ensure that our predictions align with realistic vehicular kinematics. Complementing our methodological advancements, we introduce MoCAD-missing, a comprehensive real-world traffic dataset, alongside enhanced versions of the NGSIM and HighD datasets, designed to facilitate rigorous testing in environments with missing observations. Extensive evaluations demonstrate that our approach markedly outperforms existing methods, achieving high accuracy even in scenarios with up to 75% missing observations.

List of keywords

Robotics -> ROB: Other
Planning and Scheduling -> PS: Planning under uncertainty

129

MFTraj: Map-Free, Behavior-Driven Trajectory Prediction for Autonomous Driving

Haicheng Liao, Zhenning Li, Chengyue Wang, Huanming Shen, Dongping Liao, Bonan Wang, Guofa Li, Cheng-Zhong Xu

[+] More

[-] Less

This paper introduces a trajectory prediction model tailored for autonomous driving, focusing on capturing complex interactions in dynamic traffic scenarios without reliance on high-definition maps. The model, termed MFTraj, harnesses historical trajectory data combined with a novel dynamic geometric graph-based behavior-aware module. At its core, an adaptive structure-aware interactive graph convolutional network captures both positional and behavioral features of road users, preserving spatial-temporal intricacies. Enhanced by a linear attention mechanism, the model achieves computational efficiency and reduced parameter overhead. Evaluations on the Argoverse, NGSIM, HighD, and MoCAD datasets underscore MFTraj’s robustness and adaptability, outperforming numerous benchmarks even in data-challenged scenarios without the need for additional information such as HD maps or vectorized maps. Importantly, it maintains competitive performance even in scenarios with substantial missing data (12.5%-50%), outperforming most existing state-of-the-art models. The results and methodology suggest a significant advancement in autonomous driving trajectory prediction, paving the way for safer and efficient autonomous systems.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Transportation
Agent-based and Multi-agent Systems -> MAS: Applications
Agent-based and Multi-agent Systems -> MAS: Multi-agent planning
Robotics -> ROB: Other

130

A Cognitive-Driven Trajectory Prediction Model for Autonomous Driving in Mixed Autonomy Environments

Haicheng Liao, Zhenning Li, Chengyue Wang, Bonan Wang, Hanlin Kong, Yanchen Guan, Guofa Li, Zhiyong Cui

[+] More

[-] Less

As autonomous driving technology progresses, the need for precise trajectory prediction models becomes paramount. This paper introduces an innovative model that infuses cognitive insights into trajectory prediction, focusing on perceived safety and dynamic decision-making. Distinct from traditional approaches, our model excels in analyzing interactions and behavior patterns in mixed autonomy traffic scenarios. We introduce the Macao Connected Autonomous Driving (MoCAD) dataset as part of our contributions, which adds value to its complex urban driving scenarios. Our model represents a significant leap forward, achieving marked performance improvements on several key datasets. Specifically, it surpasses existing benchmarks with gains of 16.2% on the Next Generation Simulation (NGSIM), 27.4% on the Highway Drone (HighD), and 19.8% on the MoCAD dataset. Our proposed model shows exceptional proficiency in handling corner cases, essential for real-world applications. Moreover, its robustness is evident in scenarios with missing or limited data, outperforming most of the state-of-the-art baselines. This adaptability and resilience position our model as a viable tool for real-world autonomous driving systems, heralding a new standard in vehicle trajectory prediction for enhanced safety and efficiency.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Transportation
Agent-based and Multi-agent Systems -> MAS: Human-agent interaction
Planning and Scheduling -> PS: Applications
Robotics -> ROB: Motion and path planning

139

Hyperparameter Optimization Can Even Be Harmful in Off-Policy Learning and How to Deal with It

Yuta Saito, Masahiro Nomura

[+] More

[-] Less

There has been a growing interest in off-policy evaluation in the literature such as recommender systems and personalized medicine. We have so far seen significant progress in developing estimators aimed at accurately estimating the effectiveness of counterfactual policies based on biased logged data. However, there are many cases where those estimators are used not only to evaluate the value of decision making policies but also to search for the best hyperparameters from a large candidate space. This work explores the latter hyperparameter optimization (HPO) task for off-policy learning. We empirically show that naively applying an unbiased estimator of the generalization performance as a surrogate objective in HPO can cause an unexpected failure, merely pursuing hyperparameters whose generalization performance is greatly overestimated. We then propose simple and computationally efficient corrections to the typical HPO procedure to deal with the aforementioned issues simultaneously. Empirical investigations demonstrate the effectiveness of our proposed HPO algorithm in situations where the typical procedure fails severely.

List of keywords

Machine Learning -> ML: Causality
Machine Learning -> ML: Hyperparameter optimization
Machine Learning -> ML: Multi-armed bandits
Uncertainty in AI -> UAI: Causality, structural causal models and causal inference

158

AutoAgents: A Framework for Automatic Agent Generation

Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje Karlsson, Jie Fu, Yemin Shi

[+] More

[-] Less

Large language models (LLMs) have enabled remarkable advances in automated task-solving with multi-agent systems. However, most existing LLM-based multi-agent approaches rely on predefined agents to handle simple tasks, limiting the adaptability of multi-agent collaboration to different scenarios. Therefore, we introduce AutoAgents, an innovative framework that adaptively generates and coordinates multiple specialized agents to build an AI team according to different tasks. Specifically, AutoAgents couples the relationship between tasks and roles by dynamically generating multiple required agents based on task content and planning solutions for the current task based on the generated expert agents. Multiple specialized agents collaborate with each other to efficiently accomplish tasks. Concurrently, an observer role is incorporated into the framework to reflect on the designated plans and agents’ responses and improve upon them. Our experiments on various benchmarks demonstrate that AutoAgents generates more coherent and accurate solutions than the existing multi-agent methods. This underscores the significance of assigning different roles to different tasks and of team cooperation, offering new perspectives for tackling complex tasks. The repository of this project is available at https://github.com/Link-AGI/AutoAgents.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Applications
Natural Language Processing -> NLP: Applications

161

Boosting Model Resilience via Implicit Adversarial Data Augmentation

Xiaoling Zhou, Wei Ye, Zhemg Lee, Rui Xie, Shikun Zhang

[+] More

[-] Less

Data augmentation plays a pivotal role in enhancing and diversifying training data. Nonetheless, consistently improving model performance in varied learning scenarios, especially those with inherent data biases, remains challenging. To address this, we propose to augment the deep features of samples by incorporating their adversarial and anti-adversarial perturbation distributions, enabling adaptive adjustment in the learning difficulty tailored to each sample’s specific characteristics. We then theoretically reveal that our augmentation process approximates the optimization of a surrogate loss function as the number of augmented copies increases indefinitely. This insight leads us to develop a meta-learning-based framework for optimizing classifiers with this novel loss, introducing the effects of augmentation while bypassing the explicit augmentation process. We conduct extensive experiments across four common biased learning scenarios: long-tail learning, generalized long-tail learning, noisy label learning, and subpopulation shift learning. The empirical results demonstrate that our method consistently achieves state-of-the-art performance, highlighting its broad adaptability.

List of keywords

Machine Learning -> ML: Classification
Data Mining -> DM: Class imbalance and unequal cost
Machine Learning -> ML: Meta-learning
Machine Learning -> ML: Robustness

168

Negative-Binomial Randomized Gamma Dynamical Systems for Heterogeneous Overdispersed Count Time Sequences

Rui Huang, Sikun Yang, Heinz Koeppl

[+] More

[-] Less

Modeling count-valued time sequences has been receiving growing interests because count time sequences naturally arise in physical and social domains. Poisson gamma dynamical systems (PGDSs) are newly-developed methods, which can well capture the expressive latent transition structure and bursty dynamics behind count sequences. In particular, PGDSs demonstrate superior performance in terms of data imputation and prediction, compared with canonical linear dynamical system (LDS) based methods. Despite these advantages, PGDS cannot capture the heterogeneous overdispersed behaviours of the underlying dynamic processes. To mitigate this defect, we propose a negative-binomial-randomized gamma Markov process, which not only significantly improves the predictive performance of the proposed dynamical system, but also facilitates the fast convergence of the inference algorithm. Moreover, we develop methods to estimate both factor-structured and graph-structured transition dynamics, which enable us to infer more explainable latent structure, compared with PGDSs. Finally, we demonstrate the explainable latent structure learned by the proposed method, and show its superior performance in imputing missing data and forecasting future observations, compared with the related models.

List of keywords

Machine Learning -> ML: Time series and data streams
Machine Learning -> ML: Bayesian learning
Machine Learning -> ML: Probabilistic machine learning
Uncertainty in AI -> UAI: Tractable probabilistic models

169

Scale and Direction Guided GAN for Inertial Sensor Signal Enhancement

Yifeng Wang, Yi Zhao

[+] More

[-] Less

Inertial sensors, serving as attitude and motion sensing components, are extensively used in various portable devices spanning consumer electronics, sports health, aerospace, etc. However, the severe intrinsic errors of inertial sensors greatly restrict their capability to implement advanced functions, such as motion tracking and semantic recognition. Although generative models hold significant potential for signal enhancement, unsupervised or weakly-supervised generative methods may not achieve ideal generation results due to the absence of guidance from paired data. To address this, we propose a scale and direction-guided generative adversarial network (SDG-GAN), which provides dual guidance mechanisms for GAN with unpaired data across two practical application scenarios. In the unsupervised scenario where only unpaired signals of varying quality are available, our scale-guided GAN (SG-GAN) forces the generator to learn high-quality signal characteristics at different scales simultaneously via the proposed self-supervised zoom constraint, thereby facilitating multi-scale interactive learning. In the weakly-supervised scenario, where additional experimental equipment can provide some motion information, our direction-guided GAN (DG-GAN) introduces auxiliary tasks to supervise signal generation while avoiding interference from auxiliary tasks on the main generation task. Extensive experiments demonstrate that both the unsupervised SG-GAN and the weakly-supervised DG-GAN significantly outperform all comparison methods, including fully-supervised approaches. The combined SDG-GAN achieves remarkable results, enabling unimaginable tasks based on the original inertial signal, such as 3D motion tracking.

List of keywords

Machine Learning -> ML: Generative models
Machine Learning -> ML: Unsupervised learning
Machine Learning -> ML: Weakly supervised learning
Multidisciplinary Topics and Applications -> MTA: Sensor networks and smart cities

170

Nukplex: An Efficient Local Search Algorithm for Maximum K-Plex Problem

Rui Sun, Yiyuan Wang, Shimao Wang, Hui Li, Ximing Li, Minghao Yin

[+] More

[-] Less

The maximum k-plex problem (MKPP) is an significant relaxation version of the maximum clique problem with extensive applications. Recently, lots of researchers have proposed many heuristic algorithms based on various methods to solve the MKPP. In this work, to further improve the performance of solving the MKPP, we propose an efficient local search algorithm based on three main ideas. First, we propose a relaxed bounded configuration checking strategy that considers two kinds of historical searching information to relax the restricted strength of configuration checking and the forbidden condition of candidate vertices for the operation Add, respectively. Second, we present a novel solution information-based vertex selection strategy based on two kinds of solution information to select high-quality candidate vertices. Third, we define the solution core and then introduce a core-based perturbation strategy to help the algorithm jump out of local optima.The experimental results show that the proposed algorithm significantly outperforms the state-of-the-art MKPP algorithms in almost all the instances.

List of keywords

Search -> S: Local search
Search -> S: Heuristic search

185

Design a Win-Win Strategy That Is Fair to Both Service Providers and Tasks When Rejection Is Not an Option

Yohai Trabelsi, Pan Xu, Sarit Kraus

[+] More

[-] Less

Assigning tasks to service providers is a frequent procedure across various applications. Often the tasks arrive dynamically while the service providers remain static. Preventing task rejection caused by service provider overload is of utmost significance.To ensure a positive experience in relevant applications for both service providers and tasks, fairness must be considered. To address the issue, we model the problem as an online matching within a bipartite graph and tackle two minimax problems: one focuses on minimizing the highest waiting time of a task, while the other aims to minimize the highest utilization of a service provider. We show that the second problem can be expressed as a linear program and thus solved efficiently while maintaining a reasonable approximation to the objective of the first problem. We developed novel methods that utilize the two minimax problems. We conducted extensive simulation experiments using real data and demonstrated that our novel heuristics, based on the linear program, performed remarkably well.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Resource allocation

190

Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion

Bohan Li, Yasheng Sun, Zhujin Liang, Dalong Du, Zhuanghui Zhang, Xiaofeng Wang, Yunnan Wang, Xin Jin, Wenjun Zeng

[+] More

[-] Less

3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. In this paper, we resort to stereo matching technique and bird’s-eye-view (BEV) representation learning to address such issues in SSC. Complementary to each other, stereo matching mitigates geometric ambiguity with epipolar constraint while BEV representation enhances the hallucination ability for invisible regions with global semantic context. However, due to the inherent representation gap between stereo geometry and BEV features, it is non-trivial to bridge them for dense prediction task of SSC. Therefore, we further develop a unified occupancy-based framework dubbed BRGScene, which effectively bridges these two representations with dense 3D volumes for reliable semantic scene completion. Specifically, we design a novel Mutual Interactive Ensemble (MIE) block for pixel-level reliable aggregation of stereo geometry and BEV features. Within the MIE block, a Bi-directional Reliable Interaction (BRI) module, enhanced with confidence re-weighting, is employed to encourage fine-grained interaction through mutual guidance. Besides, a Dual Volume Ensemble (DVE) module is introduced to facilitate complementary aggregation through channel-wise recalibration and multi-group voting. Our method outperforms all published camera-based methods on SemanticKITTI for semantic scene completion. Our code is available on https://github.com/Arlo0o/StereoScene.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Scene analysis and understanding

195

Unbiased Active Semi-supervised Binary Classification Models

JooChul Lee, Weidong Ma, Ziyang Wang

[+] More

[-] Less

Active learning is known to be a well-motivated algorithm that aims to maximize model performance with relatively small data, but it introduces sampling bias due to active selection. To adjust the bias, current literature utilizes corrective weights in a supervised learning approach. However, those methods consider only a small amount of actively sampled data and thus estimation efficiency can be improved using unsampled data together. In this paper, we develop an actively improved augmented estimation equation (AI-AEE) based on corrective weights as well as imputation models that allow us to leverage unlabeled data. The asymptotic distribution of the proposed estimator as the solution to the AI-AEE is derived, and an optimal sampling scheme to minimize the asymptotic mean squared error of the estimator is proposed. We then propose a general practical algorithm for training prediction models in the active and semi-supervised learning framework. The superiority of our method is demonstrated on synthetic and real data examples.

List of keywords

Machine Learning -> ML: Active learning
Machine Learning -> ML: Regression
Machine Learning -> ML: Semi-supervised learning

200

Proportion-based Sensitivity Analysis of Uncontrolled Confounding Bias in Causal Inference

Haruka Yoshida, Manabu Kuroki

[+] More

[-] Less

Uncontrolled confounding bias causes a spurious relationship between an exposure variable and an outcome variable and precludes reliable evaluation of the causal effect from observed data.Thus, it is important to observe a sufficient set of confounders to reliably evaluate the causal effect.However, there is no statistical method for judging whether an available set of covariates is sufficient to derive a reliable estimator for the causal effect.To address this problem, we focus on the fact that the mean squared error (MSE) of the outcome variable with respect to the average causal risk can be described as the sum of "the conditional variance of the outcome variable given the exposure variable" and "the square of the uncontrolled confounding bias".We then propose a novel sensitivity analysis, namely, the proportion-based sensitivity analysis of uncontrolled confounding bias in causal effects (PSA) in which the sensitivity parameter is formulated as the proportion of "the square of the uncontrolled confounding bias" to the MSE, and we clarify some properties.We also demonstrate the applicability of the PSA through two case studies.

List of keywords

Uncertainty in AI -> UAI: Causality, structural causal models and causal inference

204

Contrastive General Graph Matching with Adaptive Augmentation Sampling

Jianyuan Bo, Yuan Fang

[+] More

[-] Less

Graph matching has important applications in pattern recognition and beyond. Current approaches predominantly adopt supervised learning, demanding extensive labeled data which can be limited or costly. Meanwhile, self-supervised learning methods for graph matching often require additional side information such as extra categorical information and input features, limiting their application to the general case. Moreover, designing the optimal graph augmentations for self-supervised graph matching presents another challenge to ensure robustness and efficacy. To address these issues, we introduce a novel Graph-centric Contrastive framework for Graph Matching (GCGM), capitalizing on a vast pool of graph augmentations for contrastive learning, yet without needing any side information. Given the variety of augmentation choices, we further introduce a Boosting-inspired Adaptive Augmentation Sampler (BiAS), which adaptively selects more challenging augmentations tailored for graph matching. Through various experiments, our GCGM surpasses state-of-the-art self-supervised methods across various datasets, marking a significant step toward more effective, efficient and general graph matching.

List of keywords

Machine Learning -> ML: Unsupervised learning
Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Sequence and graph learning

214

FedSSA: Semantic Similarity-based Aggregation for Efficient Model-Heterogeneous Personalized Federated Learning

Liping Yi, Han Yu, Zhuan Shi, Gang Wang, Xiaoguang Liu, Lizhen Cui, Xiaoxiao Li

[+] More

[-] Less

Federated learning (FL) is a privacy-preserving collaboratively machine learning paradigm. Traditional FL requires all data owners (a.k.a. FL clients) to train the same local model. This design is not well-suited for scenarios involving data and/or system heterogeneity. Model-Heterogeneous Personalized FL (MHPFL) has emerged to address this challenge. Existing MHPFL approaches often rely on a public dataset with the same nature as the learning task, or incur high computation and communication costs. To address these limitations, we propose the Federated Semantic Similarity Aggregation (FedSSA) approach for supervised classification tasks, which splits each client’s model into a heterogeneous (structure-different) feature extractor and a hom*ogeneous (structure-same) classification header. It performs local-to-global knowledge transfer via semantic similarity-based header parameter aggregation. In addition, global-to-local knowledge transfer is achieved via an adaptive parameter stabilization strategy which fuses the seen-class parameters of historical local headers with that of the latest global header for each client. FedSSA does not rely on public datasets, while only requiring partial header parameter transmission to save costs. Theoretical analysis proves the convergence of FedSSA. Extensive experiments present that FedSSA achieves up to 3.62% higher accuracy, 15.54 times higher communication efficiency, and 15.52 times higher computational efficiency compared to 7 state-of-the-art MHPFL baselines.

List of keywords

Machine Learning -> ML: Federated learning

219

Empirical Analysis of Dialogue Relation Extraction with Large Language Models

Guozheng Li, Zijie Xu, Ziyu Shang, Jiajun Liu, Ke Ji, Yikai Guo

[+] More

[-] Less

Dialogue relation extraction (DRE) aims to extract relations between two arguments within a dialogue, which is more challenging than standard RE due to the higher person pronoun frequency and lower information density in dialogues. However, existing DRE methods still suffer from two serious issues: (1) hard to capture long and sparse multi-turn information, and (2) struggle to extract golden relations based on partial dialogues, which motivates us to discover more effective methods that can alleviate the above issues. We notice that the rise of large language models (LLMs) has sparked considerable interest in evaluating their performance across diverse tasks. To this end, we initially investigate the capabilities of different LLMs in DRE, considering both proprietary models and open-source models. Interestingly, we discover that LLMs significantly alleviate two issues in existing DRE methods. Generally, we have following findings: (1) scaling up model size substantially boosts the overall DRE performance and achieves exceptional results, tackling the difficulty of capturing long and sparse multi-turn information; (2) LLMs encounter with much smaller performance drop from entire dialogue setting to partial dialogue setting compared to existing methods; (3) LLMs deliver competitive or superior performances under both full-shot and few-shot settings compared to current state-of-the-art; (4) LLMs show modest performances on inverse relations but much stronger improvements on general relations, and they can handle dialogues of various lengths especially for longer sequences.

List of keywords

Natural Language Processing -> NLP: Information extraction

223

Graph Contrastive Learning with Reinforcement Augmentation

Ziyang Liu, Chaokun Wang, Cheng Wu

[+] More

[-] Less

Graph contrastive learning (GCL), designing contrastive objectives to learn embeddings from augmented graphs, has become a prevailing method for extracting embeddings from graphs in an unsupervised manner. As an important procedure in GCL, graph data augmentation (GDA) directly affects the model performance on downstream tasks. Currently, the GCL methods typically treat GDA as independent events, neglecting its continuity. In this paper, we regard the GDA in GCL as a Markov decision process and propose a novel graph reinforcement augmentation framework for GCL. Based on this framework, we design a Graph Advantage Actor-Critic (GA2C) model. We conduct extensive experiments to evaluate GA2C on unsupervised learning, transfer learning, and semi-supervised learning. The experimental results demonstrate the performance superiority of GA2C over the state-of-the-art GCL models. Furthermore, we verify that GA2C is more efficient than the other GCL methods with learnable GDA and provide two examples of chemical molecular graphs from ZINC-2M to demonstrate that GA2C generates meaningful augmented views, where the edge weights reflect the importance of chemical bonds in the molecule.

List of keywords

Data Mining -> DM: Mining graphs
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Self-supervised Learning

230

Negative Prompt Driven Complementary Parallel Representation for Open-World 3D Object Retrieval

Yang Xu, Yifan Feng, Yue Gao

[+] More

[-] Less

The limited availability of supervised labels (positive information) poses a notable challenge for open-world retrieval. However, negative information is more easily obtained but remains underexploited in current methods. In this paper, we introduce the Negative Prompt Driven Complementary Parallel Representation (NPCP) framework, which navigates the complexities of open-world retrieval through the lens of Negative Prompts. Specifically, we employ the Parallel Exclusive Embedding (PEE) to effectively utilize the prompt information, bilaterally capturing both explicit negative and implicit positive signals. To address the challenges of embedding unification and generalization, our method leverages high-order correlations among objects through the Complementary Structure Tuning (CST), by constructing a complementary hypergraph based on bi-directional and cross-category correlations. We have developed four multimodal datasets for open-world 3D object retrieval with negative prompts: NPMN, NPAB, NPNT, and NPES. Extensive experiments and ablation studies on these four benchmarks demonstrate the superiority of our method over current state-of-the-art approaches.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Image and video retrieval
Computer Vision -> CV: Representation learning

231

MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator

Xiao-Yin Liu, Xiao-Hu Zhou, Guotao Li, Hao Li, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Zeng-Guang Hou

[+] More

[-] Less

Offline reinforcement learning (RL) faces a significant challenge of distribution shift. Model-free offline RL penalizes the Q value for out-of-distribution (OOD) data or constrains the policy closed to the behavior policy to tackle this problem, but this inhibits the exploration of the OOD region. Model-based offline RL, which uses the trained environment model to generate more OOD data and performs conservative policy optimization within that model, has become an effective method for this problem. However, the current model-based algorithms rarely consider agent robustness when incorporating conservatism into policy. Therefore, the new model-based offline algorithm with a conservative Bellman operator (MICRO) is proposed. This method trades off performance and robustness via introducing the robust Bellman operator into the algorithm. Compared with previous model-based algorithms with robust adversarial models, MICRO can significantly reduce the computation cost by only choosing the minimal Q value in the state uncertainty set. Extensive experiments demonstrate that MICRO outperforms prior RL algorithms in offline RL benchmark and is considerably robust to adversarial perturbations.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Model-based and model learning reinforcement learning
Machine Learning -> ML: Offline reinforcement learning
Machine Learning -> ML: Robustness

234

Spear: Evaluate the Adversarial Robustness of Compressed Neural Models

Chong Yu, Tao Chen, Zhongxue Gan, Jiayuan Fan

[+] More

[-] Less

As Artificial Intelligence evolves, the neural models vulnerable to adversarial attacks may produce fatal results in critical applications. This paper mainly discusses the robustness of the compressed neural models facing adversarial attacks. A few studies discuss the interaction between model compression and adversarial attack. However, they focus on the robustness against the traditional attacks designed for the dense models, not the attacks intended explicitly for the compressed models, using sparsity and quantization techniques. Compressed models often have fewer parameters and smaller sizes that are more friendly to resource-limited devices than dense models, so they are widely deployed in various edge and mobile devices. However, introducing the sparsity and quantization into neural models further imposes higher attack risks. A specific adversarial attack method (Spear) is proposed to generate the particular adversarial attack samples for evaluating the robustness of the compressed models. The Spear attack finds minimal perturbations to create the attack samples to maximize the different behaviors between the compressed and dense reference models. We demonstrate the proposed Spear attack technique can generally be applied to various networks and tasks through quantitative and ablation experiments.

List of keywords

Machine Learning -> ML: Adversarial machine learning
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Machine Learning -> ML: Learning sparse models
Machine Learning -> ML: Robustness

247

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

Ziying Song, Guoxing Zhang, Lin Liu, Lei Yang, Shaoqing Xu, Caiyan Jia, Feiyang Jia, Li Wang

[+] More

[-] Less

Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD). Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at https://github.com/adept-thu/RoboFusion.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Recognition (object detection, categorization)
Robotics -> ROB: Perception

260

A Semi-supervised Molecular Learning Framework for Activity Cliff Estimation

Fang Wu

[+] More

[-] Less

Machine learning (ML) enables accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Their success is based on the principle of similarity at its heart, assuming that similar molecules exhibit close properties. However, activity cliffs challenge this principle, and their presence leads to a sharp decline in the performance of existing ML algorithms, particularly graph-based methods. To overcome this obstacle under a low-data scenario, we propose a novel semi-supervised learning (SSL) method dubbed SemiMol, which employs predictions on numerous unannotated data as pseudo-signals for subsequent training. Specifically, we introduce an additional instructor model to evaluate the accuracy and trustworthiness of proxy labels because existing pseudo-labeling approaches require probabilistic outputs to reveal the model’s confidence and fail to be applied in regression tasks. Moreover, we design a self-adaptive curriculum learning algorithm to progressively move the target model toward hard samples at a controllable pace. Extensive experiments on 30 activity cliff datasets demonstrate that SemiMol significantly enhances graph-based ML architectures and outpasses state-of-the-art pretraining and SSL baselines.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Health and medicine
Multidisciplinary Topics and Applications -> MTA: Life sciences
Humans and AI -> HAI: Applications

264

VSGT: Variational Spatial and Gaussian Temporal Graph Models for EEG-based Emotion Recognition

Chenyu Liu, Xinliang Zhou, Jiaping Xiao, Zhengri Zhu, Liming Zhai, Ziyu Jia, Yang Liu

[+] More

[-] Less

Electroencephalogram (EEG), which directly reflects the emotional activity of the brain, has been increasingly utilized for emotion recognition. Most works exploit the spatial and temporal dependencies in EEG to learn emotional feature representations, but they still have two limitations to reach their full potential. First, prior knowledge is rarely used to capture the spatial dependency of brain regions. Second, the cross temporal dependency between consecutive time slices for different brain regions is ignored. To address these limitations, in this paper, we propose Variational Spatial and Gaussian Temporal (VSGT) graph models to investigate the spatial and temporal dependencies for EEG-based emotion recognition. The VSGT has two key components: Variational Spatial Encoder (VSE) and Gaussian Temporal Encoder (GTE). The VSE leverages the upper bound theorem to identify the dynamic spatial dependency based on prior knowledge by the variational Bayesian method. Besides, the GTE exploits the conditional Gaussian graph transform that computes comprehensive temporal dependency between consecutive time slices. Finally, the VSGT utilizes a recurrent structure to calculate the spatial and temporal dependencies for all time slices. Extensive experiments show the superiority of VSGT over state-of-the-art methods on multiple EEG datasets.

List of keywords

Humans and AI -> HAI: Cognitive modeling
Humans and AI -> HAI: Brain sciences

281

Probabilistically Robust Watermarking of Neural Networks

Mikhail Pautov, Nikita Bogdanov, Stanislav Pyatkin, Oleg Rogov, Ivan Oseledets

[+] More

[-] Less

As deep learning (DL) models are widely and effectively used in Machine Learning as a Service (MLaaS) platforms, there is a rapidly growing interest in DL watermarking techniques that can be used to confirm the ownership of a particular model. Unfortunately, these methods usually produce watermarks susceptible to model stealing attacks. In our research, we introduce a novel trigger set-based watermarking approach that demonstrates resilience against functionality stealing attacks, particularly those involving extraction and distillation. Our approach does not require additional model training and can be applied to any model architecture. The key idea of our method is to compute the trigger set, which is transferable between the source model and the set of proxy models with a high probability. In our experimental study, we show that if the probability of the set being transferable is reasonably high, it can be effectively used for ownership verification of the stolen model. We evaluate our method on multiple benchmarks and show that our approach outperforms current state-of-the-art watermarking techniques in all considered experimental setups.

List of keywords

Machine Learning -> ML: Adversarial machine learning
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Uncertainty in AI -> UAI: Applications

291

CoFInAl: Enhancing Action Quality Assessment with Coarse-to-Fine Instruction Alignment

Kanglei Zhou, Junlin Li, Ruizhi Cai, Liyuan Wang, Xingxing Zhang, Xiaohui Liang

[+] More

[-] Less

Action Quality Assessment (AQA) is pivotal for quantifying actions across domains like sports and medical care. Existing methods often rely on pre-trained backbones from large-scale action recognition datasets to boost performance on smaller AQA datasets. However, this common strategy yields suboptimal results due to the inherent struggle of these backbones to capture the subtle cues essential for AQA. Moreover, fine-tuning on smaller datasets risks overfitting. To address these issues, we propose Coarse-to-Fine Instruction Alignment (CoFInAl). Inspired by recent advances in large language model tuning, CoFInAl aligns AQA with broader pre-trained tasks by reformulating it as a coarse-to-fine classification task. Initially, it learns grade prototypes for coarse assessment and then utilizes fixed sub-grade prototypes for fine-grained assessment. This hierarchical approach mirrors the judging process, enhancing interpretability within the AQA framework. Experimental results on two long-term AQA datasets demonstrate CoFInAl achieves state-of-the-art performance with significant correlation gains of 5.49% and 3.55% on Rhythmic Gymnastics and Fis-V, respectively. Our Code is available at https://github.com/ZhouKanglei/CoFInAl_AQA.

List of keywords

Computer Vision -> CV: Action and behavior recognition
Computer Vision -> CV: Video analysis and understanding

297

Hacking Task Confounder in Meta-Learning

Jingyao Wang, Yi Ren, Zeen Song, Jianqi Zhang, Changwen Zheng, Wenwen Qiang

[+] More

[-] Less

Meta-learning enables rapid generalization to new tasks by learning knowledge from various tasks. It is intuitively assumed that as the training progresses, a model will acquire richer knowledge, leading to better generalization performance. However, our experiments reveal an unexpected result: there is negative knowledge transfer between tasks, affecting generalization performance. To explain this phenomenon, we conduct Structural Causal Models (SCMs) for causal analysis. Our investigation uncovers the presence of spurious correlations between task-specific causal factors and labels in meta-learning. Furthermore, the confounding factors differ across different batches. We refer to these confounding factors as “Task Confounders". Based on these findings, we propose a plug-and-play Meta-learning Causal Representation Learner (MetaCRL) to eliminate task confounders. It encodes decoupled generating factors from multiple tasks and utilizes an invariant-based bi-level optimization mechanism to ensure their causality for meta-learning. Extensive experiments on various benchmark datasets demonstrate that our work achieves state-of-the-art (SOTA) performance. The code is provided in https://github.com/WangJingyao07/MetaCRL.

List of keywords

Machine Learning -> ML: Meta-learning
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Machine Learning -> ML: Causality
Machine Learning -> ML: Few-shot learning

313

FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on

Chenhui Wang, Tao Chen, Zhihao Chen, Zhizhong Huang, Taoran Jiang, Qi Wang, Hongming Shan

[+] More

[-] Less

Despite their impressive generative performance, latent diffusion model-based virtual try-on (VTON) methods lack faithfulness to crucial details of the clothes, such as style, pattern, and text. To alleviate these issues caused by the diffusion stochastic nature and latent supervision, we propose a novel Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves the conventional latent diffusion process in three major aspects. First, we propose incorporating warped clothes as both the starting point and local condition, supplying the model with faithful clothes priors. Second, we introduce a novel clothes flattening network to constrain generated try-on images, providing clothes-consistent faithful supervision. Third, we devise a clothes-posterior sampling for faithful inference, further enhancing the model performance over conventional clothes-agnostic Gaussian sampling. Extensive experimental results on the benchmark VITON-HD and Dress Code datasets demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is able to generate photo-realistic try-on images with faithful clothing details.

List of keywords

Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation

322

Bridging the Gap: Learning Pace Synchronization for Open-World Semi-Supervised Learning

Bo Ye, Kai Gan, Tong Wei, Min-Ling Zhang

[+] More

[-] Less

In open-world semi-supervised learning, a machine learning model is tasked with uncovering novel categories from unlabeled data while maintaining performance on seen categories from labeled data. The central challenge is the substantial learning gap between seen and novel categories, as the model learns the former faster due to accurate supervisory information. Moreover, capturing the semantics of unlabeled novel category samples is also challenging due to the missing label information. To address the above issues, we introduce 1) the adaptive synchronizing marginal loss which imposes class-specific negative margins to alleviate the model bias towards seen classes, and 2) the pseudo-label contrastive clustering which exploits pseudo-labels predicted by the model to group unlabeled data from the same category together in the output space. Extensive experiments on benchmark datasets demonstrate that previous approaches may significantly hinder novel class learning, whereas our method strikingly balances the learning pace between seen and novel classes, achieving a remarkable 3% average accuracy increase on the ImageNet dataset. Importantly, we find that fine-tuning the self-supervised pre-trained model significantly boosts the performance, which is overlooked in prior literature. Our code is available at https://github.com/yebo0216best/LPS-main.

List of keywords

Machine Learning -> ML: Semi-supervised learning
Machine Learning -> ML: Weakly supervised learning

334

Explore Internal and External Similarity for Single Image Deraining with Graph Neural Networks

Cong Wang, Wei Wang, Chengjin Yu, Jie Mu

[+] More

[-] Less

Patch-level non-local self-similarity is an important property of natural images. However, most existing methods do not consider this property into neural networks for image deraining, thus affecting recovery performance. Motivated by this property, we find that there exists significant patch recurrence property of a rainy image, that is, similar patches tend to recur many times in one image and its multi-scale images and external images. To better model this property for image detaining, we develop a multi-scale graph network with exemplars, called MSGNN, that contains two branches: 1) internal data-based supervised branch is used to model the internal relations of similar patches from the rainy image itself and its multi-scale images and 2) external data-participated unsupervised branch is used to model the external relations of the similar patches in the rainy image and exemplar. Specifically, we construct a graph model by searching the k-nearest neighboring patches from both the rainy images in a multi-scale framework and the exemplar. After obtaining the corresponding k neighboring patches from the multi-scale images and exemplar, we build a graph and aggregate them in an attentional manner so that the graph can provide more information from similar patches for image deraining. We embed the proposed graph in a deep neural network and train it in an end-to-end manner. Extensive experiments demonstrate that the proposed algorithm performs favorably against eight state-of-the-art methods on five public synthetic datasets and one real-world dataset. The source codes will be available at https://github.com/supersupercong/MSGNN.

List of keywords

Computer Vision -> CV: Applications
Computer Vision -> CV: Computational photography

349

Contrastive and View-Interaction Structure Learning for Multi-view Clustering

Jing Wang, Songhe Feng

[+] More

[-] Less

Existing Deep Multi-view Clustering (DMVC) approaches typically concentrate on capturing consensus semantics from multiple views, where contrastive learning is widely used to align view-specific representations of each view. Unfortunately, view-specific representations are extracted from the content information of the corresponding instance, neglecting the relationships among different instances. Furthermore, existing contrastive loss imports numerous false negative pairs that conflict with the clustering objectives. In response to these challenges, we propose a contraStive and viEw-interaction stRucture learning framework for multI-viEw cluStering (SERIES). Our method takes into account the structural relations among instances and boosts the contrastive loss to improve intra-class compactness. Meanwhile, a cross-view dual relation generation mechanism is introduced to achieve the consensus structural graph across multiple views for clustering. Specifically, we initially acquire view-specific representations using multiple graph autoencoders to exploit both content information and structural information. Furthermore, to pull together the same cluster instances, a soft negative pair aware contrastive loss is employed to distinguish the dissimilar instances while attracting similar instances. Thereafter, the view-specific representations are fed into cross-view dual relation generation layers to generate the affinity matrices of each other, aiming to reveal a consistent structural graph across various views. Extensive experiments conducted on six benchmarks illustrate the superiority of our method compared to other state-of-the-art approaches.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering

362

ELF-UA: Efficient Label-Free User Adaptation in Gaze Estimation

Yong Wu, Yang Wang, Sanqing Qu, Zhijun Li, Guang Chen

[+] More

[-] Less

We consider the problem of user-adaptive 3D gaze estimation. The performance of person-independent gaze estimation is limited due to interpersonal anatomical differences. Our goal is to provide a personalized gaze estimation model specifically adapted to a target user. Previous work on user-adaptive gaze estimation requires some labeled images of the target person data to fine-tune the model at test time. However, this can be unrealistic in real-world applications, since it is cumbersome for an end-user to provide labeled images. In addition, previous work requires the training data to have both gaze labels and person IDs. This data requirement makes it infeasible to use some of the available data. To tackle these challenges, this paper proposes a new problem called efficient label-free user adaptation in gaze estimation. Our model only needs a few unlabeled images of a target user for the model adaptation. During offline training, we have some labeled source data without person IDs and some unlabeled person-specific data. Our proposed method uses a meta-learning approach to learn how to adapt to a new user with only a few unlabeled images. Our key technical innovation is to use a generalization bound from domain adaptation to define the loss function in meta-learning, so that our method can effectively make use of both the labeled source data and the unlabeled person-specific data during training. Extensive experiments validate the effectiveness of our method on several challenging benchmarks.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Humans and AI -> HAI: Applications

366

Efficiency Calibration of Implicit Regularization in Deep Networks via Self-paced Curriculum-Driven Singular Value Selection

Zhe Li, Shuo Chen, Jian Yang, Lei Luo

[+] More

[-] Less

The generalization of neural networks has been a major focus of research in deep learning. It is often interpreted as an implicit bias towards solutions with specific properties. Especially, in practical applications, it has been observed that linear neural networks (LNN) tend to favor low-rank solutions for matrix completion tasks. However, most existing methods rely on increasing the depth of the neural network to enhance the low rank of solutions, resulting in higher complexity. In this paper, we propose a new explicit regularization method that calibrates the implicit bias towards low-rank trends in matrix completion tasks. Our approach automatically incorporates smaller singular values into the training process using a self-paced learning strategy, gradually restoring matrix information. By jointly using both implicit and explicit regularization, we effectively capture the low-rank structure of LNN and accelerate its convergence. We also analyze how our proposed penalty term interacts with implicit regularization and provide theoretical guarantees for our new model. To evaluate the effectiveness of our method, we conduct a series of experiments on both simulated and real-world data. Our experimental results clearly demonstrate that our method has better robustness and generalization ability compared with other methods.

List of keywords

Machine Learning -> ML: Representation learning
Data Mining -> DM: Recommender systems
Machine Learning -> ML: Theory of deep learning

376

Higher-Order Argumentation Frameworks: Principles and Gradual Semantics

Leila Amgoud, Dragan Doder, Marie-Christine Lagasquie-Schiex

[+] More

[-] Less

The paper investigates how to evaluate elements in complex argumentation frameworks, where both arguments and attacks are weighted and might be attacked by arguments. We propose the first gradual semantics that assign a numerical value to every argument and attack. The value represents the acceptance (seriousness) degree of an argument (attack). We start by highlighting various technical challenges facing semantics in such complex settings, including how to deal with attacks vs arguments, and how to combine their values. We present principles that describe different strategies offered to semantics to meet such challenges. Then, we introduce various semantics per strategy. For instance, some semantics evaluate attacks and arguments in the same way while others, called hybrid, treat them differently. Finally, the principles are used to compare the plethora of novel semantics. The final result is a catalogue of semantics with different formal guarantees and behaviours.

List of keywords

Knowledge Representation and Reasoning -> KRR: Argumentation
Knowledge Representation and Reasoning -> KRR: Common-sense reasoning

387

InfoMatch: Entropy Neural Estimation for Semi-Supervised Image Classification

Qi Han, Zhibo Tian, Chengwei Xia, Kun Zhan

[+] More

[-] Less

Semi-supervised image classification, leveraging pseudo supervision and consistency regularization, has demonstrated remarkable success. However, the ongoing challenge lies in fully exploiting the potential of unlabeled data. To address this, we employ information entropy neural estimation to utilize the potential of unlabeled samples. Inspired by contrastive learning, the entropy is estimated by maximizing a lower bound on mutual information across different augmented views. Moreover, we theoretically analyze that the information entropy of the posterior of an image classifier is approximated by maximizing the likelihood function of the softmax predictions. Guided by these insights, we optimize our model from both perspectives to ensure that the predicted probability distribution closely aligns with the ground-truth distribution. Given the theoretical connection to information entropy, we name our method InfoMatch. Through extensive experiments, we show its superior performance. The source code is available at https://github.com/kunzhan/InfoMatch.

List of keywords

Machine Learning -> ML: Semi-supervised learning
Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Unsupervised learning
Computer Vision -> CV: Representation learning

395

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Zhaoxi Mu, Xinyu Yang

[+] More

[-] Less

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.

List of keywords

Natural Language Processing -> NLP: Speech
Machine Learning -> ML: Multi-modal learning

404

Scalable Federated Unlearning via Isolated and Coded Sharding

Yijing Lin, Zhipeng Gao, Hongyang Du, Dusit Niyato, Gui Gui, Shuguang Cui, Jinke Ren

[+] More

[-] Less

Federated unlearning has emerged as a promising paradigm to erase the client-level data effect without affecting the performance of collaborative learning models. However, the federated unlearning process often introduces extensive storage overhead and consumes substantial computational resources, thus hindering its implementation in practice. To address this issue, this paper proposes a scalable federated unlearning framework based on isolated sharding and coded computing. We first divide distributed clients into multiple isolated shards across stages to reduce the number of clients being affected. Then, to reduce the storage overhead of the central server, we develop a coded computing mechanism by compressing the model parameters across different shards. In addition, we provide the theoretical analysis of time efficiency and storage effectiveness for the isolated and coded sharding. Finally, extensive experiments on two typical learning tasks, i.e., classification and generation, demonstrate that our proposed framework can achieve better performance than three state-of-the-art frameworks in terms of accuracy, retraining time, storage overhead, and F1 scores for resisting membership inference attacks.

List of keywords

Machine Learning -> ML: Federated learning
Machine Learning -> ML: Trustworthy machine learning

406

AllMatch: Exploiting All Unlabeled Data for Semi-Supervised Learning

Zhiyu Wu, Jinshi Cui

[+] More

[-] Less

Existing semi-supervised learning algorithms adopt pseudo-labeling and consistency regulation techniques to introduce supervision signals for unlabeled samples. To overcome the inherent limitation of threshold-based pseudo-labeling, prior studies have attempted to align the confidence threshold with the evolving learning status of the model, which is estimated through the predictions made on the unlabeled data. In this paper, we further reveal that classifier weights can reflect the differentiated learning status across categories and consequently propose a class-specific adaptive threshold mechanism. Additionally, considering that even the optimal threshold scheme cannot resolve the problem of discarding unlabeled samples, a binary classification consistency regulation approach is designed to distinguish candidate classes from negative options for all unlabeled samples. By combining the above strategies, we present a novel SSL algorithm named AllMatch, which achieves improved pseudo-label accuracy and a 100% utilization ratio for the unlabeled data. We extensively evaluate our approach on multiple benchmarks, encompassing both balanced and imbalanced settings. The results demonstrate that AllMatch consistently outperforms existing state-of-the-art methods.

List of keywords

Machine Learning -> ML: Semi-supervised learning

418

IntensPure: Attack Intensity-aware Secondary Domain Adaptive Diffusion for Adversarial Purification

Eun-Gi Lee, Moon Seok Lee, Jae Hyun Yoon, Seok Bong Yoo

[+] More

[-] Less

Adversarial attacks pose a severe threat to the accuracy of person re-identification (re-ID) systems, a critical security technology. Adversarial purification methods are promising approaches for defending against comprehensive attacks, including unseen ones. However, re-ID testing identities (IDs) are unseen, requiring more sophisticated purification than other classification tasks for adversarial defense. We propose IntensPure, an adversarial purification method in person re-ID that quantifies attack intensity via ID stability and attribute inconsistency to customize purification strength. Based on the estimated attack intensity, IntensPure employs secondary domain adaptive diffusion focused on purifying the low- and mid-frequency coefficients vulnerable to re-ID attacks. This method significantly reduces computational costs compared to the conventional diffusion method. For elaborate purification, IntensPure performs a directional diffusion process and refinements, leveraging the directional characteristics of secondary images. The experimental results on diverse attacks demonstrate that IntensPure outperforms the existing methods in terms of rank-1 accuracy.

List of keywords

Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Recognition (object detection, categorization)

434

Efficient Tuning and Inference for Large Language Models on Textual Graphs

Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang

[+] More

[-] Less

Rich textual and topological information of textual graphs need to be modeled in real-world applications such as webpages, e-commerce, and academic articles. Practitioners have been long following the path of adopting a shallow text encoder and a subsequent graph neural network (GNN) to solve this problem. In light of recent advancements in large language models (LLMs), it is apparent that integrating LLMs for enhanced textual encoding can substantially improve the performance of textual graphs. Nevertheless, the efficiency of these methods poses a significant challenge. In this paper, we propose ENGINE, a parameter- and memory-efficient fine-tuning method for textual graphs with an LLM encoder. The key insight is to combine the LLMs and GNNs through a tunable side structure, which significantly reduces the training complexity without impairing the joint model’s capacity. Extensive experiments on textual graphs demonstrate our method’s effectiveness by achieving the best model performance, meanwhile having the lowest training cost compared to previous methods. Moreover, we introduce two variants with caching and dynamic early exit to further enhance training and inference speed. Specifically, caching accelerates ENGINE’s training by 12x, and dynamic early exit achieves up to 5x faster inference with a negligible performance drop (at maximum 1.17% relevant drop across 7 datasets).

List of keywords

Machine Learning -> ML: Sequence and graph learning
Data Mining -> DM: Mining graphs

439

Exploiting Multi-Label Correlation in Label Distribution Learning

Zhiqiang Kou, Jing Wang, Jiawei Tang, Yuheng Jia, Boyu Shi, Xin Geng

[+] More

[-] Less

Label Distribution Learning (LDL) is a novel machine learning paradigm that assigns label distribution to each instance. Numerous LDL methods proposed to leverage label correlation in the learning process to solve the exponential-sized output space; among these, many exploited the low-rank structure of label distribution to capture label correlation. However, recent research has unveiled that label distribution matrices typically maintain full rank, posing a challenge to approaches relying on low-rank label correlation. Notably, low-rank label correlation finds widespread adoption in multi-label learning (MLL) literature due to the often low-rank nature of multi-label matrices. Inspired by that, we introduce an auxiliary MLL process within the LDL framework, focusing on capturing low-rank label correlation within this auxiliary MLL component rather than the LDL itself. By doing so, we adeptly exploited low-rank label correlation in our LDL methods. We conduct comprehensive experiments and demonstrate that our methods are superior to existing LDL methods. Besides, the ablation studies justify the advantages of exploiting low-rank label correlation in the auxiliary MLL.

List of keywords

Machine Learning -> ML: Multi-label learning
Machine Learning -> ML: Applications
Machine Learning -> ML: Classification

440

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Zuan Gao, YuXin Wang, Yadong Qu, Boqiang Zhang, Zixiao Wang, Jianjun Xu, Hongtao Xie

[+] More

[-] Less

In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1\% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Representation learning
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

465

EvaNet: Elevation-Guided Flood Extent Mapping on Earth Imagery

Mirza Tanzim Sami, Da Yan, Saugat Adhikari, Lyuheng Yuan, Jiao Han, Zhe Jiang, Jalal Khalil, Yang Zhou

[+] More

[-] Less

Accurate and timely mapping of flood extent from high resolution satellite imagery plays a crucial role in disaster management such as damage assessment and relief activities. However, current state-of-the-art solutions are based on U-Net, which cannot segment the flood pixels accurately due to the ambiguous pixels (e.g., tree canopies, clouds) that prevent a direct judgement from only the spectral features. Thanks to the digital elevation model (DEM) data readily available from sources such as United States Geological Survey (USGS), this work explores the use of an elevation map to improve flood extent mapping. We propose, EvaNet, an elevation-guided segmentation model based on the encoder-decoder architecture with two novel techniques: (1) a loss function encoding the physical law of gravity that if a location is flooded (resp. dry), then its adjacent locations with a lower (resp. higher) elevation must also be flooded (resp. dry); (2) a new (de)convolution operation that integrates the elevation map by a location-sensitive gating mechanism to regulate how much spectral features flow through adjacent layers. Extensive experiments show that EvaNet significantly outperforms the U-Net baselines, and works as a perfect drop-in replacement for U-Net in existing solutions to flood extent mapping. EvaNet is open-sourced at https://github.com/MTSami/EvaNet.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Applications
Computer Vision -> CV: Segmentation
Data Mining -> DM: Mining spatial and/or temporal data

478

Dual Enhancement in ODI Super-Resolution: Adapting Convolution and Upsampling to Projection Distortion

Xiang Ji, Changqiao Xu, Lujie Zhong, Shujie Yang, Han Xiao, Gabriel-Miro Muntean

[+] More

[-] Less

Omnidirectional images (ODIs) demand considerably higher resolution to ensure high quality across all viewports. Traditional convolutional neural networks (CNN)-based single-image super-resolution (SISR) networks, however, are not effective for spherical ODIs. This is due to the uneven pixel density distribution and varying texture complexity in different regions that arise when projecting from a sphere to a plane. Additionally, the computational and memory costs associated with large-sized ODIs present a challenge for real-world application. To address these issues, we propose an efficient distortion-adaptive super-resolution network (ODA-SRN). Specifically, ODA-SRN employs a series of specially designed Distortion Attention Block Groups (DABG) as its backbone. Our Distortion Attention Blocks (DABs) utilize multi-segment parameterized convolution to generate dynamic filters, which compensate for distortion and texture fading during feature extraction. Moreover, we introduce an upsampling scheme that accounts for the dependence of pixel position and distortion degree to achieve pixel-level distortion offset. A comprehensive set of results demonstrates that our ODA-SRN significantly improves the super-resolution performance for ODIs, both quantitatively and qualitatively, when compared to other state-of-the-art methods.

List of keywords

Computer Vision -> CV: 3D computer vision
Agent-based and Multi-agent Systems -> MAS: Human-agent interaction
Computer Vision -> CV: Applications
Computer Vision -> CV: Structural and model-based approaches, knowledge representation and reasoning

479

Structure-Aware Spatial-Temporal Interaction Network for Video Shadow Detection

Housheng Wei, Guanyu Xing, Jingwei Liao, Yanci Zhang, Yanli Liu

[+] More

[-] Less

Video shadow detection faces significant challenges due to ambiguous semantics and variable shapes. Existing video shadow detection algorithms typically overlook the fine shadow details, resulting in inconsistent detection between consecutive frames in complex real-world video scenarios. To address this issue, we propose a spatial-temporal feature interaction strategy, which refines and enhances global shadow semantics with local prior features in the modeling of shadow relations between frames. Moreover, a structure-aware shadow prediction module is proposed, which focuses on modeling the distance relation between local shadow edges and regions. Quantitative experimental results demonstrate that our approach significantly outperforms the state-of-the-art methods, providing stable and consistent shadow detection results in complex video shadow scenarios.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Scene analysis and understanding
Computer Vision -> CV: Segmentation
Computer Vision -> CV: Video analysis and understanding

481

Self-Supervised Monocular Depth Estimation in the Dark: Towards Data Distribution Compensation

Haolin Yang, Chaoqiang Zhao, Lu Sheng, Yang Tang

[+] More

[-] Less

Nighttime self-supervised monocular depth estimation has received increasing attention in recent years. However, using night images for self-supervision is unreliable because the photometric consistency assumption is usually violated in the videos taken under complex lighting conditions. Even with domain adaptation or photometric loss repair, performance is still limited by the poor supervision of night images on trainable networks. In this paper, we propose a self-supervised nighttime monocular depth estimation method that does not use any night images during training. Our framework utilizes day images as a stable source for self-supervision and applies physical priors (e.g., wave optics, reflection model and read-shot noise model) to compensate for some key day-night differences. With day-to-night data distribution compensation, our framework can be trained in an efficient one-stage self-supervised manner. Though no nighttime images are considered during training, qualitative and quantitative results demonstrate that our method achieves SoTA depth estimating results on the challenging nuScenes-Night and RobotCar-Night compared with existing methods.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Other
Computer Vision -> CV: Scene analysis and understanding
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

485

SwiftThief: Enhancing Query Efficiency of Model Stealing by Contrastive Learning

Jeonghyun Lee, Sungmin Han, Sangkyun Lee

[+] More

[-] Less

Model-stealing attacks are emerging as a severe threat to AI-based services because an adversary can create models that duplicate the functionality of the black-box AI models inside the services with regular query-based access. To avoid detection or query costs, the model-stealing adversary must consider minimizing the number of queries to obtain an accurate clone model. To achieve this goal, we propose SwiftThief, a novel model-stealing framework that utilizes both queried and unqueried data to reduce query complexity. In particular, SwiftThief uses contrastive learning, a recent technique for representation learning. We formulate a new objective function for model stealing consisting of self-supervised (for abundant unqueried inputs from public datasets) and soft-supervised (for queried inputs) contrastive losses, jointly optimized with an output matching loss (for queried inputs). In addition, we suggest a new sampling strategy to prioritize rarely queried classes to improve attack performance. Our experiments proved that SwiftThief could significantly enhance the efficiency of model-stealing attacks compared to the existing methods, achieving similar attack performance using only half of the query budgets of the competing approaches. Also, SwiftThief showed high competence even when a defense was activated for the victims.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Multidisciplinary Topics and Applications -> MTA: Security and privacy

486

PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation

Deyi Ji, Wenwei Jin, Hongtao Lu, Feng Zhao

[+] More

[-] Less

The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these issues, we introduce the PPTFormer, a novel Pseudo Multi-Perspective Transformer network that revolutionizes UAV image segmentation. Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning. The PPTFormer network boasts Perspective Decomposition, novel Perspective Prototypes, and a specialized encoder and decoder that together achieve superior segmentation results through Pseudo Multi-Perspective Attention (PMP Attention) and fusion. Our experiments demonstrate that PPTFormer achieves state-of-the-art performance across five UAV segmentation datasets, confirming its capability to effectively simulate UAV flight perspectives and significantly advance segmentation precision. This work presents a pioneering leap in UAV scene understanding and sets a new benchmark for future developments in semantic segmentation.

List of keywords

Computer Vision -> CV: Scene analysis and understanding
Computer Vision -> CV: Segmentation

496

Online Submodular Maximization via Adaptive Thresholds

Zhengchen Yang, Jiping Zheng

[+] More

[-] Less

Submodular function maximization has been studied extensively in recent years due to its numerous applications in machine learning and artificial intelligence. We study a natural online variant of this problem on massive streaming data in which elements arrive one-by-one and the algorithm has to maintain a solution under cardinality constraint, i.e., k. Upon arrival of an element, the algorithm to maximize a monotone submodular function has to decide whether to accept the element and may replace a previously chosen element. Existing algorithms cannot simultaneously achieve optimal performance in terms of competitive ratio, memory complexity and running time. Also, the algorithm with best competitive ratio performs poorly in practice. In this paper, we propose a new algorithm OnlineAdaptive with optimal performance by exploiting adaptive thresholds to decide the acceptance of arriving elements by replacement. We prove that the competitive ratio of OnlineAdaptive is at least 1/4, and the ratio is about 0.2959 when k>=4 and approaches 0.3178 when k tends to infinity. In addition, OnlineAdaptive only needs O(k) memory and just performs one oracle per element. Experiments on diverse datasets confirm that OnlineAdaptive outperforms existing algorithms in both quality and efficiency.

List of keywords

Search -> S: Combinatorial search and optimisation
Search -> S: Heuristic search

509

EVE: Efficient Zero-Shot Text-Based Video Editing With Depth Map Guidance and Temporal Consistency Constraints

Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, Qingpei Guo

[+] More

[-] Less

Motivated by the superior performance of image diffusion models, more and more researchers strive to extend these models to the text-based video editing task. Nevertheless, current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity. Compared with images, we conjecture that videos necessitate more constraints to preserve the temporal consistency during editing. Towards this end, we propose EVE, a robust and Efficient zero-shot Video Editing method. Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost. Moreover, recognizing the absence of a publicly available video editing dataset for fair comparisons, we construct a new benchmark named ZVE-50 dataset. Through comprehensive experimentation, we validate that EVE achieves a satisfactory trade-off between performance and efficiency. Codebase, datasets, and video editing demos are available at https://github.com/alipay/Ant-Multi-Modal-Framework/blob/main/prj/EVE.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Applications

524

OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer

Shengjian Wu, Li Sun, Qingli Li

[+] More

[-] Less

DEtection TRansformer (DETR) becomes a dominant paradigm, mainly due to its common architecture with high accuracy and no post-processing. However, DETR suffers from unstable training dynamics. It consumes more data and epochs to converge compared with CNN-based detectors. This paper aims to stabilize DETR training through the online distillation. It utilizes a teacher model, accumulated by Exponential Moving Average (EMA), and distills its knowledge into the online model in following three aspects. First, the matching relation between object queries and ground truth (GT) boxes in the teacher is employed to guide the student, so queries within the student are not only assigned labels based on their own predictions, but also refer to the matching results from the teacher. Second, the teacher’s initial query is given to the online student, and its prediction is directly constrained by the corresponding output from the teacher. Finally, the object queries from teacher’s different decoding stages are used to build the auxiliary groups to accelerate the convergence. For each GT, two queries with the least matching costs are selected into this extra group, and they predict the GT box and participate the optimization. Extensive experiments show that the proposed OD-DETR successfully stabilizes the training, and significantly increases the performance without bringing in more parameters.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)

549

MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection

Guiping Cao, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang, Yaowei Wang

[+] More

[-] Less

Popular transformer-based detectors detect objects in a one-to-one manner, where both the bounding box and category of each object are predicted only by the single query, leading to the box-sensitive category predictions. Additionally, the initialization of positional queries solely based on the predicted confidence scores or learnable embeddings neglects the significant spatial interrelation between different queries. This oversight leads to an imbalanced spatial distribution of queries (SDQ). In this paper, we propose a new MLP-DINO model to address these issues. Firstly, we present a new Query-Independent Category Supervision (QICS) approach for modeling categories information, decoupling the sensitive bounding box prediction process to improve the detection performance. Additionally, to further improve the category predictions, we introduce a deep MLP model into transformer-based detection framework to capture the long-range and short-range information simultaneously. Thirdly, to balance the SDQ, we design a novel Graph-based Query Selection (GQS) method that distributes each query point in a discrete manner by graphing the spatial information of queries to cover a broader range of potential objects, significantly enhancing the hit-rate of queries. Experimental results on COCO indicate that our MLP-DINO achieves 54.6% AP with only 44M parameters under 36-epoch setting, greatly outperforming the original DINO by +3.7% AP with fewer parameters and FLOPs. The source codes will be available at https://github.com/Med-Process/MLP-DINO.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Scene analysis and understanding
Machine Learning -> ML: Multi-label learning

554

Feature Norm Regularized Federated Learning: Utilizing Data Disparities for Model Performance Gains

Ke Hu, Peng Tang, Liyao Xiang, Weidong Qiu

[+] More

[-] Less

Federated learning (FL) is a machine learning paradigm that aggregates knowledge and utilizes computational power from multiple participants to train a global model. However, a commonplace challenge—non-independent and identically distributed (non-i.i.d.) data across participants—can lead to significant divergence in model updates, thus diminishing training efficacy. In this paper, we propose the Feature Norm Regularized Federated Learning (FNR-FL) algorithm to tackle the non-i.i.d challenge. FNR-FL incorporates class average feature norms into the loss function by a straightforward yet effective regularization strategy. The core idea of FNR-FL is to penalize the deviations in the update directions of local models caused by the non-i.i.d data. Theoretically, we provide convergence guarantees for FNR-FL when training under non-i.i.d scenarios. Practically, our comprehensive experimental evaluations demonstrate that FNR-FL significantly outperforms existing FL algorithms in terms of test accuracy, and maintains a competitive convergence rate with lower communication overhead and shorter duration. Compared to FedAvg, FNR-FL exhibits a 66.24% improvement in accuracy and an 11.40% reduction in training time, underscoring its enhanced effectiveness and efficiency. The code is available on GitHub at: https://github.com/LonelyMoonDesert/FNR-FL.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Social sciences
Machine Learning -> ML: Optimization
Machine Learning -> ML: Robustness

564

Towards Counterfactual Fairness-aware Domain Generalization in Changing Environments

Yujie Lin, Chen Zhao, Minglai Shao, Baoluo Meng, Xujiang Zhao, Haifeng Chen

[+] More

[-] Less

Recognizing domain generalization as a commonplace challenge in machine learning, data distribution might progressively evolve across a continuum of sequential domains in practical scenarios. While current methodologies primarily concentrate on bolstering model effectiveness within these new domains, they tend to neglect issues of fairness throughout the learning process. In response, we propose an innovative framework known as Disentanglement for Counterfactual Fairness-aware Domain Generalization (DCFDG). This approach adeptly removes domain-specific information and sensitive information from the embedded representation of classification features. To scrutinize the intricate interplay between semantic information, domain-specific information, and sensitive attributes, we systematically partition the exogenous factors into four latent variables. By incorporating fairness regularization, we utilize semantic information exclusively for classification purposes. Empirical validation on synthetic and authentic datasets substantiates the efficacy of our approach, demonstrating elevated accuracy levels while ensuring the preservation of fairness amidst the evolving landscape of continuous domains.

List of keywords

Machine Learning -> ML: Time series and data streams
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
Machine Learning -> ML: Causality
Machine Learning -> ML: Generative models

567

Class-Specific Semantic Generation and Reconstruction Learning for Open Set Recognition

Liu Haoyang, Yaojin Lin, Peipei Li, Jun Hu, Xuegang Hu

[+] More

[-] Less

Open set recognition is a crucial research theme for open-environment machine learning. For this problem, a common solution is to learn compact representations of known classes and identify unknown samples by measuring deviations from these known classes. However, the aforementioned methods (1) lack open training consideration, which is detrimental to the fitting of known classes, and (2) recognize unknown classes on an inadequate basis, which limits the accuracy of recognition. In this study, we propose an open reconstruction learning framework that learns a union boundary region of known classes to characterize unknown space. This facilitates the isolation of known space from unknown space to represent known classes compactly and provides a more reliable recognition basis from the perspective of both known and unknown space. Specifically, an adversarial constraint is used to generate class-specific boundary samples. Then, the known classes and approximate unknown space are fitted with manifolds represented by class-specific auto-encoders. Finally, the auto-encoders output the reconstruction error in terms of known and unknown spaces to recognize samples. Extensive experimental results show that the proposed method outperforms existing advanced methods and achieves new stateof-the-art performance. The code is available at https://github.com/Ashowman98/CSGRL.

List of keywords

Data Mining -> DM: Other
Data Mining -> DM: Anomaly/outlier detection

577

Learning with Posterior Sampling for Revenue Management under Time-varying Demand

Kazuma Shimizu, Junya Honda, Shinji Ito, Shinji Nakadai

[+] More

[-] Less

This paper discusses the revenue management (RM) problem to maximize revenue by pricing items or services. One challenge in this problem is that the demand distribution is unknown and varies over time in real applications such as airline and retail industries. In particular, the time-varying demand has not been well studied under scenarios of unknown demand due to the difficulty of jointly managing the remaining inventory and estimating the demand. To tackle this challenge, we first introduce an episodic generalization of the RM problem motivated by typical application scenarios. We then propose a computationally efficient algorithm based on posterior sampling, which effectively optimizes prices by solving linear programming. We derive a Bayesian regret upper bound of this algorithm for general models where demand parameters can be correlated between time periods, while also deriving a regret lower bound for generic algorithms. Our empirical study shows that the proposed algorithm performs better than other benchmark algorithms and comparably to the optimal policy in hindsight. We also propose a heuristic modification of the proposed algorithm, which further efficiently learns the pricing policy in the experiments.

List of keywords

Machine Learning -> ML: Online learning
Machine Learning -> ML: Bayesian learning
Machine Learning -> ML: Multi-armed bandits

586

ParsNets: A Parsimonious Composition of Orthogonal and Low-Rank Linear Networks for Zero-Shot Learning

Jingcai Guo, Qihua Zhou, Xiaocheng Lu, Ruibin Li, Ziming Liu, Jie Zhang, Bo Han, Junyang Chen, Xin Xie, Song Guo

[+] More

[-] Less

This paper provides a novel parsimonious yet efficient design for zero-shot learning (ZSL), dubbed ParsNets, in which we are interested in learning a composition of on-device friendly linear networks, each with orthogonality and low-rankness properties, to achieve equivalent or better performance against deep models. Concretely, we first refactor the core module of ZSL, i.e., the visual-semantics mapping function, into several base linear networks that correspond to diverse components of the semantic space, wherein the complex nonlinearity can be collapsed into simple local linearities. Then, to facilitate the generalization of local linearities, we construct a maximal margin geometry on the learned features by enforcing low-rank constraints on intra-class samples and high-rank constraints on inter-class samples, resulting in orthogonal subspaces for different classes. To enhance the model’s adaptability and counterbalance the over-/under-fittings, a set of sample-wise indicators is employed to select a sparse subset from these base linear networks to form a composite semantic predictor for each sample. Notably, maximal margin geometry can guarantee the diversity of features and, meanwhile, local linearities guarantee efficiency. Thus, our ParsNets can generalize better to unseen classes and can be deployed flexibly on resource-constrained devices.

List of keywords

Machine Learning -> ML: Cost-sensitive learning
Machine Learning -> ML: Ensemble methods
Machine Learning -> ML: Few-shot learning
Machine Learning -> ML: Learning sparse models

603

A Swap Relaxation-Based Local Search for the Latin Square Completion Problem

Zhenxuan Xie, Zhipeng Lü, Zhouxing Su, Chu-Min Li, Junwen Ding, Yuxuan Wang

[+] More

[-] Less

The Latin square completion (LSC) problem aims to assign n symbols to the empty cells of a partially filled Latin square such that in each row and each column, each symbol appears exactly once. In this paper, we propose a swap relaxation-based fast local search algorithm called SRLS for solving the LSC problem. First, it introduces a novel search space definition, which forbids row conflicts based on which a swap-based neighborhood is defined. Second, a color domain relaxation technique is employed in the swap-based neighborhood by temporarily accepting the violation of some constraints to connect high-quality solutions. Third, two effective scoring functions are adopted to select neighborhood moves minimizing the number of conflicting edges as well as the number of color domain violations. Finally, SRLS employs an adaptive restart mechanism to balance the exploitation and exploration of the search. Extensive experiments on 1819 public benchmark instances demonstrate that SRLS outperforms the state-of-the-art algorithms in the literature in terms of both success rate and computational efficiency.

List of keywords

Search -> S: Local search
Search -> S: Heuristic search
Search -> S: Meta-reasoning and meta-heuristics
Search -> S: Combinatorial search and optimisation

605

LLM-based Multi-Level Knowledge Generation for Few-shot Knowledge Graph Completion

Qian Li, Zhuo Chen, Cheng Ji, Shiqi Jiang, Jianxin Li

[+] More

[-] Less

Knowledge Graphs (KGs) are pivotal in various NLP applications but often grapple with incompleteness, especially due to the long-tail problem where infrequent, unpopular relationships drastically reduce the KG completion performance. In this paper, we focus on Few-shot Knowledge Graph Completion (FKGC), a task addressing these gaps in long-tail scenarios. Amidst the rapid evolution of Large Language Models, we propose a generation-based FKGC paradigm facilitated by LLM distillation. Our MuKDC framework employs multi-level knowledge distillation for few-shot KG completion, generating supplementary knowledge to mitigate data scarcity in few-shot environments. MuKDC comprises two primary components: Multi-level Knowledge Generation, which enriches the KG at various levels, and Consistency Assessment, to ensure the coherence and reliability of the generated knowledge. Most notably, our method achieves SOTA results in both FKGC and multi-modal FKGC benchmarks, significantly advancing KG completion and enhancing the understanding and application of LLMs in structured knowledge generation and assessment.

List of keywords

Data Mining -> DM: Knowledge graphs and knowledge base completion
Natural Language Processing -> NLP: Applications

616

A Multi-Valued Decision Diagram-Based Approach to Constrained Optimal Path Problems over Directed Acyclic Graphs

Mingwei Zhang, Liangda Fang, Zhenhao Gu, Quanlong Guan, Yong Lai

[+] More

[-] Less

Numerous combinatorial optimization problems can be reduced to the optimal path problem over directed acyclic graphs (DAGs). The constrained version of the optimal path problem requires the solution to satisfy a given logical constraint. Nishino et al. [2015] proposed an efficient algorithm, namely BDD-constrained search (BCS), to the constrained optimal path problem over DAGs. This algorithm considers edges as variables and constraints as Boolean functions and maintains constraints via binary decision diagrams (BDDs), a compact form of Boolean functions. However, BCS involves redundant operations during the search process. To reduce these redundant operations, we use vertices instead of edges as variables and hence represent constraints as multi-valued functions. Due to the multi-valued representation of constraints, we propose a novel algorithm, namely MDD-constrained search (MCS), by using multi-valued decision diagrams (MDDs) instead of BDDs, an efficient representation of multi-valued functions. In addition, we improve MCS via domain reduction in multi-valued functions. Experimental results prove that our proposed algorithm outperforms BCS.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Applications
Constraint Satisfaction and Optimization -> CSO: Solvers and tools
Knowledge Representation and Reasoning -> KRR: Knowledge compilation

620

GenSeg: On Generating Unified Adversary for Segmentation

Yuxuan Zhang, Zhenbo Shi, Wei Yang, Shuchang Wang, Shaowei Wang, Yinxing Xue

[+] More

[-] Less

Great advancements in semantic, instance, and panoptic segmentation have been made in recent years, yet the top-performing models remain vulnerable to imperceptible adversarial perturbation. Current attacks on segmentation primarily focus on a single task, and these methods typically rely on iterative instance-specific strategies, resulting in limited attack transferability and low efficiency. In this paper, we propose GenSeg, a Generative paradigm that creates unified adversaries for Segmentation tasks. In particular, we propose an intermediate-level objective to enhance attack transferability, including a mutual agreement loss for feature deviation, and a prototype obfuscating loss to disrupt intra-class and inter-class relationships. Moreover, GenSeg crafts an adversary in a single forward pass, significantly boosting the attack efficiency. Besides, we unify multiple segmentation tasks to GenSeg in a novel category-and-mask view, which makes it possible to attack these segmentation tasks within this unified framework, and conduct cross-domain and cross-task attacks as well. Extensive experiments demonstrate the superiority of GenSeg in black-box attacks compared with state-of-the-art attacks. To our best knowledge, GenSeg is the first approach capable of conducting cross-domain and cross-task attacks on segmentation tasks, which are closer to real-world scenarios.

List of keywords

Computer Vision -> CV: Segmentation
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods

634

Enhancing Scalability of Metric Differential Privacy via Secret Dataset Partitioning and Benders Decomposition

Chenxi Qiu

[+] More

[-] Less

Metric Differential Privacy (mDP) extends the concept of Differential Privacy (DP) to serve as a new paradigm of data perturbation. It is designed to protect secret data represented in general metric space, such as text data encoded as word embeddings or geo-location data on the road network or grid maps. To derive an optimal data perturbation mechanism under mDP, a widely used method is linear programming (LP), which, however, might suffer from a polynomial explosion of decision variables, rendering it impractical in large-scale mDP. In this paper, our objective is to develop a new computation framework to enhance the scalability of the LP-based mDP. Considering the connections established by the mDP constraints among the secret records, we partition the original secret dataset into various subsets. Building upon the partition, we reformulate the LP problem for mDP and solve it via Benders Decomposition, which is composed of two stages: (1) a master program to manage the perturbation calculation across subsets, and (2) a set of subproblems, each managing the perturbation derivation within a subset. Our experimental results on multiple datasets, including geo-location data in the road network/grid maps, text data, and synthetic data, underscore our proposed mechanism’s superior scalability and efficiency.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Constraint programming
Multidisciplinary Topics and Applications -> MTA: Security and privacy

638

Learning a Spiking Neural Network for Efficient Image Deraining

Tianyu Song, Guiyue Jin, Pengpeng Li, Kui Jiang, Xiang Chen, Jiyu Jin

[+] More

[-] Less

Recently, spiking neural networks (SNNs) have demonstrated substantial potential in computer vision tasks. In this paper, we present an Efficient Spiking Deraining Network, called ESDNet. Our work is motivated by the observation that rain pixel values will lead to a more pronounced intensity of spike signals in SNNs. However, directly applying deep SNNs to image deraining task still remains a significant challenge. This is attributed to the information loss and training difficulties that arise from discrete binary activation and complex spatiotemporal dynamics. To this end, we develop a spiking residual block to convert the input into spike signals, then adaptively optimize the membrane potential by introducing attention weights to adjust spike responses in a data-driven manner, alleviating information loss caused by discrete binary activation. By this way, our ESDNet can effectively detect and analyze the characteristics of rain streaks by learning their fluctuations. This also enables better guidance for the deraining process and facilitates high-quality image reconstruction. Instead of relying on the ANN-SNN conversion strategy, we introduce a gradient proxy strategy to directly train the model for overcoming the challenge of training. Experimental results show that our approach gains comparable performance against ANN-based methods while reducing energy consumption by 41%.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Applications
Computer Vision -> CV: Computational photography

652

Revitalizing Real Image Deraining via a Generic Paradigm towards Multiple Rainy Patterns

Li Xin, Yuxin Feng, Fan Zhou, Yun Liang, Zhuo Su

[+] More

[-] Less

Synthetic data-driven methods perform well on image rain removal task, but they still face many challenges in real rainfall scenarios due to the complexity and diversity of rainy patterns. In this paper, we propose a new generic paradigm for real image deraining from the perspective of synthesizing data covering more rainy patterns and constructing image rain removal networks with strong generalization performance. Firstly, instead of simply superimposing rain layers, we integrate various rainy patterns and design a phenomenal pipeline that incorporates multiple degradation types. Secondly, we construct a Patterns-aware Rain Removal Network (PRRN), which learns from both synthetic and real data simultaneously. In addition, to eliminate the inevitable distribution differences between synthetic and real data, we design a new Multi-representation Inter-domain Alignment Module (MIAM) in PRRN. By using multiple parallel submodules, MIAM achieves alignment of data domains in multiple feature subspaces. Based on several authoritative objective evaluation metrics, we successfully validate the effectiveness and robustness of the proposed method in real scenarios through extensive experiments carried out on five challenging real datasets.

List of keywords

Computer Vision -> CV: Computational photography
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

669

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen

[+] More

[-] Less

Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to ~15x compared to existing audio SSL models.

List of keywords

Machine Learning -> ML: Representation learning
Machine Learning -> ML: Self-supervised Learning
Natural Language Processing -> NLP: Speech

670

Graph Collaborative Expert Finding with Contrastive Learning

Qiyao Peng, Wenjun Wang, Hongtao Liu, Cuiying Huo, Minglai Shao

[+] More

[-] Less

In Community Question Answering (CQA) websites, most current expert finding methods often model expert embeddings from textual features and optimize them with expert-question first-order interactions, i.e., this expert has answered this question. In this paper, we try to address the limitation of current models that typically neglect the intrinsic high-order connectivity within expert-question interactions, which is pivotal for collaborative effects. We introduce an innovative and simple approach: by conceptualizing expert-question interactions as a bipartite graph, and then we propose a novel graph-based expert finding method based on contrastive learning to effectively capture both first-order and intricate high-order connectivity, named CGEF. Specifically, we employ a question encoder to model questions from titles and employ the graph attention network to recursively propagate embeddings. Besides, to alleviate the problem of sparse interactions, we devise two auxiliary tasks to enhance expert modeling. First, we generate multiple views of one expert, including: 1) behavior-level augmentation drops interaction edges randomly in the graph; 2) interest-level augmentation randomly replaces question titles with tags in the graph. Then we maximize the agreement between one expert and the corresponding augmented expert on a specific view. In this way, the model can effectively inject collaborative signals into expert modeling. Extensive experiments on six CQA datasets demonstrate significant improvements compared with recent methods.

List of keywords

Data Mining -> DM: Mining graphs
Data Mining -> DM: Mining text, web, social media
Data Mining -> DM: Networks
Data Mining -> DM: Recommender systems

671

Boosting Efficiency in Task-Agnostic Exploration through Causal Knowledge

Yupei Yang, Biwei Huang, Shikui Tu, Lei Xu

[+] More

[-] Less

The effectiveness of model training heavily relies on the quality of available training resources. However, budget constraints often impose limitations on data collection efforts. To tackle this challenge, we introduce causal exploration in this paper, a strategy that leverages the underlying causal knowledge for both data collection and model training. We, in particular, focus on enhancing the sample efficiency and reliability of the world model learning within the domain of task-agnostic reinforcement learning. During the exploration phase, the agent actively selects actions expected to yield causal insights most beneficial for world model training. Concurrently, the causal knowledge is acquired and incrementally refined with the ongoing collection of data. We demonstrate that causal exploration aids in learning accurate world models using fewer data and provide theoretical guarantees for its convergence. Empirical experiments, on both synthetic data and real-world applications, further validate the benefits of causal exploration. The source code is available at https://github.com/CMACH508/CausalExploration.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Active learning
Machine Learning -> ML: Causality
Uncertainty in AI -> UAI: Causality, structural causal models and causal inference

685

Agentive Permissions in Multiagent Systems

Qi Shi

[+] More

[-] Less

This paper proposes to distinguish four forms of agentive permissions in multiagent settings. The main technical results are the complexity analysis of model checking, the semantic undefinability of modalities that capture these forms of permissions through each other, and a complete logical system capturing the interplay between these modalities.

List of keywords

Knowledge Representation and Reasoning -> KRR: Reasoning about actions
AI Ethics, Trust, Fairness -> ETF: Moral decision making
AI Ethics, Trust, Fairness -> ETF: AI and law, governance, regulation
AI Ethics, Trust, Fairness -> ETF: Ethical, legal and societal issues

694

Exploring the Inefficiency of Heavy Ball as Momentum Parameter Approaches 1

Xiaoge Deng, Tao Sun, Dongsheng Li, Xicheng Lu

[+] More

[-] Less

The heavy ball momentum method is a commonly used technique for accelerating training processes in the machine learning community. However, empirical evidence suggests that the convergence of stochastic gradient descent (SGD) with heavy ball may slow down when the momentum hyperparameter approaches 1. Despite this observation, there are no established theories or solutions to explain and address this issue. In this study, we provide the first theoretical result that elucidates why momentum slows down SGD as it tends to 1. To better understand this inefficiency, we focus on the quadratic convex objective in the analysis. Our findings show that momentum accelerates SGD when the scaling parameter is not very close to 1. Conversely, when the scaling parameter approaches 1, momentum impairs SGD and degrades its stability. Based on the theoretical findings, we propose a descending warmup technique for the heavy ball momentum, which exploits the advantages of the heavy ball method and overcomes the inefficiency problem when the momentum tends to 1. Numerical results demonstrate the effectiveness of the proposed SHB-DW algorithm.

List of keywords

Machine Learning -> ML: Optimization
Machine Learning -> ML: Applications
Machine Learning -> ML: Learning theory

705

Counterfactual User Sequence Synthesis Augmented with Continuous Time Dynamic Preference Modeling for Sequential POI Recommendation

Lianyong Qi, Yuwen Liu, Weiming Liu, Shichao Pei, Xiaolong Xu, Xuyun Zhang, Yingjie Wang, Wanchun Dou

[+] More

[-] Less

With the proliferation of Location-based Social Networks (LBSNs), user check-in data at Points-of-Interest (POIs) has surged, offering rich insights into user preferences. However, sequential POI recommendation systems always face two pivotal challenges. A challenge lies in the difficulty of modeling time in a discrete space, which fails to accurately capture the dynamic nature of user preferences. Another challenge is the inherent sparsity and noise in continuous POI recommendation, which hinder the recommendation process. To address these challenges, we propose counterfactual user sequence synthesis with continuous time dynamic preference modeling (CussCtpm). CussCtpm innovatively combines Gated Recurrent Unit (GRU) with neural Ordinary Differential Equations (ODEs) to model user preferences in a continuous time framework. CussCtpm captures user preferences at both the POI-level and interest-level, identifying deterministic and non-deterministic preference concepts. Particularly at the interest-level, we employ GRU and neural ODEs to model users’ dynamic preferences in continuous space, aiming to capture finer-grained shifts in user preferences over time. Furthermore, CussCtpm utilizes counterfactual data augmentation to generate counterfactual positive and negative user sequences. Our extensive experiments on two widely-used public datasets demonstrate that CussCtpm outperforms several advanced baseline models.

List of keywords

Data Mining -> DM: Recommender systems

709

Conflict-Alleviated Gradient Descent for Adaptive Object Detection

Wenxu Shi, Bochuan Zheng

[+] More

[-] Less

Unsupervised domain adaptive object detection (DAOD) aims to adapt detectors from a labeled source domain to an unlabelled target domain. Existing DAOD works learn feature representations with both class discriminative and domain invariant by jointly minimizing the loss across domain alignment and detection tasks. However, joint resolution of different tasks may lead to conflicts, with one contributing factor being gradient conflicts during optimization. If left untouched, such disagreement may degrade adaptation performance. In this work, we propose an efficient optimization strategy named Conflict-Alleviated Gradient descent (CAGrad) which aims to alleviate the conflict between two tasks (i.e., alignment and classification). Particularly, we alter the gradients by projecting each onto the normal plane of the other. The projection operation changes conflicting gradients from obtuse angles to acute angles, thus alleviating the conflict and achieving gradient harmonization. We further validate our theoretical analysis and methods on several domain adaptive object detection tasks, including cross-camera, weather, scene, and synthetic to real-world adaptation. Extensive experiments on multiple DAOD benchmarks demonstrate the effectiveness and superiority of our CAGrad.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
Machine Learning -> ML: Optimization
Machine Learning -> ML: Unsupervised learning

715

Invertible Residual Rescaling Models

Jinmin Li, Tao Dai, Yaohua Zha, Yilu Luo, Longfei Lu, Bin Chen, Zhi Wang, Shu-Tao Xia, Jingyun Zhang

[+] More

[-] Less

Invertible Rescaling Networks (IRNs) and their variants have witnessed remarkable achievements in various image processing tasks like image rescaling. However, we observe that IRNs with deeper networks are difficult to train, thus hindering the representational ability of IRNs. To address this issue, we propose Invertible Residual Rescaling Models (IRRM) for image rescaling by learning a bijection between a high-resolution image and its low-resolution counterpart with a specific distribution. Specifically, we propose IRRM to build a deep network, which contains several Residual Downscaling Modules (RDMs) with long skip connections. Each RDM consists of several Invertible Residual Blocks (IRBs) with short connections. In this way, RDM allows rich low-frequency information to be bypassed by skip connections and forces models to focus on extracting high-frequency information from the image. Extensive experiments show that our IRRM performs significantly better than other state-of-the-art methods with much fewer parameters and complexity. Particularly, our IRRM has respectively PSNR gains of at least 0.3 dB over HCFlow and IRN in the $\times$ 4 rescaling while only using 60% parameters and 50% FLOPs. The code will be available at https://github.com/THU-Kingmin/IRRM.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Applications

719

Probabilistic Contrastive Learning for Domain Adaptation

Junjie Li, Yixin Zhang, Zilei Wang, Saihui Hou, Keyu Tu, Man Zhang

[+] More

[-] Less

Contrastive learning has shown impressive success in enhancing feature discriminability for various visual tasks in a self-supervised manner, but the standard contrastive paradigm (features+l2 normalization) has limited benefits when applied in domain adaptation. We find that this is mainly because the class weights (weights of the final fully connected layer) are ignored in the domain adaptation optimization process, which makes it difficult for features to cluster around the corresponding class weights. To solve this problem, we propose the simple but powerful Probabilistic Contrastive Learning (PCL), which moves beyond the standard paradigm by removing l2 normalization and replacing the features with probabilities. PCL can guide the probability distribution towards a one-hot configuration, thus minimizing the discrepancy between features and class weights. We conduct extensive experiments to validate the effectiveness of PCL and observe consistent performance gains on five tasks, i.e., Unsupervised/Semi-Supervised Domain Adaptation (UDA/SSDA), Semi-Supervised Learning (SSL), UDA Detection and Semantic Segmentation. Notably, for UDA Semantic Segmentation on SYNTHIA, PCL surpasses the sophisticated CPSL-D by 2% in terms of mean IoU with a much lower training cost (PCL: 1*3090, 5 days v.s. CPSL-D: 4*V100, 11 days). Code is available at https://github.com/ljjcoder/Probabilistic-Contrastive-Learning.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Computer Vision -> CV: Recognition (object detection, categorization)

756

WSRFNet: Wavelet-Based Scale-Specific Recurrent Feedback Network for Diabetic Retinopathy Lesion Segmentation

Xuan Li, Xiangqian Wu

[+] More

[-] Less

Diabetic retinopathy lesion segmentation (DRLS) faces a challenge of significant variation in the size of different lesions. An effective method to address this challenge is to fuse multi-scale features. To boost the performance of this kind of method, most existing DRLS methods work on devising sophisticated multi-scale feature fusion modules. Differently, we focus on improving the quality of the multi-scale features to enhance the fused multi-scale feature representation. To this end, we design a Wavelet-based Scale-specific Recurrent Feedback Network (WSRFNet), which refines multi-scale features using recurrent feedback mechanism. Specifically, to avoid information loss when introducing feedback to multi-scale features, we propose a wavelet-based feedback pyramid module (WFPM), which is based on a reversible downsampling operation, i.e., Haar wavelet transform. Unlike scale-agnostic feedback used in previous feedback methods, we develop a scale-specific refinement module (SRM), which utilizes scale-specific feedback to pointedly refine features of different scales. Experimental results on IDRiD and DDR datasets show that our approach outperforms state-of-the-art models. The code is available at https://github.com/xuanli01/WSRFNet.

List of keywords

Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Segmentation

768

Generating More Audios for End-to-End Spoken Language Understanding

Xuxin Cheng, Yuexian Zou

[+] More

[-] Less

End-to-end spoken language understanding (SLU) aims to directly capture the comprehensive semantics from the given spoken utterance without generating any transcript. Since the transcripts might not always be available, Textless SLU is attracting increasing attention, which could eliminate the need for transcripts but often does not perform as well as SLU models trained with transcripts. In this paper, we focus on the scenarios where the transcripts are not available and propose a framework GMA-SLU to generate more audios according to the labels. In order to alleviate the modality gap between text and audio, two language models are developed and discrete tokens are utilized as a bridge, where the first language model utilizes labels to generate semantic tokens and the second language model adopts these obtained semantic tokens and the acoustic tokens of source audios to generate the synthetic audios. All the experiments are conducted on the monolingual SLU dataset SLURP and the multilingual SLU dataset MINDS-14. Experimental results show that our method outperforms the previous best Textless End-to-end SLU models and can obtain the comparable performance with the models trained with the assistance of the corresponding transcripts.

List of keywords

Natural Language Processing -> NLP: Dialogue and interactive systems

775

A New Guaranteed Outlier Removal Method Based on Plane Constraints for Large-Scale LiDAR Point Cloud Registration

Gang Ma, Hui Wei, Runfeng Lin, Jialiang Wu

[+] More

[-] Less

In this paper, we present a novel registration method based on plane constraints for large-scale LiDAR point clouds, effectively decoupling rotation estimation and translation estimation. For rotation estimation, we propose an outlier removal method that combines coarse filtering with rotation-invariant constraints and refined filtering based on computational geometric consistency checks, effectively pruning outliers and robustly estimating accurate relative rotations from plane normals. In translation estimation, we propose a component-wise method based on plane translation constraints to efficiently estimate relative translations. The robustness and effectiveness of our proposed method are empirically validated on three popular LiDAR point cloud datasets. The experimental results convincingly demonstrate that our approach achieves state-of-the-art performance.

List of keywords

Robotics -> ROB: Robotics and vision
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Scene analysis and understanding
Robotics -> ROB: Perception

778

OSIC: A New One-Stage Image Captioner Coined

Bo Wang, Zhao Zhang, Mingbo Zhao, Xiaojie Jin, Mingliang Xu, Meng Wang

[+] More

[-] Less

Mainstream image captioning models are usually two-stage captioners, i.e., encoding the region features by a pre-trained detector and then feeding them into a language model to generate the captions. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance, because the region features in the detection task are suboptimal representations and cannot provide all the necessary information for subsequent captions generation. Besides, the region features are usually represented from the last layer of the detectors that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms the images into descriptive sentences in one stage for eliminating the information gap. Specifically, to obtain rich features, multi-level features are captured by Swin Transformer, and then fed into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining to non-locally model the features interaction. As a result, OSIC can directly obtain rich semantic information to improve the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30K datasets verified the superior performance of our method.

List of keywords

Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Scene analysis and understanding
Computer Vision -> CV: Vision, language and reasoning
Natural Language Processing -> NLP: Language generation

805

Strengthening Layer Interaction via Dynamic Layer Attention

Kaishen Wang, Xun Xia, Jian Liu, Zhang Yi, Tao He

[+] More

[-] Less

In recent years, employing layer attention to enhance interaction among hierarchical layers has proven to be a significant advancement in building network structures. In this paper, we delve into the distinction between layer attention and the general attention mechanism, noting that existing layer attention methods achieve layer interaction on fixed feature maps in a static manner. These static layer attention methods limit the ability for context feature extraction among layers. To restore the dynamic context representation capability of the attention mechanism, we propose a Dynamic Layer Attention (DLA) architecture. The DLA comprises dual paths, where the forward path utilizes an improved recurrent neural network block, named Dynamic Sharing Unit (DSU), for context feature extraction. The backward path updates features using these shared context representations. Finally, the attention mechanism is applied to these dynamically refreshed feature maps among layers. Experimental results demonstrate the effectiveness of the proposed DLA architecture, outperforming other state-of-the-art methods in image recognition and object detection tasks. Additionally, the DSU block has been evaluated as an efficient plugin in the proposed DLA architecture. The code is available at https://github.com/tunantu/Dynamic-Layer-attention.

List of keywords

Machine Learning -> ML: Attention models
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Representation learning
Machine Learning -> ML: Theory of deep learning

822

BeyondVision: An EMG-driven Micro Hand Gesture Recognition Based on Dynamic Segmentation

Wang Nana, Jianwei Niu, Xuefeng Liu, Dongqin Yu, Guogang Zhu, Xinghao Wu, Mingliang Xu, Hao Su

[+] More

[-] Less

Hand gesture recognition (HGR) plays a pivotal role in natural and intuitive human-computer interactions. Recent HGR methods focus on recognizing gestures from vision-based images or videos. However, vision-based methods are limited in recognizing micro hand gestures (MHGs) (e.g., pinch within 1cm) and gestures with occluded fingers. To address these issues, combined with the electromyography (EMG) technique, we propose BeyondVision, an EMG-driven MHG recognition system based on deep learning. BeyondVision consists of a wristband-style EMG sampling device and a tailored lightweight neural network BV-Net that can accurately translate EMG signals of MHGs to control commands in real-time. Moreover, we propose a post-processing mechanism and a weight segmentation algorithm to effectively improve the accuracy rate of MHG recognition. Subjective and objective experimental results show that our approach achieves over $95\%$ average recognition rate, 2000Hz sampling frequency, and real-time micro gesture recognition. Our technique has been applied in a commercially available product, introduced at: https://github.com/tyc333/NoBarriers.

List of keywords

Multidisciplinary Topics and Applications -> MTA: AI hardware
Computer Vision -> CV: Biometrics, face, gesture and pose recognition
Machine Learning -> ML: Applications
Multidisciplinary Topics and Applications -> MTA: Interactive entertainment

825

Dynamically Anchored Prompting for Task-Imbalanced Continual Learning

Chenxing Hong, Yan Jin, Zhiqi Kang, Yizhou Chen, Mengke Li, Yang Lu, Hanzi Wang

[+] More

[-] Less

Existing continual learning literature relies heavily on a strong assumption that tasks arrive with a balanced data stream, which is often unrealistic in real-world applications. In this work, we explore task-imbalanced continual learning (TICL) scenarios where the distribution of task data is non-uniform across the whole learning process. We find that imbalanced tasks significantly challenge the capability of models to control the trade-off between stability and plasticity from the perspective of recent prompt-based continual learning methods. On top of the above finding, we propose Dynamically Anchored Prompting (DAP), a prompt-based method that only maintains a single general prompt to adapt to the shifts within a task stream dynamically. This general prompt is regularized in the prompt space with two specifically designed prompt anchors, called boosting anchor and stabilizing anchor, to balance stability and plasticity in TICL. Remarkably, DAP achieves this balance by only storing a prompt across the data stream, therefore offering a substantial advantage in rehearsal-free CL. Extensive experiments demonstrate that the proposed DAP results in 4.5% to 15% absolute improvements over state-of-the-art methods on benchmarks under task-imbalanced settings. Our code is available at https://github.com/chenxing6666/DAP.

List of keywords

Machine Learning -> ML: Incremental learning
Computer Vision -> CV: Recognition (object detection, categorization)
Data Mining -> DM: Class imbalance and unequal cost
Machine Learning -> ML: Classification

883

Efficient Screen Content Image Compression via Superpixel-based Content Aggregation and Dynamic Feature Fusion

Sheng Shen, Huanjing Yue, Jingyu Yang

[+] More

[-] Less

This paper addresses the challenge of efficiently compressing screen content images (SCIs) – computer generated images with unique attributes such as large uniform regions, sharp edges, and limited color palettes, which pose difficulties for conventional compression algorithms. We propose a Superpixel-based Content Aggregation Block (SCAB) to aggregate local pixels into one super-pixel and aggregate non-local information via super-pixel transformer. Such aggregation enables the dynamic assimilation of non-local information while maintaining manageable complexity. Furthermore, we enhance our channel-wise context entropy model with a Dynamic Feature Fusion (DFF) mechanism. This mechanism integrates decoded slices and side information dynamically based on their global correlation, allowing the network to dynamically learn the optimal weights for global information usage. Extensive experiments on three SCI datasets (SCID, CCT, and SIQAD) show our method’s superior RD performance and inference time, making it the first network comparable with the advanced VVC-SCC standard.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Computational photography
Computer Vision -> CV: Other

889

Bandits with Concave Aggregated Reward

Yingqi Yu, Sijia Zhang, Shaoang Li, Lan Zhang, Wei Xie, Xiang-Yang Li

[+] More

[-] Less

Multi-armed bandit is a simple but powerful algorithmic framework, and many effective algorithms have been proposed for various online models. In numerous applications, the decision-maker faces diminishing marginal utility. With non-linear aggregations, those algorithms often have poor regret bounds. Motivated by this, we study a bandit problem with diminishing marginal utility. In each round $t$, the agent will choose an arm $a_t$ from a set of $K$ arms, and the arm $a_t$ will generate a random value $v_{a_t}(t)$, which is unobservable by the agent. The agent’s objective is to maximize an aggregated reward $f(\sum_{t=1}^T v_{a_t}(t))$ with an unknown but fixed concave reward function $f(\cdot)$ where $T$ is the number of rounds. To tackle this problem, we propose two algorithms with different assumptions. Let $OPT$ be the best-arm benchmark and $\mu^*$ be the optimal arm’s mean value. In the fundamental case, our proposed algorithm SW-BCAR achieves a regret of $\tilde{O}(K^{1/3}T^{-1/3})OPT$; if $\mu^*$ is in the range of $[1/\sigma,1]$, where $\sigma>1$, our proposed algorithm SWUCB-BCAR achieves a regret of $\tilde{O}(\sigma K^{1/2}T^{-1/2})OPT$. Extensive simulations demonstrate that our algorithms achieve better results than the most advanced bandits algorithms.

List of keywords

Machine Learning -> ML: Multi-armed bandits

898

Shap-Mix: Shapley Value Guided Mixing for Long-Tailed Skeleton Based Action Recognition

Jiahang Zhang, Lilang Lin, Jiaying Liu

[+] More

[-] Less

In real-world scenarios, human actions often fall into a long-tailed distribution. It makes the existing skeleton-based action recognition works, which are mostly designed based on balanced datasets, suffer from a sharp performance degradation. Recently, many efforts have been made to image/video long-tailed learning. However, directly applying them to skeleton data can be sub-optimal due to the lack of consideration of the crucial spatial-temporal motion patterns, especially for some modality-specific methodologies such as data augmentation. To this end, considering the crucial role of the body parts in the spatially concentrated human actions, we attend to the mixing augmentations and propose a novel method, Shap-Mix, which improves long-tailed learning by mining representative motion patterns for tail categories. Specifically, we first develop an effective spatial-temporal mixing strategy for the skeleton to boost representation quality. Then, the employed saliency guidance method is presented, consisting of the saliency estimation based on Shapley value and a tail-aware mixing policy. It preserves the salient motion parts of minority classes in mixed data, explicitly establishing the relationships between crucial body structure cues and high-level semantics. Extensive experiments on three large-scale skeleton datasets show our remarkable performance improvement under both long-tailed and balanced settings. Our project is publicly available at: https://jhang2020.github.io/Projects/Shap-Mix/Shap-Mix.html.

List of keywords

Computer Vision -> CV: Action and behavior recognition

899

Class-Consistent Contrastive Learning Driven Cross-Dimensional Transformer for 3D Medical Image Classification

Qikui Zhu, Chuan Fu, Shuo Li

[+] More

[-] Less

Transformer emerges as an active research topic in medical image analysis. Yet, three substantial challenges limit the effectiveness of both 2D and 3D Transformers in 3D medical image classification: 1) Challenge in capturing spatial structure correlation due to the unreasonable flattening operation; 2) Challenge in burdening the high computational complexity and memory consumption due to the quadratic growth of computational complexity and memory consumption for 3D medical data; 3) Challenge in discriminative representation learning, due to data-sensitivity. To address the above challenges, a novel Cross-dimensional Transformer (CdTransformer) and a creative Class-consistent Contrastive Learning (CcCL) are proposed. Specifically, CdTransformer consists of two novel modules: 1) Cross-dimensional Attention Module (CAM), which breaks the limitation that Transformer cannot reasonably establish spatial structure correlation when meeting 3D medical data, meanwhile, reduces the computational complexity and memory consumption. 2) Inter-dimensional Feed-forward Network (IdFN), which addresses the challenge of traditional feed-forward networks not being able to learn depth dimension information that is unique to 3D medical data. CcCL innovatively takes full advantage of the inter-class and intra-class features from the slice-distorted samples to boost Transformer in learning feature representation. CdTransformer and CcCL are validated on six 3D medical image classification tasks. Extensive experimental results demonstrate that CdTransformer outperforms state-of-the-art CNNs and Transformers on 3D medical image classification, and CcCL enables significantly improving Transformer in discriminative representation learning.

List of keywords

Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Applications
Machine Learning -> ML: Adversarial machine learning
Machine Learning -> ML: Classification

901

Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution

Binxiao Huang, Jason Chun Lok Li, Jie Ran, Boyu Li, Jiajun Zhou, Dahai Yu, Ngai Wong

[+] More

[-] Less

Conventional super-resolution (SR) schemes make heavy use of convolutional neural networks (CNNs), which involve intensive multiply-accumulate (MAC) operations, and require specialized hardware such as graphics processing units. This contradicts the regime of edge AI that often runs on devices strained by power, computing, and storage resources. Such a challenge has motivated a series of lookup table (LUT)-based SR schemes that employ simple LUT readout and largely elude CNN computation. Nonetheless, the multi-megabyte LUTs in existing methods still prohibit on-chip storage and necessitate off-chip memory transport. This work tackles this storage hurdle and innovates hundred-kilobyte LUT (HKLUT) models amenable to on-chip cache. Utilizing an asymmetric two-branch multistage network coupled with a suite of specialized kernel patterns, HKLUT demonstrates an uncompromising performance and superior hardware efficiency over existing LUT schemes. Our implementation is publicly available at: https://github.com/jasonli0707/hklut.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Applications

907

MAS-SAM: Segment Any Marine Animal with Aggregated Features

Tianyu Yan, Zifu Wan, Xinhao Deng, Pingping Zhang, Yang Liu, Huchuan Lu

[+] More

[-] Less

Recently, Segment Anything Model (SAM) shows exceptional performance in generating high-quality object masks and achieving zero-shot image segmentation. However, as a versatile vision model, SAM is primarily trained with large-scale natural light images. In underwater scenes, it exhibits substantial performance degradation due to the light scattering and absorption. Meanwhile, the simplicity of the SAM’s decoder might lead to the loss of fine-grained object details. To address the above issues, we propose a novel feature learning framework named MAS-SAM for marine animal segmentation, which involves integrating effective adapters into the SAM’s encoder and constructing a pyramidal decoder. More specifically, we first build a new SAM’s encoder with effective adapters for underwater scenes. Then, we introduce a Hypermap Extraction Module (HEM) to generate multi-scale features for a comprehensive guidance. Finally, we propose a Progressive Prediction Decoder (PPD) to aggregate the multi-scale features and predict the final segmentation results. When grafting with the Fusion Attention Module (FAM), our method enables to extract richer marine information from global contextual cues to fine-grained local details. Extensive experiments on four public MAS datasets demonstrate that our MAS-SAM can obtain better results than other typical segmentation methods. The source code is available at https://github.com/Drchip61/MAS-SAM.

List of keywords

Robotics -> ROB: Applications
Robotics -> ROB: Perception
Robotics -> ROB: Robotics and vision

932

Hierarchical Reinforcement Learning on Multi-Channel Hypergraph Neural Network for Course Recommendation

Lu Jiang, Yanan Xiao, Xinxin Zhao, Yuanbo Xu, Shuli Hu, Pengyang Wang, Minghao Yin

[+] More

[-] Less

With the widespread popularity of massive open online courses, personalized course recommendation has become increasingly important due to enhancing users’ learning efficiency. While achieving promising performances, current works suffering from the vary across the users and other MOOC entities. To address this problem, we propose hierarchical reinforcement learning with a multi-channel hypergraphs neural network for course recommendation(called HHCoR). Specifically, we first construct an online course hypergraph as the environment to capture the complex relationships and historical information by considering all entities. Then, we design a multi-channel propagation mechanism to aggregate embeddings in the online course hypergraph and extract user interest through an attention layer. Besides, we employ two-level decision-making: the low-level focuses on the rating courses, while the high-level integrates these considerations to finalize the decision. Furthermore, in co-optimization, we design a joint reward function to improve the policy of two-layer agents. Finally, we conducted extensive experiments on two real-world datasets and the quantitative results have demonstrated the effectiveness of the proposed method.

List of keywords

Data Mining -> DM: Applications
Data Mining -> DM: Mining graphs
Data Mining -> DM: Mining heterogenous data
Data Mining -> DM: Mining spatial and/or temporal data

936

Evolutionary Generalized Zero-Shot Learning

Dubing Chen, Chenyi Jiang, Haofeng Zhang

[+] More

[-] Less

Zero-Shot Learning (ZSL) empowers models to recognize new classes unseen during training. However, existing ZSL settings have limitations, as inductive ZSL is prone to the domain shift problem, while transductive ZSL is impractical for real-world applications. In this work, we propose a novel Evolutionary Generalized Zero-Shot Learning setting which enables the model to continue learning while predicting during the deployment phase. The proposed setting enables a low-performing zero-shot model to adapt to the test data stream and evolve online. We elaborate on three challenges of this special task, \ie, catastrophic forgetting, initial prediction bias, and evolutionary data class bias. Moreover, we propose targeted solutions for each challenge, resulting in a generic method capable of continuing to evolve on a given initial IGZSL model. Experiments on three popular GZSL benchmark datasets show that our model can learn from the test data stream while other baselines fail.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Computer Vision -> CV: Vision, language and reasoning

937

Learning Spatial Similarity Distribution for Few-shot Object Counting

Yuanwu Xu, Feifan Song, Haofeng Zhang

[+] More

[-] Less

Few-shot object counting aims to count the number of objects in a query image that belong to the same class as the given exemplar images. Existing methods compute the similarity between the query image and exemplars in the 2D spatial domain and perform regression to obtain the counting number. However, these methods overlook the rich information about the spatial distribution of similarity on the exemplar images, leading to significant impact on matching accuracy. To address this issue, we propose a network learning Spatial Similarity Distribution (SSD) for few-shot object counting, which preserves the spatial structure of exemplar features and calculates a 4D similarity pyramid point-to-point between the query features and exemplar features, capturing the complete distribution information for each point in the 4D similarity space. We propose a Similarity Learning Module (SLM) which applies the efficient center-pivot 4D convolutions on the similarity pyramid to map different similarity distributions to distinct predicted density values, thereby obtaining accurate count. Furthermore, we also introduce a Feature Cross Enhancement (FCE) module that enhances query and exemplar features mutually to improve the accuracy of feature matching. Our approach outperforms state-of-the-art methods on multiple datasets, including FSC-147 and CARPK. Code is available at https://github.com/CBalance/SSD.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Computer Vision -> CV: Recognition (object detection, categorization)

965

ENOTO: Improving Offline-to-Online Reinforcement Learning with Q-Ensembles

Kai Zhao, Jianye Hao, Yi Ma, Jinyi Liu, Yan Zheng, Zhaopeng Meng

[+] More

[-] Less

Offline reinforcement learning (RL) is a learning paradigm where an agent learns from a fixed dataset of experience. However, learning solely from a static dataset can limit the performance due to the lack of exploration. To overcome it, offline-to-online RL combines offline pre-training with online fine-tuning, which enables the agent to further refine its policy by interacting with the environment in real-time. Despite its benefits, existing offline-to-online RL methods suffer from performance degradation and slow improvement during the online phase. To tackle these challenges, we propose a novel framework called ENsemble-based Offline-To-Online (ENOTO) RL. By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance. Moreover, to expedite online performance enhancement, we appropriately loosen the pessimism of Q-value estimation and incorporate ensemble-based exploration mechanisms into our framework. Experimental results demonstrate that ENOTO can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods during online fine-tuning on a range of locomotion and navigation tasks, significantly outperforming existing offline-to-online RL methods.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Ensemble methods
Machine Learning -> ML: Offline reinforcement learning
Machine Learning -> ML: Online learning

973

Optimisation and Approximation in Abstract Argumentation: The Case of Stable Semantics

Matthias Thimm

[+] More

[-] Less

We analyse two soft notions of stable extensions in abstract argumentation, one that weakens the requirement of having full range and one that weakens the requirement of conflict-freeness. We then consider optimisation problems over these two notions that represent optimisation variants of the credulous reasoning problem with stable semantics. We investigate the computational complexity of these two problems in terms of the complexity of solving the optimisation problem exactly and in terms of approximation complexity. We also present some polynomial-time approximation algorithms for these optimisation problems and investigate their approximation quality experimentally.

List of keywords

Knowledge Representation and Reasoning -> KRR: Argumentation
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning

980

Delve into Base-Novel Confusion: Redundancy Exploration for Few-Shot Class-Incremental Learning

Haichen Zhou, Yixiong Zou, Ruixuan Li, Yuhua Li, Kui Xiao

[+] More

[-] Less

Few-shot class-incremental learning (FSCIL) aims to acquire knowledge from novel classes with limited samples while retaining information about base classes. Existing methods address catastrophic forgetting and overfitting by freezing the feature extractor during novel-class learning. However, these methods usually tend to cause the confusion between base and novel classes, i.e., classifying novel-class samples into base classes.In this paper, we delve into this phenomenon to study its cause and solution. We first interpret the confusion as the collision between the novel-class and the base-class region in the feature space.Then, we find the collision is caused by the label-irrelevant redundancies within the base-class feature and pixel space. Through qualitative and quantitative experiments, we identify this redundancy as the shortcut in the base-class training, which can be decoupled to alleviate the collision. Based on this analysis, to alleviate the collision between base and novel classes, we propose a method for FSCIL named Redundancy Decoupling and Integration (RDI). RDI first decouples redundancies from base-class space to shrink the intra-base-class feature space. Then, it integrates the redundancies as a dummy class to enlarge the inter-base-class feature space. This process effectively compresses the base-class feature space, creating buffer space for novel classes and alleviating the model’s confusion between the base and novel classes. Extensive experiments across benchmark datasets, including CIFAR-100, \textit{mini}ImageNet, and CUB-200-2011 demonstrate that our method achieves state-of-the-art performance.

List of keywords

Machine Learning -> ML: Incremental learning
Machine Learning -> ML: Few-shot learning

993

MetaISP: Efficient RAW-to-sRGB Mappings with Merely 1M Parameters

Zigeng Chen, Chaowei Liu, Yuan Yuan, Michael Bi Mi, Xinchao Wang

[+] More

[-] Less

State-of-the-art deep ISP models alleviate the dilemma of limited generalization capabilities across heterogeneous inputs by increasing the size and complexity of the network, which inevitably leads to considerable growth in parameter counts and FLOPs. To address this challenge, this paper presents MetaISP – a streamlined model that achieves superior reconstruction quality by adaptively modulating its parameters and architecture in response to diverse inputs. Our rationale revolves around obtaining corresponding spatial and channel-wise correction matrices for various inputs within distinct feature spaces, which assists in assigning optimal attention. This is achieved by predicting dynamic weights for each input image and combining these weights with multiple learnable basis matrices to construct the correction matrices. The proposed MetaISP makes it possible to obtain best performance while being computationally efficient. SOTA results are achieved on two large-scale datasets, e.g. 23.80dB PSNR on ZRR, exceeding the previous SOTA 0.19dB with only 9.2% of its parameter count and 10.6% of its FLOPs; 25.06dB PSNR on MAI21, exceeding the previous SOTA 0.17dB with only 0.9% of its parameter count and 2.7% of its FLOPs.

List of keywords

Computer Vision -> CV: Computational photography
Computer Vision -> CV: Applications

996

Tolerating Outliers: Gradient-Based Penalties for Byzantine Robustness and Inclusion

Latifa Errami, El houcine Bergou

[+] More

[-] Less

This work investigates the interplay between Robustness and Inclusion in the context of poisoning attacks targeting the convergence of Stochastic Gradient Descent (SGD). While robustness has received significant attention, the standard Byzantine defenses rely on the Independent and Identically Distributed (IID) assumption causing their performance to deteriorate on non-IID data distributions, even without any attack. This is largely due to these defenses being excessively cautious and discarding benign outliers. We introduce a penalty-based aggregation that accounts for the discrepancy between trusted clients and outliers. We propose the use of Linear Scalarization (LS) as an enhancing method to enable current defenses to simultaneously circumvent Byzantine attacks while also granting inclusion of outliers. This empowers existing defenses to not only counteract malicious adversaries effectively but also to incorporate outliers into the learning process. We conduct a theoretical analysis to demonstrate the convergence of our approach. Specifically, we establish the robustness and resilience of our method under standard assumptions. Empirical analysis further validates the viability of the proposed approach. Across mild to strong non-IID data splits, our method consistently either matches or surpasses the performance of current approaches in the literature, under state-of-the-art Byzantine attack scenarios.

List of keywords

Machine Learning -> ML: Robustness
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
Machine Learning -> ML: Trustworthy machine learning

998

LSPAN: Spectrally Localized Augmentation for Graph Consistency Learning

Heng-Kai Zhang, Yi-Ge Zhang, Zhi Zhou, Yu-Feng Li

[+] More

[-] Less

Graph-based consistency principle has been successfully applied to many semi-supervised problems in machine learning. Its performance largely depends on the quality of augmented graphs, which has been recently proven that revealing graph properties and maintaining the invariance of graphs are crucial for good performance. However, existing topology- or feature-based augmentation methods are spectrally non-localized — important spectrums are disturbed throughout the entire frequency range, and their invariance may not be well preserved. Efforts on this issue remain to be limited. This paper proposes a simple yet effective model called Localized SPectral AugmentatioN (LSPAN), which perturbs a concentrated part of graph spectrum with equivalent intensity using Fourier orthogonality, so as to enhance graph spectrum preservation as well as model prediction. Moreover, it also avoids the significant training time of inverse Fourier transform. Extensive empirical evaluation on real-world datasets clearly shows the performance gain of spectrally localized augmentation, as well as its good convergence and efficiency compared to existing graph methods.

List of keywords

Machine Learning -> ML: Semi-supervised learning
Machine Learning -> ML: Active learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Multi-task and transfer learning

1018

Scaling Up Unbiased Search-based Symbolic Regression

Paul Kahlmeyer, Joachim Giesen, Michael Habeck, Henrik Voigt

[+] More

[-] Less

In a regression task, a function is learned from labeled data to predict the labels at new data points. The goal is to achieve small prediction errors. In symbolic regression, the goal is more ambitious, namely, to learn an interpretable function that makes small prediction errors. This additional goal largely rules out the standard approach used in regression, that is, reducing the learning problem to learning parameters of an expansion of basis functions by optimization. Instead, symbolic regression methods search for a good solution in a space of symbolic expressions. To cope with the typically vast search space, most symbolic regression methods make implicit, or sometimes even explicit, assumptions about its structure. Here, we argue that the only obvious structure of the search space is that it contains small expressions, that is, expressions that can be decomposed into a few subexpressions. We show that systematically searching spaces of small expressions finds solutions that are more accurate and more robust against noise than those obtained by state-of-the-art symbolic regression methods. In particular, systematic search outperforms state-of-the-art symbolic regressors in terms of its ability to recover the true underlying symbolic expressions on established benchmark data sets.

List of keywords

Machine Learning -> ML: Symbolic methods
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Regression
Search -> S: Search and machine learning

1019

Bridging LiDAR Gaps: A Multi-LiDARs Domain Adaptation Dataset for 3D Semantic Segmentation

Shaoyang Chen, Bochun Yang, Yan Xia, Ming Cheng, Siqi Shen, Cheng Wang

[+] More

[-] Less

We focus on the domain adaptation problem for 3D semantic segmentation, addressing the challenge of data variability in point clouds collected by different LiDARs. Existing benchmarks often mix different types of datasets, which blurs and complicates segmentation evaluations. Here, we introduce a Multi-LiDARs Domain Adaptation Segmentation (MLDAS) dataset, which contains point-wise semantic annotated point clouds captured simultaneously by a 128-beam LiDAR, a 64-beam LiDAR, a 32-beam LiDAR. We select 31,875 scans from 2 representative scenarios: campus and urban street. Furthermore, we evaluate the current 3D segmentation unsupervised domain adaptation methods on the proposed dataset and propose Hierarchical Segmentation Network with Spatial Consistency (HSSC) as a novel knowledge transfer method to mitigate the domain gap significantly using spatial-temporal consistency constraints. Extensive experiments show that HSSC greatly improves the state-of-the-art cross-domain semantic segmentation methods. Our project is available at https://sychen320.github.io/projects/MLDAS.

List of keywords

Computer Vision -> CV: 3D computer vision

1025

A De-singularity Subgradient Approach for the Extended Weber Location Problem

Zhao-Rong Lai, Xiaotian Wu, Liangda Fang, Ziliang Chen

[+] More

[-] Less

The extended Weber location problem is a classical optimization problem that has inspired some new works in several machine learning scenarios recently. However, most existing algorithms may get stuck due to the singularity at the data points when the power of the cost function 1\<= q<2, such as the widely-used iterative Weiszfeld approach. In this paper, we establish a de-singularity subgradient approach for this problem. We also provide a complete proof of convergence which has fixed some incomplete statements of the proofs for some previous Weiszfeld algorithms. Moreover, we deduce a new theoretical result of superlinear convergence for the iteration sequence in a special case where the minimum point is a singular point. We conduct extensive experiments in a real-world machine learning scenario to show that the proposed approach solves the singularity problem, produces the same results as in the non-singularity cases, and shows a reasonable rate of linear convergence. The results also indicate that the q-th power case (1<q<2) is more advantageous than the 1-st power case and the 2-nd power case in some situations. Hence the de-singularity subgradient approach is beneficial to advancing both theory and practice for the extended Weber location problem.

List of keywords

Machine Learning -> ML: Optimization
Constraint Satisfaction and Optimization -> CSO: Solvers and tools
Constraint Satisfaction and Optimization -> CSO: Other

1036

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

Taeho Kim, Yanming Wang, Vatshank Chaturvedi, Lokesh Gupta, Seyeon Kim, Yongin Kwon, Sangtae Ha

[+] More

[-] Less

Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with an error rate of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.

List of keywords

Natural Language Processing -> NLP: Language models
Machine Learning -> ML: Deep learning architectures

1039

Markov Constraint as Large Language Model Surrogate

Alexandre Bonlarron, Jean-Charles Régin

[+] More

[-] Less

This paper presents NgramMarkov a variant of the Markov constraints. It is dedicated to the text generation in constraint programming (CP).It involves a set of n-grams (i.e., sequence of n words) associated with probabilities given by a large language model (LLM). It limits the product of the probabilities of the n-gram of a sentence. The propagator of this constraint can be seen as an extension of the \emph{ElementaryMarkov} constraint propagator, incorporating the LLM distribution instead of the maximum likelihood estimation of n-grams. It uses a gliding threshold, i.e., it rejects n-grams whose local probabilities are too low, to guarantee balanced solutions.It can also be combined with a "look-ahead" approach to remove n-grams that are very unlikely to lead to acceptable sentences for a fixed-length horizon. This idea is based on the MDDMarkovProcess constraint propagator, but without explicitly using an MDD (Multi-Valued Decision Diagram).The experimental results show that the generated text is valued in a similar way to the LLM perplexity function.Using this new constraint dramatically reduces the number of candidate sentences produced, improves computation times, and allows larger corpora or smaller n-grams to be used. A real-world problem has been solved for the first time using 4-grams instead of 5-grams.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Constraint programming
Constraint Satisfaction and Optimization -> CSO: Applications
Constraint Satisfaction and Optimization -> CSO: Modeling

1042

Temporal Inductive Logic Reasoning over Hypergraphs

Yuan Yang, Siheng Xiong, Ali Payani, James C. Kerce, Faramarz Fekri

[+] More

[-] Less

Inductive logic reasoning is a fundamental task in graph analysis, which aims to generalize patterns from data. This task has been extensively studied for traditional graph representations, such as knowledge graphs (KGs), using techniques like inductive logic programming (ILP). Existing ILP methods assume learning from KGs with static facts and binary relations. Beyond KGs, graph structures are widely present in other applications such as procedural instructions, scene graphs, and program executions. While ILP is beneficial for these applications, applying it to those graphs is nontrivial: they are more complex than KGs, which usually involve timestamps and n-ary relations, effectively a type of hypergraph with temporal events. In this work, we propose temporal inductive logic reasoning (TILR), an ILP method that reasons on temporal hypergraphs. To enable hypergraph reasoning, we introduce the multi-start random B-walk, a novel graph traversal method for hypergraphs. By combining it with a path-consistency algorithm, TILR learns logic rules by generalizing from both temporal and relational data. To address the lack of hypergraph benchmarks, we create and release two temporal hypergraph datasets: YouCook2-HG and nuScenes-HG. Experiments on these benchmarks demonstrate that TILR achieves superior reasoning capability over various strong baselines.

List of keywords

Knowledge Representation and Reasoning -> KRR: Logic programming
Data Mining -> DM: Knowledge graphs and knowledge base completion

1043

Imperfect-Recall Games: Equilibrium Concepts and Their Complexity

Emanuel Tewolde, Brian Zhang, Caspar Oesterheld, Manolis Zampetakis, Tuomas Sandholm, Paul Goldberg, Vincent Conitzer

[+] More

[-] Less

We investigate optimal decision making under imperfect recall, that is, when the agent(s) knows that it will forget information it once held before. An example is the absentminded driver game, as well as team games in which the members exhibit limited communication capabilities. In the framework of extensive-form games with imperfect recall, we analyze the computational complexities of finding equilibria in multiplayer settings across three different solution concepts: Nash, multiselves based on evidential decision theory (EDT), and multiselves based on causal decision theory (CDT). We are interested in both exact and approximate solution computation. As special cases, we consider (1) single-player games, (2) two-player zero-sum games and relationships to maximin values, and (3) games without exogenous stochasticity (chance nodes). We relate these problems to the complexity classes PPAD, PLS, Σ_2^P, ∃R, and ∃∀R.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Noncooperative games

1062

Self-supervised Weighted Information Bottleneck for Multi-view Clustering

Zhengzheng Lou, Chaoyang Zhang, Hang Xue, Yangdong Ye, Qinglei Zhou, Shizhe Hu

[+] More

[-] Less

Multi-view clustering (MVC) is a long-standing topic in machine learning and data mining community, focusing on investigating and utilizing the relationships among views for final consistent data cluster structure discovery. Generally, weighted MVC is one of the popular methods working by learning and applying the view weight/importance on each view for fully exploring the complementary information across views. However, most existing weighted MVCs only consider the quality of each view, ignoring the vital role of pseudo label self-supervision information in weight learning. In this work, we propose a novel self-supervised weighted information bottleneck (SWIB) method for solving the multi-view clustering problem. It combines the weighted information from different views based on information bottleneck theory, and the view weight learning mechanism is newly designed by simultaneously taking into accounting both the quality of view-contained information and the self-supervised information on the data partition of each view. Experimental results on multi-view text, multi-feature image, multi-angle video, and multi-modal text-image dataset as well as large-scale datasets show the superiority of the SWIB method. To our knowledge, this is the first work incorporating the self-supervised learning into weighted multi-view clustering.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
Machine Learning -> ML: Multi-modal learning
Machine Learning -> ML: Unsupervised learning

1074

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, Hua Huang

[+] More

[-] Less

Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions. The strategy aims to perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs. The data and code are available at https://github.com/FlagOpen/CMMU.

List of keywords

Computer Vision -> CV: Multimodal learning
Multidisciplinary Topics and Applications -> MTA: Education

1086

IMM: An Imitative Reinforcement Learning Approach with Predictive Representation Learning for Automatic Market Making

Hui Niu, Siyuan Li, Jiahao Zheng, Zhouchi Lin, Bo An, Jian Li, Jian Guo

[+] More

[-] Less

In recent years, there has been a growing interest in applying reinforcement learning (RL) techniques to order execution owing to RL’s strong sequential decision-making ability. However, realistic order execution tasks usually involve a large fine-grained action space and a long trading duration. The former hinders the RL agents from efficient exploration. The latter increase the task complexity, since the agent must capture price advantages throughout the day as well as micro changes within a few seconds on the limited order books. In addressing these challenges, we propose MacMic, a novel Hierarchical RL-based order execution approach that captures market patterns and executes orders from different temporal scales. MacMic employs a high-level agent to split the parent order into smaller slices at coarse-grained time steps. Then a low-level agent is adopted to execute these slices by placing fixed-size sub-orders at a continuous time. Besides, to balance the multifaceted objectives of the two tasks, MacMic pretrains a causal stacking hidden Markov model (SHMM) to obtain both effective macro-level and micro-level market states. Comprehensive experimental results on 200 stocks across the US and China A-share markets validate the effectiveness of the proposed method.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Finance
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Representation learning

1093

Linear-Time Optimal Deadlock Detection for Efficient Scheduling in Multi-Track Railway Networks

Hastyn Doshi, Ayush Tripathi, Keshav Agarwal, Harshad Khadilkar, Shivaram Kalyanakrishnan

[+] More

[-] Less

The railway scheduling problem requires the computation of an operable timetable that satisfies constraints involving railway infrastructure and resource occupancy times, while minimising average delay over a set of events. Since this problem is computationally hard, practical solutions typically generate feasible (but suboptimal) schedules one step at a time, by choosing which train to move next in every step. The choices made by such algorithms are necessarily myopic, and incur the risk of driving the system to a deadlock, which is an undesirable state from which no further progress is possible. To escape deadlocks, the predominant approach is to stay away from states flagged as potentially unsafe by some fast-to-compute rule R. While many choices of R guarantee deadlock avoidance, they are suboptimal in the sense of also flagging some safe states as unsafe. In this paper, we revisit the literature on process scheduling and describe a rule R0 that is (i) necessary and sufficient for deadlock detection when the network has at least two tracks in each resource (node), (ii) computable in linear time, and (ii) yields lower delays when combined with existing scheduling algorithms on both synthetic and real data sets from Indian Railways.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Transportation
Planning and Scheduling -> PS: Applications
Planning and Scheduling -> PS: Markov decisions processes
Planning and Scheduling -> PS: Scheduling

1095

A Lightweight U-like Network Utilizing Neural Memory Ordinary Differential Equations for Slimming the Decoder

Quansong He, Xiaojun Yao, Jun Wu, Zhang Yi, Tao He

[+] More

[-] Less

In recent years, advanced U-like networks have demonstrated remarkable performance in medical image segmentation tasks. However, their drawbacks, including excessive parameters, high computational complexity, and slow inference speed, pose challenges for practical implementation in scenarios with limited computational resources. Existing lightweight U-like networks have alleviated some problems, but they often have pre-designed structures and consist of non-detachable modules, limiting their application scenarios. In this paper, we propose three plug-and-play decoders by employing different discretization methods of the neural memory Ordinary Differential Equation (nmODE). These decoders integrate features at various levels of abstraction by processing information from skip connections and performing numerical operations on upward paths. Through experiments on the PH2, ISIC2017, and ISIC2018 datasets, we embed these decoders into different U-like networks, demonstrating their effectiveness in significantly reducing the number of parameters and computation while maintaining performance. In summary, the proposed discretized nmODE decoder is capable of reducing the number of parameters by about 20% ~ 50% and computation by up to 74%, while being adaptive to all U-like networks. Our code is available at https://github.com/nayutayuki/Lightweight-nmODE-Decoders-For-U-like-networks.

List of keywords

Computer Vision -> CV: Segmentation
Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Machine learning for vision
Machine Learning -> ML: Convolutional networks

1098

Common-Individual Semantic Fusion for Multi-View Multi-Label Learning

Gengyu Lyu, Weiqi Kang, Haobo Wang, Zheng Li, Zhen Yang, Songhe Feng

[+] More

[-] Less

In Multi-View Multi-Label Learning, each instance is described by several heterogeneous features and associated with multiple valid labels simultaneously. Existing methods mainly focus on leveraging feature-level view fusion to capture a common representation for multi-label classifier induction. In this paper, we take a new perspective and propose a new semantic-level fusion model named Common-Individual Semantic Fusion Multi-View Multi-Label Learning Method (CISF). Different from previous feature-level fusion model, our proposed method directly focuses on semantic-level view fusion and simultaneously take both the common semantic across different views and the individual semantic of each specific view into consideration. Specifically, we first assume each view involves some common semantic labels while owns a few exclusive semantic labels. Then, the common and exclusive semantic labels are separately forced to be consensus and diverse to excavate the consistences and complementarities among different views. Afterwards, we introduce the low-rank and sparse constraint to highlight the label co-occurrence relationship of common semantics and the view-specific expression of individual semantics. We provide theoretical guarantee for the strict convexity of our method by properly setting parameters. Extensive experiments on various data sets have verified the superiority of our method.

List of keywords

Machine Learning -> ML: Multi-label learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Weakly supervised learning

1106

Multi-Attention Based Visual-Semantic Interaction for Few-Shot Learning

Peng Zhao, Yin Wang, Wei Wang, Jie Mu, Huiting Liu, Cong Wang, Xiaochun Cao

[+] More

[-] Less

Few-Shot Learning (FSL) aims to train a model that can generalize to recognize new classes, with each new class having only very limited training samples. Since extracting discriminative features for new classes with few samples is challenging, existing FSL methods leverage visual and semantic prior knowledge to guide discriminative feature learning. However, for meta-learning purposes, the semantic knowledge of the query set is unavailable, so their features lack discriminability. To address this problem, we propose a novel Multi-Attention based Visual-Semantic Interaction (MAVSI) approach for FSL. Specifically, we utilize spatial and channel attention mechanisms to effectively select discriminative visual features for the support set based on its ground-truth semantics while using all the support set semantics for each query set sample. Then, a relation module with class prototypes of the support set is employed to supervise and select discriminative visual features for the query set. To further enhance the discriminability of the support set, we introduce a visual-semantic contrastive learning module to promote the similarity between visual features and their corresponding semantic features. Extensive experiments on four benchmark datasets demonstrate that our proposed MAVSI could outperform existing state-of-the-art FSL methods.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Machine Learning -> ML: Meta-learning

1114

Large Language Models Are Not Strong Abstract Reasoners

Gaël Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie

[+] More

[-] Less

Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain opaque, and it is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally circ*mscribed. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we introduce a new benchmark for evaluating language models beyond memorization on abstract reasoning tasks. We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks. We argue that guiding LLM generation to follow causal paths could help improve the generalisation and reasoning abilities of LLMs.

List of keywords

Natural Language Processing -> NLP: Language models
Machine Learning -> ML: Evaluation
Machine Learning -> ML: Robustness
Natural Language Processing -> NLP: Question answering

1120

Toward a Manifold-Preserving Temporal Graph Network in Hyperbolic Space

Viet Quan Le, Viet Cuong Ta

[+] More

[-] Less

Hyperbolic geometry provides an ideal setting to represent the scale-free or hierarchical characteristics of an input graph naturally. Utilizing hyperbolic geometry for learning dynamic graph representation has gained a growing interest in recent years. However, the majority of hyperbolic-based approaches rely on tangent spaces to perform graph operations, which could distort the structure of the dynamic graph when the graph grows over time. To avoid the distortion in tangent space, we propose a Hyperbolic Manifold-Preserving Temporal Graph Network that works directly on the hyperbolic manifold. The model includes a graph convolution module for learning the spatial dependencies, an attention architecture for capturing the temporal properties, and a gated recurrent unit for extracting the spatio-temporal relationships. By evaluating on diverse real-world dynamic graphs, our model has achieved significant improvements in link prediction and new link prediction tasks, in comparison with other baselines.

List of keywords

Machine Learning -> ML: Sequence and graph learning
Machine Learning -> ML: Geometric learning
Data Mining -> DM: Mining graphs
Data Mining -> DM: Mining spatial and/or temporal data

1125

Continual Compositional Zero-Shot Learning

Yang Zhang, Songhe Feng, Jiazheng Yuan

[+] More

[-] Less

Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions with the knowledge learned from seen compositions, where each composition is composed of two primitives (attribute and object). However, existing CZSL methods are designed to learn compositions from fixed primitive set, which cannot handle the continually expanding primitive set in real-world applications. In this paper, we propose a new CZSL setting, named Continual Compositional Zero-Shot Learning (CCZSL), which requires the model to recognize unseen compositions composed of learned primitive set while continually increasing the size of learned primitive set. Contextuality and catastrophic forgetting are the main issues to be addressed in this setting. Specifically, we capture similar contextuality in compositions through several learnable Super-Primitives that can modify the invariant primitive embedding to better adapt the contextuality in the corresponding composition. Then we introduce a dual knowledge distillation loss which aims at maintaining old knowledge learned from previous sessions and avoiding overfitting of new session. We design the CCZSL evaluation protocol and conduct extensive experiments on widely used benchmarks, demonstrating the superiority of our method compared to the state-of-the-art CZSL methods.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Machine Learning -> ML: Incremental learning

1129

Causality-enhanced Discreted Physics-informed Neural Networks for Predicting Evolutionary Equations

Ye Li, Siqi Chen, Bin Shan, Sheng-Jun Huang

[+] More

[-] Less

Physics-informed neural networks (PINNs) have shown promising potential for solving partial differential equations (PDEs) using deep learning. However, PINNs face training difficulties for evolutionary PDEs, particularly for dynamical systems whose solutions exhibit multi-scale or turbulent behavior over time.The reason is that PINNs may violate the temporal causality property since all the temporal features in the PINNs loss are trained simultaneously. This paper proposes to use implicit time differencing schemes to enforce temporal causality, and use transfer learning to sequentially update the PINNs in space as surrogates for PDE solutions in different time frames.The evolving PINNs are better able to capture the varying complexities of the evolutionary equations, while only requiring minor updates between adjacent time frames.Our method is theoretically proven to be convergent if the time step is small and each PINN in different time frames is well-trained.In addition, we provide state-of-the-art (SOTA) numerical results for a variety of benchmarks for which existing PINNs formulations may fail or be inefficient.We demonstrate that the proposed method improves the accuracy of PINNs approximation for evolutionary PDEs and improves efficiency by a factor of 4–40x.The code is available at https://github.com/SiqiChen9/TL-DPINNs.

List of keywords

Machine Learning -> ML: Applications
Machine Learning -> ML: Causality
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Regression

1133

SGDCL: Semantic-Guided Dynamic Correlation Learning for Explainable Autonomous Driving

Chengtai Cao, Xinhong Chen, Jianping Wang, Qun Song, Rui Tan, Yung-Hui Li

[+] More

[-] Less

By learning expressive representations, deep learning (DL) has revolutionized autonomous driving (AD). Despite significant advancements, the inherent opacity of DL models engenders public distrust, impeding their widespread adoption. For explainable autonomous driving, current studies primarily concentrate on extracting features from input scenes to predict driving actions and their corresponding explanations. However, these methods underutilize semantics and correlation information within actions and explanations (collectively called categories in this work), leading to suboptimal performance. To address this issue, we propose Semantic-Guided Dynamic Correlation Learning (SGDCL), a novel approach that effectively exploits semantic richness and dynamic interactions intrinsic to categories. SGDCL employs a semantic-guided learning module to obtain category-specific representations and a dynamic correlation learning module to adaptively capture intricate correlations among categories. Additionally, we introduce an innovative loss term to leverage fine-grained co-occurrence statistics of categories for refined regularization. We extensively evaluate SGDCL on two well-established benchmarks, demonstrating its superiority over seven state-of-the-art baselines and a large vision-language model. SGDCL significantly promotes explainable autonomous driving with up to 15.3% performance improvement and interpretable attention scores, bolstering public trust in AD.

List of keywords

Computer Vision -> CV: Interpretability and transparency
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Classification
Computer Vision -> CV: Machine learning for vision

1138

Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling

Gongye Liu, Haoze Sun, Jiayi Li, Fei Yin, Yujiu Yang

[+] More

[-] Less

Diffusion models have recently demonstrated an impressive ability to address inverse problems in an unsupervised manner. While existing methods primarily focus on modifying the posterior sampling process, the potential of the forward process remains largely unexplored. In this work, we propose \textbf{Shortcut Sampling for Diffusion(SSD)}, a novel approach for solving inverse problems in a zero-shot manner. Instead of initiating from random noise, the core concept of SSD is to find a specific transitional state that bridges the measurement image y and the restored image x. By utilizing the shortcut path of "input – transitional state – output", SSD can achieve precise restoration with fewer steps. To derive the transitional state during the forward process, we introduce Distortion Adaptive Inversion. Moreover, we apply back projection as additional consistency constraints during the generation process. Experimentally, we demonstrate SSD’s effectiveness on multiple representative IR tasks. Our method achieves competitive results with only 30 NFEs compared to state-of-the-art zero-shot methods(100 NFEs) and outperforms them with 100 NFEs in certain tasks. Code is available at https://github.com/GongyeLiu/SSD.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Applications

1163

Expressiveness is Effectiveness: Self-supervised Fashion-aware CLIP for Video-to-Shop Retrieval

Likai Tian, Zhengwei Yang, Zechao Hu, Hao Li, Yifang Yin, Zheng Wang

[+] More

[-] Less

The rise of online shopping and social media has spurred the Video-to-Shop Retrieval (VSR) task, which involves identifying fashion items (e.g., clothing) in videos and matching them with identical products provided by stores. In real-world scenarios, human movement in dynamic video scenes can cause substantial morphological alterations of fashion items with aspects of occlusion, shifting viewpoints (parallax), and partial visibility (truncation). This results in those high-quality frames being overwhelmed by a vast of redundant ones, which makes the retrieval less effectiveness. To this end, this paper introduces a framework, named Self-supervised Fashion-aware CLIP (SF-CLIP), for effective VSR. The SF-CLIP enables the discovery of salient frames with high fashion expressiveness via generating pseudo-labels from three key aspects of fashion expressiveness to assess occlusion, parallax, and truncation. With such pseudo-labels, the ability of CLIP is expanded to facilitate the discovery of salient frames. Furthermore, to encompass the comprehensive representations among salient frames, a dual-branch graph-based fusion module is proposed to extract and integrate inter-frame features. Extensive experiments demonstrate the superiority of SF-CLIP over the state-of-the-arts.

List of keywords

Computer Vision -> CV: Image and video retrieval
Computer Vision -> CV: Interpretability and transparency

1175

Sub-Adjacent Transformer: Improving Time Series Anomaly Detection with Reconstruction Error from Sub-Adjacent Neighborhoods

Wenzhen Yue, Xianghua Ying, Ruohao Guo, DongDong Chen, Ji Shi, Bowei Xing, Yuqing Zhu, Taiyan Chen

[+] More

[-] Less

In this paper, we present the Sub-Adjacent Transformer with a novel attention mechanism for unsupervised time series anomaly detection. Unlike previous approaches that rely on all the points within some neighborhood for time point reconstruction, our method restricts the attention to regions not immediately adjacent to the target points, termed sub-adjacent neighborhoods. Our key observation is that owing to the rarity of anomalies, they typically exhibit more pronounced differences from their sub-adjacent neighborhoods than from their immediate vicinities. By focusing the attention on the sub-adjacent areas, we make the reconstruction of anomalies more challenging, thereby enhancing their detectability. Technically, our approach concentrates attention on the non-diagonal areas of the attention matrix by enlarging the corresponding elements in the training stage. To facilitate the implementation of the desired attention matrix pattern, we adopt linear attention because of its flexibility and adaptability. Moreover, a learnable mapping function is proposed to improve the performance of linear attention. Empirically, the Sub-Adjacent Transformer achieves state-of-the-art performance across six real-world anomaly detection benchmarks, covering diverse fields such as server monitoring, space exploration, and water treatment.

List of keywords

Data Mining -> DM: Anomaly/outlier detection
Machine Learning -> ML: Time series and data streams

1177

Meta-Learning via PAC-Bayesian with Data-Dependent Prior: Generalization Bounds from Local Entropy

Shiyu Liu, Wei Shi, Zenglin Xu, Shaogao Lv, Yehong Zhang, Hui Wang

[+] More

[-] Less

Meta-learning accelerates the learning process on unseen learning tasks by acquiring prior knowledge through previous related tasks. The PAC-Bayesian theory provides a theoretical framework to analyze the generalization of meta-learning to unseen tasks. However, previous works still encounter two notable limitations: (1) they merely focus on the data-free priors, which often result in inappropriate regularization and loose generalization bounds; (2) more importantly, their optimization process usually involves nested optimization problems, incurring significant computational costs. To address these issues, we derive new generalization bounds and introduce a novel PAC-Bayesian framework for meta-learning that integrates data-dependent priors. This framework enables the extraction of optimal posteriors for each task in closed form, thereby allowing us to minimize generalization bounds incorporated data-dependent priors with only a simple local entropy. The resulting algorithm, which employs SGLD for sampling from the optimal posteriors, is stable, efficient, and computationally lightweight, eliminating the need for nested optimization. Extensive experimental results demonstrate that our proposed method outperforms the other baselines.

List of keywords

Machine Learning -> ML: Bayesian learning
Machine Learning -> ML: Meta-learning

1185

DTS-TPT: Dual Temporal-Sync Test-time Prompt Tuning for Zero-shot Activity Recognition

Rui Yan, Hongyu Qu, Xiangbo Shu, Wenbin Li, Jinhui Tang, Tieniu Tan

[+] More

[-] Less

Finetuning the large vision-language models on video data with a set of learnable prompts has shown promising performance on zero-shot activity recognition but still requires extra video data and expensive training costs. Inspired by recent Test-time Prompt Tuning~(TPT) on the image domain, this work attempts to extend TPT to video data for zero-shot activity recognition. However, monotonous spatial augmentation and short class names cannot meet the need to capture diverse and complicated semantics of human behavior during prompt tuning. To this end, this work proposes a Dual Temporal-Sync Test-time Prompt Tuning~(DTS-TPT) framework for zero-shot activity recognition. DTS-TPT tunes the learnable prompts appended to text inputs on video feature sequences of different temporal scales in multiple steps during test time. In each tuning step, we minimize the semantic consistency among the predictions from video feature sequences randomly augmented via AugMix with both original class names and the corresponding description generated through LLM. Compared with the state-of-the-art methods, the proposed method improves the zero-shot top-1 accuracy by approximately 2%~5% on popular benchmarks. The code is available on the project website.

List of keywords

Computer Vision -> CV: Video analysis and understanding
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

1194

On the Effects of Fairness to Adversarial Vulnerability

Cuong Tran, Keyu Zhu, Pascal Van Hentenryck, Ferdinando Fioretto

[+] More

[-] Less

Fairness and robustness are two important notions of learning models. Fairness ensures that models do not disproportionately harm (or benefit) some groups over others, while robustness measures the models’ resilience against small input perturbations. While equally important properties, this paper illustrates a dichotomy between fairness and robustness, and analyzes when striving for fairness decreases the model robustness to adversarial samples. The reported analysis sheds light on the factors causing such contrasting behavior, suggesting that distance to the decision boundary across groups as a key factor. Experiments on non-linear models and different architectures validate the theoretical findings. In addition to the theoretical analysis, the paper also proposes a simple, yet effective, solution to construct models achieving good tradeoffs between fairness and robustness.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems

1201

Hybrid Frequency Modulation Network for Image Restoration

Yuning Cui, Mingyu Liu, Wenqi Ren, Alois Knoll

[+] More

[-] Less

Image restoration involves recovering a high-quality image from its corrupted counterpart. This paper presents an effective and efficient framework for image restoration, termed CSNet, based on “channel + spatial" hybrid frequency modulation. Different feature channels include different degradation patterns and degrees, however, most current networks ignore the importance of channel interactions. To alleviate this issue, we propose a frequency-based channel feature modulation module to facilitate channel interactions through the channel-dimension Fourier transform. Furthermore, based on our observations, we develop a multi-scale frequency-based spatial feature modulation module to refine the direct-current component of features using extremely lightweight learnable parameters. This module contains a densely connected coarse-to-fine learning paradigm for enhancing multi-scale representation learning. In addition, we introduce a frequency-inspired loss function to achieve omni-frequency learning. Extensive experiments on nine datasets demonstrate that the proposed network achieves state-of-the-art performance for three image restoration tasks, including image dehazing, image defocus deblurring, and image desnowing. The code and models are available at https://github.com/c-yn/CSNet.

List of keywords

Computer Vision -> CV: Applications
Computer Vision -> CV: Computational photography
Computer Vision -> CV: Representation learning

1227

Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal

Lihe Li, Ruotong Chen, Ziqian Zhang, Zhichao Wu, Yi-Chen Li, Cong Guan, Yang Yu, Lei Yuan

[+] More

[-] Less

Multi-objective reinforcement learning (MORL) approaches address real-world problems with multiple objectives by learning policies maximizing returns weighted by different user preferences. Typical methods assume the objectives remain unchanged throughout the agent’s lifetime. However, in some real-world situations, the agent may encounter dynamically changing learning objectives, i.e., different vector-valued reward functions at different learning stages. This issue has not been considered in problem formulation or algorithm design. To address this issue, we formalize the setting as a continual MORL (CMORL) problem for the first time, accounting for the evolution of objectives throughout the learning process. Subsequently, we propose Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal (CORe3), incorporating a dynamic agent network for rapid adaptation to new objectives. Moreover, we develop a reward model rehearsal technique to recover the reward signals for previous objectives, thus alleviating catastrophic forgetting. Experiments on four CMORL benchmarks showcase that CORe3 effectively learns policies satisfying different preferences on all encountered objectives, and outperforms the best baseline by 171%, highlighting the capability of CORe3 to handle situations with evolving objectives.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Incremental learning
Machine Learning -> ML: Optimization

1230

DCDet: Dynamic Cross-based 3D Object Detector

Shuai Liu, Boyang Li, Zhiyu Fang, Kai Huang

[+] More

[-] Less

Recently, significant progress has been made in the research of 3D object detection. However, most prior studies have focused on the utilization of center-based or anchor-based label assignment schemes. Alternative label assignment strategies remain unexplored in 3D object detection. We find that the center-based label assignment often fails to generate sufficient positive samples for training, while the anchor-based label assignment tends to encounter an imbalanced issue when handling objects with different scales. To solve these issues, we introduce a dynamic cross label assignment (DCLA) scheme, which dynamically assigns positive samples for each object from a cross-shaped region, thus providing sufficient and balanced positive samples for training. Furthermore, to address the challenge of accurately regressing objects with varying scales, we put forth a rotation-weighted Intersection over Union (RWIoU) metric to replace the widely used L1 metric in regression loss. Extensive experiments demonstrate the generality and effectiveness of our DCLA and RWIoU-based regression loss. The Code is available at https://github.com/Say2L/DCDet.git.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Recognition (object detection, categorization)

1231

Sparse Multi-Relational Graph Convolutional Network for Multi-type Object Trajectory Prediction

Jianhui Zhang, Jun Yao, Liqi Yan, Yanhong Xu, Zheng Wang

[+] More

[-] Less

Object trajectory prediction is a hot research issue with wide applications in video surveillance and autonomous driving. The previous studies consider the interaction sparsity mainly among the pedestrians instead of multi-type of objects, which brings new types of interactions and consequently superfluous ones. This paper proposes a Multi-type Object Trajectory Prediction (MOTP) method with a Sparse Multi-relational Graph Convolutional Network (SMGCN) and a novel multi-round Global Temporal Aggregation (GTA). MOTP introduces a novel adaptive sparsification and multi-scale division method to model interactions among multitype of objects. It further incorporates a Sparse Multi-relational Temporal Graph to capture the temporal division of multi-type trajectories, along with a multi-round Global Temporal Aggregation (GTA) mechanism to mitigate error accumulation, and enhances the trajectory prediction accuracy. The extensive evaluation on the ETH, UCY and SDD datasets shows that our method outperforms the typical state-of-the-art works by significant margins. Codes will be available in https://github.com/ sounio/SMGCN.

List of keywords

Computer Vision -> CV: Video analysis and understanding
Computer Vision -> CV: Action and behavior recognition

1239

DenseKoopman: A Plug-and-Play Framework for Dense Pedestrian Trajectory Prediction

Xianbang Li, Yilong Ren, Han Jiang, Haiyang Yu, Yanlei Cui, Liang Xu

[+] More

[-] Less

Pedestrian trajectory prediction has emerged as a core component of human-robot interaction and autonomous driving. Fast and accurate prediction of surrounding pedestrians contributes to making decisions and improves safety and efficiency. However, pedestrians’ future trajectories will interact with their surrounding traffic participants. As the density of pedestrians increases, the complexity of such interactions also increases significantly, leading to an inevitable decrease in the accuracy of pedestrian trajectory prediction. To address this issue, we propose DenseKoopman, a plug-and-play framework for dense pedestrian trajectory prediction. Specifically, we introduce the Koopman operator theory to find an embedding space for a global linear approximation of a nonlinear pedestrian motion system. By encoding historical trajectories as linear state embeddings in the Koopman space, we transforms nonlinear trajectory data for pedestrians in dense scenes. This linearized representation greatly reduces the complexity of dense pedestrian trajectory prediction. Extensive experiments on pedestrian trajectory prediction benchmarks demonstrate the superiority of the proposed framework. We also conducted an analysis of the data transformation to explore how our DenseKoopman framework works with each validation method and uncovers motion patterns that may be hidden within the trajectory data. Code is available at https://github.com/lixianbang/DenseKoopman.

List of keywords

Computer Vision -> CV: Motion and tracking
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Other

1240

Boundary-aware Decoupled Flow Networks for Realistic Extreme Rescaling

Jinmin Li, Tao Dai, Jingyun Zhang, Kang Liu, Jun Wang, Shaoming Wang, Shu-Tao Xia, Rizen Guo

[+] More

[-] Less

Recently developed generative methods, including invertible rescaling network (IRN) based and generative adversarial network (GAN) based methods, have demonstrated exceptional performance in image rescaling. However, IRN-based methods tend to produce over-smoothed results, while GAN-based methods easily generate fake details, which thus hinders their real applications. To address this issue, we propose Boundary-aware Decoupled Flow Networks (BDFlow) to generate realistic and visually pleasing results. Unlike previous methods that model high-frequency information as standard Gaussian distribution directly, our BDFlow first decouples the high-frequency information into semantic high-frequency that adheres to a Boundary distribution and non-semantic high-frequency counterpart that adheres to a Gaussian distribution. Specifically, to capture semantic high-frequency parts accurately, we use Boundary-aware Mask (BAM) to constrain the model to produce rich textures, while non-semantic high-frequency part is randomly sampled from a Gaussian distribution. Comprehensive experiments demonstrate that our BDFlow significantly outperforms other state-of-the-art methods while maintaining lower complexity. Notably, our BDFlow improves the PSNR by 4.4 dB and the SSIM by 0.1 on average over GRAIN, utilizing only 74% of the parameters and 20% of the computation. The code will be available at https://github.com/THU-Kingmin/BAFlow.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Applications

1256

PointTFA: Training-Free Clustering Adaption of Large 3D Point Cloud Models

Jinmeng Wu, Chong Cao, Hao Zhang, Basura Fernando, Yanbin Hao, HanYu Hong

[+] More

[-] Less

The success of contrastive learning models like CLIP, known for aligning 2D image-text pairs, has inspired the development of triplet alignment for Large 3D Point Cloud Models (3D-PCM). Examples like ULIP integrate images, text, and point clouds into a unified semantic space. However, despite showing impressive zero-shot capabilities, frozen 3D-PCM still falls short compared to fine-tuned methods, especially when downstream 3D datasets are significantly different from upstream data. Addressing this, we propose a Data-Efficient, Training-Free 3D Adaptation method named PointTFA that adjusts ULIP outputs with representative samples. PointTFA comprises the Representative Memory Cache (RMC) for selecting a representative support set, Cloud Query Refactor (CQR) for reconstructing a query cloud using the support set, and Training-Free 3D Adapter (3D-TFA) for inferring query categories from the support set. A key advantage of PointTFA is that it introduces no extra training parameters, yet outperforms vanilla frozen ULIP, closely approaching few-shot fine-tuning training methods in downstream cloud classification tasks like ModelNet10 & 40 and ScanObjectNN. Our project will be open-sourced post peer-review for future research.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Recognition (object detection, categorization)

1258

R2V-MIF: Rule-to-Vector Contrastive Learning and Multi-channel Information Fusion for Therapy Recommendation

Nengjun Zhu, Jieyun Huang, Jian Cao, Liang Hu, Zixuan Yuan, Huanjing Gao

[+] More

[-] Less

Integrating data-driven and rule-based approaches is crucial for therapy recommendations since they can collaborate to achieve better performance. Medical rules, which are chains of reasoning that can infer therapies, widely exist. However, their symbolic and logical forms make integrating them with data-driven modeling technologies hard. Although rare attempts have indirectly modeled rules using data that supports them, the poor generalization of medical rules leads to inadequate supporting data and thus impairs the benefit of medical rules. To this end, we propose R2V-MIF, which fills the gap by rule-to-vector contrastive learning (R2V) and multi-channel information fusion (MIF). R2V is a data-free module and utilizes a hypergraph, including condition and result nodes, to instantiate the logic of medical rules. Each rule is reflected in the relations between nodes, and their representations are determined through contrastive learning. By taking rule representations as a bridge, MIF integrates the knowledge from medical rules, similar neighbors, and patient contents, and then recommends therapies. Extensive experiments show that R2V-MIF outperforms the baselines in several metrics using real-world medical data. Our code is available at https://github.com/vgeek-z/r2vmif.

List of keywords

Data Mining -> DM: Recommender systems
Data Mining -> DM: Mining heterogenous data
Multidisciplinary Topics and Applications -> MTA: Health and medicine

1301

Constructive Interpolation and Concept-Based Beth Definability for Description Logics via Sequents

Timothy S. Lyon, Jonas Karge

[+] More

[-] Less

We introduce a constructive method applicable to a large number of description logics (DLs) for establishing the concept-based Beth definability property (CBP) based on sequent systems. Using the highly expressive DL RIQ as a case study, we introduce novel sequent calculi for RIQ-ontologies and show how certain interpolants can be computed from sequent calculus proofs, which permit the extraction of explicit definitions of implicitly definable concepts. To the best of our knowledge, this is the first sequent-based approach to computing interpolants and definitions within the context of DLs, as well as the first proof that RIQ enjoys the CBP. Moreover, due to the modularity of our sequent systems, our results hold for any restriction of RIQ, and are applicable to other DLs by suitable modifications.

List of keywords

Knowledge Representation and Reasoning -> KRR: Description logics and ontologies
Knowledge Representation and Reasoning -> KRR: Automated reasoning and theorem proving

1315

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Qingdong He, Jinlong Peng, Zhengkai Jiang, Kai Wu, Xiaozhong Ji, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Mingang Chen, Yunsheng Wu

[+] More

[-] Less

3D open-vocabulary scene understanding aims to recognize arbitrary novel categories beyond the base label space. However, existing works not only fail to fully utilize all the available modal information in the 3D domain but also lack sufficient granularity in representing the features of each modality. In this paper, we propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D, aligning point clouds with image, language and depth. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module that learns fine-grained feature representations. Further, to facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs, capitalizing on geometric constraints across various viewpoints of 3D scenes. Extensive experimental results have demonstrated the effectiveness and superiority of our method in open-vocabulary semantic and instance segmentation, which achieves state-of-the-art performance on both indoor and outdoor benchmarks such as ScanNet, ScanNet200, S3IDS and nuScenes. Code is available at https://github.com/hithqd/UniM-OV3D.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Applications
Computer Vision -> CV: Scene analysis and understanding

1323

KTCN: Enhancing Open-World Object Detection with Knowledge Transfer and Class-Awareness Neutralization

Xing Xi, Yangyang Huang, Jinhao Lin, Ronghua Luo

[+] More

[-] Less

Open-World Object Detection (OWOD) has garnered widespread attention due to its ability to recall unannotated objects. Existing works generate pseudo-labels for the model using heuristic priors, which limits the model’s performance. In this paper, we leverage the knowledge of the large-scale visual model to provide supervision for unknown categories. Specifically, we use the Segment Anything Model (SAM) to generate raw pseudo-labels for potential objects and refine them through Intersection over Union (IOU) and the shortest bounding box side length. Nevertheless, the abundance of pseudo-labels still exacerbates the competition issue in the one-to-many label assignment. To address this, we propose the Dual Matching Label Assignment (DMLA) strategy. Furthermore, we propose the Class-Awareness Neutralizer (CAN) to reduce the model’s bias towards known categories. Evaluation results on open-world object detection benchmarks, including MS COCO and Pascal VOC, show that our method achieves nearly 200% the unknown recall rate of previous state-of-the-art (SOTA) methods, reaching 41.5 U-Recall. Additionally, our approach does not add any extra parameters, maintaining the inference speed advantage of Faster R-CNN, leading the SOTA methods based on deformable DETR at a speed of over 10 FPS. Our code is available at https://github.com/xxyzll/KTCN.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)

1351

Ansatz-Agnostic Exponential Resource Saving in Variational Quantum Algorithms Using Shallow Shadows

Afrad Basheer, Yuan Feng, Christopher Ferrie, Sanjiang Li

[+] More

[-] Less

Variational Quantum Algorithms (VQA) have been identified as a promising candidate for the demonstration of near-term quantum advantage in solving optimization tasks in chemical simulation, quantum information, and machine learning. The standard model of training requires a significant amount of quantum resources, which led researchers to use classical shadows to devise an alternative that consumes exponentially fewer quantum resources. However, the approach only works when the observables are local and the ansatz is the shallow Alternating Layered Ansatz (ALA), thus severely limiting its potential in solving problems such as quantum state preparation, where the ideal state might not be approximable with an ALA. In this work, we present a protocol based on shallow shadows that achieves similar levels of savings for almost any shallow ansatz studied in the literature, when combined with observables of low Frobenius norm. We show that two important applications in quantum information for which VQAs can be a powerful option, namely variational quantum state preparation and variational quantum circuit synthesis, are compatible with our protocol. We also experimentally demonstrate orders of magnitude improvement in comparison to the standard VQA model.

List of keywords

Machine Learning -> ML: Other
Machine Learning -> ML: Optimization

1358

Span-based Unified Named Entity Recognition Framework via Contrastive Learning

Hongli Mao, Xian-Ling Mao, Hanlin Tang, Yu-Ming Shang, Xiaoyan Gao, Ao-Jie Ma, Heyan Huang

[+] More

[-] Less

Traditional Named Entity Recognition (NER) models are typically designed for domain-specific datasets and limited to fixed predefined types, resulting in difficulty generalizing to new domains. Recently, prompt-based generative methods attempt to mitigate this constraint by training models jointly on diverse datasets and extract specified entities via prompt instructions. However, due to autoregressive structure, these methods cannot directly model entity span and suffer from slow sequential decoding. To address these issues, we propose a novel Span-based Unified NER framework via contrastive learning (SUNER), which aligns text span and entity type representations in a shared semantic space to extract entities in parallel. Specifically, we first extract mention spans without considering entity types to better generalize across datasets. Then, by leveraging the power of contrastive learning and well-designed entity marker structure, we map candidate spans and their textual type descriptions into the same vector representation space to differentiate entities across domains. Extensive experiments on both supervised and zero/few-shot settings demonstrate that proposed SUNER model achieves better performance and higher efficiency than previous state-of-the-art unified NER models.

List of keywords

Natural Language Processing -> NLP: Named entities
Natural Language Processing -> NLP: Information extraction

1366

Hundredfold Accelerating for Pathological Images Diagnosis and Prognosis through Self-reform Critical Region Focusing

Xiaotian Yu, Luo Haoming, Jiacong Hu, Xiuming Zhang, Yuexuan Wang, Wenjie Liang, Yijun Bei, Mingli Song, Zunlei Feng

[+] More

[-] Less

Pathological slides are commonly gigapixel images with abundant information and are therefore significant for clinical diagnosis. However, the ultra-large size makes both training and evaluation extremely time-consuming. Most existing methods need to crop the slide into patches, which also leads to large memory requirements. In this paper, we propose the Self-reform Multilayer Transformer (SMT) to accelerate the pathological image diagnosis and prognosis. Inspired by the pathologists’ diagnostic procedure, SMT is designed to achieve layer-by-layer focus on critical regions. In the forward process, the first layer takes thumbnails as inputs and measures the significance of each patch that deserves focusing. Images from focused regions are cropped with a higher magnification and used as the input of the next layer. By analogy, the third layer inputs are focused images of second layer, which contain abundant cellular features. In addition to the forward focusing, the backward reform strategy is proposed to improve the precision of former layers. This cyclic process achieves iterative interactions for better performance on both classification and focusing. In this way, only a small part of critical patches are required in SMT for diagnosis and prognosis. Sufficient experiments demonstrate that SMT achieves hundreds times faster speed, while achieving comparable accuracy and less storage compared with existing SOTA methods.

List of keywords

Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Recognition (object detection, categorization)
Machine Learning -> ML: Classification

1367

Optimal Graph Learning and Nuclear Norm Maximization for Deep Cross-Domain Robust Label Propagation

Wei Wang, Hanyang Li, Ke Shi, Chao Huang, Yang Cao, Cong Wang, Xiaochun Cao

[+] More

[-] Less

Domain adaptation aims to achieve label transfer from a labeled source domain to an unlabeled target domain, where the two domains exhibit different distributions. Existing methods primarily concentrate on designing a feature extractor to learn better domain-invariant features, along with developing an effective classifier for reliable predictions. In this paper, we introduce optimal graph learning to generate a cross-domain graph that effectively connects the two domains, and two domain-specific graphs to capture domain-specific structures. On the one hand, we incorporate the three graphs into the label propagation (LP) classifier to enhance its robustness to distribution difference. On the other hand, we leverage the three graphs to introduce graph embedding losses, promoting the learning of locally discriminative and domain-invariant features. Furthermore, we maximize the nuclear norm of predictions in LP to enhance class diversity, thereby improving its robustness to class imbalance problem. Correspondingly, we develop an efficient algorithm to solve the associated optimization problem. Finally, we integrate the proposed LP and graph embedding losses into a deep neural network, resulting in our proposed deep cross-domain robust LP. Extensive experiments conducted on three cross-domain benchmark datasets demonstrate that our proposed approach could outperform existing state-of-the-art domain adaptation methods.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Feature extraction, selection and dimensionality reduction

1377

DifTraj: Diffusion Inspired by Intrinsic Intention and Extrinsic Interaction for Multi-Modal Trajectory Prediction

Yanghong Liu, Xingping Dong, Yutian Lin, Mang Ye

[+] More

[-] Less

Recent years have witnessed the success of generative adversarial networks and diffusion models in multi-model trajectory prediction. However, prevailing algorithms only explicitly consider human interaction, but ignore the modeling of human intention, yielding that the generated results deviate largely from real trajectories in some complex scenes. In this paper, we analyze the conditions of multi-modal trajectory prediction from two objective perspectives and propose a novel end-to-end framework based on the diffusion model to predict more precise and socially-acceptable trajectories for humans. Firstly, a spatial-temporal aggregation module is built to extract the extrinsic interaction features for capturing socially-acceptable behaviors. Secondly, we explicitly construct the intrinsic intention module to obtain intention features for precise prediction. Finally, we estimate a noise trajectory distribution with these two features as the initiation of diffusion model and leverage denoising process to obtain the final trajectories. Furthermore, to reduce the noise of the initiative trajectory estimation, we present a novel sample consistency loss to constrain multiple predictions. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods on ETH-UCY and SDD benchmarks, specifically achieving 19.0%/24.2% ADE/FDE improvement on ETH-UCY.

List of keywords

Computer Vision -> CV: Motion and tracking
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Machine Learning -> ML: Time series and data streams

1383

ClothPPO: A Proximal Policy Optimization Enhancing Framework for Robotic Cloth Manipulation with Observation-Aligned Action Spaces

Libing Yang, Yang Li, Long Chen

[+] More

[-] Less

Vision-based robotic cloth unfolding has made great progress recently. However, prior works predominantly rely on value learning and have not fully explored policy-based techniques. Recently, the success of reinforcement learning on the large language model has shown that the policy gradient algorithm can enhance policy with huge action space. In this paper, we introduce ClothPPO, a framework that employs a policy gradient algorithm based on actor-critic architecture to enhance a pre-trained model with huge 10^6 action spaces aligned with observation in the task of unfolding clothes. To this end, we redefine the cloth manipulation problem as a partially observable Markov decision process. A supervised pre-training stage is employed to train a baseline model of our policy. In the second stage, the Proximal Policy Optimization (PPO) is utilized to guide the supervised model within the observation-aligned action space. By optimizing and updating the strategy, our proposed method increases the garment’s surface area for cloth unfolding under the soft-body manipulation task. Experimental results show that our proposed framework can further improve the unfolding performance of other state-of-the-art methods.

List of keywords

Robotics -> ROB: Learning in robotics
Robotics -> ROB: Manipulation
Robotics -> ROB: Perception
Machine Learning -> ML: Reinforcement learning

1385

Rethinking Centered Kernel Alignment in Knowledge Distillation

Zikai Zhou, Yunhang Shen, sh*tong Shao, Linrui Gong, Shaohui Lin

[+] More

[-] Less

Knowledge distillation has emerged as a highly effective method for bridging the representation discrepancy between large-scale models and lightweight models. Prevalent approaches involve leveraging appropriate metrics to minimize the divergence or distance between the knowledge extracted from the teacher model and the knowledge learned by the student model. Centered Kernel Alignment (CKA) is widely used to measure representation similarity and has been applied in several knowledge distillation methods. However, these methods are complex and fail to uncover the essence of CKA, thus not answering the question of how to use CKA to achieve simple and effective distillation properly. This paper first provides a theoretical perspective to illustrate the effectiveness of CKA, which decouples CKA to the upper bound of Maximum Mean Discrepancy (MMD) and a constant term. Drawing from this, we propose a novel Relation-Centered Kernel Alignment (RCKA) framework, which practically establishes a connection between CKA and MMD. Furthermore, we dynamically customize the application of CKA based on the characteristics of each task, with less computational source yet comparable performance than the previous methods. The extensive experiments on the CIFAR-100, ImageNet-1k, and MS-COCO demonstrate that our method achieves state-of-the-art performance on almost all teacher-student pairs for image classification and object detection, validating the effectiveness of our approaches. Our code is available in https://github.com/Klayand/PCKA.

List of keywords

Machine Learning -> ML: Deep learning architectures
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Representation learning
Machine Learning -> ML: Representation learning

1399

Massively Parallel Single-Source SimRanks in O(log N) Rounds

Siqiang Luo, Zulun Zhu

[+] More

[-] Less

SimRank is one of the most fundamental measures that evaluate the structural similarity between two nodes in a graph and has been applied in a plethora of data mining and machine learning tasks. These tasks often involve single-source SimRank computation that evaluates the SimRank values between a source node $u$ and all other nodes. Due to its high computation complexity, single-source SimRank computation for large graphs is notoriously challenging, and hence recent studies resort to distributed processing. To our surprise, although SimRank has been widely adopted for two decades, theoretical aspects of distributed SimRanks with provable results have rarely been studied. In this paper, we conduct a theoretical study on single-source SimRank computation in the Massive Parallel Computation (MPC) model, which is the standard theoretical framework modeling distributed systems. Existing distributed SimRank algorithms enforce either $\Omega(\log n)$ communication round complexity or $\Omega(n)$ machine space for a graph of $n$ nodes. We overcome this barrier. Particularly, given a graph of $n$ nodes, for any query node $v$ and constant error $\epsilon>\frac{3}{n}$, we show that using $O(\log^2 \log n)$ rounds of communication among machines is enough to compute single-source SimRank values with at most $\epsilon$ absolute errors, while each machine only needs a space sub-linear to $n$. To the best of our knowledge, this is the first single-source SimRank algorithm in MPC that can overcome the $\Theta(\log n)$ round complexity barrier with provable result accuracy.

List of keywords

Data Mining -> DM: Parallel, distributed and cloud-based high performance mining
Data Mining -> DM: Theoretical foundations of data mining

1423

Truth Table Net: Scalable, Compact & Verifiable Neural Networks with a Dual Convolutional Small Boolean Circuit Networks Form

Adrien Benamira, Thomas Peyrin, Trevor Yap, Tristan Guérand, Bryan Hooi

[+] More

[-] Less

We introduce Truth Table net, a novel Deep Neural Network (DNN) architecture designed to provide excellent scalability/compactness trade-offs among DNNs, allowing in turn to tackle the DNN challenge of fast formal verification. TTnet is constructed using Learning Truth Table (LTT) filters, analogous to how a Deep Convolutional Neural Network (DCNN) is built upon convolutional filters. The differentiable LTT filters are unique by their dual form: they are both a neural network-based function and a small-sized truth table that can be computed within a practical time frame. This characteristic guarantees, by design and independently of the overall architecture, the ability to practically extract an efficient (in terms of the number of logical gates) and functionally equivalent Conjunctive Normal Form (CNF) Boolean logic gate implementation. This CNF circuit is even optimal when the LTT truth table’s input bit size n<13. In particular, ttnet architecture is the first differentiable DNN with as dual form a compact logic gate representation that can scale to datasets larger than CIFAR10: we achieve an accuracy of 41% on the ImageNet dataset while ensuring that each LTT filter truth table is fully computable within 2^{16} operations. We further compare the compactness and scalability performances of ttnet Boolean logic circuit representation to state-of-the-art differentiable logic DNNs across tabular, MNIST, and CIFAR10 datasets. We emphasize that ttnet is the first solution to the open problem of designing differentiable convolutional neural networks with an exact dual logic gate circuit representation, bridging the gap between symbolic AI and trainable DCNNs. Finally, as improving DNNs compactness in Boolean logic circuit form reduces the complexity of their formal verification, we demonstrate ttnet effectiveness in exact sound complete formal verification. Notably, our model achieves robustness verification in 10ms vs 100s for traditional state-of-the-art DNNs solvers.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Formal verification, validation and synthesis
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Constraint Satisfaction and Optimization -> CSO: Satisfiabilty
Machine Learning -> ML: Convolutional networks

1425

SemanticMask: A Contrastive View Design for Anomaly Detection in Tabular Data

Shuting Tao, Tongtian Zhu, Hongwei Wang, Xiangming Meng

[+] More

[-] Less

Contrastive learning based on data augmentation techniques has recently achieved substantial advancement in learning a representation well-suited for anomaly detection in image domain. However, due to the lack of spatial structure, designing effective data augmentation methods for tabular data remains challenging. Conventional techniques, such as random mask, disregard the inter-feature correlations and fail to accurately represent the data. To address this issue, we propose a novel augmentation technique called SemanticMask which leverages the semantic information from column names to generate better augmented views. SemanticMask aims to ensure that the shared information between views contains sufficient information for anomaly detection without redundancy. We analyze the relationship between shared information and anomaly detection performance and empirically demonstrate that good views for tabular anomaly detection tasks are feature-dependent. Our experiment results validate the superiority of SemanticMask over the state-of-the-art anomaly detection methods and existing augmentation techniques for tabular data. In further evaluations of the multi-class novelty detection task, SemanticMask also significantly outperforms the baseline.

List of keywords

Data Mining -> DM: Anomaly/outlier detection
Machine Learning -> ML: Unsupervised learning

1427

Proximal Curriculum with Task Correlations for Deep Reinforcement Learning

Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla

[+] More

[-] Less

Curriculum design for reinforcement learning (RL) can speed up an agent’s learning process and help it learn to perform well on complex tasks. However, existing techniques typically require domain-specific hyperparameter tuning, involve expensive optimization procedures for task selection, or are suitable only for specific learning objectives. In this work, we consider curriculum design in contextual multi-task settings where the agent’s final performance is measured w.r.t. a target distribution over complex tasks. We base our curriculum design on the Zone of Proximal Development concept, which has proven to be effective in accelerating the learning process of RL agents for uniform distribution over all tasks. We propose a novel curriculum, ProCuRL-Target, that effectively balances the need for selecting tasks that are not too difficult for the agent while progressing the agent’s learning toward the target distribution via leveraging task correlations. We theoretically justify the task selection strategy of ProCuRL-Target by analyzing a simple learning setting with REINFORCE learner model. Our experimental results across various domains with challenging target task distributions affirm the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents.

List of keywords

Machine Learning -> ML: Reinforcement learning
Planning and Scheduling -> PS: Markov decisions processes

1455

A Meta-Game Evaluation Framework for Deep Multiagent Reinforcement Learning

Zun Li, Michael P. Wellman

[+] More

[-] Less

Evaluating deep multiagent reinforcement learning (MARL) algorithms is complicated by stochasticity in training and sensitivity of agent performance to the behavior of other agents. We propose a meta-game evaluation framework for deep MARL, by framing each MARL algorithm as a meta-strategy, and repeatedly sampling normal-form empirical games over combinations of meta-strategies resulting from different random seeds. Each empirical game captures both self-play and cross-play factors across seeds. These empirical games provide the basis for constructing a sampling distribution, using bootstrapping, over a variety of game analysis statistics. We use this approach to evaluate state-of-the-art deep MARL algorithms on a class of negotiation games. From statistics on individual payoffs, social welfare, and empirical best-response graphs, we uncover strategic relationships among self-play, population-based, model-free, and model-based MARL methods. We also investigate the effect of run-time search as a meta-strategy operator, and find via meta-game analysis that the search version of a meta-strategy generally leads to improved performance.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Game Theory and Economic Paradigms -> GTEP: Noncooperative games

1461

Unified Physical-Digital Face Attack Detection

Hao Fang, Ajian Liu, Haocheng Yuan, Junze Zheng, Dingheng Zeng, Yanhong Liu, Jiankang Deng, Sergio Escalera, Xiaoming Liu, Jun Wan, Zhen Lei

[+] More

[-] Less

Face Recognition (FR) systems can suffer from physical (i.e., print photo) and digital (i.e., DeepFake) attacks. However, previous related work rarely considers both situations at the same time. This implies the deployment of multiple models and thus more computational burden. The main reasons for this lack of an integrated model are caused by two factors: (1) The lack of a dataset including both physical and digital attacks which the same ID covers the real face and all attack types; (2) Given the large intra-class variance between these two attacks, it is difficult to learn a compact feature space to detect both attacks simultaneously. To address these issues, we collect a Unified physical-digital Attack dataset, called UniAttackData. The dataset consists of 1,800 participations of 2 and 12 physical and digital attacks, respectively, resulting in a total of 28,706 videos. Then, we propose a Unified Attack Detection framework based on Vision-Language Models (VLMs), namely UniAttackDetection, which includes three main modules: the Teacher-Student Prompts (TSP) module, focused on acquiring unified and specific knowledge respectively; the Unified Knowledge Mining (UKM) module, designed to capture a comprehensive feature space; and the Sample-Level Prompt Interaction (SLPI) module, aimed at grasping sample-level semantics. These three modules seamlessly form a robust unified attack detection framework. Extensive experiments on UniAttackData and three other datasets demonstrate the superiority of our approach for unified face attack detection. Dataset link: https://sites.google.com/view/face-anti-spoofing-challenge/dataset-download/uniattackdatacvpr2024

List of keywords

Computer Vision -> CV: Biometrics, face, gesture and pose recognition
Machine Learning -> ML: Multi-modal learning

1500

Memorizing Documents with Guidance in Large Language Models

Bumjin Park, Jaesik Choi

[+] More

[-] Less

As large language models (LLMs) memorize vast amounts of content, tracing the provenance of generated content is essential for safe use. Among several approaches, storing documents in known memories of LLMs reveals the knowledge locations. To store documents in known memories, we tackle the problem of entangling documents and memory locations of LLMs, encouraging separate document memory selections. For this purpose, we propose document guidance loss, which increases the likelihood of document contents with the conditional generation of the document while preventing memorization with other conditions. This work shows how the guidance loss separates document memories and analyzes the relationship between the document and memory selection space with the Lipschitz continuity assumption in metric spaces. Experimental results on Wikitext-103-v1 with Pythia-1B show that document guidance provides separable document memories.

List of keywords

Natural Language Processing -> NLP: Language models
Natural Language Processing -> NLP: Embeddings
Natural Language Processing -> NLP: Interpretability and analysis of models for NLP
Natural Language Processing -> NLP: Language generation

1505

Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors

Guozheng Li, Peng Wang, Jiajun Liu, Yikai Guo, Ke Ji, Ziyu Shang, Zijie Xu

[+] More

[-] Less

Relation extraction (RE) is an important task that aims to identify the relationships between entities in texts. While large language models (LLMs) have revealed remarkable in-context learning (ICL) capability for general zero and few-shot learning, recent studies indicate that current LLMs still struggle with zero and few-shot RE. Previous studies are mainly dedicated to design prompt formats and select good examples for improving ICL-based RE. Although both factors are vital for ICL, if one can fundamentally boost the ICL capability of LLMs in RE, the zero and few-shot RE performance via ICL would be significantly improved. To this end, we introduce Micre (Meta In-Context learning of LLMs for Relation Extraction), a new meta-training framework for zero and few-shot RE where an LLM is tuned to do ICL on a diverse collection of RE datasets (i.e., learning to learn in context for RE). Through meta-training, the model becomes more effectively to learn a new RE task in context by conditioning on a few training examples with no parameter updates or task-specific templates at inference time, enabling better zero and few-shot task generalization. We experiment Micre on various LLMs with different model scales and 12 public RE datasets, and then evaluate it on unseen RE benchmarks under zero and few-shot settings. Micre delivers comparable or superior performance compared to a range of baselines including supervised fine-tuning and typical in-context learning methods. We find that the gains are particular significant for larger model scales, and using a diverse set of the meta-training RE datasets is key to improvements. Empirically, we show that Micre can transfer the relation semantic knowledge via relation label name during inference on target RE datasets.

List of keywords

Natural Language Processing -> NLP: Information extraction

1525

Efficient and Stable Offline-to-online Reinforcement Learning via Continual Policy Revitalization

Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, Ming Li

[+] More

[-] Less

In offline Reinforcement Learning (RL), the pre-trained policies are utilized for initialization and subsequent online fine-tuning. However, existing methods suffer from instability and low sample efficiency compared to pure online learning. This paper identifies these limitations stemming from direct policy initialization using offline-trained policy models. We propose Continual Policy Revitalization (CPR) as a novel efficient, stable fine-tuning method. CPR incorporates a periodic policy revitalization technique, restoring the overtrained policy network to full learning capacity while ensuring stable initial performance. This approach enables fine-tuning without being adversely affected by low-quality pre-trained policies. In contrast to previous research, CPR initializes the new policy with an adaptive policy constraint in policy optimization. Such optimization keeps the new policy close to behavior policy constructed from historical policies. This contributes to stable policy improvement and optimal converged performance. Practically, CPR can seamlessly integrate into existing offline RL algorithms with minimal modification. We empirically validate the effectiveness of our method through extensive experiments, demonstrating substantial improvements in learning stability and efficiency compared to previous approaches. Our code is available at https://github.com/LAMDA-RL/CPR.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Offline reinforcement learning

1531

Optimal Auction Design with User Coupons in Advertising Systems

Xiaodong Liu, Zhikang Fan, Yiming Ding, Yuan Guo, Lihua Zhang, Changcheng Li, Dongying Kong, Han Li, Weiran Shen

[+] More

[-] Less

Online advertising is a major revenue source for most Internet companies. The advertising opportunities are usually sold to advertisers through auctions that take into account the bids of the advertisers and the click-through rates (CTRs) and the conversion rates (CVRs) of the users. Standard auction design theory perceives both the CTRs and the CVRs as constants. We consider a new auction mechanism that offers coupons to users when displaying the ads. Such coupons allow the user to buy the advertisers’ products or services at a lower price, which increases both the CTRs and the CVRs of the ads.In this paper, we formulate the problem mathematically and perform a systematic analysis. We characterize the set of individually rational and incentive compatible mechanisms in our setting. Based on the characterization, we identify the optimal strategy of offering coupons that maximizes the platform’s expected revenue. We also conduct extensive experiments on both synthetic data and industrial data. Our experiment results show that our mechanism significantly improves both the revenue and welfare of the platform, thereby creating a win-win situation for all parties including the platform, the advertisers, and the user.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Auctions and market-based systems
Game Theory and Economic Paradigms -> GTEP: Mechanism design
Game Theory and Economic Paradigms -> GTEP: Noncooperative games

1532

FedPFT: Federated Proxy Fine-Tuning of Foundation Models

Zhaopeng Peng, Xiaoliang Fan, Yufan Chen, Zheng Wang, Shirui Pan, Chenglu Wen, Ruisheng Zhang, Cheng Wang

[+] More

[-] Less

Adapting Foundation Models (FMs) for down- stream tasks through Federated Learning (FL) emerges a promising strategy for protecting data privacy and valuable FMs. Existing methods fine- tune FM by allocating sub-FM to clients in FL, however, leading to suboptimal performance due to insufficient tuning and inevitable error accumula- tions of gradients. In this paper, we propose Feder- ated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules. First, the sub-FM construction module employs a layer-wise com- pression approach, facilitating comprehensive FM fine-tuning across all layers by emphasizing those crucial neurons. Second, the sub-FM alignment module conducts a two-step distillations—layer- level and neuron-level—before and during FL fine- tuning respectively, to reduce error of gradient by accurately aligning sub-FM with FM under theo- retical guarantees. Experimental results on seven commonly used datasets (i.e., four text and three vi- sion) demonstrate the superiority of FedPFT. Our code is available at https://github.com/pzp-dzd/FedPFT.

List of keywords

Machine Learning -> ML: Federated learning
Machine Learning -> ML: Trustworthy machine learning
Multidisciplinary Topics and Applications -> MTA: Security and privacy

1540

Unsupervised Anomaly Detection via Masked Diffusion Posterior Sampling

Di Wu, Shicai Fan, Xue Zhou, Li Yu, Yuzhong Deng, Jianxiao Zou, Baihong Lin

[+] More

[-] Less

Reconstruction-based methods have been commonly used for unsupervised anomaly detection, in which a normal image is reconstructed and compared with the given test image to detect and locate anomalies. Recently, diffusion models have shown promising applications for anomaly detection due to their powerful generative ability. However, these models lack strict mathematical support for normal image reconstruction and unexpectedly suffer from low reconstruction quality. To address these issues, this paper proposes a novel and highly-interpretable method named Masked Diffusion Posterior Sampling (MDPS). In MDPS, the problem of normal image reconstruction is mathematically modeled as multiple diffusion posterior sampling for normal images based on the devised masked noisy observation model and the diffusion-based normal image prior under Bayesian framework. Using a metric designed from pixel-level and perceptual-level perspectives, MDPS can effectively compute the difference map between each normal posterior sample and the given test image. Anomaly scores are obtained by averaging all difference maps for multiple posterior samples. Exhaustive experiments on MVTec and BTAD datasets demonstrate that MDPS can achieve state-of-the-art performance in normal image reconstruction quality as well as anomaly detection and localization.

List of keywords

Data Mining -> DM: Anomaly/outlier detection
Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation

1542

Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer

Kepeng Xu, Li Xu, Gang He, Wenxin Yu, Yunsong Li

[+] More

[-] Less

Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on https://github.com/kepengxu/PGTFormer.

List of keywords

Computer Vision -> CV: Computational photography
Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Biometrics, face, gesture and pose recognition
Computer Vision -> CV: Applications

1546

Unified Evidence Enhancement Inference Framework for Fake News Detection

Lianwei Wu, Linyong Wang, Yongqiang Zhao

[+] More

[-] Less

The current approaches for fake news detection are mainly devoted to extracting candidate evidence from comments (or external articles) and establishing interactive reasoning with the news itself to verify the falsehood of the news. However, they still have several drawbacks: 1) The interaction object is coarse-grained, which mainly drives the entire news to participate in interaction, but ignores the learning of potential suspicious segments in news; 2) The reasoning ways are relatively single, making it difficult to explore the various possible correlations between news and candidate evidence. To this end, we propose Unified Evidence Enhancement Inference framework (UEEI) to discover and infer high-quality evidence to reveal the false parts of news for detection. Specifically, UEEI first promotes the interaction fusion between comments and news from the perspectives of semantic and emotion, thereby learning potential suspicious fragments in news. Then, the model constructs entity-level and relationship-level retrievals to screen sufficient candidate evidence from external sources. Finally, we measure coherence between suspicious fragments and candidate evidence by multi-view reasoning, and further infer explainable evidence that discovers the false parts of news. Experiments on three public datasets confirm the effectiveness and interpretability of our UEEI.

List of keywords

Natural Language Processing -> NLP: Applications
Game Theory and Economic Paradigms -> GTEP: Computational social choice
Multidisciplinary Topics and Applications -> MTA: Social sciences

1551

An NCDE-based Framework for Universal Representation Learning of Time Series

Zihan Liu, Bowen Du, Junchen Ye, Xianqing Wen, Leilei Sun

[+] More

[-] Less

Exploiting self-supervised learning (SSL) to extract the universal representations of time series could not only capture the natural properties of time series but also offer huge help to the downstream tasks. Nevertheless, existing time series representation learning (TSRL) methods face challenges in attaining universality. Indeed, existing methods relying solely on one SSL strategy (either contrastive learning (CL) or generative) often fall short in capturing rich semantic information for various downstream tasks. Moreover, time series exhibit diverse distributions and inherent characteristics, particularly with the common occurrence of missing values, posing a notable challenge for existing backbones in effectively handling such diverse time series data. To bridge these gaps, we propose CTRL, a framework for universal TSRL. For the first time, we employ Neural Controlled Differential Equation (NCDE) as the backbone for TSRL, which captures the continuous processes and exhibits robustness to missing data. Additionally, a dual-task SSL strategy, integrating both reconstruction and contrasting tasks, is proposed to enrich the semantic information of the learned representations. Furthermore, novel hard negative construction and false negative elimination mechanisms are proposed to improve sampling efficiency and reduce sampling bias in CL. Finally, extensive experiments demonstrate the superiority of CTRL in forecasting, classification, and imputation tasks, particularly its outstanding robustness to missing data.

List of keywords

Machine Learning -> ML: Representation learning
Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Time series and data streams

1564

CausVSR: Causality Inspired Visual Sentiment Recognition

Xinyue Zhang, Zhaoxia Wang, Hailing Wang, Jing Xiang, Chunwei Wu, Guitao Cao

[+] More

[-] Less

Visual Sentiment Recognition (VSR) is an evolving field that aims to detect emotional tendencies within visual content. Despite its growing significance, detecting emotions depicted in visual content, such as images, faces challenges, notably the emergence of misleading or spurious correlations of the contextual information. In response to these challenges, we propose a causality inspired VSR approach, called CausVSR. CausVSR is rooted in the fundamental principles of Emotional Causality theory, mimicking the human process from receiving emotional stimuli to deriving emotional states. CausVSR takes a deliberate stride toward conquering the VSR challenges. It harnesses the power of a structural causal model, intricately designed to encapsulate the dynamic causal interplay between visual feature representation and their corresponding pseudo sentiment regions. This strategic approach allows for a deep exploration of contextual information, elevating the accuracy of emotional inference. Additionally, CausVSR utilizes a global category elicitation module, strategically employed to execute front-door adjustment techniques, effectively detecting and handling spurious correlations. Experiments, conducted on four widely-used datasets, demonstrate CausVSR’s superiority in enhancing emotion perception within VSR, surpassing existing methods.

List of keywords

Humans and AI -> HAI: Cognitive modeling
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Recognition (object detection, categorization)
Machine Learning -> ML: Deep learning architectures

1566

Modeling Selective Feature Attention for Lightweight Text Matching

Jianxiang Zang, Hui Liu

[+] More

[-] Less

Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.

List of keywords

Natural Language Processing -> NLP: Natural language semantics
Machine Learning -> ML: Attention models
Machine Learning -> ML: Deep learning architectures
Natural Language Processing -> NLP: Embeddings

1570

By Fair Means or Foul: Quantifying Collusion in a Market Simulation with Deep Reinforcement Learning

Michael Schlechtinger, Damaris Kosack, Franz Krause, Heiko Paulheim

[+] More

[-] Less

In the rapidly evolving landscape of eCommerce, Artificial Intelligence (AI) based pricing algorithms, particularly those utilizing Reinforcement Learning (RL), are becoming increasingly prevalent. This rise has led to an inextricable pricing situation with the potential for market collusion. Our research employs an experimental oligopoly model of repeated price competition, systematically varying the environment to cover scenarios from basic economic theory to subjective consumer demand preferences. We also introduce a novel demand framework that enables the implementation of various demand models, allowing for a weighted blending of different models.In contrast to existing research in this domain, we aim to investigate the strategies and emerging pricing patterns developed by the agents, which may lead to a collusive outcome. Furthermore, we investigate a scenario where agents cannot observe their competitors’ prices. Finally, we provide a comprehensive legal analysis across all scenarios. Our findings indicate that RL-based AI agents converge to a collusive state characterized by the charging of supracompetitive prices, without necessarily requiring inter-agent communication. Implementing alternative RL algorithms, altering the number of agents or simulation settings, and restricting the scope of the agents’ observation space does not significantly impact the collusive market outcome behavior.

List of keywords

AI Ethics, Trust, Fairness -> ETF: AI and law, governance, regulation
Agent-based and Multi-agent Systems -> MAS: Agent-based simulation and emergence
Machine Learning -> ML: Reinforcement learning
Multidisciplinary Topics and Applications -> MTA: Economics

1578

Primal Grammars Driven Automated Induction

Adel Bouhoula, Miki Hermann

[+] More

[-] Less

Automated induction is important for many computer sciences and artificial intelligence applications. However, proof by induction is undecidable and diverges even for small examples, leading to failures in the proving experience.Many techniques have proposed ad-hoc heuristics to speculate on additional lemmas that hopefully stop the divergence. Although these methods have succeeded in proving interesting theorems, they have significant limitations: in particular, they often fail to find appropriate lemmas, and the provided lemmas may not be valid.We present a new technique that allows us to perform inductive proofs in conditional theories by automatically detecting the divergence of proof traces and deriving a primal grammar as well as new lemmas that schematize the divergent sequences and thus allow breaking the divergence and ending the proof. Our new technique is presented as a set of inference rules whose soundness and refutational completeness have been formally proved. Refutational completeness is particularlyuseful for detecting flaws in critical systems. Moreover, unlike related work, our new technique has no risk of over-generalization. If the initial conjectures are valid, then the lemmas generated by our technique subsume the divergent sequence and are also valid.The cornerstone of our method is the use of primal grammars, which are based on primitive recursive functions and represent the most general decidable schematization, with respect to description power, among all known schematizations. Our technique always succeeds in building a primal grammar when the divergence follows a primitive recursive pattern; this allows us to cover a large class of problems.Our new technique has been fully implemented in C++ and successfully proved several dozens of complex examples that fail with well-known theorem provers such as ACL2, Isabelle, PVS, RRL, SPIKE and LEAN as well as related techniques for capturing and schematizing divergence for proof by induction.

List of keywords

Knowledge Representation and Reasoning -> KRR: Automated reasoning and theorem proving

1593

DFMDA-Net: Dense Fusion and Multi-dimension Aggregation Network for Image Restoration

Huibin Yan, Shuoyao Wang

[+] More

[-] Less

The U-shape (encoder-decoder) architecture, combined with effective blocks, has shown significantsuccess in image restoration. In U-shape models, there is insufficient focus on the feature fusionproblem between encoder and decoder features atthe same level. Current methods often employ simplistic operations like summation or concatenation,which makes it difficult to strike a balance betweenperformance and complexity. To address this issue,we propose a compression-in-the-middle mechanism, termed Integration-Compression-Integration(ICI), which effectively conducts dense fusion andavoids information loss. From the block designperspective, we design a multi-dimension aggregation (MDA) mechanism, capable of effectivelyaggregating features from both the channel andspatial dimension. Combining the IntegrationCompression-Integration feature fusion and themulti-dimension aggregation, our dense fusion andmulti-dimension aggregation network (DFMDANet) achieves superior performance over state-ofthe-art algorithms on 16 benchmarking datasets fornumerous image restoration tasks.

List of keywords

Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Representation learning
Machine Learning -> ML: Attention models
Machine Learning -> ML: Convolutional networks

1601

Robust Losses for Decision-Focused Learning

Noah Schutte, Krzysztof Postek, Neil Yorke-Smith

[+] More

[-] Less

Optimization models used to make discrete decisions often contain uncertain parameters that are context-dependent and estimated through prediction. To account for the quality of the decision made based on the prediction, decision-focused learning (end-to-end predict-then-optimize) aims at training the predictive model to minimize regret, i.e., the loss incurred by making a suboptimal decision. Despite the challenge of the gradient of this loss w.r.t. the predictive model parameters being zero almost everywhere for optimization problems with a linear objective, effective gradient-based learning approaches have been proposed to minimize the expected loss, using the empirical loss as a surrogate. However, empirical regret can be an ineffective surrogate because empirical optimal decisions can vary substantially from expected optimal decisions. To understand the impact of this deficiency, we evaluate the effect of aleatoric and epistemic uncertainty on the accuracy of empirical regret as a surrogate. Next, we propose three novel loss functions that approximate expected regret more robustly. Experimental results show that training two state-of-the-art decision-focused learning approaches using robust regret losses improves test-sample empirical regret in general while keeping computational time equivalent relative to the number of training epochs.

List of keywords

Machine Learning -> ML: Robustness
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Machine Learning -> ML: Regression
Machine Learning -> ML: Optimization

1614

Protecting Object Detection Models from Model Extraction Attack via Feature Space Coverage

Zeyu Li, Yuwen Pu, Xuhong Zhang, Yu Li, Jinbao Li, Shouling Ji

[+] More

[-] Less

The model extraction attack is an attack pattern aimed at stealing well-trained machine learning models’ functionality or privacy information. With the gradual popularization of AI-related technologies in daily life, various well-trained models are being deployed. As a result, these models are considered valuable assets and attractive to model extraction attackers. Currently, the academic community primarily focuses on defense for model extraction attacks in the context of classification, with little attention to the more commonly used task scenario of object detection. Therefore, we propose a detection framework targeting model extraction attacks against object detection models in this paper. The framework first locates suspicious users based on feature coverage in query traffic and uses an active verification module to confirm whether the identified suspicious users are attackers. Through experiments conducted in multiple task scenarios, we validate the effectiveness and detection efficiency of the proposed method.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Computer Vision -> CV: Recognition (object detection, categorization)

1625

Exploring Cross-Domain Few-Shot Classification via Frequency-Aware Prompting

Tiange Zhang, Qing Cai, Feng Gao, Lin Qi, Junyu Dong

[+] More

[-] Less

Cross-Domain Few-Shot Learning has witnessed great stride with the development of meta-learning. However, most existing methods pay more attention to learning domain-adaptive inductive bias (meta-knowledge) through feature-wise manipulation or task diversity improvement while neglecting the phenomenon that deep networks tend to rely more on high-frequency cues to make the classification decision, which thus degenerates the robustness of learned inductive bias since high-frequency information is vulnerable and easy to be disturbed by noisy information. Hence in this paper, we make one of the first attempts to propose a Frequency-Aware Prompting method with mutual attention for Cross-Domain Few-Shot classification, which can let networks simulate the human visual perception of selecting different frequency cues when facing new recognition tasks. Specifically, a frequency-aware prompting mechanism is first proposed, in which high-frequency components of the decomposed source image are switched either with normal distribution sampling or zeroing to get frequency-aware augment samples. Then, a mutual attention module is designed to learn generalizable inductive bias under CD-FSL settings. More importantly, the proposed method is a plug-and-play module that can be directly applied to most off-the-shelf CD-FLS methods. Experimental results on CD-FSL benchmarks demonstrate the effectiveness of our proposed method as well as robustly improve the performance of existing CD-FLS methods. Resources at https://github.com/tinkez/FAP_CDFSC.

List of keywords

Machine Learning -> ML: Classification
Machine Learning -> ML: Few-shot learning
Machine Learning -> ML: Meta-learning
Machine Learning -> ML: Multi-task and transfer learning

1627

Bridge to Non-Barrier Communication: Gloss-Prompted Fine-Grained Cued Speech Gesture Generation with Diffusion Model

Wentao Lei, Li Liu, Jun Wang

[+] More

[-] Less

Cued Speech (CS) is an advanced visual phonetic encoding system that integrates lip reading with hand codings, enabling people with hearing impairments to communicate efficiently. CS video generation aims to produce specific lip and hand gesture movements of CS from audio or text inputs. The main challenge is that given limited CS data, it requires the simultaneous generation of lip-reading, as well as fine-grained hand movements, which are asynchronously aligned with lip movement. Previous work for CS gesture generation used template-based statistical methods and careful hand-crafted pre-processing to fit models. Therefore, existing methods are fragile and prone to poor performance. To solve the above challenge, we propose a novel Gloss-Prompted Diffusion-based CS Gesture generation framework (called GlossDiff), which leverages the power of the large language model and prompting engineering to automatically integrate additional linguistic rules into the model. Specifically, we innovatively introduce a bridging instruction called Gloss, which is a descriptive text to establish a direct semantic connection between spoken language and CS gestures. Besides, we propose for the first time that hand movements in CS should have a rhythm that matches the audio speech. Specifically, in this work, we design, record and publish the first Chinese CS dataset with six CS cuers, including two hearing-impaired people. Extensive experiments on our datasets not only demonstrate that our method quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods. We will release code and data at anonymous.github.io.

List of keywords

Natural Language Processing -> NLP: Speech
Computer Vision -> CV: Applications

1631

Provable Acceleration of Nesterov’s Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

Xin Liu, Wei Tao, Wei Li, Dazhi Zhan, Jun Wang, Zhisong Pan

[+] More

[-] Less

Due to its simplicity and efficiency, the first-order gradient method has been extensively employed in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the first-order method is capable of attaining a global minimum during training over-parameterized neural networks, where the number of parameters is significantly larger than that of training instances. Momentum methods, including the heavy ball (HB) method and Nesterov’s accelerated gradient (NAG) method, are the workhorse of first-order gradient methods owning to their accelerated convergence. In practice, NAG often exhibits superior performance than HB. However, current theoretical works fail to distinguish their convergence difference in training neural networks. To fill this gap, we consider the training problem of the two-layer ReLU neural network under over-parameterization and random initialization. Leveraging high-resolution dynamical systems and neural tangent kernel (NTK) theory, our result not only establishes tighter upper bounds of the convergence rate for both HB and NAG, but also provides the first theoretical guarantee for the acceleration of NAG over HB in training neural networks. Finally, we validate our theoretical results on three benchmark datasets.

List of keywords

Machine Learning -> ML: Theory of deep learning
Machine Learning -> ML: Optimization

1635

Spatial-Temporal-Decoupled Masked Pre-training for Spatiotemporal Forecasting

Haotian Gao, Renhe Jiang, Zheng Dong, Jinliang Deng, Yuxin Ma, Xuan Song

[+] More

[-] Less

Spatiotemporal forecasting techniques are significant for various domains such as transportation, energy, and weather. Accurate prediction of spatiotemporal series remains challenging due to the complex spatiotemporal heterogeneity. In particular, current end-to-end models are limited by input length and thus often fall into spatiotemporal mirage, i.e., similar input time series followed by dissimilar future values and vice versa. To address these problems, we propose a novel self-supervised pre-training framework Spatial-Temporal-Decoupled Masked Pre-training (STD-MAE) that employs two decoupled masked autoencoders to reconstruct spatiotemporal series along the spatial and temporal dimensions. Rich-context representations learned through such reconstruction could be seamlessly integrated by downstream predictors with arbitrary architectures to augment their performances. A series of quantitative and qualitative evaluations on four widely used benchmarks (PEMS03, PEMS04, PEMS07, and PEMS08) are conducted to validate the state-of-the-art performance of STD-MAE. Codes are available at https://github.com/Jimmy-7664/STD-MAE.

List of keywords

Machine Learning -> ML: Time series and data streams
Data Mining -> DM: Mining spatial and/or temporal data
Knowledge Representation and Reasoning -> KRR: Qualitative, geometric, spatial, and temporal reasoning

1645

Let’s Start Over: Retraining with Selective Samples for Generalized Category Discovery

Zhimao Peng, Enguang Wang, Xialei Liu, Ming-Ming Cheng

[+] More

[-] Less

Generalized Category Discovery (GCD) presents a realisticand challenging problem in open-world learning. Given a par-tially labeled dataset, GCD aims to categorize unlabeled databy leveraging visual knowledge from the labeled data, wherethe unlabeled data includes both known and unknown classes.Existing methods based on parametric/non-parametric classi-fiers attempt to generate pseudo-labels/relationships for theunlabeled data to enhance representation learning. However,the lack of ground-truth labels for novel classes often leadsto noisy pseudo-labels/relationships, resulting in suboptimalrepresentation learning. This paper introduces a novel methodusing Nearest Neighbor Distance-aware Label Consistencysample selection. It creates class-consistent subsets for novelclass sample clusters from the current GCD method, actingas “pseudo-labeled sets” to mitigate representation bias. Wepropose progressive supervised representation learning withselected samples to optimize the trade-off between quantityand purity in each subset. Our method is versatile and appli-cable to various GCD methods, whether parametric or non-parametric. We conducted extensive experiments on multiplegeneric and fine-grained image classification datasets to eval-uate the effectiveness of our approach. The results demon-strate the superiority of our method in achieving improvedperformance in generalized category discovery tasks.

List of keywords

Machine Learning -> ML: Clustering
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Machine Learning -> ML: Classification
Computer Vision -> CV: Recognition (object detection, categorization)

1653

Convexity Certificates for Symbolic Tensor Expressions

Paul G. Rump, Niklas Merk, Julien Klaus, Maurice Wenig, Joachim Giesen

[+] More

[-] Less

Knowing that a function is convex ensures that any local minimum is also a global minimum. Here, we implement an approach to certify the convexity of twice-differentiable functions by certifying that their second-order derivative is positive semidefinite. Both the computation of the second-order derivative and the certification of positive semidefiniteness are done symbolically. Previous implementations of this approach assume that the function to be minimized takes scalar or vector inputs, meaning that the second-order derivative is at most a matrix. However, the input of many machine learning problems is naturally given in the form of matrices or higher order tensors, in which case the second-order derivative becomes a tensor of at least fourth order. The familiar linear algebra notations and known rules for determining whether a matrix is positive semidefinite are not sufficient to deal with these higher order expressions. Here, we present a formal language for tensor expressions that allows us to generalize semidefiniteness to higher-order tensors and thereby certify the convexity of a broader set of functions.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Solvers and tools
Machine Learning -> ML: Optimization
Machine Learning -> ML: Symbolic methods

1688

Attribution Quality Metrics with Magnitude Alignment

Chase Walker, Dominic Simon, Kenny Chen, Rickard Ewetz

[+] More

[-] Less

Attribution algorithms play an instrumental role in human interpretation of AI models. The methods measure the importance of the input features to the model output decision, which can be displayed as an attribution map for image classifiers. Perturbation tests are the state-of-the-art approach to evaluate the quality of an attribution map. Unfortunately, we observe that perturbation tests fail to consider attribution magnitude, which translates into inconsistent quality scores. In this paper, we propose Magnitude Aligned Scoring (MAS), a new attribution quality metric that measures the alignment between the magnitude of the attributions and the model response. In particular, the metric accounts for both the relative ordering and the magnitude of the pixels within an attribution. In the experimental evaluation, we compare the MAS metric with existing metrics across a wide range of models, datasets, attributions, and evaluations. The results demonstrate that the MAS metric is 4x more sensitive to attribution changes, 2x more consistent, and 1.6x more invariant to baseline modifications. Our code and the referenced appendix are publicly available via https://github.com/chasewalker26/Magnitude-Aligned-Scoring.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Computer Vision -> CV: Interpretability and transparency
Machine Learning -> ML: Explainable/Interpretable machine learning

1700

MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement

Zifeng Wang, Chufan Gao, Cao Xiao, Jimeng Sun

[+] More

[-] Less

Tabular data prediction has been employed in medical applications such as patient health risk prediction. However, existing methods usually revolve around the algorithm design while overlooking the significance of data engineering. Medical tabular datasets frequently exhibit significant heterogeneity across different sources, with limited sample sizes per source. As such, previous predictors are often trained on manually curated small datasets that struggle to generalize across different tabular datasets during inference. This paper proposes to scale medical tabular data predictors (MediTab) to various tabular inputs with varying features. The method uses a data engine that leverages large language models (LLMs) to consolidate tabular samples to overcome the barrier across tables with distinct schema. It also aligns out-domain data with the target task using a "learn, annotate, and refinement” pipeline. The expanded training data then enables the pre-trained MediTab to infer for arbitrary tabular input in the domain without fine-tuning, resulting in significant improvements over supervised baselines: it reaches an average ranking of 1.57 and 1.00 on 7 patient outcome prediction datasets and 3 trial outcome prediction datasets, respectively. In addition, MediTab exhibits impressive zero-shot performances: it outperforms supervised XGBoost models by 8.9% and 17.2% on average in two prediction tasks, respectively.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Health and medicine
Multidisciplinary Topics and Applications -> MTA: Bioinformatics
Multidisciplinary Topics and Applications -> MTA: Life sciences

1703

Towards Robust Multi-Label Learning against Dirty Label Noise

Yuhai Zhao, Yejiang Wang, Zhengkui Wang, Wen Shan, Miaomiao Huang, Meixia Wang, Min Huang, Xingwei Wang

[+] More

[-] Less

In multi-label learning, one of the major challenges is that the data are associated with label noise including the random noisy labels (e.g., data encoding errors) and noisy labels created by annotators (e.g., missing, extra, or error label), where noise is promoted by different structures (e.g., gaussian, sparse or subjective). Existing methods are tailored to handle noise with one specific structure. However, they lack of consideration of the fact that the data are always with dirty noisy labels, simutaneously gaussian, sparse and subjective, in real applications. In this paper, we formalize the multi-label learning with dirty noise as a new learning problem, namely Noisy Multi-label Learning (NML). To solve the NML problem, we decompose a corrupted label matrix as the noise matrix plus a true label matrix (maybe high-rank). For the noise matrix, a mixed norm penalty is developed as regularizer for dirty noise distribution. Under this norm, the conditions required for exact noise recovery are provided theoretically. For the true label matrix that is not necessarily low-rank, we apply a non-linear mapping to ensure its low-rankness such that the high-order label correlation can be utilized. Experimental results show that the proposed method outperforms the state-of-the-art methods significantly.

List of keywords

Machine Learning -> ML: Multi-label learning
Machine Learning -> ML: Optimization
Machine Learning -> ML: Weakly supervised learning

1708

EMOTE: An Explainable Architecture for Modelling the Other through Empathy

Manisha Senadeera, Thommen Karimpanal George, Stephan Jacobs, Sunil Gupta, Santu Rana

[+] More

[-] Less

Empathy allows us to assume others are like us and have goals analogous to our own. This can also at times be applied to multi-agent games – e.g. Agent 1’s attraction to green balls is analogous to Agent 2’s attraction to red balls. Drawing inspiration from empathy, we propose EMOTE, a simple and explainable inverse reinforcement learning (IRL) approach designed to model another agent’s action-value function and from it, infer a unique reward function. This is done by referencing the learning agent’s own action value function, removing the need to maintain independent action-value estimates for the modelled agents whilst simultaneously addressing the ill-posed nature of IRL by inferring a unique reward function. We experiment on minigrid environments showing EMOTE: (a) produces more consistent reward estimates relative to other IRL baselines (b) is robust in scenarios with composite reward and action-value functions (c) produces human-interpretable states, helping to explain how the agent views other agents.

List of keywords

Machine Learning -> ML: Multiagent Reinforcement Learning
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
Machine Learning -> ML: Reinforcement learning

1716

Towards Dynamic Trend Filtering through Trend Point Detection with Reinforcement Learning

Jihyeon Seong, Sekwang Oh, Jaesik Choi

[+] More

[-] Less

Trend filtering simplifies complex time series data by applying smoothness to filter out noise while emphasizing proximity to the original data. However, existing trend filtering methods fail to reflect abrupt changes in the trend due to `approximateness,’ resulting in constant smoothness. This approximateness uniformly filters out the tail distribution of time series data, characterized by extreme values, including both abrupt changes and noise. In this paper, we propose Trend Point Detection formulated as a Markov Decision Process (MDP), a novel approach to identifying essential points that should be reflected in the trend, departing from approximations. We term these essential points as Dynamic Trend Points (DTPs) and extract trends by interpolating them. To identify DTPs, we utilize Reinforcement Learning (RL) within a discrete action space and a forecasting sum-of-squares loss function as a reward, referred to as the Dynamic Trend Filtering network (DTF-net). DTF-net integrates flexible noise filtering, preserving critical original sub-sequences while removing noise as required for other sub-sequences. We demonstrate that DTF-net excels at capturing abrupt changes compared to other trend filtering algorithms and enhances forecasting performance, as abrupt changes are predicted rather than smoothed out.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data
Machine Learning -> ML: Reinforcement learning

1752

3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

Junjie Zhang, Tianci Hu, Xiaoshui Huang, Yongshun Gong, Dan Zeng

[+] More

[-] Less

Evaluating the performance of Multi-modal Large Language Models (MLLMs), integrating both point cloud and language, presents significant challenges. The lack of a comprehensive assessment hampers determining whether these models truly represent advancements, thereby impeding further progress in the field. Current evaluations heavily rely on classification and caption tasks, falling short in providing a thorough assessment of MLLMs. A pressing need exists for a more sophisticated evaluation method capable of thoroughly analyzing the spatial understanding and expressive capabilities of these models. To address these issues, we introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench, providing an extensible platform for a comprehensive evaluation of MLLMs. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level, addressing both perception and planning tasks. Furthermore, we present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total. Thorough experiments evaluating trending MLLMs, comparisons against existing datasets, and variations of training protocols demonstrate the superiority of 3DBench, offering valuable insights into current limitations and potential research directions. Codes are available at https://github.com/Inshsang/3DBench.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Scene analysis and understanding

1758

Self-adaptive Extreme Penalized Loss for Imbalanced Time Series Prediction

Yiyang Wang, Yuchen Han, Yuhan Guo

[+] More

[-] Less

Forecasting time series in imbalanced data presents a significant research challenge that requires considerable attention. Although there are specialized techniques available to tackle imbalanced time series prediction, existing approaches tend to prioritize extreme predictions at the expense of compromising the forecasting accuracy of normal samples. We in this paper propose an extreme penalized loss function that relaxes the constraint on overestimating extreme events, thereby imposing great penalties on both normal and underestimating extreme events. In addition, we provide a self-adaptive way for setting the hyperparameters of the loss function. Then, both the proposed loss function and an attention module are integrated with LSTM networks in a decomposition-based framework. Extensive experiments conducted on real-world datasets demonstrate the superiority of our framework compared to other state-of-the-art approaches for both time series prediction and block maxima prediction tasks.

List of keywords

Machine Learning -> ML: Time series and data streams

1762

Enhancing Boundary Segmentation for Topological Accuracy with Skeleton-based Methods

Chuni Liu, Boyuan Ma, Xiaojuan Ban, Yujie Xie, Hao Wang, Weihua Xue, Jingchao Ma, Ke Xu

[+] More

[-] Less

Topological consistency plays a crucial role in the task of boundary segmentation for reticular images, such as cell membrane segmentation in neuron electron microscopic images, grain boundary segmentation in material microscopic images and road segmentation in aerial images. In these fields, topological changes in segmentation results have a serious impact on the downstream tasks, which can even exceed the misalignment of the boundary itself. To enhance the topology accuracy in segmentation results, we propose the Skea-Topo Aware loss, which is a novel loss function that takes into account the shape of each object and topological significance of the pixels. It consists of two components. First, the skeleton-aware weighted loss improves the segmentation accuracy by better modeling the object geometry with skeletons. Second, a boundary rectified term effectively identifies and emphasizes topological critical pixels in the prediction errors using both foreground and background skeletons in the ground truth and predictions. Experiments prove that our method improves the topology consistency by 7 points in VI compared with 13 state-of-art methods on three different boundary segmentation datasets in objective and subjective assessments.

List of keywords

Computer Vision -> CV: Segmentation
Computer Vision -> CV: Biomedical image analysis

1768

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Jianqiang Xia, Dianxi Shi, Ke Song, Linna Song, Xiaolei Wang, Songchang Jin, Chenran Zhao, Yu Cheng, Lei Jin, Zheng Zhu, Jianan Li, Gang Wang, Junliang Xing, Jian Zhao

[+] More

[-] Less

Most existing RGB-T tracking networks extract modality features in a separate manner, which lacks interaction and mutual guidance between modalities. This limits the network’s ability to adapt to the diverse dual-modality appearances of targets and the dynamic relationships between the modalities. Additionally, the three-stage fusion tracking paradigm followed by these networks significantly restricts the tracking speed. To overcome these problems, we propose a unified single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone through joint feature extraction, fusion and relation modeling. With this structure, the network can not only extract the fusion features of templates and search regions under the interaction of modalities, but also significantly improve tracking speed through the single-stage fusion tracking paradigm. Furthermore, we introduce a novel feature selection mechanism based on modality reliability to mitigate the influence of invalid modalities for prediction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance while maintaining the fastest inference speed 84.2FPS. In particular, MPR/MSR on the short-term and long-term subsets of VTUAV dataset increased by 11.1%/11.7% and 11.3%/9.7%. Code is available at https://github.com/xiajianqiang/USTrack.

List of keywords

Computer Vision -> CV: Motion and tracking
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Video analysis and understanding

1790

WPML3CP: Wasserstein Partial Multi-Label Learning with Dual Label Correlation Perspectives

Ximing Li, Yuanchao Dai, Bing Wang, Changchun Li, Renchu Guan, Fangming Gu, Jihong Ouyang

[+] More

[-] Less

Partial multi-label learning (PMLL) refers to a weakly-supervised classification problem, where each instance is associated with a set of candidate labels, covering its ground-truth labels but also with irrelevant ones. The current methodology of PMLL is to estimate the ground-truth confidences of candidate labels, i.e., the likelihood of a candidate label being a ground-truth one, and induce the multi-label predictor with them, rather than the candidate labels. In this paper, we aim to estimate precise ground-truth confidences by leveraging precise label correlations, which are also required to estimate. To this end, we propose to capture label correlations from both measuring and modeling perspectives. Specifically, we measure the loss between ground-truth confidences and predictions by employing the Wasserstein distance involving label correlations; and form a label correlation-aware regularization to constraint predictive parameters. The two techniques are coupled to promote precise estimations of label correlations. Upon these ideas, we propose a novel PMLL method, namely Wasserstein Partial Multi-Label Learning with dual Label Correlation Perspectives (WPML3CP). We conduct extensive experiments on several benchmark datasets. Empirical results demonstrate that WPML3CP can outperform the existing PMLL baselines.

List of keywords

Machine Learning -> ML: Weakly supervised learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Self-supervised Learning

1799

FedTAD: Topology-aware Data-free Knowledge Distillation for Subgraph Federated Learning

Yinlin Zhu, Xunkai Li, Zhengyu Wu, Di Wu, Miao Hu, Ronghua Li

[+] More

[-] Less

Subgraph federated learning (subgraph-FL) is a new distributed paradigm that facilitates the collaborative training of graph neural networks (GNNs) by multi-client subgraphs. Unfortunately, a significant challenge of subgraph-FL arises from subgraph heterogeneity, which stems from node and topology variation, causing the impaired performance of the global GNN. Despite various studies, they have not yet thoroughly investigated the impact mechanism of subgraph heterogeneity. To this end, we decouple node and topology variation, revealing that they correspond to differences in label distribution and structure hom*ophily. Remarkably, these variations lead to significant differences in the class-wise knowledge reliability of multiple local GNNs, misguiding the model aggregation with varying degrees. Building on this insight, we propose topology-aware data-free knowledge distillation technology (FedTAD), enhancing reliable knowledge transfer from the local model to the global model. Extensive experiments on six public datasets consistently demonstrate the superiority of FedTAD over state-of-the-art baselines.

List of keywords

Machine Learning -> ML: Sequence and graph learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Federated learning

1803

Enhancing Length Generalization for Attention Based Knowledge Tracing Models with Linear Biases

Xueyi Li, Youheng Bai, Teng Guo, Zitao Liu, Yaying Huang, Xiangyu Zhao, Feng Xia, Weiqi Luo, Jian Weng

[+] More

[-] Less

Knowledge tracing (KT) is the task of predicting students’ future performance based on their historical learning interaction data. With the rapid advancement of attention mechanisms, many attention based KT models are developed. However, existing attention based KT models exhibit performance drops as the number of student interactions increases beyond the number of interactions on which the KT models are trained. We refer to this as the length generalization of KT model. In this paper, we propose stableKT to enhance length generalization that is able to learn from short sequences and maintain high prediction performance when generalizing on long sequences. Furthermore, we design a multi-head aggregation module to capture the complex relationships between questions and the corresponding knowledge components (KCs) by combining dot-product attention and hyperbolic attention. Experimental results on three public educational datasets show that our model exhibits robust capability of length generalization and outperforms all baseline models in terms of AUC. To encourage reproducible research, we make our data and code publicly available at https://pykt.org.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Education
Humans and AI -> HAI: Computer-aided education

1810

InstructME: An Instruction Guided Music Edit Framework with Latent Diffusion Models

Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song

[+] More

[-] Less

Music editing primarily entails the modification of instrument tracks or remixing in the whole, which offers a novel reinterpretation of the original piece through a series of operations. These music processing methods hold immense potential across various applications but demand substantial expertise. Prior methodologies, although effective for image and audio modifications, falter when directly applied to music. This is attributed to music’s distinctive data nature, where such methods can inadvertently compromise the intrinsic harmony and coherence of music. In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models. Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing. In addition, we introduce chord progression matrix as condition information and incorporate it in the semantic space to improve melodic harmony while editing. For accommodating extended musical pieces, InstructME employs a chunk transformer, enabling it to discern long-term temporal dependencies within music sequences. We tested InstructME in instrument-editing, remixing, and multi-round editing. Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony.Demo samples are available at https://musicedit.github.io

List of keywords

Multidisciplinary Topics and Applications -> MTA: Arts and creativity
Multidisciplinary Topics and Applications -> MTA: Other

1819

Improving Adversarial Robustness via Feature Pattern Consistency Constraint

Jiacong Hu, Jingwen Ye, Zunlei Feng, Jiazhen Yang, Shunyu Liu, Xiaotian Yu, Lingxiang Jia, Mingli Song

[+] More

[-] Less

Convolutional Neural Networks (CNNs) are well-known for their vulnerability to adversarial attacks, posing significant security concerns. In response to these threats, various defense methods have emerged to bolster the model’s robustness. However, most existing methods either focus on learning from adversarial perturbations, leading to overfitting to the adversarial examples, or aim to eliminate such perturbations during inference, inevitably increasing computational burdens. Conversely, clean training, which strengthens the model’s robustness by relying solely on clean examples, can address the aforementioned issues. In this paper, we align with this methodological stream and enhance its generalizability to unknown adversarial examples. This enhancement is achieved by scrutinizing the behavior of latent features within the network. Recognizing that a correct prediction relies on the correctness of the latent feature’s pattern, we introduce a novel and effective Feature Pattern Consistency Constraint (FPCC) method to reinforce the latent feature’s capacity to maintain the correct feature pattern. Specifically, we propose Spatial-wise Feature Modification and Channel-wise Feature Selection to enhance latent features. Subsequently, we employ the Pattern Consistency Loss to constrain the similarity between the feature pattern of the latent features and the correct feature pattern. Our experiments demonstrate that the FPCC method empowers latent features to uphold correct feature patterns even in the face of adversarial examples, resulting in inherent adversarial robustness surpassing state-of-the-art models.

List of keywords

Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Recognition (object detection, categorization)
Machine Learning -> ML: Classification

1823

AMO-aware Aggregates in Answer Set Programming

Mario Alviano, Carmine Dodaro, Salvatore Fiorentino, Marco Maratea

[+] More

[-] Less

Aggregates such as sum and count are among the most frequently used linguistic extensions of Answer Set Programming (ASP).At-most-one (AMO) constraints are a specific form of aggregates that excludes the simultaneous truth of multiple elements in a set.This article unleashes a powerful propagation strategy in case groups of elements in an aggregate are also involved in AMO constraints.In fact, the combined knowledge given by aggregates and AMO constraints significantly increases the effectiveness of search space pruning, resulting in sensible performance gains.

List of keywords

Knowledge Representation and Reasoning -> KRR: Logic programming
Knowledge Representation and Reasoning -> KRR: Non-monotonic reasoning

1828

Dynamic Brightness Adaptation for Robust Multi-modal Image Fusion

Yiming Sun, Bing Cao, Pengfei Zhu, Qinghua Hu

[+] More

[-] Less

Infrared and visible image fusion aims to combine the advantageous features of different modalities to generate visually appealing and informative images. In real-world scenarios, visible imaging is vulnerable to dynamic fluctuations in environmental brightness, which leads to texture degradation caused by over-brightness or darkness. Unfortunately, existing fusion methods struggle to achieve robust fusion under dynamic brightness disturbances, causing the fusion results to be inevitably influenced by fluctuations in brightness. This greatly diminishes the visual fidelity of the fused images. To tackle this challenge, we propose a Brightness Adaptive multimodal dynamic fusion framework (BA-Fusion) that achieves robust image fusion even in the presence of dynamic brightness fluctuations. Specifically, we develop a Brightness Adaptive Gate (BAG) module designed to dynamically select features from brightness-related channels for brightness normalization, while preserving of brightness-independent structural feature information in the source images. We also propose a brightness consistency loss function to optimize the BAG module. The whole framework is optimized by alternating training strategies. Extensive experiments demonstrate that our method outperforms state-of-the-art comparison methods in preserving multi-modal image rich information and visual fidelity, and our model also exhibits the most robust performance under varying levels of brightness.

List of keywords

Computer Vision -> CV: Applications
Computer Vision -> CV: Multimodal learning

1862

Concentration Tail-Bound Analysis of Coevolutionary and Bandit Learning Algorithms

Shishen Lin, Per Kristian Lehre

[+] More

[-] Less

Runtime analysis, as a branch of the theory of AI, studies how the number of iterations algorithms take before finding a solution (its runtime) depends on the design of the algorithm and the problem structure. Drift analysis is a state-of-the-art tool for estimating the runtime of randomised algorithms, such as bandit and evolutionary algorithms. Drift refers roughly to the expected progress towards the optimum per iteration. This paper considers the problem of deriving concentration tail-bounds on the runtime of algorithms. It provides a novel drift theorem that gives precise exponential tail-bounds given positive, weak, zero and even negative drift. Previously, such exponential tail bounds were missing in the case of weak, zero, or negative drift. Our drift theorem can be used to prove a strong concentration of the runtime/regret of algorithms in AI. For example, we prove that the regret of the RWAB bandit algorithm is highly concentrated, while previous analyses only considered the expected regret. This means that the algorithm obtains the optimum within a given time frame with high probability, i.e. a form of algorithm reliability.Moreover, our theorem implies that the time needed by the co-evolutionary algorithm RLS-PD to obtain a Nash equilibrium in a Bilinear max-min-benchmark problem is highly concentrated. However, we also prove that the algorithm forgets the Nash equilibrium, and the time until this occurs is highly concentrated. This highlights a weakness in the RLS-PD which should be addressed by future work.

List of keywords

Search -> S: Evolutionary computation
Search -> S: Heuristic search
Search -> S: Other

1872

FactCHD: Benchmarking Fact-Conflicting Hallucination Detection

Xiang Chen, Duanzheng Song, Gui Honghao, Wang Chenxi, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lyu, Zhang Dan, Huajun Chen

[+] More

[-] Less

Despite their impressive generative capabilities, LLMs are hindered by fact-conflicting hallucinations in real-world applications. The accurate identification of hallucinations in texts generated by LLMs, especially in complex inferential scenarios, is a relatively unexplored area. To address this gap, we present FactCHD, a dedicated benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. A distinctive element of FactCHD is its integration of fact-based evidence chains, significantly enhancing the depth of evaluating the detectors’ explanations. Experiments on different LLMs expose the shortcomings of current approaches in detecting factual errors accurately. Furthermore, we introduce TRUTH-TRIANGULATOR which synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence.

List of keywords

Natural Language Processing -> NLP: Resources and evaluation
Natural Language Processing -> NLP: Applications

1883

Two-stage Semi-supervised Speaker Recognition with Gated Label Learning

Xingmei Wang, Jiaxiang Meng, Kong Aik Lee, Boquan Li, Jinghan Liu

[+] More

[-] Less

Speaker recognition technologies have been successfully applied in diverse domains, benefiting from the advance of deep learning. Nevertheless, current efforts are still subject to the lack of labeled data. Such issues have been attempted in computer vision, through semi-supervised learning (SSL) that assigns pseudo labels for unlabeled data, undertaking the role of labeled ones. Through our empirical evaluations, the state-of-the-art SSL methods show unsatisfactory performance in speaker recognition tasks, due to the imbalance between the quantity and quality of pseudo labels. Therefore, in this work, we propose a two-stage SSL framework, with the aim to address the data scarcity challenge. We first construct an initial contrastive learning network, where the encoder outputs the embedding representation of utterances. Furthermore, we construct an iterative holistic semi-supervised learning network that involves a clustering strategy to assign pseudo labels, and a gated label learning (GLL) strategy to further select reliable pseudo-label data. Systematical evaluations show that our proposed framework achieves superior performance in speaker recognition than the state-of-the-art methods, matching the performance of supervised learning.

List of keywords

Natural Language Processing -> NLP: Speech
Machine Learning -> ML: Semi-supervised learning

1886

A Fast Algorithm for MaxSAT above Half Number of Clauses

Junqiang Peng, Mingyu Xiao

[+] More

[-] Less

We study the following parameterization of the MaxSAT problem: Given a CNF formula $\mathcal{F}$ with $m$ clauses, decide whether at least $m/2 + \mu$ clauses in $\mathcal{F}$ could be satisfied,where $\mu$ is the excess of the number of satisfied clauses over the trivial lower bound $m/2$ and is taken as the parameter. This perspective is known as the “above guarantee" parameterization. Since its introduction by Mahajan and Raman [1999], the analysis of parameterization above guarantee has become a highly active and fruitful line of research. In this paper, we develop a new algorithm with runtime $O^*(2.1479^\mu)$, significantly improving the previous best upper bound $O^*(5.4064^\mu)$ for this important problem. Here, the $O^*$ notation omits polynomial factors.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Satisfiabilty
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Constraint satisfaction
Search -> S: Combinatorial search and optimisation

1888

InstructEdit: Instruction-Based Knowledge Editing for Large Language Models

Ningyu Zhang, Bozhong Tian, Siyuan Cheng, Xiaozhuan Liang, Yi Hu, Kouying Xue, Yanjie Gou, Xi Chen, Huajun Chen

[+] More

[-] Less

Knowledge editing for large language models can offer an efficient solution to alter a model’s behavior without negatively impacting the overall performance. However, current knowledge editing approach encounter issues with limited generalizability across tasks, necessitating a distinct editor for each task, which significantly hinder the broader applications. To address this, we take the first step to analyze the task generalization issue in knowledge editing. Specifically, we develop an instruction-based editing technique, termed InstructEdit, which facilitates the editor’s adaptation to various task performances simultaneously using simple instructions. With only one unified editor for each LLM, we empirically demonstrate that InstructEdit can improve the editor’s control, leading to an average 14.86% increase in Reliability in multi-task editing setting. Furthermore, experiments involving unseen tasks show that InstructEdit consistently surpass previous baselines. To further investigate the underlying mechanisms of instruction-based knowledge editing, we analyze the principal components of the gradient directions, which unveils that instructions can help control optimization direction with stronger OOD generalization. Code and datasets will be released for future research.

List of keywords

Natural Language Processing -> NLP: Language models
Natural Language Processing -> NLP: Applications

1919

Subgraph Pooling: Tackling Negative Transfer on Graphs

Zehong Wang, Zheyuan Zhang, Chuxu Zhang, Yanfang Ye

[+] More

[-] Less

Transfer learning aims to enhance performance on a target task by using knowledge from related tasks. However, when the source and target tasks are not closely aligned, it can lead to reduced performance, known as negative transfer. Unlike in image or text data, we find that negative transfer could commonly occur in graph-structured data, even when source and target graphs have semantic similarities. Specifically, we identify that structural differences significantly amplify the dissimilarities in the node embeddings across graphs. To mitigate this, we bring a new insight in this paper: for semantically similar graphs, although structural differences lead to significant distribution shift in node embeddings, their impact on subgraph embeddings could be marginal. Building on this insight, we introduce Subgraph Pooling (SP) by aggregating nodes sampled from a k-hop neighborhood and Subgraph Pooling++ (SP++) by a random walk, to mitigate the impact of graph structural differences on knowledge transfer. We theoretically analyze the role of SP in reducing graph discrepancy and conduct extensive experiments to evaluate its superiority under various settings. The proposed SP methods are effective yet elegant, which can be easily applied on top of any backbone Graph Neural Networks (GNNs). Our code and data are available at: https://github.com/Zehong-Wang/Subgraph-Pooling.

List of keywords

Machine Learning -> ML: Sequence and graph learning
Data Mining -> DM: Mining graphs
Machine Learning -> ML: Multi-task and transfer learning
Machine Learning -> ML: Semi-supervised learning

1920

Game Transformations That Preserve Nash Equilibria or Best Response Sets

Emanuel Tewolde, Vincent Conitzer

[+] More

[-] Less

In this paper, we investigate under which conditions normal-form games are (guaranteed) to be strategically equivalent. First, we show for N-player games (N >= 3) that (A) it is NP-hard to decide whether a strategy constitutes a best response to some strategy profile of the opponents, and that(B) it is co-NP-hard to decide whether two games have the same best response sets.Combining that with known results from the literature, we move our attention to equivalence-preserving game transformations.It is a widely used fact that a positive affine (linear) transformation of the utility payoffs neither changes the best response sets nor the Nash equilibrium set. We investigate which other game transformations also possess either of the two properties when being applied to an arbitrary N-player game (N >= 2):(i) The Nash equilibrium set stays the same. (ii) The best response sets stay the same. For game transformations that operate player-wise and strategy-wise, we prove that (i) implies (ii) and that transformations with property (ii) must be positive affine. The resulting equivalence chain highlights the special status of positive affine transformations among all the transformation procedures that preserve key game-theoretic characteristics.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Noncooperative games

1944

Eliminating the Cross-Domain Misalignment in Text-guided Image Inpainting

Muqi Huang, Chaoyue Wang, Yong Luo, Lefei Zhang

[+] More

[-] Less

Text-guided image inpainting has rapidly garnered prominence as a task in user-directed image synthesis, aiming to complete the occluded image regions following the textual prompt provided. However, current methods usually grapple with issues arising from the disparity between low-level pixel data and high-level semantic descriptions, which results in inpainted sections not harmonizing with the original image (either structurally or texturally). In this study, we introduce a Structure-Aware Inpainting Learning scheme and an Asymmetric Cross Domain Attention to address these cross-domain misalignment challenges. The proposed structure-aware learning scheme employs features of an intermediate modality as structure guidance to bridge the gap between text information and low-level pixels. Meanwhile, asymmetric cross-domain attention enhances the texture consistency between inpainted and unmasked regions. Our experiments show exceptional performance on leading datasets such as MS-COCO and Open Images, surpassing state-of-the-art text-guided image inpainting methods.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Vision, language and reasoning

1946

Fraud Risk Mitigation in Real-Time Payments: A Strategic Agent-Based Analysis

Katherine Mayo, Nicholas Grabill, Michael P. Wellman

[+] More

[-] Less

Whereas standard financial mechanisms for payment may take days to finalize, real-time payments (RTPs) provide immediate processing and final receipt of funds. The speed of settlement benefits customers, but raises vulnerability to fraud. We seek to understand how bank nodes may strategically mitigate fraud risk in RTPs, through investment in fraud detection and restricting payments eligible for real-time processing. To study this, we introduce an agent-based model of the payment network supporting both real-time and standard payments, and define a game among banks and fraudsters. Using empirical game-theoretic analysis, we identify Nash equilibria in nine game configurations defined by network attributes. Our analysis finds that as banks become more liable for fraud, they continue to allow RTPs but are more likely to employ both restrictions and a high level of fraud detection. Fraudsters, in response, switch from targeting only RTPs to attempting fraud with any type of payment and tend to exploit banks where they have historically been most successful. We also conduct a strategic feature gains assessment to further understand the benefit offered by each of the bank’s risk mitigation measures, which confirms the importance of selective RTP restrictions. Finally, we find that in equilibrium bank strategic decisions negatively affect fraudsters while minimally impacting customers.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Applications
Multidisciplinary Topics and Applications -> MTA: Economics
Multidisciplinary Topics and Applications -> MTA: Finance

1949

Concept-Level Causal Explanation Method for Brain Function Network Classification

Jinduo Liu, Feipeng Wang, Junzhong Ji

[+] More

[-] Less

Using deep models to classify brain functional networks (BFNs) for the auxiliary diagnosis and treatment of brain diseases has become increasingly popular. However, the unexplainability of deep models has seriously hindered their applications in computer-aided diagnosis. In addition, current explanation methods mostly focus on natural images, which cannot be directly used to explain the deep model for BFN classification. In this paper, we propose a concept-level causal explanation method for BFN classification called CLCEM. First, CLCEM employs the causal learning method to extract concepts that are meaningful to humans from BFNs. Second, it aggregates the same concepts to obtain the contribution of each concept to the model output. Finally, CLCEM adds the contribution of each concept to make a diagnosis. The experimental results show that our CLCEM can not only accurately identify brain regions related to specific brain diseases but also make decisions based on the concepts of these brain regions, which enables humans to understand the decision-making process without performance degradation.

List of keywords

Humans and AI -> HAI: Brain sciences
AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
Data Mining -> DM: Networks
Machine Learning -> ML: Causality

1951

Domain-Hierarchy Adaptation via Chain of Iterative Reasoning for Few-shot Hierarchical Text Classification

Ke Ji, Peng Wang, Wenjun Ke, Guozheng Li, Jiajun Liu, Jingsheng Gao, Ziyu Shang

[+] More

[-] Less

Recently, various pre-trained language models (PLMs) have been proposed to prove their impressive performances on a wide range of few-shot tasks. However, limited by the unstructured prior knowledge in PLMs, it is difficult to maintain consistent performance on complex hierarchically dependent tasks, especially when the downstream data is extremely scarce. The main challenge is how to transfer the unstructured semantic space in PLMs to the downstream domain hierarchy. Unlike previous work on hierarchical text classification (HTC) which directly performs multi-label classification or uses graph neural network (GNN) to inject label hierarchy, in this work, we study the HTC problem under a few-shot setting to adapt knowledge in PLMs from an unstructured manner to the downstream hierarchy. Technically, we design a simple yet effective method named Hierarchical Iterative Conditional Random Field (HierICRF) to search the most domain-challenging directions and exquisitely crafts domain-hierarchy adaptation as a hierarchical iterative language modeling problem, and then it encourages the model to make hierarchical consistency self-correction during the inference, thereby achieving knowledge transfer with hierarchical consistency preservation. We perform HierICRF on various architectures, and extensive experiments on two popular HTC datasets demonstrate that prompt with HierICRF significantly boosts the few-shot HTC performance with an average Micro-F1 by 28.80% to 1.50% and Macro-F1 by 36.29% to 1.5% over the previous state-of-the-art (SOTA) baselines under few-shot settings (1->16), while remaining SOTA hierarchical consistency performance.

List of keywords

Natural Language Processing -> NLP: Applications
Natural Language Processing -> NLP: Text classification

1958

Feedback-Based Adaptive Crossover-Rate in Evolutionary Computation

Xiaoyuan Guan, Yang TianYi, Chunliang Zhao, Yuren Zhou

[+] More

[-] Less

We propose a novel approach to improve multi-objective evolutionary algorithms by modifying crossover operations. Our approach uses a modifiable cross distribution and virtual point to rebalance the probability distribution of all crossover options. This design reduces runtime for typical pseudo-Boolean functions. Experiments and analysis show our approach effectively optimizes bi-objective problems COCZ and LOTZ in Θ(n) time during crossover, outperforming conventional crossover multi-objective evolutionary algorithms (C-MOEA) which require O(n log n) steps. For the tri-objective problem Hierarchical-COCZ, our approach guarantees an expected runtime of Θ(n2 log n), while C-MOEA needs at least Ω(n2 log n) and at most O(n2 log2 n) steps.

List of keywords

Search -> S: Evolutionary computation
Machine Learning -> ML: Evolutionary learning

1959

Cross-View Diversity Embedded Consensus Learning for Multi-View Clustering

Chong Peng, Kai Zhang, Yongyong Chen, Chenglizhao Chen, Qiang Cheng

[+] More

[-] Less

Multi-view clustering (MVC) has garnered significant attention in recent studies. In this paper, we propose a novel MVC method, named CCL-MVC. The novel method constructs a cross-order neighbor tensor of multi-view data to recover a low-rank essential tensor, preserves noise-free, comprehensive, and complementary cross-order relationships among the samples. Furthermore, it constructs a consensus representation matrix by fusing the low-rank essential tensor with auto-adjusted cross-view diversity embedding, fully exploiting both consensus and discriminative information of the data. An effective optimization algorithm is developed, which is theoretically guaranteed to converge. Extensive experimental results confirm the effectiveness of the proposed method.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering

1967

On the Logic of Theory Change Iteration of KM-Update, Revised

Liangda Fang, Tong Zhu, Quanlong Guan, Junming Qiu, Zhao-Rong Lai, Weiqi Luo, Hai Wan

[+] More

[-] Less

Belief revision and update, two significant types of belief change, both focus on how an agent modifies her beliefs in presence of new information. The most striking difference between them is that the former studies the change of beliefs in a static world while the latter concentrates on a dynamically-changing world. The famous AGM and KM postulates were proposed to capture rational belief revision and update, respectively. However, both of them are too permissive to exclude some unreasonable changes in the iteration. In response to this weakness, the DP postulates and its extensions for iterated belief revision were presented. Furthermore, Ferme and Goncalves integrated these postulates in belief update. Unfortunately, some redundant components are included in the definitions of belief states and the faithful assignments for semantic characterizations. Moreover, their approach does not meet the desired property of iterated belief update. They also do not discuss the rationale of any DP postulate within the update context. This paper is intended to fix these deficiencies of Ferme and Goncalves’s approach. Firstly, we present a modification of the original KM postulates based on belief states, and propose the notion of faithful collective assignments of belief states to partial preorders. Subsequently, we migrate several well-known postulates for iterated belief revision to iterated belief update. Moreover, we provide the exact semantic characterizations based on partial preorders for each of the proposed postulates. Finally, we analyze the compatibility between the above iterated postulates and the KM postulates for belief update.

List of keywords

Knowledge Representation and Reasoning -> KRR: Belief change
Knowledge Representation and Reasoning -> KRR: Reasoning about knowledge and belief

1975

PACIA: Parameter-Efficient Adapter for Few-Shot Molecular Property Prediction

Shiguang Wu, Yaqing Wang, Quanming Yao

[+] More

[-] Less

Molecular property prediction (MPP) plays a crucial role in biomedical applications, but it often encounters challenges due to a scarcity of labeled data. Existing works commonly adopt gradient-based strategy to update a large amount of parameters for task-level adaptation. However, the increase of adaptive parameters can lead to overfitting and poor performance. Observing that graph neural network (GNN) performs well as both encoder and predictor, we propose PACIA, a parameter-efficient GNN adapter for few-shot MPP. We design a unified adapter to generate a few adaptive parameters to modulate the message passing process of GNN. We then adopt a hierarchical adaptation mechanism to adapt the encoder at task-level and the predictor at query-level by the unified GNN adapter. Extensive results show that PACIA obtains the state-of-the-art performance in few-shot MPP problems, and our proposed hierarchical adaptation mechanism is rational and effective.

List of keywords

Machine Learning -> ML: Applications
Machine Learning -> ML: Few-shot learning

1982

Instance-Level Metalearning for Outlier Detection

Long Vu, Peter Kirchner, Charu C. Aggarwal, Horst Samulowitz

[+] More

[-] Less

A machine learning task can be viewed as a sequential pipeline of different algorithmic choices, includingdata preprocessing, model selection, andhyper-parameter tuning. Automated machine learning selects this sequence in anautomated manner. While such approaches are natural in supervised settings, they remain challenging for unsupervised tasks such as outlier detection because of the lack of availability of label-centric feedback. In this paper, we present an instance-level metalearning approach for outlier detection. This approach learns how outlier instances are related to normal points in many labeled data sets to create a supervised meta-model. Thismeta-model is then used on a new (unlabeled) data set to predict outliers. We show the robustness of our approach on several benchmarks from the OpenML repository.

List of keywords

Data Mining -> DM: Anomaly/outlier detection
Machine Learning -> ML: Automated machine learning
Machine Learning -> ML: Meta-learning

1993

Natural Language-centered Inference Network for Multi-modal Fake News Detection

Qiang Zhang, Jiawei Liu, Fanrui Zhang, Jingyi Xie, Zheng-Jun Zha

[+] More

[-] Less

The proliferation of fake news with image and text in the internet has triggered widespread concern. Existing research has made important contributions in cross-modal information interaction and fusion, but fails to fundamentally address the modality gap among news image, text, and news-related external knowledge representations. In this paper, we propose a novel Natural Language-centered Inference Network (NLIN) for multi-modal fake news detection by aligning multi-modal news content with the natural language space and introducing an encoder-decoder architecture to fully comprehend the news in-context. Specifically, we first unify multi-modal news content into textual modality by converting news images and news-related external knowledge into plain textual content. Then, we design a multi-modal feature reasoning module, which consists of a multi-modal encoder, a unified-modal context encoder and an inference decoder with prompt phrase. This framework not only fully extracts the latent representation of cross-modal news content, but also utilizes the prompt phrase to stimulate the powerful in-context learning ability of the pre-trained large language model to reason about the truthfulness of the news content. In addition, to support the research in the field of multi-modal fake news detection, we produce a challenging large scale, multi-platform, multi-domain multi-modal Chinese Fake News Detection (CFND) dataset. Extensive experiments show that our CFND dataset is challenging and the proposed NLIN outperforms state-of-the-art methods.

List of keywords

Data Mining -> DM: Mining text, web, social media
Multidisciplinary Topics and Applications -> MTA: News and media

1997

Kernel Readout for Graph Neural Networks

Jiajun Yu, Zhihao Wu, Jinyu Cai, Adele Lu Jia, Jicong Fan

[+] More

[-] Less

Graph neural networks (GNNs) for graph classification or representation learning require a pooling operation to convert the nodes’ embeddings of each graph to a vector as the graph-level representation and the operation has a significant impact on model accuracy. The paper presents a novel graph pooling method called kernel readout. Kernel readout maps the node embeddings from the sample space with limited nodes to an augmented sample space with infinite nodes, and then calculates the inner product between some learnable adaptive centers and the augmented node embeddings, which forms a final graph-level feature vector. We apply the proposed strategy to six supervised and two unsupervised graph neural networks such as GCN, GIN, GUNet, InfoGraph, and GraphCL, and the experiments on eight benchmark datasets show that the proposed readout outperforms classical pooling methods such as Sum and seven state-of-the-art pooling methods such as SRead and Janossy GRU.

List of keywords

Data Mining -> DM: Mining graphs
Machine Learning -> ML: Representation learning

2005

When Fairness Meets Privacy: Exploring Privacy Threats in Fair Binary Classifiers via Membership Inference Attacks

Huan Tian, Guangsheng Zhang, Bo Liu, Tianqing Zhu, Ming Ding, Wanlei Zhou

[+] More

[-] Less

While in-processing fairness approaches show promise in mitigating bias predictions, their potential impact on privacy leakage remains under-explored. We aim to address this gap by assessing the privacy risks of fairness-enhanced binary classifiers with membership inference attacks (MIAs). Surprisingly, our results reveal that these fairness interventions exhibit increased resilience against existing attacks, indicating that enhancing fairness does not necessarily lead to privacy compromises. However, we find current attack methods are ineffective as they typically degrade into simple threshold models with limited attack effectiveness. Following this observation, we discover a novel threat dubbed Fairness Discrepancy Membership Inference Attacks (FD-MIA) that exploits prediction discrepancies between fair and biased models. This attack reveals more potent vulnerabilities and poses significant privacy risks to model privacy. Extensive experiments across multiple datasets, attack methods, and representative fairness approaches confirm our findings and demonstrate the efficacy of the proposed attack method. Our study exposes the overlooked privacy threats in fairness studies, advocating for thorough evaluations of potential security vulnerabilities before model deployments.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
AI Ethics, Trust, Fairness -> ETF: Safety and robustness

2006

Ten Words Only Still Help: Improving Black-Box AI-Generated Text Detection via Proxy-Guided Efficient Re-Sampling

Yuhui Shi, Qiang Sheng, Juan Cao, Hao Mi, Beizhe Hu, Danding Wang

[+] More

[-] Less

With the rapidly increasing application of large language models (LLMs), their abuse has caused many undesirable societal problems such as fake news, academic dishonesty, and information pollution. This makes AI-generated text (AIGT) detection of great importance. Among existing methods, white-box methods are generally superior to black-box methods in terms of performance and generalizability, but they require access to LLMs’ internal states and are not applicable to black-box settings. In this paper, we propose to estimate word generation probabilities as pseudo white-box features via multiple re-sampling to help improve AIGT detection under the black-box setting. Specifically, we design POGER, a proxy-guided efficient re-sampling method, which selects a small subset of representative words (e.g., 10 words) for performing multiple re-sampling in black-box AIGT detection. Experiments on datasets containing texts from humans and seven LLMs show that POGER outperforms all baselines in macro F1 under black-box, partial white-box, and out-of-distribution settings and maintains lower re-sampling costs than its existing counterparts.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Natural Language Processing -> NLP: Applications

2007

SDformer: Transformer with Spectral Filter and Dynamic Attention for Multivariate Time Series Long-term Forecasting

Ziyu Zhou, Gengyu Lyu, Yiming Huang, Zihao Wang, Ziyu Jia, Zhen Yang

[+] More

[-] Less

Transformer has gained widespread adoption in modeling time series due to the exceptional ability of its self-attention mechanism in capturing long-range dependencies. However, when processing time series data with numerous variates, the vanilla self-attention mechanism tends to distribute attention weights evenly and smoothly, causing row-hom*ogenization in attention maps and further hampering time series forecasting. To tackle this issue, we propose an advanced Transformer architecture entitled SDformer, which designs two novel modules, Spectral-Filter-Transform (SFT) and Dynamic-Directional-Attention (DDA), and integrates them into the encoder of Transformer to achieve more intensive attention allocation. Specifically, the SFT module utilizes the Fast Fourier Transform to select the most prominent frequencies, along with a Hamming Window to smooth and denoise the filtered series data; The DDA module applies a specialized kernel function to the query and key vectors projected from the denoised data, concentrating this innovative attention mechanism more effectively on the most informative variates to obtain a sharper attention distribution. These two modules jointly enable attention weights to be more salient among numerous variates, which in turn enhances the attention’s ability to capture multivariate correlations, improving the performance in forecasting. Extensive experiments on public datasets demonstrate its superior performance over other state-of-the-art models. Code is available at https://github.com/zhouziyu02/SDformer.

List of keywords

Machine Learning -> ML: Time series and data streams
Data Mining -> DM: Mining spatial and/or temporal data
Machine Learning -> ML: Attention models

2024

VCC-INFUSE: Towards Accurate and Efficient Selection of Unlabeled Examples in Semi-supervised Learning

Shijie Fang, Qianhan Feng, Tong Lin

[+] More

[-] Less

Despite the progress of Semi-supervised Learning (SSL), existing methods fail to utilize unlabeled data effectively and efficiently. Many pseudo-label-based methods select unlabeled examples based on inaccurate confidence scores from the classifier. Most prior work also uses all available unlabeled data without pruning, making it difficult to handle large amounts of unlabeled data. To address these issues, we propose two methods: Variational Confidence Calibration (VCC) and Influence-Function-based Unlabeled Sample Elimination (INFUSE). VCC is a universal plugin for SSL confidence calibration, using a variational autoencoder to select more accurate pseudo labels based on three types of consistency scores. INFUSE is a data pruning method that constructs a core dataset of unlabeled examples under SSL. Our methods are effective in multiple datasets and settings, reducing classification error rates and saving training time. Together, VCC-INFUSE reduces the error rate of FlexMatch on the CIFAR-100 dataset by 1.08% while saving nearly half of the training time.

List of keywords

Machine Learning -> ML: Semi-supervised learning

2026

Balancing Multimodal Learning via Online Logit Modulation

Daoming Zong, Chaoyue Ding, Baoxiang Li, Jiakui Li, Ken Zheng

[+] More

[-] Less

Multimodal learning is provably superior to unimodal learning. However, in practice, the best-performing unimodal networks often outperform jointly trained multimodal networks. This phenomenon can be attributed to the varying convergence and generalization rates across different modalities, leading to the dominance of one modality and causing underfitting of other modalities in simple multimodal joint training. To mitigate this issue, we propose two key ingredients: i) disentangling the learning of unimodal features and multimodal interaction through an intermediate representation fusion block; ii) modulating the logits of different modalities via dynamic coefficients during training to align their magnitudes with the target values, referred to as online logit modulation (OLM). Remarkably, OLM is model-agnostic and can be seamlessly integrated with most existing multimodal training frameworks. Empirical evidence shows that our approach brings significant enhancements over baselines on a wide range of multimodal tasks, covering video, audio, text, image, and depth modalities.

List of keywords

Machine Learning -> ML: Optimization
Computer Vision -> CV: Multimodal learning
Machine Learning -> ML: Applications
Machine Learning -> ML: Attention models

2029

PTDE: Personalized Training with Distilled Execution for Multi-Agent Reinforcement Learning

Yiqun Chen, Hangyu Mao, Jiaxin Mao, Shiguang Wu, Tianle Zhang, Bin Zhang, Wei Yang, Hongxing Chang

[+] More

[-] Less

Centralized Training with Decentralized Execution (CTDE) has emerged as a widely adopted paradigm in multi-agent reinforcement learning, emphasizing the utilization of global information for learning an enhanced joint Q-function or centralized critic. In contrast, our investigation delves into harnessing global information to directly enhance individual Q-functions or individual actors. Notably, we discover that applying identical global information universally across all agents proves insufficient for optimal performance. Consequently, we advocate for the customization of global information tailored to each agent, creating agent-personalized global information to bolster overall performance. Furthermore, we introduce a novel paradigm named Personalized Training with Distilled Execution (PTDE), wherein agent-personalized global information is distilled into the agent’s local information. This distilled information is then utilized during decentralized execution, resulting in minimal performance degradation. PTDE can be seamless integrated with state-of-the-art algorithms, leading to notable performance enhancements across diverse benchmarks, including the SMAC benchmark, Google Research Football (GRF) benchmark, and Learning to Rank (LTR) task.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation

2034

HeterGCL: Graph Contrastive Learning Framework on Heterophilic Graph

Chenhao Wang, Yong Liu, Yan Yang, Wei Li

[+] More

[-] Less

Graph Contrastive Learning (GCL) has attracted significant research attention due to its self-supervised ability to learn robust node representations. Unfortunately, most methods primarily focus on hom*ophilic graphs, rendering them less effective for heterophilic graphs. In addition, the complexity of node interactions in heterophilic graphs poses considerable challenges to augmentation schemes, coding architectures, and contrastive designs for traditional GCL. In this work, we propose HeterGCL, a novel graph contrastive learning framework with structural and semantic learning to explore the true potential of GCL on heterophilic graphs. Specifically, We abandon the random augmentation scheme that leads to the destruction of the graph structure, instead introduce an adaptive neighbor aggregation strategy (ANA) to extract topology-supervised signals from neighboring nodes at different distances and explore the structural information with an adaptive local-to-global contrastive loss. In the semantic learning module, we jointly consider the original nodes’ features and the similarity between nodes in the latent feature space to explore hidden associations between nodes. Experimental results on hom*ophilic and heterophilic graphs demonstrate that HeterGCL outperforms existing self-supervised and semi-supervised baselines across various downstream tasks.

List of keywords

Data Mining -> DM: Mining graphs
Machine Learning -> ML: Self-supervised Learning

2043

Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction

Guozheng Li, Peng Wang, Wenjun Ke, Yikai Guo, Ke Ji, Ziyu Shang, Jiajun Liu, Zijie Xu

[+] More

[-] Less

Relation extraction (RE) aims to identify relations between entities mentioned in texts. Although large language models (LLMs) have demonstrated impressive in-context learning (ICL) abilities in various tasks, they still suffer from poor performances compared to most supervised fine-tuned RE methods. Utilizing ICL for RE with LLMs encounters two challenges: (1) retrieving good demonstrations from training examples, and (2) enabling LLMs exhibit strong ICL abilities in RE. On the one hand, retrieving good demonstrations is a non-trivial process in RE, which easily results in low relevance regarding entities and relations. On the other hand, ICL with an LLM achieves poor performance in RE while RE is different from language modeling in nature or the LLM is not large enough. In this work, we propose a novel recall-retrieve-reason RE framework that synergizes LLMs with retrieval corpora (training examples) to enable relevant retrieving and reliable in-context reasoning. Specifically, we distill the consistently ontological knowledge from training datasets to let LLMs generate relevant entity pairs grounded by retrieval corpora as valid queries. These entity pairs are then used to retrieve relevant training examples from the retrieval corpora as demonstrations for LLMs to conduct better ICL via instruction tuning. Extensive experiments on different LLMs and RE datasets demonstrate that our method generates relevant and valid entity pairs and boosts ICL abilities of LLMs, achieving competitive or new state-of-the-art performance on sentence-level RE compared to previous supervised fine-tuning methods and ICL-based methods.

List of keywords

Natural Language Processing -> NLP: Information extraction

2045

ReliaAvatar: A Robust Real-Time Avatar Animator with Integrated Motion Prediction

Bo Qian, Zhenhuan Wei, Jiashuo Li, Xing Wei

[+] More

[-] Less

Efficiently estimating full-body pose with minimal wearable devices presents a worthwhile research direction. Despite the significant advancements in this field, most current research neglects to explore the full-body avatar estimation under low-quality signal conditions, which is prevalent in practical usage. To bridge this gap, we summarize three scenarios that may be encountered in real-world applications: standard scenario, instantaneous data-loss scenario, and prolonged data-loss scenario, and propose a new evaluation benchmark. The solution we propose to address data-loss scenarios is integrating the full-body avatar pose estimation problem with motion prediction. Specifically, we present \textit{ReliaAvatar}, a real-time, \textbf{relia}ble \textbf{avatar} estimator equipped with predictive modeling capabilities by employing a dual-pathway architecture. ReliaAvatar operates effectively, with an impressive performance rate of 109 frames per second (fps). Extensive comparative evaluations on widely recognized benchmark datasets demonstrate ReliaAvatar’s superior performance in both standard and low data-quality conditions, marking a significant advancement in full-body avatar estimation.

List of keywords

Humans and AI -> HAI: Applications
Humans and AI -> HAI: Human-computer interaction
Humans and AI -> HAI: Personalization and user modeling
Robotics -> ROB: Human robot interaction

2051

Allocating Mixed Goods with Customized Fairness and Indivisibility Ratio

Bo Li, Zihao Li, Shengxin Liu, Zekai Wu

[+] More

[-] Less

We consider the problem of fairly allocating a combination of divisible and indivisible goods. While fairness criteria like envy-freeness (EF) and proportionality (PROP) can always be achieved for divisible goods, only their relaxed versions, such as the “up to one” relaxations EF1 and PROP1, can be satisfied when the goods are indivisible. The “up to one” relaxations require the fairness conditions to be satisfied provided that one good can be completely eliminated or added in the comparison. In this work, we bridge the gap between the two extremes and propose “up to a fraction” relaxations for the allocation of mixed divisible and indivisible goods. The fraction is determined based on the proportion of indivisible goods, which we call the indivisibility ratio. The new concepts also introduce asymmetric conditions that are customized for individuals with varying indivisibility ratios. We provide both upper and lower bounds on the fractions of the modified item in order to satisfy the fairness criterion. Our results are tight up to a constant for EF and asymptotically tight for PROP.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Fair division

2053

Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control

Ka-Ho Chow, Wenqi Wei, Lei Yu

[+] More

[-] Less

Natural language processing (NLP) has received unprecedented attention. While advancements in NLP models have led to extensive research into their backdoor vulnerabilities, the potential for these advancements to introduce new backdoor threats remains unexplored. This paper proposes Imperio, which harnesses the language understanding capabilities of NLP models to enrich backdoor attacks. Imperio provides a new model control experience. Demonstrated through controlling image classifiers, it empowers the adversary to manipulate the victim model with arbitrary output through language-guided instructions. This is achieved using a language model to fuel a conditional trigger generator, with optimizations designed to extend its language understanding capabilities to backdoor instruction interpretation and execution. Our experiments across three datasets, five attacks, and nine defenses confirm Imperio’s effectiveness. It can produce contextually adaptive triggers from text descriptions and control the victim model with desired outputs, even in scenarios not encountered during training. The attack reaches a high success rate without compromising the accuracy of clean inputs and exhibits resilience against representative defenses. Supplementary materials are available at https://khchow.com/Imperio.

List of keywords

Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI

2055

Cross-View Contrastive Fusion for Enhanced Molecular Property Prediction

Yan Zheng, Song Wu, JunYu Lin, Yazhou Ren, Jing He, Xiaorong Pu, Lifang He

[+] More

[-] Less

Molecular Property Prediction (MPP), which leverages machine learning to predict molecular properties, has garnered substantial attention in recent years. However, current MPP methods face two prominent challenges: 1) single-view MPP approaches do not sufficiently exploit the complementary information of molecular data across multiple views, generally producing sub-optimal performance, and 2) most existing multi-view MPP models ignore the disparities in data quality among different views, inadvertently introducing the risk of models being overshadowed by inferior views. We introduce a novel multi-view MPP approach, termed MolFuse. We first extract intricate molecular semantics and structures from both sequence and topological-spatial views, leveraging the complementarity of multi-view data. Notably, MolFuse employs two distinct graphs – the atomic graph and the chemical bond graph – to enhance the representation of the molecular graph. This comprehensive representation integrates both the fundamental backbone attributes and the nuanced shape characteristics. To further refine the initial feature representations within each view, we incorporate a dual learning mechanism. Subsequently, we extract more precise and informative global features by maximizing the coherence among diverse view-specific molecular representations. Finally, the learning processes are combined into a unified optimization problem for iterative training. Experiments on multiple benchmark datasets substantiate the efficacy of method.

List of keywords

Machine Learning -> ML: Multi-view learning
Data Mining -> DM: Mining graphs
Multidisciplinary Topics and Applications -> MTA: Bioinformatics

2072

Langshaw: Declarative Interaction Protocols Based on Sayso and Conflict

Munindar Singh, Samuel Christie, Amit Chopra

[+] More

[-] Less

Existing languages for specifying multiagent interaction protocols either over-constrain how protocols are operationalized or limit how application meanings for communications are expressed. We propose a declarative protocol language that unites an information model to express meaning with a social construct to coordinate agents. Thus, a protocol is specified using domain knowledge alone. We give a formal semantics for our language, procedures for determining the safety and liveness of a protocol, and a method to generate a message-oriented protocol (embedding needed coordination) suitable for flexible asynchronous enactment.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Agent communication

2089

Vulnerabilities of Single-Round Incentive Compatibility in Auto-bidding: Theory and Evidence from ROI-Constrained Online Advertising Markets

Juncheng Li, Pingzhong Tang

[+] More

[-] Less

Most of the work in the auction design literature assumes that bidders behave rationally based on the information available for every individual auction, and the revelation principle enables designers to restrict their efforts to incentive compatible (IC) mechanisms. However, in today’s online advertising markets, one of the most important real-life applications of auction design, the data and computational power required to bid optimally are only available to the platform, and an advertiser can only participate by setting performance objectives and constraints for its proxy auto-bidder provided by the platform. The prevalence of auto-bidding necessitates a review of auction theory. In this paper, we examine the markets through the lens of ROI-constrained value-maximizing campaigns. We show that second price auction exhibits many undesirable properties (computational hardness, non-monotonicity, instability of bidders’ utilities, and interference in A/B testing) and loses its dominant theoretical advantages in single-item scenarios. In addition, we make it clear how IC and its runner-up-winner interdependence contribute to each property. We hope that our work could bring new perspectives to the community and benefit practitioners to attain a better grasp of real-world markets.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Auctions and market-based systems
Multidisciplinary Topics and Applications -> MTA: Economics

2093

Large Language Model-Enhanced Algorithm Selection: Towards Comprehensive Algorithm Representation

Xingyu Wu, Yan Zhong, Jibin Wu, Bingbing Jiang, Kay Chen Tan

[+] More

[-] Less

Algorithm selection, a critical process of automated machine learning, aims to identify the most suitable algorithm for solving a specific problem prior to execution. Mainstream algorithm selection techniques heavily rely on problem features, while the role of algorithm features remains largely unexplored. Due to the intrinsic complexity of algorithms, effective methods for universally extracting algorithm information are lacking. This paper takes a significant step towards bridging this gap by introducing Large Language Models (LLMs) into algorithm selection for the first time. By comprehending the code text, LLM not only captures the structural and semantic aspects of the algorithm, but also demonstrates contextual awareness and library function understanding. The high-dimensional algorithm representation extracted by LLM, after undergoing a feature selection module, is combined with the problem representation and passed to the similarity calculation module. The selected algorithm is determined by the matching degree between a given problem and different algorithms. Extensive experiments validate the performance superiority of the proposed model and the efficacy of each key module. Furthermore, we present a theoretical upper bound on model complexity, showcasing the influence of algorithm representation and feature selection modules. This provides valuable theoretical guidance for the practical implementation of our method.

List of keywords

Machine Learning -> ML: Automated machine learning
Search -> S: Algorithm portfolios and configuration
Machine Learning -> ML: Applications
Natural Language Processing -> NLP: Language models

2119

Seed Selection in the Heterogeneous Moran Process

Petros Petsinis, Andreas Pavlogiannis, Josef Tkadlec, Panagiotis Karras

[+] More

[-] Less

The Moran process is a classic stochastic process that models the rise and takeover of novel traits in network-structured populations. In biological terms, a set of mutants, each with fitness m ∈ (0, ∞) invade a population of residents with fitness 1. Each agent reproduces at a rate proportional to its fitness and each offspring replaces a random network neighbor. The process ends when the mutants either fixate (take over the whole population) or go extinct. The fixation probability measures the success of the invasion. To account for environmental heterogeneity, we study a generalization of the Standard process, called the Heterogeneous Moran process. Here, the fitness of each agent is determined both by its type (resident/mutant) and the node it occupies. We study the natural optimization problem of seed selection: given a budget k, which k agents should initiate the mutant invasion to maximize the fixation probability? We show that the problem is strongly inapproximable: it is NP-hard to distinguish between maximum fixation probability 0 and 1. We then focus on mutant-biased networks, where each node exhibits at least as large mutant fitness as resident fitness. We show that the problem remains NP-hard, but the fixation probability becomes submodular, and thus the optimization problem admits a greedy (1 − 1/e)-approximation. An experimental evaluation of the greedy algorithm along with various heuristics on real-world data sets corroborates our results.

List of keywords

Data Mining -> DM: Networks
Agent-based and Multi-agent Systems -> MAS: Resource allocation
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Search -> S: Evolutionary computation

2126

Score-CDM: Score-Weighted Convolutional Diffusion Model for Multivariate Time Series Imputation

Shunyang Zhang, Senzhang Wang, Hao Miao, Hao Chen, Changjun Fan, Jian Zhang

[+] More

[-] Less

Multivariant time series (MTS) data are usually incomplete in real scenarios, and imputing the incomplete MTS is practically important to facilitate various time series mining tasks. Recently, diffusion model-based MTS imputation methods have achieved promising results by utilizing CNN or attention mechanisms for temporal features learning. However, it is hard to adaptively trade off the diverse effects of local and global temporal features by simply combining CNN and attention. To address this issue, we propose a Score-weighted Convolutional Diffusion Model (Score-CDM for short), whose backbone consists of a Score-weighted Convolution Module (SCM) and an Adaptive Reception Module (ARM). SCM adopts a score map to capture the global temporal features in the time domain, while ARM uses a Spectral2Time Window Block (S2TWB) to convolve the local time series data in the spectral domain. Benefiting from the time convolution properties of Fast Fourier Transformation, ARM can adaptively change the receptive field of the score map, and thus effectively balance the local and global temporal features. We conduct extensive evaluations on three real MTS datasets of different domains, and the result verifies the effectiveness of the proposed Score-CDM.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data

2127

Cooperation and Control in Delegation Games

Oliver Sourbut, Lewis Hammond, Harriet Wood

[+] More

[-] Less

Many settings of interest involving humans and machines – from virtual personal assistants to autonomous vehicles – can naturally be modelled as principals (humans) delegating to agents (machines), which then interact with each other on their principals’ behalf. We refer to these multi-principal, multi-agent scenarios as delegation games. In such games, there are two important failure modes: problems of control (where an agent fails to comply with their principal’s wishes) and problems of cooperation (where the agents fail to work well together). In this paper we formalise and analyse these problems, further breaking them down into issues of alignment (do the players have similar preferences?) and capabilities (how competent are the players at satisfying those preferences?). We show – theoretically and empirically – how these measures determine the principals’ welfare, how they can be estimated using limited observations, and thus how they might be used to help us design more aligned and cooperative AI systems.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Game Theory and Economic Paradigms -> GTEP: Other
Humans and AI -> HAI: Human-AI collaboration

2128

RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Yumeng Liu, Yaxun Yang, Youzhuo Wang, Xiaofei Wu, Jiamin Wang, Yichen Yao, Sören Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, Xuming He, Yuexin Ma

[+] More

[-] Less

In this paper, we introduce RealDex, a pioneering dataset capturing authentic dexterous hand grasping motions infused with human behavioral patterns, enriched by multi-view and multimodal visual data. Utilizing a teleoperation system, we seamlessly synchronize human-robot hand poses in real time. This collection of human-like motions is crucial for training dexterous hands to mimic human movements more naturally and precisely. RealDex holds immense promise in advancing humanoid robot for automated perception, cognition, and manipulation in real-world scenarios. Moreover, we introduce a cutting-edge dexterous grasping motion generation framework, which aligns with human experience and enhances real-world applicability through effectively utilizing Multimodal Large Language Models. Extensive experiments have demonstrated the superior performance of our method on RealDex and other open datasets. The dataset and associated code are available at https://4dvlab.github.io/RealDex_page/.

List of keywords

Robotics -> ROB: Learning in robotics
Robotics -> ROB: Manipulation
Robotics -> ROB: Robotics and vision

2140

Facility Location Problems with Capacity Constraints: Two Facilities and Beyond

Gennaro Auricchio, Zihe Wang, Jie Zhang

[+] More

[-] Less

In this paper, we investigate the Mechanism Design aspects of the $m$-Capacitated Facility Location Problem ($m$-CFLP) on a line. We focus on two frameworks. In the first framework, the number of facilities is arbitrary, all facilities have the same capacity, and the number of agents is equal to the total capacity of all facilities. In the second framework, we aim to place two facilities, each with a capacity of at least half of the total agents. For both of these frameworks, we propose truthful mechanisms with bounded approximation ratios with respect to the Social Cost (SC) and the Maximum Cost (MC). When $m>2$, the result sharply contrasts with the impossibility results known for the classic $m$-Facility Location Problem, where capacity constraints are not considered. Furthermore, all our mechanisms are optimal with respect to the MC and optimal or nearly optimal with respect to the SC among anonymous mechanisms. For both frameworks, we provide a lower bound on the approximation ratio that any truthful and deterministic mechanism can achieve with respect to the SC and MC.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Mechanism design
Agent-based and Multi-agent Systems -> MAS: Agent theories and models
Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
Agent-based and Multi-agent Systems -> MAS: Resource allocation

2144

Robust Reward Placement under Uncertainty

Petros Petsinis, Kaichen Zhang, Andreas Pavlogiannis, Jingbo Zhou, Panagiotis Karras

[+] More

[-] Less

Risk-averse humans prefer robust decisions that ensure the best feasible rewards in worst-case scenarios. Over network-based planning problems, this risk-aversion leads to a challenge: select a set of network nodes that maximizes a cumulative reward function in expectation dependent on the probabilistic choices of mobile agents. In this paper we formulate and address the ensuing combinatorial optimization problem assuming agents that move in a Markovian manner following any of several network-based Markov Mobility Models (MMMs) and collect rewards when they reach a selected node, or reward state. We call this problem Robust Reward Placement (RRP). We establish the problem’s NP-hardness and inapproximability. We propose a polynomial-time algorithm that approximates the optimal solution while exceeding the budget constraint by a factor logarithmic in the number of MMMs, as well as several heuristics, most prominently one inspired from a dynamic programming algorithm for the max–min 0–1 KNAPSACK problem. We corroborate our theoretical findings with an experimental comparison of our solution vs. a suite of heuristics.

List of keywords

Planning and Scheduling -> PS: Planning under uncertainty
Data Mining -> DM: Networks
Planning and Scheduling -> PS: Markov decisions processes
Search -> S: Combinatorial search and optimisation

2165

A Deep Reinforcement Learning Approach to Balance Viewport Prediction and Video Transmission in 360° Video Streaming

Guanghui Zhang, Jing Guo

[+] More

[-] Less

360° video streaming has seen tremendous growth in past years. However, our measurement reveals a dilemma that severely limits QoE. On the one hand, viewport prediction requires the shortest possible prediction distance for high predicting accuracy; On the other hand, video transmission requires more buffered data to compensate for bandwidth fluctuations otherwise substantial playback rebuffering would be incurred. Since no existing method can break this dilemma, the QoE optimization was naturally bottlenecked. This work tackles this challenge by developing QUTA – a novel learning-based streaming system. Specifically, our measurement shows that three kinds of internal streaming parameters have significant impacts on the prediction distance, namely, download pause, data rate threshold, and playback rate. On top of this, we design a new long-term-planning (LTP) learning method that tunes the parameters dynamically based on the network and streaming context. Evaluations with large-scale streaming trace data show that QUTA not only improves the prediction accuracy and QoE by up to 68.4% but also exhibits strong robustness.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Transportation

2184

Comparing Ways of Obtaining Candidate Orderings from Approval Ballots

Théo Delemazure, Chris Dong, Dominik Peters, Magdalena Tydrichova

[+] More

[-] Less

To understand and summarize approval preferences and other binary evaluation data, it is useful to order the items on an axis which explains the data. In a political election using approval voting, this could be an ideological left-right axis such that each voter approves adjacent candidates, an analogue of single-peakedness. In a perfect axis, every approval set would be an interval, which is usually not possible, and so we need to choose an axis that gets closest to this ideal. The literature has developed algorithms for optimizing several objective functions (e.g., minimize the number of added approvals needed to get a perfect axis), but provides little help with choosing among different objectives. In this paper, we take a social choice approach and compare 5 different axis selection rules axiomatically, by studying the properties they satisfy. We establish some impossibility theorems, and characterize (within the class of scoring rules) the rule that chooses the axes that maximize the number of votes that form intervals, using the axioms of ballot monotonicity and resistance to cloning. Finally, we study the behavior of the rules on data from French election surveys, on the votes of justices of the US Supreme Court, and on synthetic data.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

2188

Determining Winners in Elections with Absent Votes

Qishen Han, Amelie Marian, Lirong Xia

[+] More

[-] Less

An important question in elections is determining whether a candidate can be a winner when some votes are absent. We study this determining winner with absent votes (WAV) problem with elections that take top-truncated ballots. We show that the WAV problem is NP-complete for single transferable vote, Maximin, and Copeland, and propose a special case of positional scoring rule such that the problem can be computed in polynomial time. Our results for top-truncated rankings differ from the results in full rankings as their hardness results still hold when the number of candidates or the number of missing votes are bounded, while we show that the problem can be solved in polynomial time in either case.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

2224

NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional, and Explainable Reasoning

Nathaniel Weir, Peter Clark, Benjamin Van Durme

[+] More

[-] Less

Our goal is a modern approach to answering questions via systematic reasoning where answers are supported by human interpretable proof trees grounded in an NL corpus of authoritative facts. Such a system would help alleviate the challenges of interpretability and hallucination with modern LMs, and the lack of grounding of current explanation methods (e.g., Chain-of-Thought). This paper proposes a new take on Prolog-based inference engines, where we replace handcrafted rules with a combination of neural language modeling, guided generation, and semiparametric dense retrieval. Our implementation, NELLIE, is the first system to demonstrate fully interpretable, end-to-end grounded QA as entailment tree proof search, going beyond earlier work explaining known-to-be-true facts from text. In experiments, NELLIE outperforms a similar-sized state-of-the-art reasoner [Tafjord et al., 2022] while producing knowledge-grounded explanations. We also find NELLIE can exploit both semi-structured and NL text corpora to guide reasoning. Together these suggest a new way to jointly reap the benefits of both modern neural methods and traditional symbolic reasoning.

List of keywords

Knowledge Representation and Reasoning -> KRR: Automated reasoning and theorem proving
Knowledge Representation and Reasoning -> KRR: Reasoning about knowledge and belief
Natural Language Processing -> NLP: Question answering
Search -> S: Other

2226

Hypergraph Self-supervised Learning with Sampling-efficient Signals

Fan Li, Xiaoyang Wang, Dawei Cheng, Wenjie Zhang, Ying Zhang, Xuemin Lin

[+] More

[-] Less

Self-supervised learning (SSL) provides a promising alternative for representation learning on hypergraphs without costly labels. However, existing hypergraph SSL models are mostly based on contrastive methods with the instance-level discrimination strategy, suffering from two significant limitations: (1) They select negative samples arbitrarily, which is unreliable in deciding similar and dissimilar pairs, causing training bias. (2) They often require a large number of negative samples, resulting in expensive computational costs. To address the above issues, we propose SE-HSSL, a hypergraph SSL framework with three sampling-efficient self-supervised signals. Specifically, we introduce two sampling-free objectives leveraging the canonical correlation analysis as the node-level and group-level self-supervised signals. Additionally, we develop a novel hierarchical membership-level contrast objective motivated by the cascading overlap relationship in hypergraphs, which can further reduce membership sampling bias and improve the efficiency of sample utilization. Through comprehensive experiments on 7 real-world hypergraphs, we demonstrate the superiority of our approach over the state-of-the-art method in terms of both effectiveness and efficiency.

List of keywords

Machine Learning -> ML: Self-supervised Learning
Data Mining -> DM: Mining graphs

2228

Predictive Modeling with Temporal Graphical Representation on Electronic Health Records

Jiayuan Chen, Changchang Yin, Yuanlong Wang, Ping Zhang

[+] More

[-] Less

Deep learning-based predictive models, leveraging Electronic Health Records (EHR), are receiving increasing attention in healthcare. An effective representation of a patient’s EHR should hierarchically encompass both the temporal relationships between historical visits and medical events, and the inherent structural information within these elements. Existing patient representation methods can be roughly categorized into sequential representation and graphical representation. The sequential representation methods focus only on the temporal relationships among longitudinal visits. On the other hand, the graphical representation approaches, while adept at extracting the graph-structured relationships between various medical events, fall short in effectively integrate temporal information. To capture both types of information, we model a patient’s EHR as a novel temporal heterogeneous graph. This graph includes historical visits nodes and medical events nodes. It propagates structured information from medical event nodes to visit nodes and utilizes time-aware visit nodes to capture changes in the patient’s health status. Furthermore, we introduce a novel temporal graph transformer (TRANS) that integrates temporal edge features, global positional encoding, and local structural encoding into heterogeneous graph convolution, capturing both temporal and structural information. We validate the effectiveness of TRANS through extensive experiments on three real-world datasets. The results show that our proposed approach achieves state-of-the-art performance.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Health and medicine
Data Mining -> DM: Applications

2243

Towards Sharper Generalization Bounds for Adversarial Contrastive Learning

Wen Wen, Han Li, Tieliang Gong, Hong Chen

[+] More

[-] Less

Recently, the enhancement on the adversarial robustness of machine learning algorithms has gained significant attention across various application domains. Given the widespread label scarcity issue in real-world data, adversarial contrastive learning (ACL) has been proposed to adversarially train robust models using unlabeled data. Despite the empirical success, its generalization behavior remains poorly understood and far from being well-characterized. This paper aims to address this issue from a learning theory perspective. We establish novel high-probability generalization bounds for the general Lipschitz loss functions. The derived bounds scale O(log(k)) with respect to the number of negative samples k, which improves the existing linear dependency bounds. Our results are generally applicable to many prediction models, including linear models and deep neural networks. In particular, we obtain an optimistic generalization bound O(1/n) under the smoothness assumption of the loss function on the sample size n. To the best of our knowledge, this is the first fast-rate bound valid for ACL. Empirical evaluations on real-world datasets verify our theoretical findings.

List of keywords

Machine Learning -> ML: Adversarial machine learning
Machine Learning -> ML: Learning theory
Machine Learning -> ML: Self-supervised Learning

2249

FastScene: Text-Driven Fast Indoor 3D Scene Generation via Panoramic Gaussian Splatting

Yikun Ma, Dandan Zhan, Zhi Jin

[+] More

[-] Less

Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart home technologies to augmented and virtual reality (AR/VR) applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and high-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for fast scene generation. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a complete 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Scene analysis and understanding

2255

Motion-Aware Heatmap Regression for Human Pose Estimation in Videos

Inpyo Song, Lee Jongmin, Moonwook Ryu, Jangwon Lee

[+] More

[-] Less

We present an approach to solving 2D human pose estimation in videos. The problem of human pose estimation in videos differs from estimating human poses in static images since videos contain a lot of motion related information. Thus, we investigate how to utilize by the information of the human body movements across in a sequence of video frames for estimating human poses in videos. To do this, we introduce a novel heatmap regression method what we call motion-aware heatmap regression. Our approach computes motion vectors in joint keypoints from adjacent frames. We then design a new style of heatmap that we call Motion-Aware Heatmaps to reflect the motion uncertainty of each joint point. Unlike traditional heatmaps, our motion-aware heatmaps not only consider the current joint locations but also account how joints move over time. Furthermore, we introduce a simple yet effective framework designed to incorporate motion information into heatmap regression. We evaluate our motion-aware heatmap regression on PoseTrack(2018, 21) and Sub-JHMDB datasets. Our results validate that the proposed motion-aware heatmaps significantly improve the precision of human pose estimation in videos, particularly in challenging scenarios such as videos like sports game footage with substantial human motions.

List of keywords

Computer Vision -> CV: Biometrics, face, gesture and pose recognition
Computer Vision -> CV: Action and behavior recognition
Computer Vision -> CV: Video analysis and understanding

2260

Continual Multi-View Clustering with Consistent Anchor Guidance

Chao Zhang, Deng Xu, Xiuyi Jia, Chunlin Chen, Huaxiong Li

[+] More

[-] Less

Multi-view clustering (MVC) has recently attracted much attention. Most existing approaches are designed for fixed multi-view data, and cannot deal with the common streaming data in real world. In this paper, we address this problem by proposing a consistent Anchor guided Continual MVC (ACMVC) method in a two-stage way. In initial learning stage, a low-rank anchor graph based model is constructed. In continual learning stage, to leverage the historical knowledge, the multi-level anchor information is reused to refine the model via adding consistency regularization. It not only provides prior knowledge to enhance the exploration on current data, but also captures the similarity relationship between previous and current data, enabling a comprehensive exploitation on streaming data. The proposed model can be optimized efficiently with linear time and space complexity. Experiments demonstrate the effectiveness and efficiency of our method compared with some state-of-the-art approaches.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
Machine Learning -> ML: Unsupervised learning
Data Mining -> DM: Mining data streams

2262

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

Zihao Zhou, Bin Hu, Chenyang Zhao, Pu Zhang, Bin Liu

[+] More

[-] Less

Recent studies have uncovered the potential of Large Language Models (LLMs) in addressing complex sequential decision-making tasks through the provision of high-level instructions. However, LLM-based agents lack specialization in tackling specific target problems, particularly in real-time dynamic environments. Additionally, deploying an LLM-based agent in practical scenarios can be both costly and time-consuming. On the other hand, reinforcement learning (RL) approaches train agents that specialize in the target task but often suffer from low sampling efficiency and high exploration costs. In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task. We conducted experiments on challenging MiniGrid and Habitat environments, specifically designed for embodied AI research, to evaluate the effectiveness of our framework. The results clearly demonstrate that our approach achieves superior performance compared to strong baseline methods. Our code is available at https://github.com/ZJLAB-AMMI/LLM4Teach.

List of keywords

Machine Learning -> ML: Reinforcement learning
Natural Language Processing -> NLP: Language models
Uncertainty in AI -> UAI: Sequential decision making

2264

Learning Hierarchy-Enhanced POI Category Representations Using Disentangled Mobility Sequences

Hongwei Jia, Meng Chen, Weiming Huang, Kai Zhao, Yongshun Gong

[+] More

[-] Less

Points of interest (POIs) carry a wealth of semantic information of varying locations in cities and thus have been widely used to enable various location-based services. To understand POI semantics, existing methods usually model contextual correlations of POI categories in users’ check-in sequences and embed categories into a latent space based on the word2vec framework. However, such an approach does not fully capture the underlying hierarchical relationship between POI categories and can hardly integrate the category hierarchy into various deep sequential models. To overcome this shortcoming, we propose a Semantically Disentangled POI Category Embedding Model (SD-CEM) to generate hierarchy-enhanced category representations using disentangled mobility sequences. Specifically, first, we construct disentangled mobility sequences using human mobility data based on the semantics of POIs. Then we utilize the POI category hierarchy to initialize a hierarchy-enhanced representation for each category in the disentangled sequences, employing an attention mechanism. Finally, we optimize these category representations by incorporating both the masked category prediction task and the next category prediction task. To evaluate the effectiveness of SD-CEM, we conduct comprehensive experiments using two check-in datasets covering three tasks. Experimental results demonstrate that SD-CEM outperforms several competitive baselines, highlighting its substantial improvement in performance as well as the understanding of learned category representations.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data

2280

Protecting Split Learning by Potential Energy Loss

Fei Zheng, Chaochao Chen, Lingjuan Lyu, Xinyi Fu, Xing Fu, Weiqiang Wang, Xiaolin Zheng, Jianwei Yin

[+] More

[-] Less

As a practical privacy-preserving learning method, split learning has drawn much attention in academia and industry. However, its security is constantly being questioned since the intermediate results are shared during training and inference. In this paper, we focus on the privacy leakage from the forward embeddings of split learning. Specifically, since the forward embeddings contain too much information about the label, the attacker can either use a few labeled samples to fine-tune the top model or perform unsupervised attacks such as clustering to infer the true labels from the forward embeddings. To prevent such kind of privacy leakage, we propose the potential energy loss to make the forward embeddings more ‘complicated’, by pushing embeddings of the same class towards the decision boundary. Therefore, it is hard for the attacker to learn from the forward embeddings. Experiment results show that our method significantly lowers the performance of both fine-tuning attacks and clustering attacks.

List of keywords

Machine Learning -> ML: Federated learning
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Multidisciplinary Topics and Applications -> MTA: Security and privacy

2282

EPIC: Graph Augmentation with Edit Path Interpolation via Learnable Cost

Jaeseung Heo, Seungbeom Lee, Sungsoo Ahn, Dongwoo Kim

[+] More

[-] Less

Data augmentation plays a critical role in improving model performance across various domains, but it becomes challenging with graph data due to their complex and irregular structure. To address this issue, we propose EPIC (Edit Path Interpolation via learnable Cost), a novel interpolation-based method for augmenting graph datasets. To interpolate between two graphs lying in an irregular domain, EPIC leverages the concept of graph edit distance, constructing an edit path that represents the transformation process between two graphs via edit operations. Moreover, our method introduces a context-sensitive cost model that accounts for the importance of specific edit operations formulated through a learning framework. This allows for a more nuanced transformation process, where the edit distance is not merely count-based but reflects meaningful graph attributes. With randomly sampled graphs from the edit path, we enrich the training set to enhance the generalization capability of classification models. Experimental evaluations across several benchmark datasets demonstrate that our approach outperforms existing augmentation techniques in many tasks.

List of keywords

Machine Learning -> ML: Sequence and graph learning

2284

Estimating Conditional Average Treatment Effects via Sufficient Representation Learning

Pengfei Shi, Wei Zhong, Xinyu Zhang, Ningtao Wang, Xing Fu, Weiqiang Wang, Yin Jin

[+] More

[-] Less

Estimating the conditional average treatment effects (CATE) is very important in causal inference and has a wide range of applications across many fields. In the estimation process of CATE, the unconfoundedness assumption is typically required to ensure the identifiability of the regression problems. When estimating CATE using high-dimensional data, there have been many variable selection methods and neural network approaches based on representation learning, while these methods do not provide a way to verify whether the subset of variables after dimensionality reduction or the learned representations still satisfy the unconfoundedness assumption during the estimation process, which can lead to ineffective estimates of the treatment effects. Additionally, these methods typically use data from only the treatment or control group when estimating the regression functions for each group. This paper proposes a novel neural network approach named CrossNet to learn a sufficient representation for the features, based on which we then estimate the CATE, where cross indicates that in estimating the regression functions, we used data from their own group as well as cross-utilized data from another group. Numerical simulations and empirical results demonstrate that our method outperforms the competitive approaches.

List of keywords

Machine Learning -> ML: Regression
Machine Learning -> ML: Causality
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Supervised Learning

2309

STAR: Spatio-Temporal State Compression for Multi-Agent Tasks with Rich Observations

Chao Li, Yujing Hu, Shangdong Yang, Tangjie Lv, Changjie Fan, Wenbin Li, Chongjie Zhang, Yang Gao

[+] More

[-] Less

This paper focuses on the problem of learning compressed state representations for multi-agent tasks. Under the assumption of rich observation, we pinpoint that the state representations should be compressed both spatially and temporally to enable efficient prioritization of task-relevant features, while existing works typically fail. To overcome this limitation, we propose a novel method named Spatio-Temporal stAte compRession (STAR) that explicitly defines both spatial and temporal compression operations on the learned state representations to encode per-agent task-relevant features. Specifically, we first formalize this problem by introducing Task Informed Partially Observable Stochastic Game (TI-POSG). Then, we identify the spatial representation compression in it as encoding the latent states from the joint observations of all agents, and achieve this by learning representations that approximate the latent states based on the information theoretical principle. After that, we further extract the task-relevant features of each agent from these representations by aligning them based on their reward similarities, which is regarded as the temporal representation compression. Structurally, we implement these two compression by learning a set of agent-specific decoding functions and incorporate them into a critic shared by agents for scalable learning. We evaluate our method by developing decentralized policies on 12 maps of the StarCraft Multi-Agent Challenge benchmark, and the superior performance demonstrates its effectiveness.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Machine Learning -> ML: Reinforcement learning

2312

Modeling Personalized Retweeting Behaviors for Multi-Stage Cascade Popularity Prediction

Mingyang Zhou, Yanjie Lin, Gang Liu, Li Zuwen, Hao Liao, Rui Mao

[+] More

[-] Less

Predicting the size of message cascades is critical in various applications, such as online advertising and early detection of rumors. However, most existing deep learning approaches rely on cascade observation, which hinders accurate cascade prediction before message posting. Besides, these approaches overlook personalized retweeting behaviors that reflect users’ inclination to retweeting specific types of information. In this study, we propose a universal cascade prediction framework, namely \textbf{Cas}cade prediction regarding \textbf{M}ultiple \textbf{S}tage (CasMS), that effectively predicts cascade popularity across message generation stage as well as short-term and long-term stages. Unlike previous methods, our approach not only captures users’ personalized retweeting behaviors but also incorporates temporal cascade features. We perform the experiments in datasets collected ourselves as well as public datasets. The results show that our method significantly surpasses existing approaches in predicting the cascade during the message generation stage and different time periods in the cascade dynamics.

List of keywords

Data Mining -> DM: Mining text, web, social media
Data Mining -> DM: Recommender systems
Machine Learning -> ML: Applications

2319

Multi-Granularity Graph-Convolution-Based Method for Weakly Supervised Person Search

Haichun Tai, De Cheng, Jie Li, Nannan Wang, Xinbo Gao

[+] More

[-] Less

One-step Weakly Supervised Person Search (WSPS) jointly performs pedestrian detection and person Re-IDentification (ReID) only with bounding box annotations, which makes the traditional person ReID problem more suitable and efficient for real-world applications. However, this task is very challenging due to the following reasons: 1) large feature gap between person ReID and general object detection tasks when learning shared representations; 2) difficult pseudo identity estimation for each person image with unrefined raw detection and dramatic scale changes. To address above issues, we propose a multi-granularity graph convolution framework to jointly optimize the aligned task features, as well as to assist the pseudo label estimation. Specifically, the multi-granularity feature alignment module (MFA) in the designed two-branch framework, employs cluster-level bi-directional interaction of various granularity information to narrow down the large feature gap. Further, upon the MFA module, we introduce the multi-granularity graph-convolution-based pseudo-label estimation module, to enhance feature representations for distinguishing diverse identities. Extensive experimental results demonstrate the effectiveness of the proposed method, and show superior performances to state-of-the art methods by a large margin on CUHK-SYSU and PRW datasets. Code is available in the supplementary materials.

List of keywords

Computer Vision -> CV: Representation learning
Computer Vision -> CV: Image and video retrieval
Computer Vision -> CV: Recognition (object detection, categorization)

2320

P2P: Transforming from Point Supervision to Explicit Visual Prompt for Object Detection and Segmentation

Guangqian Guo, Dian Shao, Chenguang Zhu, Sha Meng, Xuan Wang, Shan Gao

[+] More

[-] Less

Point-supervised vision tasks, including detection and segmentation, aiming to learn a network that transforms from points to pseudo labels, have attracted much attention in recent years. However, the lack of precise object size and boundary annotations in the point-supervised condition results in a large performance gap between point- and fully-supervised methods. In this paper, we propose a novel iterative learning framework, Point to Prompt (P2P), for point-supervised object detection and segmentation, with the key insight of transforming from point supervision to explicit visual prompt of the foundation model. The P2P is formulated as an iterative refinement process of two stages: Semantic Explicit Prompt Generation (SEPG) and Prompt Guided Spatial Refinement (PGSR). Specifically, SEPG serves as a prompt generator for generating semantic-explicit prompts from point input via a group-based learning strategy. In the PGSR stage, prompts guide the visual foundation model to further refine the object regions, by leveraging the outstanding generalization ability of the foundation model. The two stages are iterated multiple times to improve the quality of predictions progressively. Experimental results on multiple datasets demonstrate that P2P achieves SOTA performance in both detection and segmentation tasks, further narrowing the performance gap with fully-supervised methods. The source code and supplementary material can be found at https://github.com/guangqian-guo/P2P.

List of keywords

Machine Learning -> ML: Weakly supervised learning
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Segmentation

2322

Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Shiyin Dong, Mingrui Zhu, Kun Cheng, Nannan Wang, Xinbo Gao

[+] More

[-] Less

The remarkable prowess of diffusion models in image generation has spurred efforts to extend their application beyond generative tasks. However, a persistent challenge exists in lacking a unified approach to apply diffusion models to visual perception tasks with diverse semantic granularity requirements. Our purpose is to establish a unified visual perception framework, capitalizing on the potential synergies between generative and discriminative models. In this paper, we propose Vermouth, a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an Adapted-Expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. We emphasize that there is no necessity for incorporating a heavyweight or intricate decoder to transform diffusion models into potent representation learners. Extensive comparative evaluations against tailored discriminative models showcase the efficacy of our approach on zero-shot sketch-based image retrieval (ZS-SBIR), few-shot classification, and open-vocabulary (OV) semantic segmentation tasks. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.

List of keywords

Computer Vision -> CV: Representation learning
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Segmentation
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

2331

A Dataset and Model for Realistic License Plate Deblurring

Haoyan Gong, Yuzheng Feng, Zhenrong Zhang, Xianxu Hou, Jingxin Liu, Siqi Huang, Hongbin Liu

[+] More

[-] Less

Vehicle license plate recognition is a crucial task in intelligent traffic management systems. However, the challenge of achieving accurate recognition persists due to motion blur from fast-moving vehicles. Despite the widespread use of image synthesis approaches in existing deblurring and recognition algorithms, their effectiveness in real-world scenarios remains unproven. To address this, we introduce the first large-scale license plate deblurring dataset named License Plate Blur (LPBlur), captured by a dual-camera system and processed through a post-processing pipeline to avoid misalignment issues. Then, we propose a License Plate Deblurring Generative Adversarial Network (LPDGAN) to tackle the license plate deblurring: 1) a Feature Fusion Module to integrate multi-scale latent codes; 2) a Text Reconstruction Module to restore structure through textual modality; 3) a Partition Discriminator Module to enhance the model’s perception of details in each letter. Extensive experiments validate the reliability of the LPBlur dataset for both model training and testing, showcasing that our proposed model outperforms other state-of-the-art motion deblurring methods in realistic license plate deblurring scenarios. The dataset and code are available at https://github.com/haoyGONG/LPDGAN.

List of keywords

Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation

2348

LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

Wentao Jiang, Jing Zhang, Di Wang, Qiming Zhang, Zengmao Wang, Bo Du

[+] More

[-] Less

Due to spatial redundancy in remote sensing images, sparse tokens containing rich information are usually involved in self-attention (SA) to reduce the overall token numbers within the calculation, avoiding the high computational cost issue in Vision Transformers. However, such methods usually obtain sparse tokens by hand-crafted or parallel-unfriendly designs, posing a challenge to reach a better balance between efficiency and performance. Different from them, this paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information meanwhile improving the inference speed. Technically, the meta tokens are first initialized from image tokens via cross-attention. Then, we propose Dual Cross-Attention (DCA) to promote information exchange between image tokens and meta tokens, where they serve as query and key (value) tokens alternatively in a dual-branch structure, significantly reducing the computational complexity compared to self-attention. By employing DCA in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 \times$ speedup, fewer parameters, and competitive performance compared to the baseline models, and achieves a better trade-off between efficiency and performance. The code will be released.

List of keywords

Computer Vision -> CV: Representation learning
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Segmentation

2354

Distribution-Independent Cell Type Identification for Single-Cell RNA-seq Data

Yuyao Zhai, Liang Chen, Minghua Deng

[+] More

[-] Less

Automatic cell type annotation aims to transfer the label knowledge from label-abundant reference data to label-scarce target data, which makes encouraging progress in single-cell RNA-seq data analysis. While previous works have focused on classifying close-set cells and detecting open-set cells during testing, it is still essential to be able to classify unknown cell types as human beings. Additionally, few efforts have been devoted to addressing the challenge of common long-tail dilemma in cell type annotation data. Therefore, in this paper, we propose an innovative distribution-independent universal cell type identification framework called scDET from the perspective of autonomously equilibrated dual-consultative contrastive learning. Our model can generate fine-grained predictions for both close-set and open-set cell types in a long-tailed open-world environment. scDET consists of a contrastive-learning branch and a pseudo-labeling branch, which work collaboratively to provide interactive supervision. Specifically, the contrastive-learning branch provides reliable distribution estimation to regularize the predictions of the pseudo-labeling branch, which in turn guides itself through self-balanced knowledge transfer and a designed novel soft contrastive loss. Extensive experimental results on various evaluation datasets demonstrate the superior performance of scDET over other state-of-the-art single-cell clustering and annotation methods.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Bioinformatics
Multidisciplinary Topics and Applications -> MTA: Other

2360

Efficient Event Stream Super-Resolution with Recursive Multi-Branch Fusion

Quanmin Liang, Zhilin Huang, Xiawu Zheng, Feidiao Yang, Jun Peng, Kai Huang, Yonghong Tian

[+] More

[-] Less

Current Event Stream Super-Resolution (ESR) methods overlook the redundant and complementary information present in positive and negative events within the event stream, employing a direct mixing approach for super-resolution, which may lead to detail loss and inefficiency. To address these issues, we propose an efficient Recursive Multi-Branch Information Fusion Network (RMFNet) that separates positive and negative events for complementary information extraction, followed by mutual supplementation and refinement. Particularly, we introduce Feature Fusion Modules (FFM) and Feature Exchange Modules (FEM). FFM is designed for the fusion of contextual information within neighboring event streams, leveraging the coupling relationship between positive and negative events to alleviate the misleading of noises in the respective branches. FEM efficiently promotes the fusion and exchange of information between positive and negative branches, enabling superior local information enhancement and global information complementation. Experimental results demonstrate that our approach achieves over 17% and 31% improvement on synthetic and real datasets, accompanied by a 2.3x acceleration. Furthermore, we evaluate our method on two downstream event-driven applications, i.e., object recognition and video reconstruction, achieving remarkable results that outperform existing methods. Our code and Supplementary Material are available at https://github.com/Lqm26/RMFNet.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Other

2365

Parameterized Complexity of Kidney Exchange Revisited

Vaishali Surianarayanan, Daniel Lokshtanov, Ursula Hebert-Johnson, Chinmay Sonar

[+] More

[-] Less

As of January 2023, there are more than 90,000 people on the national transplant waiting list in need of a kidney in the United States. These patients often have a friend or family member who is willing to donate, but whose kidney type might not be compatible. To help match these patients to suitable donors, the compatibility between patients and donors can be modeled as a directed graph. Specifically, in the Kidney Exchange problem, the input is a directed graph $\GG$, a subset $\BB$ of vertices (altruistic vertex), and two integers $l_p$ and $l_c$. An altruistic vertex is a donor who is not paired with a patient, and the remaining vertices are patient-donor pairs. Whenever a donor is compatible with a patient from a patient-donor pair, we place a directed edge from the donor vertex to the patient-donor pair. Here the donor vertex can be either altruistic or non-altruistic.The goal is to find a collection of vertex-disjoint cycles and paths covering the maximum number of patients such that each cycle has length at most $l_c$ and each path has length at most $l_p$ and begins at a vertex in $\BB$. The path and cycle lengths are bounded so that the surgeries can be performed simultaneously.Kidney Exchange has received a great deal of attention in recent years [IJCAI ’18, IJCAI ’22, IJCAI ’23, AAAI ’17, NeurIPS ’20, EC ’20]. We contribute to this line of work by closing two open problems from IJCAI ’18 and IJCAI ’22: “Is Kidney Exchange {\sf FPT} when parameterized by (i) treewidth ($\omega$) of $\GG$ and (ii) the number of vertex types ($\theta$) in $\GG$?" Two vertices have the same vertex type if they have the same in- and out-neighborhoods. We show that Kidney Exchange is {\sf FPT} parameterized by $theta$ and {\sf W[1]}-hard with respect to $\omega$. We also design a $4^tn^{\mathcal{O}(1)}$-time algorithm parameterized by $t$, the number of patients helped, significantly improving upon the previous state of the art of $161^tn^{\mathcal{O}(1)}$ [IJCAI ’22].

List of keywords

Agent-based and Multi-agent Systems -> MAS: Resource allocation
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Game Theory and Economic Paradigms -> GTEP: Auctions and market-based systems
Game Theory and Economic Paradigms -> GTEP: Computational social choice

2371

FreqFormer: Frequency-aware Transformer for Lightweight Image Super-resolution

Tao Dai, Jianping Wang, Hang Guo, Jinmin Li, Jinbao Wang, Zexuan Zhu

[+] More

[-] Less

Transformer-based models have been widely and successfully used in various low-vision visual tasks, and have achieved remarkable performance in single image super-resolution (SR). Despite the significant progress in SR, Transformer-based SR methods (e.g., SwinIR) still suffer from the problems of heavy computation cost and low-frequency preference, while ignoring the reconstruction of rich high-frequency information, hence hindering the representational power of Transformers. To address these issues, in this paper, we propose a novel Frequency-aware Transformer (FreqFormer) for lightweight image SR. Specifically, a Frequency Division Module (FDM) is first introduced to separately handle high- and low-frequency information in a divide-and-conquer manner. Moreover, we present Frequency-aware Transformer Block (FTB) to extracting both spatial frequency attention and channel transposed attention to recover high-frequency details. Extensive experimental results on public datasets demonstrate the superiority of our FreqFormer over state-of-the-art SR methods in terms of both quantitative metrics and visual quality. Code and models are available at https://github.com/JPWang-CS/FreqFormer.

List of keywords

Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Interpretability and transparency
Computer Vision -> CV: Machine learning for vision

2383

Machine Unlearning via Null Space Calibration

Huiqiang Chen, Tianqing Zhu, Xin Yu, Wanlei Zhou

[+] More

[-] Less

Machine unlearning aims at enabling models to forget specific data instances when receiving deletion requests. Current research centers on efficient unlearning to erase the influence of data from the model and neglects the subsequent impacts on the remaining data. Consequently, existing unlearning algorithms degrade the model’s performance after unlearning, known as over-unlearning. This paper addresses this critical yet under-explored issue by introducing machine Unlearning via Null Space Calibration (UNSC), which can accurately unlearn target samples without over-unlearning. On the contrary, by calibrating the decision space during unlearning, UNSC can significantly improve the model’s performance on the remaining samples. In particular, our approach hinges on confining the unlearning process to a specified null space tailored to the remaining samples, which is augmented by strategically pseudo-labeling the unlearning samples. Comparative analyses against several established baselines affirm the superiority of our approach.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Accountability
AI Ethics, Trust, Fairness -> ETF: Ethical, legal and societal issues
AI Ethics, Trust, Fairness -> ETF: Other

2388

Encoding Auxiliary Information to Restore Compressed Point Cloud Geometry

Gexin Liu, Jiahao Zhu, Dandan Ding, Zhan Ma

[+] More

[-] Less

The standardized Geometry-based Point Cloud Compression (G-PCC) suffers from limited coding performance and low-quality reconstruction. To address this, we propose AuxGR, a performance-complexity tradeoff solution for point cloud geometry restoration: leveraging auxiliary bitstream to enhance the quality of G-PCC compressed point cloud geometry. This auxiliary bitstream efficiently encapsulates spatio-temporal information. For static coding, we perform paired information embedding (PIE) on the G-PCC decoded frame by employing target convolutions from its original counterpart, producing an auxiliary bitstream containing abundant original information. For dynamic coding, in addition to PIE, we propose temporal information embedding (TIE) to capture motion information between the previously restored and the current G-PCC decoded frames. TIE applies target kNN attention between them, which ensures the temporal neighborhood construction for each point and implicitly represents motions. Due to the similarity across temporal frames, only the residuals between TIE and PIE outputs are compressed as auxiliary bitstream. Experimental results demonstrate that AuxGR notably outperforms existing methods in both static and dynamic coding scenarios. Moreover, our framework enables the flexible incorporation of auxiliary information under computation constraints, which is attractive to real applications.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data

2397

OTOcc: Optimal Transport for Occupancy Prediction

Pengteng Li, Ying He, F Richard Yu, Pinhao Song, Xingchen Zhou, Guang Zhou

[+] More

[-] Less

The autonomous driving community is highly interested in 3D occupancy prediction due to its outstanding geometric perception and object recognition capabilities. However, previous methods are limited to existing semantic conversion mechanisms for solving sparse ground truths problem, causing excessive computational demands and sub-optimal voxels representation. To tackle the above limitations, we propose OTOcc, a novel 3D occupancy prediction framework that models semantic conversion from 2D pixels to 3D voxels as Optimal Transport (OT) problem, offering accurate semantic mapping to adapt to sparse scenarios without attention or depth estimation. Specifically, the unit transportation cost between each demander (voxel) and supplier (pixel) pair is defined as the weighted occupancy prediction loss. Then, we utilize the Sinkhorn-Knopp Iteration to find the best mapping matrices with minimal transportation costs. To reduce the computational cost, we propose a block reading technique with multi-perspective feature representation, which also brings fine-grained scene understanding. Extensive experiments show that OTOcc not only has the competitive prediction performance but also has about more than 4.58% reduction in computational overhead compared to state-of-the-art methods.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Applications
Computer Vision -> CV: Machine learning for vision

2403

SCAT: A Time Series Forecasting with Spectral Central Alternating Transformers

Chengjie Zhou, Chao Che, Pengfei Wang, Qiang Zhang

[+] More

[-] Less

Time series forecasting has essential applications across various domains. For instance, forecasting power time series can optimize energy usage and bolster grid stability and reliability. Existing models based on transformer architecture are limited to classical design, ignoring the impact of spatial information and noise on model architecture design. Therefore, we propose an atypical design of Transformer-based models for multivariate time series forecasting. This design consists of two critical components: (i) spectral clustering center of time series employed as the focal point for attention computation; (ii) alternating attention mechanism wherein each query transformer is compatible with spectral clustering centers, executing attention at the sequence level instead of the token level. The alternating design has a two-fold benefit: firstly, it eliminates the uncertainty noise present in the dependent variable sequence of the channel input, and secondly, it incorporates the Euclidean distance to mitigate the impact of extreme values on the attention matrix, thereby aligning predictions more closely to the sequence’s natural progression. Experiments on ten real-world datasets, encompassing Wind, Electricity, Weather, and others, demonstrate that our Spectral Central Alternating Transformer (SCAT) outperforms state-of-the-art methods (SOTA) by an average of 42.8% in prediction performance in power time series forecasting.

List of keywords

Machine Learning -> ML: Time series and data streams

2414

Laying the Foundations for Solving FOND HTN Problems: Grounding, Search, Heuristics (and Benchmark Problems)

Mohammad Yousefi, Pascal Bercher

[+] More

[-] Less

Uncertainty in planning has been an active area of research for many years. However, little effort has been made to develop systems that can deal with uncertain outcomes in the hierarchical setting. Building upon the recent advancements in formalising Fully Observable Non-Deterministic (FOND) Hierarchical Task Network (HTN) planning, we aim to bridge this gap by presenting a search algorithm, along with a compilation that relaxes a FOND HTN problem to a deterministic one. This allows the utilisation of existing heuristics and grounders in the deterministic HTN planning literature. Furthermore, we extend the Hierarchical Domain Description Language (HDDL) to include uncertain effects, and introduce multiple benchmark domains in the extended language for our, as well as future, empirical evaluations.

List of keywords

Planning and Scheduling -> PS: Hierarchical planning
Planning and Scheduling -> PS: Planning algorithms
Planning and Scheduling -> PS: Planning under uncertainty
Planning and Scheduling -> PS: Search in planning and scheduling

2426

AnchorGT: Efficient and Flexible Attention Architecture for Scalable Graph Transformers

Wenhao Zhu, Guojie Song, Liang Wang, Shaoguo Liu

[+] More

[-] Less

Graph Transformers (GTs) have significantly advanced the field of graph representation learning by overcoming the limitations of message-passing graph neural networks (GNNs) and demonstrating promising performance and expressive power. However, the quadratic complexity of self-attention mechanism in GTs has limited their scalability, and previous approaches to address this issue often suffer from expressiveness degradation or lack of versatility. To address this issue, we propose AnchorGT, a novel attention architecture for GTs with global receptive field and almost linear complexity, which serves as a flexible building block to improve the scalability of a wide range of GT models. Inspired by anchor-based GNNs, we employ structurally important $k$-dominating node set as anchors and design an attention mechanism that focuses on the relationship between individual nodes and anchors, while retaining the global receptive field for all nodes. With its intuitive design, AnchorGT can easily replace the attention module in various GT models with different network architectures and structural encodings, resulting in reduced computational overhead without sacrificing performance. In addition, we theoretically prove that AnchorGT attention can be strictly more expressive than Weisfeiler-Lehman test, showing its superiority in representing graph structures. Our experiments on three state-of-the-art GT models demonstrate that their AnchorGT variants can achieve similar results while being faster and significantly more memory efficient.

List of keywords

Machine Learning -> ML: Sequence and graph learning

2434

Diffusion Mask-Driven Visual-language Tracking

Guangtong Zhang, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shuxiang Song

[+] More

[-] Less

Most existing visual-language trackers greatly rely on the initial language descriptions on a target object to extract their multi-modal features. However, the initial language descriptions are often inaccurate in a highly time-varying video sequence and thus greatly deteriorate their tracking performance due to the low quality of extracted multi-modal features. To address this challenge, we propose a Diffusion Mask-DrivenVisual-language Tracker (DMTrack) based on a diffusion model. Confronting the issue of low-quality multi-modal features due to inaccurate language descriptions, we leverage the diffusion model to capture high-quality semantic information from multi-modal features and transform it into target mask features. During the training phase, we further enhance the diffusion model’s perception of pixel-level features by calculating the loss between the target mask features and the ground truth masks. Additionally, we perform joint localization of the target using both target mask features and visual features, instead of relying solely on multi-modal features for localization. Through extensive experiments on four tracking benchmarks (i.e., LaSOT, TNL2K, LaSOText, and OTB-Lang), we validate that our proposed Diffusion Mask-Driven Visual-language Tracker can improve the robustness and effectiveness of the model.

List of keywords

Computer Vision -> CV: Motion and tracking

2447

Unified Unsupervised Salient Object Detection via Knowledge Transfer

Yao Yuan, Wutao Liu, Pan Gao, Qun Dai, Jie Qin

[+] More

[-] Less

Recently, unsupervised salient object detection (USOD) has gained increasing attention due to its annotation-free nature. However, current methods mainly focus on specific tasks such as RGB and RGB-D, neglecting the potential for task migration. In this paper, we propose a unified USOD framework for generic USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based Saliency Distilling (PCL-SD) mechanism to extract saliency cues from a pre-trained deep network. This mechanism starts with easy samples and progressively moves towards harder ones, to avoid initial interference caused by hard samples. Afterwards, the obtained saliency cues are utilized to train a saliency detector, and we employ a Self-rectify Pseudo-label Refinement (SPR) mechanism to improve the quality of pseudo-labels. Finally, an adapter-tuning method is devised to transfer the acquired saliency knowledge, leveraging shared knowledge to attain superior transferring performance on the target tasks. Extensive experiments on five representative SOD tasks confirm the effectiveness and feasibility of our proposed method. Code and supplement materials are available at https://github.com/I2-Multimedia-Lab/A2S-v3.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Scene analysis and understanding
Machine Learning -> ML: Unsupervised learning

2454

How to Learn Domain-Invariant Representations for Visual Reinforcement Learning: An Information-Theoretical Perspective

Shuo Wang, Zhihao Wu, Jinwen Wang, Xiaobo Hu, Youfang Lin, Kai Lv

[+] More

[-] Less

Despite the impressive success in visual control challenges, Visual Reinforcement Learning (VRL) policies have struggled to generalize to other scenarios. Existing works attempt to empirically improve the generalization capability, lacking theoretical support. In this work, we explore how to learn domain-invariant representations for VRL from an information-theoretical perspective. Specifically, we identify three Mutual Information (MI) terms. These terms highlight that a robust representation should preserve domain invariant information (return and dynamic transition) under significant observation perturbation. Furthermore, we relax the MI terms to derive three components for implementing a practical Mutual Information-based Invariant Representation (MIIR) algorithm for VRL. Extensive experiments demonstrate that MIIR achieves state-of-the-art generalization performance and the best sample efficiency in the DeepMind Control suite, Robotic Manipulation, and Carla.

List of keywords

Computer Vision -> CV: Embodied vision: Active agents, simulation

2462

Improved Parallel Algorithm for Non-Monotone Submodular Maximization under Knapsack Constraint

Tan Tran, Canh Pham, Dung Ha, Phuong Pham

[+] More

[-] Less

This work proposes an efficient parallel algorithm for non-monotone submodular maximization under a knapsack constraint problem over the ground set of size $n$. Our algorithm improves the best approximation factor of the existing parallel one from $8+\epsilon$ to $7+\epsilon$ with $O(\log n)$ adaptive complexity. The key idea of our approach is to create an alternate threshold algorithmic framework. This new strategy alternately constructs two disjoint candidate solutions within a constant number of sequence rounds. Then, the algorithm boosts solution quality without sacrificing the adaptive complexity. Extensive experimental studies on three applications, Revenue Maximization, Image Summarization, and Maximum Weighted Cut, show that our algorithm not only significantly increases solution quality but also requires comparative adaptivity to state-of-the-art algorithms.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Applications
Constraint Satisfaction and Optimization -> CSO: Constraint learning and acquisition
Data Mining -> DM: Big data and scalability

2476

Revealing the Two Sides of Data Augmentation: An Asymmetric Distillation-based Win-Win Solution for Open-Set Recognition

Yunbing Jia, Xiaoyu Kong, Fan Tang, Yixing Gao, Weiming Dong, Yi Yang

[+] More

[-] Less

In this paper, we reveal the two sides of data augmentation: enhancements in closed-set recognition correlate with a significant decrease in open-set recognition. Through empirical investigation, we find that multi-sample-based augmentations would contribute to reducing feature discrimination, thereby diminishing the open-set criteria. Although knowledge distillation could impair the feature via imitation, the mixed feature with ambiguous semantics hinders the distillation. To this end, we propose an asymmetric distillation framework by feeding teacher model extra raw data to enlarge the benefit of teacher. Moreover, a joint mutual information loss and a selective relabel strategy are utilized to alleviate the influence of hard mixed samples. Our method successfully mitigates the decline in open-set and outperforms SOTAs by 2%~3% AUROC on the Tiny-ImageNet dataset and experiments on large-scale dataset ImageNet-21K demonstrate the generalization of our method.

List of keywords

Computer Vision -> CV: Representation learning
Computer Vision -> CV: Structural and model-based approaches, knowledge representation and reasoning

2484

CMACE: CMAES-based Counterfactual Explanations for Black-box Models

Xudong Yin, Yao Yang

[+] More

[-] Less

Explanatory Artificial Intelligence plays a vital role in machine learning, due to its widespread application in decision-making scenarios, e.g., credit lending. Counterfactual Explanation (CFE) is a new kind of explanatory method that involves asking “what if ”, i.e. what would have happened if model inputs slightly change. To answer the question, Counterfactual Explanation aims at finding a minimum perturbation in model inputs leading to a different model decision. Compared with model-agnostic approaches, model-specific CFE approaches designed only for specific type of models usually have better performance in finding optimal counterfactual perturbations, owing to access to the inner workings of models. To deal with this dilemma, this work first proposes CMAES-based Counterfactual Explanations (CMACE): an effective model-agnostic counterfactual generating approach based on Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and a warm starting scheme that provides good initialization of the counterfactual’s mean and covariance parameters for CMA-ES taking advantage of prior information of training samples. CMACE significantly outperforms another state-of-art (SOTA) model-agnostic approach (Bayesian Counterfactual Generator, BayCon) with various experimental settings. Extensive experiments also demonstrate that CMACE is superior to a SOTA model-specific approach (Flexible Optimizable Counterfactual Explanations for Tree Ensembles, FOCUS) that is designed for tree-based models using gradient-based optimization.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Machine Learning -> ML: Optimization
Machine Learning -> ML: Trustworthy machine learning

2486

Learning Low-Rank Tensor Cores with Probabilistic ℓ0-Regularized Rank Selection for Model Compression

Tianxiao Cao, Lu Sun, Canh Hao Nguyen, Hiroshi Mamitsuka

[+] More

[-] Less

Compressing deep neural networks is of great importance for real-world applications on resource-constrained devices. Tensor decomposition is one promising answer that retains the functionality and most of the expressive power of the original deep models by replacing the weights with their decomposed cores. Decomposition with optimal ranks can achieve a good compression-accuracy trade-off, but it is expensive to optimize due to its discrete and combinatorial nature. A common practice is to set all ranks equal and tune one hyperparameter, but it may significantly harm the flexibility and generalization. In this paper, we propose a novel automatic rank selection method for deep model compression that allows learning model weights and decomposition ranks simultaneously. We propose to penalize the ℓ0 (quasi-)norm of the slices of decomposed tensor cores during model training. To avoid combinatorial optimization, we develop a probabilistic formulation and apply an approximate Bernoulli gate to each of the slices of tensor cores, which can be implemented in an end-to-end and scalable framework via gradient descent. It enables the automatic rank selection to be incorporated with arbitrary tensor decompositions and neural network layers such as linear layers, convolutional layers, and embedding layers. Comprehensive experiments on various tasks, including image classification, text sentiment classification, and neural machine translation, demonstrate the superior effectiveness of the proposed method over baselines.

List of keywords

Machine Learning -> ML: Matrix/tensor methods
Machine Learning -> ML: Learning sparse models

2495

A Complete Landscape of EFX Allocations on Graphs: Goods, Chores and Mixed Manna

Bo Li, Minming Li, Tianze Wei, Yu Zhou

[+] More

[-] Less

We study \textit{envy-free up to any item} (EFX) allocations on graphs where vertices represent agents and edges represent items. An agent only cares about the items that are incident to her and all other items have zero marginal value to her. Christodoulou et al. [EC, 2023] proposed this setting and studied the case of goods where each edge is liked by both its endpoints. We extend their results to the case of mixed manna where an item may be liked or disliked by its endpoints. In our setting, an agent has an arbitrary valuation over her incident items such that the items she likes have non-negative marginal values to her and those she dislikes have non-positive marginal values. We provide a complete study of the four variants of EFX for mixed manna in the literature (i.e., $\EFX_0^0$, $\EFX^0_-$, $\EFX^+_0$, and $\EFX^+_-$), which differ by whether the removed item can have zero marginal value. We prove that an $\EFX_0^0$ allocation may not exist and determining its existence is NP-complete, while an allocation that satisfies any of the other three notions always exists and can be computed in polynomial time. We also prove that an orientation (i.e., a special allocation where each edge must be allocated to one of its endpoints) that satisfies any of the four notions may not exist, and determining its existence is NP-complete. To complement the hardness results for orientations, we study some basic graphs, for which whether orientations that satisfy the four notions exist can be determined in polynomial time.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Fair division
Game Theory and Economic Paradigms -> GTEP: Computational social choice

2504

G2LTraj: A Global-to-Local Generation Approach for Trajectory Prediction

Zhanwei Zhang, Zishuo Hua, Minghao Chen, Wei Lu, Binbin Lin, Deng Cai, Wenxiao Wang

[+] More

[-] Less

Predicting future trajectories of traffic agents accurately holds substantial importance in various applications such as autonomous driving. Previous methods commonly infer all future steps of an agent either recursively or simultaneously. However, the recursive strategy suffers from the accumulated error, while the simultaneous strategy overlooks the constraints among future steps, resulting in kinematically infeasible predictions. To address these issues, in this paper, we propose G2LTraj, a plug-and-play global-to-local generation approach for trajectory prediction. Specifically, we generate a series of global key steps that uniformly cover the entire future time range. Subsequently, the local intermediate steps between the adjacent key steps are recursively filled in. In this way, we prevent the accumulated error from propagating beyond the adjacent key steps. Moreover, to boost the kinematical feasibility, we not only introduce the spatial constraints among key steps but also strengthen the temporal constraints among the intermediate steps. Finally, to ensure the optimal granularity of key steps, we design a selectable granularity strategy that caters to each predicted trajectory. Our G2LTraj significantly improves the performance of seven existing trajectory predictors across the ETH, UCY and nuScenes datasets. Experimental results demonstrate its effectiveness. Code will be available at https://github.com/Zhanwei-Z/G2LTraj.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data
Computer Vision -> CV: Motion and tracking
Machine Learning -> ML: Time series and data streams

2505

TFCD: Towards Multi-modal Sarcasm Detection via Training-Free Counterfactual Debiasing

Zhihong Zhu, Xianwei Zhuang, Yunyan Zhang, Derong Xu, Guimin Hu, Xian Wu, Yefeng Zheng

[+] More

[-] Less

Multi-modal sarcasm detection (MSD), which aims to identify whether a given sample with multi-modal information (i.e., text and image) is sarcastic, has garnered widespread attention. Recent approaches focus on designing sophisticated architectures or mechanisms to extract sarcastic cues from entire or local image and text features. Nevertheless, a long-overlooked issue is that current MSD task invariably suffers from unintended dataset biases, especially the statistical label bias and sarcasmless word bias. Concretely, such harmful biases are confounders that may mislead existing models to learn spurious correlations, significantly limiting models’ performance. To tackle this issue, this paper proposes a Training-Free Counterfactual Debiasing framework TFCD, which first formulates the causalities among variables in MSD via a tailored causal graph. Then, TFCD extracts biases from the conventionally-trained model by generating counterfactual utterances and contexts and mitigates them using element-wise subtraction. Extensive experiments on two benchmarks demonstrate the effectiveness of the proposed TFCD. Remarkably, TFCD requires neither data balancing nor model modifications, and thus can be seamlessly integrated into diverse state-of-the-art approaches and achieve considerable improvement margins.

List of keywords

Natural Language Processing -> NLP: Sentiment analysis, stylistic analysis, and argument mining

2519

Partial Optimal Transport Based Out-of-Distribution Detection for Open-Set Semi-Supervised Learning

Yilong Ren, Chuanwen Feng, Xike Xie, S. Kevin Zhou

[+] More

[-] Less

Semi-supervised learning (SSL) is a machine learning paradigm that utilizes both labeled and unlabeled data to enhance the performance of learning tasks. However, SSL methods operate under the assumption that the label spaces of labeled and unlabeled data are identical, which may not hold in open-world applications. In such scenarios, the unlabeled data may contain novel categories that were not presented in the labeled training data, essentially outliers. This specific challenge is referred to as the Open-set Semi-supervised Learning (OSSL) problem. In OSSL, a pivotal concern is the detection of out-of-distribution (OOD) samples within unlabeled data. Existing methods often struggle to provide effective OOD detection strategies, especially when dealing with datasets comprising a large number of training categories. In response to this challenge, we model the OOD detection problem in OSSL as a partial optimal transport (POT) problem. With POT theory, we devise a mass score function to measure the likelihood of a sample being an outlier, which enables a binary classifier for OOD detection. Further, we put forward an OOD loss, enabling the seamless integration of the binary classifier and off-the-shelf SSL methods under OSSL settings, all within an end-to-end training framework. We extensively evaluate our proposal under various datasets and OSSL configurations, consistently demonstrating the superior performance of our proposal. Codes are available at https://github.com/ryl0427/Code_for_POT_OSSL.

List of keywords

Machine Learning -> ML: Semi-supervised learning
Machine Learning -> ML: Optimization
Machine Learning -> ML: Robustness

2536

PDF-MVQA: A Comprehensive Dataset for Investigating Multimodal Information Retrieval in PDF-based Visual Question Answering

Yihao Ding, Kaixuan Ren, Siwen Luo, Jiabin Huang, Soyeon Han

[+] More

[-] Less

Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA

List of keywords

Natural Language Processing -> NLP: Applications
Natural Language Processing -> NLP: Resources and evaluation

2537

A Deep Probabilistic Spatiotemporal Framework for Dynamic Graph Representation Learning with Application to Brain Disorder Identification

Sin-Yee Yap, Junn Yong Loo, Fuad Noman, Raphael Phan, David Dowe, Adeel Razi, Chee-Ming Ting

[+] More

[-] Less

Recent applications of pattern recognition techniques on brain connectome classification using functional connectivity (FC) are shifting towards acknowledging the non-Euclidean topology and causal dynamics of brain connectivity across time. In this paper, a deep spatiotemporal variational Bayes (DSVB) framework is proposed to learn time-varying topological structures in dynamic FC networks for identifying autism spectrum disorder (ASD) in human participants. The framework incorporates a spatial-aware recurrent neural network with an attention-based message passing scheme to capture rich spatiotemporal patterns across dynamic FC networks. To overcome model overfitting on limited training datasets, an adversarial training strategy is introduced to learn graph embedding models that generalize well to unseen brain networks. Evaluation on the ABIDE resting-state functional magnetic resonance imaging dataset shows that our proposed framework substantially outperforms state-of-the-art methods in identifying patients with ASD. Dynamic FC analyses with DSVB-learned embeddings reveal apparent group differences between ASD and healthy controls in brain network connectivity patterns and switching dynamics of brain states.

List of keywords

Machine Learning -> ML: Probabilistic machine learning
Machine Learning -> ML: Adversarial machine learning
Machine Learning -> ML: Learning graphical models
Multidisciplinary Topics and Applications -> MTA: Health and medicine

2540

MARS: Multimodal Active Robotic Sensing for Articulated Characterization

Hongliang Zeng, Ping Zhang, Chengjiong Wu, Jiahua Wang, Tingyu Ye, Fang Li

[+] More

[-] Less

Precise perception of articulated objects is vital for empowering service robots. Recent studies mainly focus on point cloud, a single-modal approach, often neglecting vital texture and lighting details and assuming ideal conditions like optimal viewpoints, unrepresentative of real-world scenarios. To address these limitations, we introduce MARS, a novel framework for articulated object characterization. It features a multi-modal fusion module utilizing multi-scale RGB features to enhance point cloud features, coupled with reinforcement learning-based active sensing for autonomous optimization of observation viewpoints. In experiments conducted with various articulated object instances from the PartNet-Mobility dataset, our method outperformed current state-of-the-art methods in joint parameter estimation accuracy. Additionally, through active sensing, MARS further reduces errors, demonstrating enhanced efficiency in handling suboptimal viewpoints. Furthermore, our method effectively generalizes to real-world articulated objects, enhancing robot interactions. Code is available at https://github.com/robhlzeng/MARS.

List of keywords

Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Multimodal learning
Robotics -> ROB: Perception
Robotics -> ROB: Manipulation

2565

Graph Attention Network with High-Order Neighbor Information Propagation for Social Recommendation

Fei Xiong, Haoran Sun, Guixun Luo, Shirui Pan, Meikang Qiu, Liang Wang

[+] More

[-] Less

In recommender systems, graph neural networks (GNN) can integrate interactions between users and items with their attributes, which makes GNN-based methods more powerful. However, directly stacking multiple layers in a graph neural network can easily lead to over-smoothing, hence recommendation systems based on graph neural networks typically underutilize higher-order neighborhoods in their learning. Although some heterogeneous graph random walk methods based on meta-paths can achieve higher-order aggregation, the focus is predominantly on the nodes at the ends of the paths. Moreover, these methods require manually defined meta-paths, which limits the model’s expressiveness and flexibility. Furthermore, path encoding in graph neural networks usually focuses only on the sequence leading to the target node. However, real-world interactions often do not follow this strict sequence, limiting the predictive performance of sequence-based network models. These problems prevent GNN-based methods from being fully effective. We propose a Graph Attention network with Information Propagation path aggregation for Social Recommendation (GAIPSRec). Firstly, we propose a universal heterogeneous graph sampling framework that does not require manually defining meta-paths for path sampling, thereby offering greater flexibility. Moreover, our method takes into account all nodes on the aggregation path and is capable of learning information from higher-order neighbors without succumbing to over-smoothing. Finally, our method utilizes a gate mechanism to fuse sequential and non-sequential dependence in encoding path instances, allowing a more holistic view of the data. Extensive experiments on real-world datasets show that our proposed GAIPSRec improves the performance significantly and outperforms state-of-the-art methods.

List of keywords

Data Mining -> DM: Mining graphs
Data Mining -> DM: Recommender systems

2594

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

Ji Ma, Wei Suo, Peng Wang, Yanning Zhang

[+] More

[-] Less

Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., “exposure bias” problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.

List of keywords

Computer Vision -> CV: Vision, language and reasoning
Computer Vision -> CV: Multimodal learning
Natural Language Processing -> NLP: Language models

2608

FineFMPL: Fine-grained Feature Mining Prompt Learning for Few-Shot Class Incremental Learning

Hongbo Sun, Jiahuan Zhou, Xiangteng He, Jinglin Xu, Yuxin Peng

[+] More

[-] Less

Few-shot class incremental learning (FSCIL) aims to continually learn new classes with few training samples without forgetting already learned old classes. Existing FSCIL methods generally fix the backbone network in incremental sessions to achieve a balance between suppressing forgetting old classes and learning new classes. However, the fixed backbone network causes insufficient learning of new classes from a few samples. Benefiting from the powerful visual and textual understanding ability of vision-language (VL) models, we propose a fine-grained feature mining prompt learning (FineFMPL) approach to adapt the VL model to comprehensively learn and memorize fine-grained discriminative information of classes for facilitating FSCIL. Concretely, the visual probe prompt is first proposed to guide the vision encoder to extract global-level coarse-grained features and object-level fine-grained features, and visual prototypes are preserved based on image patch significance, which contains the discriminative characteristics exclusive to the categories. Secondly, the textual context prompt is constructed by cross-modal mapping of visual prototypes, feeding into the text encoder to memorize the class information as textual prototypes. Finally, integrating visual and textual prototypes based on fine-grained feature mining into the model improves the recognition performance of all classes in FSCIL. Extensive experiments on three benchmark datasets demonstrate that our FineFMPL achieves new state-of-the-art.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)

2616

Joint Multimodal Aspect Sentiment Analysis with Aspect Enhancement and Syntactic Adaptive Learning

Linlin Zhu, Heli Sun, Qunshu Gao, Tingzhou Yi, Liang He

[+] More

[-] Less

As an important task in sentiment analysis, joint multimodal aspect sentiment analysis (JMASA) has received increasing attention in recent years. However, previous approaches either i) directly fuse multimodal data without fully exploiting the correlation between multimodal input data, or ii) equally utilize the dependencies of words in the text for sentiment analysis, ignoring the differences in the importance of different words. To address these limitations, we propose a joint multimodal sentiment analysis method based on Aspect Enhancement and Syntactic Adaptive Learning (AESAL). Specifically, we construct an aspect enhancement pre-training task to enable the model to fully learn the correlation of aspects between multimodal input data. In order to capture the differences in the importance of different words in the text, we design a syntactic adaptive learning mechanism. First, we construct different syntactic dependency graphs based on the distance between words to learn global and local information in the text. Second, we use a multi-channel adaptive graph convolutional network to maintain the uniqueness of each modality while fusing the correlations between different modalities. Experimental results on benchmark datasets show that our method outperforms state-of-the-art methods.

List of keywords

Natural Language Processing -> NLP: Sentiment analysis, stylistic analysis, and argument mining
Computer Vision -> CV: Multimodal learning

2617

Pluggable Watermarking of Deepfake Models for Deepfake Detection

Han Bao, Xuhong Zhang, Qinying Wang, Kangming Liang, Zonghui Wang, Shouling Ji, Wenzhi Chen

[+] More

[-] Less

Deepfake model misuse poses major security concerns. Existing passive and active Deepfake detection methods both suffer from a lack of generalizability and robustness. In this study, we propose a pluggable and efficient active model watermarking framework for Deepfake detection. This approach facilitates the embedding of identification watermarks across a variety of Deepfake generation models, enabling easy extraction by authorities for detection purposes. Specifically, our method leverages the universal convolutional structure in generative model decoders. It employs convolutional kernel sparsification for adaptive watermark embedding positioning and introduces convolutional kernel normalization to seamlessly integrate watermark parameters with those of the generative model.For watermark extraction, we jointly train a watermark extractor based on a Deepfake detection model and use BCH encoding to identify watermark images effectively.Finally, we apply our approach to eight major types of Deepfake generation models.Experiments show our method successfully detects Deepfakes with an average accuracy exceeding 94% even in heavy lossy channels.This approach operates independently of the generation model’s training without affecting the original model’s performance.Furthermore, our model only requires training a very limited number of parameters, and it is resilient against three major adaptive attacks.The source code can be found at https://github.com/GuaiZao/Pluggable-Watermarking

List of keywords

AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Safety and robustness

2622

PEACH: Pretrained-Embedding Explanation across Contextual and Hierarchical Structure

Feiqi Cao, Soyeon Han, Hyunsuk Chung

[+] More

[-] Less

In this work, we propose a novel tree-based explanation technique, PEACH (Pretrained-embedding Explanation Across Contextual and Hierarchical Structure), that can explain how text-based documents are classified by using any pretrained contextual embeddings in a tree-based human-interpretable manner. Note that PEACH can adopt any contextual embeddings of the PLMs as a training input for the decision tree. Using the proposed PEACH, we perform a comprehensive analysis of several contextual embeddings on nine different NLP text classification benchmarks. This analysis demonstrates the flexibility of the model by applying several PLM contextual embeddings, its attribute selections, scaling, and clustering methods. Furthermore, we show the utility of explanations by visualising the feature selection and important trend of text classification via human-interpretable word-cloud-based trees, which clearly identify model mistakes and assist in dataset debugging. Besides interpretability, PEACH outperforms or is similar to those from pretrained models. The code and Implementation details will be provided via GitHub after the acceptance.

List of keywords

Natural Language Processing -> NLP: Interpretability and analysis of models for NLP
Knowledge Representation and Reasoning -> KRR: Other

2656

An Efficient Prototype-Based Clustering Approach for Edge Pruning in Graph Neural Networks to Battle Over-Smoothing

Yuyang Huang, Wenjing Lu, Yang Yang

[+] More

[-] Less

Topology augmentation is a popular strategy to address the issue of over-smoothing in graph neural networks (GNNs). To prevent potential distortion of node representations, an essential principle is to enhance the separability between embeddings of nodes from different classes while preserving smoothness among nodes of the same class. However, differentiating between inter-class and intra-class edges becomes arduous when class labels are unavailable or the graph is partially labeled. While clustering offers an alternative for identifying closely connected groups of nodes, traditional clustering methods face challenges when applied to GNNs in terms of accuracy, efficiency, adaptability, and scalability to diverse graphs. To address these limitations, we introduce ClusterDrop, which uses learnable prototypes for efficient clustering and incorporates supervised signals to enhance accuracy and adaptability across different graphs. Experiments on six datasets with varying graph structures demonstrate its effectiveness in alleviating over-smoothing and enhancing GNN performance.

List of keywords

Machine Learning -> ML: Sequence and graph learning
Data Mining -> DM: Mining graphs
Data Mining -> DM: Networks

2666

Incorporating Schema-Aware Description into Document-Level Event Extraction

Zijie Xu, Peng Wang, Wenjun Ke, Guozheng Li, Jiajun Liu, Ke Ji, Xiye Chen, Chenxiao Wu

[+] More

[-] Less

Document-level event extraction (DEE) aims to extract the structured event information from a given document, facing two critical challenges: (1) event arguments always scatter across sentences (arguments-scattering); (2) multiple events can co-occur in one document (multi-event). Most recent studies mainly follow two simplified settings to ease the challenges: one simplifies DEE with the no-trigger-words design (NDEE), and the other focuses on event argument extraction (DEAE), a sub-task of DEE. However, the former excludes trigger extraction and suffers from error propagation in the sub-tasks. The latter relies heavily on the gold triggers as prerequisites and struggles to distinguish multiple arguments playing the same role in different events. To address the limitations above, we propose a novel joint trigger and argument extraction paradigm SEELE to enhance the DEE model via incorporating SchEma-awarE descriptions into Document-Level Event extraction. Specifically, the schema-aware descriptions are leveraged from two aspects: (1) guiding the attention mechanism among event-aware tokens across sentences, which relieves arguments-scattering without error propagation; (2) performing the fine-grained contrastive learning to distinguish different events, which mitigates multi-event without gold triggers. Extensive experiments show the superiority of SEELE, achieving notable improvements (2.1% to 9.7% F1) on three NDEE datasets and competitive performance on two DEAE datasets. Our code is available at https://github.com/TheoryRhapsody/SEELE.

List of keywords

Natural Language Processing -> NLP: Information extraction

2669

Bridging the Gap between General and Down-Closed Convex Sets in Submodular Maximization

Loay Mualem, Murad Tukan, Moran Feldman

[+] More

[-] Less

Optimization of DR-submodular functions has experienced a notable surge in significance in recent times, marking a pivotal development within the domain of non-convex optimization. Motivated by real-world scenarios, some recent works have delved into the maximization of non-monotone DR-submodular functions over general (not necessarily down-closed) convex set constraints. Up to this point, these works have all used the minimum $\ell_\infty$ norm of any feasible solution as a parameter. Unfortunately, a recent hardness result due to Mualem & Feldman (2023) shows that this approach cannot yield a smooth interpolation between down-closed and non-down-closed constraints. In this work, we suggest novel offline and online algorithms that provably provide such an interpolation based on a natural decomposition of the convex body constraint into two distinct convex bodies: a down-closed convex body and a general convex body. We also empirically demonstrate the superiority of our proposed algorithms across three offline and two online applications.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Machine Learning -> ML: Online learning
Search -> S: Combinatorial search and optimisation

2671

CLIP-FSAC: Boosting CLIP for Few-Shot Anomaly Classification with Synthetic Anomalies

Zuo Zuo, Yao Wu, Baoqiang Li, Jiahao Dong, You Zhou, Lei Zhou, Yanyun Qu, Zongze Wu

[+] More

[-] Less

Few-shot anomaly classification (FSAC) is a vital task in manufacturing industry. Recent methods focus on utilizing CLIP in zero/few normal shot anomaly detection instead of custom models. However, there is a lack of specific text prompts in anomaly classification and most of them ignore the modality gap between image and text. Meanwhile, there is distribution discrepancy between the pre-trained and the target data. To provide a remedy, in this paper, we propose a method to boost CLIP for few-normal-shot anomaly classification, dubbed CLIP-FSAC, which contains two-stage of training and alternating fine-tuning with two modality-specific adapters. Specifically, in the first stage, we train image adapter with text representation output from text encoder and introduce an image-to-text tuning to enhance multi-modal interaction and facilitate a better language-compatible visual representation. In the second stage, we freeze the image adapter to train the text adapter. Both of them are constrained by fusion-text contrastive loss. Comprehensive experiment results are provided for evaluating our method in few-normal-shot anomaly classification, which outperforms the state-of-the-art method by 12.2%, 10.9%, 10.4% AUROC on VisA for 1, 2, and 4-shot settings.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Recognition (object detection, categorization)

2674

Learning Big Logical Rules by Joining Small Rules

Céline Hocquette, Andreas Niskanen, Rolf Morel, Matti Järvisalo, Andrew Cropper

[+] More

[-] Less

A major challenge in inductive logic programming is learning big rules. To address this challenge, we introduce an approach where we join small rules to learn big rules. We implement our approach in a constraint-driven system and use constraint solvers to efficiently join rules. Our experiments on many domains, including game playing and drug design, show that our approach can (i) learn rules with more than 100 literals, and (ii) drastically outperform existing approaches in terms of predictive accuracies.

List of keywords

Knowledge Representation and Reasoning -> KRR: Logic programming
Machine Learning -> ML: Symbolic methods

2677

Learning Logic Programs by Discovering Higher-Order Abstractions

Céline Hocquette, Sebastijan Dumancic, Andrew Cropper

[+] More

[-] Less

We introduce the higher-order refactoring problem, where the goal is to compress a logic program by discovering higher-order abstractions, such as map, filter, and fold. We implement our approach in Stevie, which formulates the refactoring problem as a constraint optimisation problem. Our experiments on multiple domains, including program synthesis and visual reasoning, show that refactoring can improve the learning performance of an inductive logic programming system, specifically improving predictive accuracies by 27% and reducing learning times by 47%. We also show that Stevie can discover abstractions that transfer to multiple domains.

List of keywords

Knowledge Representation and Reasoning -> KRR: Logic programming
Machine Learning -> ML: Symbolic methods

2679

LitE-SNN: Designing Lightweight and Efficient Spiking Neural Network through Spatial-Temporal Compressive Network Search and Joint Optimization

Qianhui Liu, Jiaqi Yan, Malu Zhang, Gang Pan, Haizhou Li

[+] More

[-] Less

Spiking Neural Networks (SNNs) mimic the information-processing mechanisms of the human brain and are highly energy-efficient, making them well-suited for low-power edge devices. However, the pursuit of accuracy in current studies leads to large, long-timestep SNNs, conflicting with the resource constraints of these devices. In order to design lightweight and efficient SNNs, we propose a new approach named LitE-SNN that incorporates both spatial and temporal compression into the automated network design process. Spatially, we present a novel Compressive Convolution block (CompConv) to expand the search space to support pruning and mixed-precision quantization. Temporally, we are the first to propose a compressive timestep search to identify the optimal number of timesteps under specific computation cost constraints. Finally, we formulate a joint optimization to simultaneously learn the architecture parameters and spatial-temporal compression strategies to achieve high performance while minimizing memory and computation costs. Experimental results on CIFAR-10, CIFAR-100, and Google Speech Command datasets demonstrate our proposed LitE-SNNs can achieve competitive or even higher accuracy with remarkably smaller model sizes and fewer computation costs.

List of keywords

Humans and AI -> HAI: Cognitive modeling
Humans and AI -> HAI: Cognitive systems

2681

Theoretical Study on Multi-objective Heuristic Search

Shawn Skyler, Shahaf Shperberg, Dor Atzmon, Ariel Felner, Oren Salzman, Shao-Hung Chan, Han Zhang, Sven Koenig, William Yeoh, Carlos Hernandez

[+] More

[-] Less

This paper provides a theoretical study on Multi- Objective Heuristic Search. We first classify states in the state space into must-expand, maybe-expand, and never-expand states and then transfer these definitions to nodes in the search tree. We then formalize a framework that generalizes A* to Multi-Objective Search. We study different ways to order nodes under this framework and its relation to the traditional tie-breaking policies and pro vide theoretical findings.Finally, we study and empirically compare different ordering functions.

List of keywords

Search -> S: Heuristic search
Search -> S: Other
Search -> General

2684

Cross-Domain Feature Augmentation for Domain Generalization

Yingnan Liu, Yingtian Zou, Rui Qiao, Fusheng Liu, Mong Li Lee, Wynne Hsu

[+] More

[-] Less

Domain generalization aims to develop models that are robust to distribution shifts. Existing methods focus on learning invariance across domains to enhance model robustness, and data augmentation has been widely used to learn invariant predictors, with most methods performing augmentation in the input space. However, augmentation in the input space has limited diversity whereas in the feature space is more versatile and has shown promising results. Nonetheless, feature semantics is seldom considered and existing feature augmentation methods suffer from a limited variety of augmented features. We decompose features into class-generic, class-specific, domain-generic, and domain-specific components. We propose a cross-domain feature augmentation method named XDomainMix that enables us to increase sample diversity while emphasizing the learning of invariant representations to achieve domain generalization. Experiments on widely used benchmark datasets demonstrate that our proposed method is able to achieve state-of-the-art performance. Quantitative analysis indicates that our feature augmentation approach facilitates the learning of effective models that are invariant across different domains.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Computer Vision -> CV: Machine learning for vision

2699

Semantics for Non-Flat Assumption-Based Argumentation, Revisited

Jesse Heyninck, Ofer Arieli

[+] More

[-] Less

Assumption-based argumentation (ABA) is an argumentative formalism that allows for reasoning on the basis of defeasible assumptions and strict rules. Standard semantics for this formalism sometimes give rise to problematic behaviour in the presence of rules with assumptions in their heads. In this paper, we introduce a six-valued labelling semantics that overcomes these shortcomings while preserving all the usual properties of the standard Dung-style three-valued semantics for ABA frameworks, including existence of the complete semantics, uniqueness of the grounded semantics and preservation of the computational complexity of all main reasoning processes.

List of keywords

Knowledge Representation and Reasoning -> KRR: Argumentation
Knowledge Representation and Reasoning -> KRR: Non-monotonic reasoning

2704

CompetEvo: Towards Morphological Evolution from Competition

Kangyao Huang, Di Guo, Xinyu Zhang, Xiangyang Ji, Huaping Liu

[+] More

[-] Less

Training an agent to adapt to specific tasks through co-optimization of morphology and control has widely attracted attention. However, whether there exists an optimal configuration and tactics for agents in a multiagent competition scenario is still an issue that is challenging to definitively conclude. In this context, we propose competitive evolution (CompetEvo), which co-evolves agents’ designs and tactics in confrontation. We build arenas consisting of three animals and their evolved derivatives, placing agents with different morphologies in direct competition with each other. The results reveal that our method enables agents to evolve a more suitable design and strategy for fighting compared to fixed-morph agents, allowing them to obtain advantages in combat scenarios. Moreover, we demonstrate the amazing and impressive behaviors that emerge when confrontations are conducted under asymmetrical morphs.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Agent-based simulation and emergence
Robotics -> ROB: Learning in robotics
Search -> S: Evolutionary computation

2707

TIM: An Efficient Temporal Interaction Module for Spiking Transformer

Sicheng Shen, Dongcheng Zhao, Guobin Shen, Yi Zeng

[+] More

[-] Less

Spiking Neural Networks (SNNs), as the third generation of neural networks, have gained prominence for their biological plausibility and computational efficiency, especially in processing diverse datasets. The integration of attention mechanisms, inspired by advancements in neural network architectures, has led to the development of Spiking Transformers. These have shown promise in enhancing SNNs’ capabilities, particularly in the realms of both static and neuromorphic datasets. Despite their progress, a discernible gap exists in these systems, specifically in the Spiking Self Attention (SSA) mechanism’s effectiveness in leveraging the temporal processing potential of SNNs. To address this, we introduce the Temporal Interaction Module (TIM), a novel, convolution-based enhancement designed to augment the temporal data processing abilities within SNN architectures. TIM’s integration into existing SNN frameworks is seamless and efficient, requiring minimal additional parameters while significantly boosting their temporal information handling capabilities. Through rigorous experimentation, TIM has demonstrated its effectiveness in exploiting temporal information, leading to state-of-the-art performance across various neuromorphic datasets. The code is available at https://github.com/BrainCog-X/Brain-Cog/tree/main/examples/TIM.

List of keywords

Humans and AI -> HAI: Cognitive modeling
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Representation learning
Humans and AI -> HAI: Cognitive systems

2710

Learning Embeddings for Sequential Tasks Using Population of Agents

Mridul Mahajan, Georgios Tzannetos, Goran Radanovic, Adish Singla

[+] More

[-] Less

We present an information-theoretic framework to learn fixed-dimensional embeddings for tasks in reinforcement learning. We leverage the idea that two tasks are similar if observing an agent’s performance on one task reduces our uncertainty about its performance on the other. This intuition is captured by our information-theoretic criterion which uses a diverse agent population as an approximation for the space of agents to measure similarity between tasks in sequential decision-making settings. In addition to qualitative assessment, we empirically demonstrate the effectiveness of our techniques based on task embeddings by quantitative comparisons against strong baselines on two application scenarios: predicting an agent’s performance on a new task by observing its performance on a small quiz of tasks, and selecting tasks with desired characteristics from a given set of options.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Multi-task and transfer learning
Machine Learning -> ML: Representation learning

2717

MOSER: Learning Sensory Policy for Task-specific Viewpoint via View-conditional World Model

Shenghua Wan, Hai-Hang Sun, Le Gan, De-Chuan Zhan

[+] More

[-] Less

Reinforcement learning from visual observations is a challenging problem with many real-world applications. Existing algorithms mostly rely on a single observation from a well-designed fixed camera that requires human knowledge. Recent studies learn from different viewpoints with multiple fixed cameras, but this incurs high computation and storage costs and may not guarantee the coverage of the optimal viewpoint. To alleviate these limitations, we propose a straightforward View-conditional Partially Observable Markov Decision Processes (VPOMDPs) assumption and develop a new method, the MOdel-based SEnsor controlleR (MOSER). MOSER jointly learns a view-conditional world model (VWM) to simulate the environment, a sensory policy to control the camera, and a motor policy to complete tasks. We design intrinsic rewards from the VWM without additional modules to guide the sensory policy to adjust the camera parameters. Experiments on locomotion and manipulation tasks demonstrate that MOSER autonomously discovers task-specific viewpoints and significantly outperforms most baseline methods.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Model-based and model learning reinforcement learning
Machine Learning -> ML: Partially observable reinforcement learning and POMDPs

2718

FBLG: A Local Graph Based Approach for Handling Dual Skewed Non-IID Data in Federated Learning

Yi Xu, Ying Li, Haoyu Luo, Xiaoliang Fan, Xiao Liu

[+] More

[-] Less

In real-world situations, federated learning often needs to process non-IID (non-independent and identically distributed) data with multiple skews, causing inadequate model performance. Existing federated learning methods mainly focus on addressing the problem with a single skew of non-IID, and hence the performance of global models can be degraded when faced with dual skewed non-IID data caused by heterogeneous label distributions and sample sizes among clients. To address the problem with dual skewed non-IID data, in this paper, we propose a federated learning algorithm based on local graph, named FBLG. Specifically, to address the label distribution skew, we firstly construct a local graph based on clients’ local losses and Jensen-Shannon (JS) divergence, so that similar clients can be selected for aggregation to ensure a highly consistent global model. Afterwards, to address the sample size skew, we design the objective function to favor clients with more samples as models trained with more samples tend to carry more useful information. Experiments on four datasets with dual skewed non-IID data demonstrate FBLG outperforms nine baseline methods and achieves up to 9% improvement in accuracy. Simultaneously, both theoretical analysis and experiments show FBLG can converge quickly.

List of keywords

Machine Learning -> ML: Federated learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Optimization
Machine Learning -> ML: Supervised Learning

2729

Reconfigurability-Aware Selection for Contrastive Active Domain Adaptation

Zeyu Zhang, Chun Shen, Shuai Lü, Shaojie Zhang

[+] More

[-] Less

Active domain adaptation (ADA) aims to label a small portion of target samples to drastically improve the adaptation performance. The existing ADA methods mostly rely on the output of domain discriminator or the original prediction probability to design sample selection strategies and do not fully explore the semantic information of source and target domain features, which may lead to selecting the valueless target samples. Moreover, most of them require complex network structures (such as introducing additional domain discriminator, multiple classifiers, or loss predictors) and multiple query functions. In this work, we propose a concise but effective ADA method called Reconfigurability-Aware Selection for Contrastive active domain adaptation (RASC). With the reconfigurability-aware sample selection strategy, RASC can select the most valuable target samples for annotation in the presence of domain shift. To better utilize the selected target samples, we further design a contrastive learning-based gradual active domain adaptation framework. In addition, we propose a variant of RASC called RASC-Ob, which uses a simpler sample annotation method and supplements the learning of misclassified samples. Extensive experimental results on multiple benchmarks demonstrate the superiority of RASC.

List of keywords

Machine Learning -> ML: Multi-task and transfer learning
Machine Learning -> ML: Semi-supervised learning

2744

Temporal Knowledge Graph Extrapolation via Causal Subhistory Identification

Kai Chen, Ye Wang, Xin Song, Siwei Chen, Han Yu, Aiping Li

[+] More

[-] Less

Temporal knowledge graph extrapolation has become a prominent area of study interest in recent years. Numerous methods for extrapolation have been put forth, mining query-relevant information from history to generate forecasts. However, existing approaches normally do not discriminate between causal and non-causal effects in reasoning; instead, they focus on analyzing the statistical correlation between the future events to be predicted and the historical data given, which may be deceptive and hinder the model’s capacity to learn real causal information that actually affects the reasoning conclusions. To tackle it, we propose a novel approach called Causal Subhistory Identification (CSI), which focuses on extracting the causal subhistory for reasoning purposes from a large amount of historical data. CSI can improve the clarity and transparency of the reasoning process and more effectively convey the logic behind conclusions by giving priority to the causal subhistory and eliminating non-causal correlations. Extensive experiments demonstrate the remarkable potential of our CSI in the following aspects: superiority, improvement, explainability, and robustness.

List of keywords

Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Data Mining -> DM: Knowledge graphs and knowledge base completion
Natural Language Processing -> NLP: Applications

2752

Navigating Continual Test-time Adaptation with Symbiosis Knowledge

Xu Yang, Moqi Li, Jie Yin, Kun Wei, Cheng Deng

[+] More

[-] Less

Continual test-time domain adaptation seeks to adapt the source pre-trained model to a continually changing target domain without incurring additional data acquisition or labeling costs. Unfortunately, existing mainstream methods may result in a detrimental cycle. This is attributed to noisy pseudo-labels caused by the domain shift, which immediately negatively impacts the model’s knowledge. The long-term accumulation of these negative effects exacerbates the model’s difficulty in generalizing to future domain shifts and contributes to catastrophic forgetting. To address these challenges, this paper introduces a Dual-stream Network that independently optimizes different parameters in each stream to capture symbiotic knowledge from continual domains, thereby ensuring generalization while enhancing instantaneous discrimination. Furthermore, to prevent catastrophic forgetting, a weighted soft parameter alignment method is designed to leverage knowledge from the source model. Finally, efforts are made to calibrate and explore reliable supervision signals to mitigate instantaneous negative optimization. These include label calibration with prior knowledge, label selection using self-adaptive confidence thresholds, and a soft-weighted contrastive module for capturing potential semantics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on several benchmark datasets.

List of keywords

Machine Learning -> ML: Unsupervised learning
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

2758

Equilibria in Two-Stage Facility Location with Atomic Clients

Simon Krogmann, Pascal Lenzner, Alexander Skopalik, Marc Uetz, Marnix Vos

[+] More

[-] Less

We consider competitive facility location as a two-stage multi-agent system with two types of clients. For a given host graph with weighted clients on the vertices, first facility agents strategically select vertices for opening their facilities. Then, the clients strategically select which of the opened facilities in their neighborhood to patronize. Facilities want to attract as much client weight as possible, clients want to minimize congestion on the chosen facility.All recently studied versions of this model assume that clients can split their weight strategically. We consider clients with unsplittable weights, but allow mixed strategies. So clients may randomize over which facility to patronize. Besides modeling a natural client behavior, this subtle change yields drastic changes, e.g., for a given facility placement, qualitatively different client equilibria are possible.As our main result, we show that pure subgame perfect equilibria always exist if all client weights are identical. For this, we use a novel potential function argument, employing a hierarchical classification of the clients and sophisticated rounding in each step. In contrast, for non-identical clients, we show that deciding the existence of even approximately stable states is computationally intractable. On the positive side, we give a tight bound of 2 on the price of anarchy which implies high social welfare of equilibria, if they exist.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Noncooperative games

2759

Angluin-Style Learning of Deterministic Büchi and Co-Büchi Automata

Yong Li, Sven Schewe, Qiyi Tang

[+] More

[-] Less

While recently developed Angluin-style learning algorithms for omega-automata have much in common with her classic DFA learning algorithm, there is a huge difference in the cost of the equivalence queries about the target automata.For omega-regular languages, the target is to learn nondeterministic Buchi automata (NBAs) through the vehicle of Families of DFAs (FDFAs).While the cost of equivalence queries is usually idealised as constant in learning, it makes a practical difference that the language equivalence checking about the learned NBAs is computationally hard.We develop efficient techniques for the cases, where we learn deterministic Buchi automata (DBAs) or deterministic co-Buchi automata (DCAs).This is based on the observation that some classes of FDFAs can be used to learn DBAs for DBA recognisable languages, rather than having to resort to nondeterministic ones.We believe that the restriction to DBAs and DCAs in equivalence queries also makes our algorithm more appealing to realistic applications, as the operations are cheap—NL—for DBAs and DCAs.

List of keywords

Machine Learning -> ML: Active learning
Knowledge Representation and Reasoning -> KRR: Automated reasoning and theorem proving
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Machine Learning -> ML: Model-based and model learning reinforcement learning

2765

Compilation and Fast Model Counting beyond CNF

Alexis de Colnet, Stefan Szeider, Tianwei Zhang

[+] More

[-] Less

Circuits in deterministic decomposable negation normal form, or d-DNNF circuits, are representations for Boolean functions that enable linear-time model counting. This paper strengthens our theoretical knowledge of what classes of functions can be efficiently transformed, or compiled, into d-DNNF. Our main contribution is the fixed-parameter tractable (FPT) compilation of systems of specific constraints parameterized by the system’s incidence treewidth. This subsumes the known result for CNF. The constraints in question are all functions representable by constant-width ordered binary decision diagrams (OBDD) for all variable orderings. For instance this includes parity constraints and cardinality constraint with constant threshold. The running time of the FPT compilation is singly exponential in incidence treewidth but hides large constants in the exponent. To balance that we also give a more efficient FPT algorithm for model counting that applies to a sub-family of the constraints and does not require compilation.

List of keywords

Knowledge Representation and Reasoning -> KRR: Knowledge compilation

2768

The Orthogonality of Weight Vectors: The Key Characteristics of Normalization and Residual Connections

Zhixing Lu, Yuanyuan Sun, Zhihao Yang, Qin Zhou, Hongfei Lin

[+] More

[-] Less

Normalization and residual connections find extensive application within the intricate architecture of deep neural networks, contributing significantly to their heightened performance. Nevertheless, the precise factors responsible for this elevated performance have remained elusive. Our theoretical investigations have unveiled a noteworthy revelation: the utilization of normalization and residual connections results in an enhancement of the orthogonality within the weight vectors of deep neural networks. This, in turn, induces the Gram matrix of neural network weights to exhibit a pronounced tendency towards strict diagonal dominance, thereby amplifying the neural network’s capacity for feature learning. Meanwhile, we have designed the parameters independence index (PII) to precisely characterize the orthogonality of parameter vectors. In tandem with our theoretical findings, we undertook empirical validations through experiments conducted on prevalent network models, including fully connected networks (FNNs), convolutional neural networks (CNNs), Transformers, pre-trained language models(PLMs) and large language models (LLMs) composed of Transformers. Finally, we have found that a fine-tuning technique (LoRA) preserves the orthogonality of parameter vectors, a revelation that carries importance within the framework of fine-tuning techniques for LLMs.

List of keywords

Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Deep learning architectures

2773

Fast and Continual Knowledge Graph Embedding via Incremental LoRA

Jiajun Liu, Wenjun Ke, Peng Wang, Jiahao Wang, Jinhua Gao, Ziyu Shang, Guozheng Li, Zijie Xu, Ke Ji, Yining Li

[+] More

[-] Less

Continual Knowledge Graph Embedding (CKGE) aims to efficiently learn new knowledge and simultaneously preserve old knowledge. Dominant approaches primarily focus on alleviating catastrophic forgetting of old knowledge but neglect efficient learning for the emergence of new knowledge. However, in real-world scenarios, knowledge graphs (KGs) are continuously growing, which brings a significant challenge to fine-tuning KGE models efficiently. To address this issue, we propose a fast CKGE framework (FastKGE), incorporating an incremental low-rank adapter (IncLoRA) mechanism to efficiently acquire new knowledge while preserving old knowledge. Specifically, to mitigate catastrophic forgetting, FastKGE isolates and allocates new knowledge to specific layers based on the fine-grained influence between old and new KGs. Subsequently, to accelerate fine-tuning, FastKGE devises an efficient IncLoRA mechanism, which embeds the specific layers into incremental low-rank adapters with fewer training parameters. Moreover, IncLoRA introduces adaptive rank allocation, which makes the LoRA aware of the importance of entities and adjusts its rank scale adaptively. We conduct experiments on four public datasets and two new datasets with a larger initial scale. Experimental results demonstrate that FastKGE can reduce training time by 34%-49% while still achieving competitive link prediction performance against state-of-the-art models on four public datasets (average MRR score of 21.0% vs. 21.1%). Meanwhile, on two newly constructed datasets, FastKGE saves 51%-68% training time and improves link prediction performance by 1.5%.

List of keywords

Data Mining -> DM: Knowledge graphs and knowledge base completion

2778

Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

Bangbang Zhou, Yadong Qu, Zixiao Wang, Zicheng Li, Boqiang Zhang, Hongtao Xie

[+] More

[-] Less

Recently, scene text recognition (STR) models have shown significant performance improvements. However, existing models still encounter difficulties in recognizing challenging texts that involve factors such as severely distorted and perspective characters. These challenging texts mainly cause two problems: (1) Large Intra-Class Variance. (2) Small Inter-Class Variance. An extremely distorted character may prominently differ visually from other characters within the same category, while the variance between characters from different classes is relatively small. To address the above issues, we propose a novel method that enriches the character features to enhance the discriminability of characters. Firstly, we propose the Character-Aware Constraint Encoder (CACE) with multiple blocks stacked. CACE introduces a decay matrix in each block to explicitly guide the attention region for each token. By continuously employing the decay matrix, CACE enables tokens to perceive morphological information at the character level. Secondly, an Intra-Inter Consistency Loss (I^2CL) is introduced to consider intra-class compactness and inter-class separability at feature space. I^2CL improves the discriminative capability of features by learning a long-term memory unit for each character category. Trained with synthetic data, our model achieves state-of-the-art performance on common benchmarks (94.1% accuracy) and Union14M-Benchmark (61.6% accuracy). Code is available at https://github.com/bang123-box/CFE.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Multimodal learning

2780

Improved Approximation of Weighted MMS Fairness for Indivisible Chores

Bo Li, Pinyan Lu, Fangxiao Wang

[+] More

[-] Less

We study how to fairly allocate a set of indivisible chores among n agents who may have different weights corresponding to their involvement in completing these chores. We found that some of the existing fairness notions may place agents with lower weights at a disadvantage, which motivates us to explore weighted maximin share fairness (WMMS). While it is known that a WMMS allocation may not exist, no non-trivial approximation has been discovered thus far. In this paper, we first design a simple sequential picking algorithm that solely relies on the agents’ ordinal rankings of the items, which achieves an approximation ratio of O(log n). Then, for the case involving two agents, we improve the approximation ratio to almost1.366, and prove that it is optimal.Finally, we consider the online setting when the items arrive one after another and prove that the optimal competitive ratio an online algorithm can achieve is O(\sqrt{n}).

List of keywords

Game Theory and Economic Paradigms -> GTEP: Fair division

2782

Online Learning with Off-Policy Feedback in Adversarial MDPs

Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Papini, Alberto Maria Metelli, Nicola Gatti

[+] More

[-] Less

In this paper, we face the challenge of online learning in adversarial Markov decision processes with off-policy feedback. In this setting, the learner chooses a policy, but, differently from the traditional on-policy setting, the environment is explored by means of a different, fixed, and possibly unknown policy (named colleague’s policy). The off-policy feedback presents an additional issue that is not present in traditional settings: the learner is charged with the regret of its chosen policy but it observes only the rewards gained by the colleague’s policy.First, we present a lower-bound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w.r.t. the dissimilarity between the optimal policy in hindsight and the colleague’s policy.Then, we propose novel algorithms that, by employing pessimistic estimators—commonly adopted in the off-line reinforcement learning literature—ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague’s policy is unknown.

List of keywords

Machine Learning -> ML: Online learning
Machine Learning -> ML: Reinforcement learning

2785

Optimizing Viscous Democracy

Ben Armstrong, Shiri Alouf-Heffetz, Nimrod Talmon

[+] More

[-] Less

Viscous democracy is a generalization of liquid democracy, a social choice framework in which voters may transitively delegate their votes. In viscous democracy, a "viscosity" factor decreases the weight of a delegation the further it travels, reducing the chance of excessive weight flowing between ideologically misaligned voters.We demonstrate that viscous democracy often significantly improves the quality of group decision-making over liquid democracy. We first show that finding optimal delegations within a viscous setting is NP-hard. However, simulations allow us to explore the practical effects of viscosity. Across social network structures, competence distributions, and delegation mechanisms we find high viscosity reduces the chance of “super-voters” attaining large amounts of weight and increases the number of voters that are able to affect the outcome of elections. This, in turn, improves group accuracy as a whole. As a result, we argue that viscosity should be considered a core component of liquid democracy.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice
Agent-based and Multi-agent Systems -> MAS: Agent-based simulation and emergence
Multidisciplinary Topics and Applications -> MTA: Web and social networks

2802

Reschedule Diffusion-based Bokeh Rendering

Shiyue Yan, Xiaoshi Qiu, Qingmin Liao, Jing-Hao Xue, Shaojun Liu

[+] More

[-] Less

Bokeh rendering for images shot with small apertures has drawn much attention in practice. Very recently people start to explore diffusion models for bokeh rendering, aiming to leverage the models’ surging power of image generation. However, we can clearly observe two big issues with the images rendered by diffusion models: large fluctuation and severe color deviation. To address these issues, we propose in this paper a prior-aware sampling approach, which can adaptively control the noise scale through learned priors, and a prior-aware noise scheduling strategy, which can greatly reduce the number of inference steps without sacrificing performance. Extensive experiments show that our method can effectively alleviate the fluctuation problem of sampling results while ensuring similar color styles to the input image. In addition, our method outperforms state-of-the-art methods, sometimes even with only two steps of sampling. Our code is available at https://github.com/Loeiii/Reschedule-Diffusion-based-Bokeh-Rendering.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Humans and AI -> HAI: Applications
Machine Learning -> ML: Applications

2809

Cross-Scale Domain Adaptation with Comprehensive Information for Pansharpening

Meiqi Gong, Hao Zhang, Hebaixu Wang, Jun Chen, Jun Huang, Xin Tian, Jiayi Ma

[+] More

[-] Less

Deep learning-based pansharpening methods typically use simulated data at the reduced-resolution scale for training. It limits their performance when generalizing the trained model to the full-resolution scale due to incomprehensive information utilization of panchromatic (PAN) images at the full-resolution scale and low generalization ability. In this paper, we adopt two targeted strategies to address the above two problems. On the one hand, we introduce a cross-scale comprehensive information capture module, which improves the information utilization of the original PAN image through fully-supervised reconstruction. On the other hand, we pioneer a domain adaptation strategy to tackle the problem of low generalization across different scales. Considering the instinct domain gap between different scales, we leverage the maximum mean discrepancy loss and the inherent pixel-level correlations between features at different scales to reduce the scale variance, thus boosting the generalization ability of our model. Experiments on various satellites demonstrate the superiority of our method over the state-of-the-arts in terms of information retention. Our code is publicly available at https://github.com/Meiqi-Gong/SDIPS.

List of keywords

Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Computational photography

2821

Learning Pareto Set for Multi-Objective Continuous Robot Control

Tianye Shu, Ke Shang, Cheng Gong, Yang Nan, Hisao Ishibuchi

[+] More

[-] Less

For a control problem with multiple conflicting objectives, there exists a set of Pareto-optimal policies called the Pareto set instead of a single optimal policy. When a multi-objective control problem is continuous and complex, traditional multi-objective reinforcement learning (MORL) algorithms search for many Pareto-optimal deep policies to approximate the Pareto set, which is quite resource-consuming. In this paper, we propose a simple and resource-efficient MORL algorithm that learns a continuous representation of the Pareto set in a high-dimensional policy parameter space using a single hypernet. The learned hypernet can directly generate various well-trained policy networks for different user preferences. We compare our method with two state-of-the-art MORL algorithms on seven multi-objective continuous robot control problems. Experimental results show that our method achieves the best overall performance with the least training parameters. An interesting observation is that the Pareto set is well approximated by a curved line or surface in a high-dimensional parameter space. This observation will provide insight for researchers to design new MORL algorithms.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Optimization
Robotics -> ROB: Learning in robotics

2823

Towards Geometric Normalization Techniques in SE(3) Equivariant Graph Neural Networks for Physical Dynamics Simulations

Ziqiao Meng, Liang Zeng, Zixing Song, Tingyang Xu, Peilin Zhao, Irwin King

[+] More

[-] Less

SE(3) equivariance is a crucial property in physical dynamics modeling that aims to maintain neural outputs robust when inputs are translated or rotated. Several proposals for SE(3) equivariant graph neural networks (GNNs) have emerged, showing promising results in simulating particle dynamics. However, existing works have overlooked a significant issue: the inability of current SE(3) equivariant GNNs to scale to large particle systems. While some simple normalization techniques are used to stabilize the training dynamics of equivariant graph networks, they actually compromise the SE(3) equivariance of the architectures. In this study, we first demonstrate the numerical instability of training equivariant GNNs on large particle systems. We then analyze existing normalization strategies adopted in modern works. To address these challenges, we propose a new normalization layer called EquiNorm, which preserves SE(3) equivariance and simultaneously stabilizes the training process. We conduct comprehensive experiments on N-body system simulation tasks with larger particle system sizes. The results demonstrate that EquiNorm successfully maintains SE(3) equivariance compared to baseline techniques and stabilizes the training dynamics of SE(3) equivariant GNNs on large systems.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Physical sciences
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Geometric learning
Machine Learning -> ML: Sequence and graph learning

2836

FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization

Zhiyuan Ning, Chunlin Tian, Meng Xiao, Wei Fan, Pengyang Wang, Li Li, Pengfei Wang, Yuanchun Zhou

[+] More

[-] Less

Federated Learning (FL) faces significant challenges in statistical and system heterogeneity, along with high energy consumption, necessitating efficient client selection strategies. Traditional approaches, including heuristic and learning-based methods, fall short of addressing these complexities holistically. In response, we propose FedGCS, a novel generative client selection framework that innovatively recasts the client selection process as a generative task. Drawing inspiration from the methodologies used in large language models, FedGCS efficiently encodes abundant decision-making knowledge within a continuous representation space, enabling efficient gradient-based optimization to search for optimal client selection that will be finally output via generation. The framework comprises four steps: (1) automatic collection of diverse “selection-score” pair data using classical client selection methods; (2) training an encoder-evaluator-decoder framework on this data to construct a continuous representation space; (3) employing gradient-based optimization in this space for optimal client selection; (4) generating the final optimal client selection via using beam search for the well-trained decoder. FedGCS outperforms traditional methods by being more comprehensive, generalizable, and efficient, simultaneously optimizing for model performance, latency, and energy consumption. The effectiveness and superiority of FedGCS are demonstrated through extensive experiments and analyses.

List of keywords

Machine Learning -> ML: Federated learning
Data Mining -> DM: Applications
Data Mining -> DM: Other
Machine Learning -> ML: Applications

2851

Extremal Separation Problems for Temporal Instance Queries

Jean Jung, Vladislav Ryzhikov, Frank Wolter, Michael Zakharyaschev

[+] More

[-] Less

The separation problem for a class Q of database queries is to find a query in Q that distinguishes between a given set of ‘positive’ and ‘negative’ data examples. Separation provides explanations of examples and underpins the query-by-example paradigm to support database users in constructing and refining queries. As the space of all separating queries can be large, it is helpful to succinctly represent this space by means of its most specific (logically strongest) and general (weakest) members. We investigate this extremal separation problem for classes of instance queries formulated in linear temporal logic LTL with the operators conjunction, ‘next’, and ‘eventually’. Our results range from tight complexity bounds for verifying and counting extremal separators to algorithms computing them.

List of keywords

Knowledge Representation and Reasoning -> KRR: Qualitative, geometric, spatial, and temporal reasoning
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Multidisciplinary Topics and Applications -> MTA: Databases

2852

Boosting Single Positive Multi-label Classification with Generalized Robust Loss

Yanxi Chen, Chunxiao Li, Xinyang Dai, Jinhuan Li, Weiyu Sun, Yiming Wang, Renyuan Zhang, Tinghe Zhang, Bo Wang

[+] More

[-] Less

Multi-label learning (MLL) requires comprehensive multi-semantic annotations that is hard to fully obtain, thus often resulting in missing labels scenarios. In this paper, we investigate Single Positive Multi-label Learning (SPML), where each image is associated with merely one positive label. Existing SPML methods only focus on designing losses using mechanisms such as hard pseudo-labeling and robust losses, mostly leading to unacceptable false negatives. To address this issue, we first propose a generalized loss framework based on expected risk minimization to provide soft pseudo labels, and point out that the former losses can be seamlessly converted into our framework. In particular, we design a novel robust loss based on our framework, which enjoys flexible coordination between false positives and false negatives, and can additionally deal with the imbalance between positive and negative samples. Extensive experiments show that our approach can significantly improve SPML performance and outperform the vast majority of state-of-the-art methods on all the four benchmarks.

List of keywords

Machine Learning -> ML: Multi-label learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Weakly supervised learning

2859

SketchEdit: Editing Freehand Sketches at the Stroke-Level

Tengjie Li, Shikui Tu, Lei Xu

[+] More

[-] Less

Recent sketch synthesis methods have demonstrated the capability of generating lifelike outcomes. However, these methods directly encode the entire sketches making it challenging to decouple the strokes from the sketches and have difficulty in controlling local sketch synthesis, e.g., stroke editing. Besides, the sketch editing task encounters the issue of accurately positioning the edited strokes, because users may not be able to draw on the exact position and the same stroke may appear in various locations in different sketches. We propose SketchEdit to realize flexible editing of sketches at the stroke-level for the first time. To tackle the challenge of decoupling strokes, SketchEdit divides a drawing sequence of a sketch into a series of strokes based on the pen state, aligns the stroke segments to have the same starting position, and learns the embeddings of every stroke by a proposed stroke encoder. Moreover, we overcome the problem of stroke placement via a diffusion process, which progressively generates the locations for the strokes to be synthesized, using the stroke features as the guiding condition. Experiments demonstrate that SketchEdit is effective for stroke-level sketch editing and sketch reconstruction. The source code is publicly available at https://github.com/CMACH508/SketchEdit/.

List of keywords

Machine Learning -> ML: Generative models
Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Representation learning

2864

Large Language Model Guided Knowledge Distillation for Time Series Anomaly Detection

Chen Liu, Shibo He, Qihang Zhou, Shizhong Li, Wenchao Meng

[+] More

[-] Less

Self-supervised methods have gained prominence in time series anomaly detection due to the scarcity of available annotations. Nevertheless, they typically demand extensive training data to acquire a generalizable representation map, which conflicts with scenarios of a few available samples, thereby limiting their performance. To overcome the limitation, we propose AnomalyLLM, a knowledge distillation-based time series anomaly detection approach where the student network is trained to mimic the features of the large language model (LLM)-based teacher network that is pretrained on large-scale datasets. During the testing phase, anomalies are detected when the discrepancy between the features of the teacher and student networks is large. To circumvent the student network from learning the teacher network’s feature of anomalous samples, we devise two key strategies. 1) Prototypical signals are incorporated into the student network to consolidate the normal feature extraction. 2) We use synthetic anomalies to enlarge the representation gap between the two networks. AnomalyLLM demonstrates state-of-the-art performance on 15 datasets, improving accuracy by at least 14.5% in the UCR dataset.

List of keywords

Data Mining -> DM: Anomaly/outlier detection
Data Mining -> DM: Mining spatial and/or temporal data
Machine Learning -> ML: Unsupervised learning

2865

The Transformation Logics

Alessandro Ronca

[+] More

[-] Less

We introduce a new family of temporal logics designed to finely balance the trade-off between expressivity and complexity. Their key feature is the possibility of defining operators of a new kind that we call transformation operators. Some of them subsume existing temporal operators, while others are entirely novel. Of particular interest are transformation operators based on semigroups. They enable logics to harness the richness of semigroup theory, and we show them to yield logics capable of creating hierarchies of increasing expressivity and complexity which are non-trivial to characterise in existing logics. The result is a genuinely novel and yet unexplored landscape of temporal logics, each of them with the potential of matching the trade-off between expressivity and complexity required by specific applications.

List of keywords

Knowledge Representation and Reasoning -> KRR: Knowledge representation languages
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning
Knowledge Representation and Reasoning -> KRR: Qualitative, geometric, spatial, and temporal reasoning

2875

First-Order Progression beyond Local-Effect and Normal Actions

Daxin Liu, Jens Claßen

[+] More

[-] Less

One of the fundamental problems in reasoning about action is progression, which is to update a knowledge base according to the effects of an action into another knowledge base that retains all proper information. The problem is notoriously challenging, as in general, it requires second-order logic. Efforts have been made to find fragments where progression is first-order definable. Liu and Lakemeyer showed that for actions that have only local effects, progression is always first-order definable. They also generalized the result to so-called normal actions, that allow for non-local effects, as long as the affected fluent predicates only depend on local-effect ones, under certain restrictions on the knowledge base. In addition, they showed that for so-called proper+ knowledge bases, progression for normal actions can be efficient under reasonable assumptions. In this paper, we consider a larger class of theories, called the acyclic ones, that strictly subsumes normal actions. In such theories, dependencies between non-local effect fluent predicates are allowed, as long as they do not contain any cycles. We prove progression to be equally first-order definable for this class. Furthermore, under similar but stronger assumptions than those made by Liu and Lakemeyer, we show that progression is efficient as well.

List of keywords

Knowledge Representation and Reasoning -> KRR: Reasoning about actions

2878

BATON: Aligning Text-to-Audio Model Using Human Preference Feedback

Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Qinmei Xu, Zunnan Xu, Jingquan Liu, Jiasheng Lu, Xiu Li

[+] More

[-] Less

With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, the first framework specifically designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference. Project page is available at https://baton2024.github.io.

List of keywords

Machine Learning -> ML: Generative models
Multidisciplinary Topics and Applications -> MTA: Arts and creativity
Natural Language Processing -> NLP: Speech

2879

Truthful Interval Covering

Argyrios Deligkas, Aris Filos-Ratsikas, Alexandros A. Voudouris

[+] More

[-] Less

We initiate the study of a novel problem in mechanism design without money, which we term Truthful Interval Covering (TIC). An instance of TIC consists of a set of agents each associated with an individual interval on a line, and the objective is to decide where to place a covering interval to minimize the total social or egalitarian cost of the agents, which is determined by the intersection of this interval with their individual ones. This fundamental problem can model situations of provisioning a public good, such as the use of power generators to prevent or mitigate load shedding in developing countries. In the strategic version of the problem, the agents wish to minimize their individual costs, and might misreport the position and/or length of their intervals to achieve that. Our goal is to design truthful mechanisms to prevent such strategic misreports and achieve good approximations to the best possible social or egalitarian cost. We consider the fundamental setting of known intervals with equal lengths and provide tight bounds on the approximation ratios achieved by truthful deterministic mechanisms. For the social cost, we also design a randomized truthful mechanism that outperforms all possible deterministic ones. Finally, we highlight a plethora of natural extensions of our model for future work, as well as some natural limitations of those settings.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Mechanism design
Game Theory and Economic Paradigms -> GTEP: Computational social choice

2885

Revisiting Causal Discovery from a Complexity-Theoretic Perspective

Robert Ganian, Viktoriia Korchemna, Stefan Szeider

[+] More

[-] Less

Causal discovery seeks to unveil causal relationships (represented as a so-called causal graph) from observational data. This paper investigates the complex relationship between the graph structure and the efficiency of constraint-based causal discovery algorithms. Our main contributions include (i) a near-tight characterization of which causal graphs admit a small d-separating set for each pair of vertices and thus can potentially be efficiently recovered by a constraint-based causal discovery algorithm, (ii) the explicit construction of a sequence of causal graphs on which the influential PC algorithm might need exponential time, although there is a small d-separating set between every pair of variables, and (iii) the formulation of a new causal discovery algorithm which achieves fixed-parameter running time by considering the maximum number of edge-disjoint paths between variables in the (undirected) super-structure as the parameter. A distinguishing feature of our investigation is that it is carried out within a more fine-grained model which more faithfully captures the infeasibility of performing accurate independence tests for large sets of conditioning variables.

List of keywords

Knowledge Representation and Reasoning -> KRR: Causality
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning

2901

Vision-fused Attack: Advancing Aggressive and Stealthy Adversarial Text against Neural Machine Translation

Yanni Xue, Haojie Hao, Jiakai Wang, Qiang Sheng, Renshuai Tao, Yu Liang, Pu Feng, Xianglong Liu

[+] More

[-] Less

While neural machine translation (NMT) models achieve success in our daily lives, they show vulnerability to adversarial attacks. Despite being harmful, these attacks also offer benefits for interpreting and enhancing NMT models, thus drawing increased research attention. However, existing studies on adversarial attacks are insufficient in both attacking ability and human imperceptibility due to their sole focus on the scope of language. This paper proposes a novel vision-fused attack (VFA) framework to acquire powerful adversarial text, i.e., more aggressive and stealthy. Regarding the attacking ability, we design the vision-merged solution space enhancement strategy to enlarge the limited semantic solution space, which enables us to search for adversarial candidates with higher attacking ability. For human imperceptibility, we propose the perception-retained adversarial text selection strategy to align the human text-reading mechanism. Thus, the finally selected adversarial text could be more deceptive. Extensive experiments on various models, including large language models (LLMs) like LLaMA and GPT-3.5, strongly support that VFA outperforms the comparisons by large margins (up to 81%/14% improvements on ASR/SSIM).

List of keywords

Natural Language Processing -> NLP: Machine translation and multilinguality
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Machine Learning -> ML: Trustworthy machine learning

2906

Enhanced DouDiZhu Card Game Strategy Using Oracle Guiding and Adaptive Deep Monte Carlo Method

Qian Luo, Tien Ping Tan, Daochen Zha, Tianqiao Zhang

[+] More

[-] Less

Deep Reinforcement Learning (DRL) exhibits significant advancements in games with both perfect and imperfect information, such as Go, Chess, Texas Hold’em, and Dota2. However, DRL encounters considerable challenges when tackling card game DouDiZhu because of the imperfect information, large state-action space, and the sparse reward issue. This paper presents OADMCDou, which combines Oracle Guiding and Adaptive Deep Monte Carlo Method to address the challenges in DouDiZhu. Oracle Guiding trains an Oracle agent with both imperfect and perfect information, gradually reducing reliance on imperfect information to transition to a standard agent. Adaptive Deep Monte Carlo uses gradient weight clipping and constrains the magnitude of updates to prevent extreme policy updates. We conduct extensive experiments to evaluate the effectiveness of the proposed methods, demonstrating OADMCDou’s superior performance over the state-of-the-art DouDiZhu AI, DouZero. This superiority over DouZero is reflected in two metrics: a 95% confidence interval of 0.104 ± 0.041 for performance, and a 28.6% reduction in loss.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Entertainment
Multidisciplinary Topics and Applications -> MTA: Computer games
Multidisciplinary Topics and Applications -> MTA: Game playing
Agent-based and Multi-agent Systems -> MAS: Other

2920

Zero-shot high-fidelity and pose-controllable character animation

Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Yu-Gang Jiang, Guo-Jun Qi

[+] More

[-] Less

Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding. To address these limitations, we propose PoseAnimate, a novel zero-shot I2V framework for character animation. PoseAnimate contains three key components: 1) a Pose-Aware Control Module (PACM) that incorporates diverse pose signals into text embeddings, to preserve character-independent content and maintain precise alignment of actions. 2) a Dual Consistency Attention Module (DCAM) that enhances temporal consistency and retains character identity and intricate background details. 3) a Mask-Guided Decoupling Module (MGDM) that refines distinct feature perception abilities, improving animation fidelity by decoupling the character and background. We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition. Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Machine Learning -> ML: Multi-modal learning

2924

Evaluation of Project Performance in Participatory Budgeting

Niclas Boehmer, Piotr Faliszewski, Łukasz Janeczko, Dominik Peters, Grzegorz Pierczyński, Šimon Schierreich, Piotr Skowron, Stanisław Szufa

[+] More

[-] Less

We study ways of evaluating the performance of losing projects in participatory budgeting (PB) elections by seeking actions that would have led to their victory. We focus on lowering the projects’ costs, obtaining additional approvals for them, and asking supporters to refrain from approving other projects: The larger a change is needed, the less successful is the given project. We seek efficient algorithms for computing our measures and we analyze and compare them experimentally. We focus on the GreedyAV, Phragmen, and Equal-Shares PB rules.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

2925

Geometry-Guided Conditional Adaptation for Surrogate Models of Large-Scale 3D PDEs on Arbitrary Geometries

Jingyang Deng, Xingjian Li, Haoyi Xiong, Xiaoguang Hu, Jinwen Ma

[+] More

[-] Less

Deep learning surrogate models aim to accelerate the solving of partial differential equations (PDEs) and have achieved certain promising results. Although several main-stream models through neural operator learning have been applied to delve into PDEs on varying geometries, they were designed to map the complex geometry to a latent uniform grid, which is still challenging to learn by the networks with general architectures. In this work, we rethink the critical factors of PDE solutions and propose a novel model-agnostic framework, called 3D Geometry-Guided Conditional adaptation (3D-GeoCA), for solving PDEs on arbitrary 3D geometries. Starting with a 3D point cloud geometry encoder, 3D-GeoCA can extract the essential and robust representations of any kind of geometric shapes, which conditionally guides the adaptation of hidden features in the surrogate model. We conduct experiments on two public computational fluid dynamics datasets, the Shape-Net Car and Ahmed-Body dataset, using several surrogate models as the backbones with various point cloud geometry encoders to simulate corresponding large-scale Reynolds Average Navier-Stokes equations. Equipped with 3D-GeoCA, these backbone models can reduce the L-2 error by a large margin. Moreover, this 3D-GeoCA is model-agnostic so that it can be applied to any surrogate model.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Physical sciences
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Machine Learning -> ML: Supervised Learning

2929

MARCO: A Memory-Augmented Reinforcement Framework for Combinatorial Optimization

Andoni I. Garmendia, Quentin Cappart, Josu Ceberio, Alexander Mendiburu

[+] More

[-] Less

Neural Combinatorial Optimization (NCO) is an emerging domain where deep learning techniques are employed to address combinatorial optimization problems as a standalone solver. Despite their potential, existing NCO methods often suffer from inefficient search space exploration, frequently leading to local optima entrapment or redundant exploration of previously visited states. This paper introduces a versatile framework, referred to as Memory-Augmented Reinforcement for Combinatorial Optimization (MARCO), that can be used to enhance both constructive and improvement methods in NCO through an innovative memory module. MARCO stores data collected throughout the optimization trajectory and retrieves contextually relevant information at each state. This way, the search is guided by two competing criteria: making the best decision in terms of the quality of the solution and avoiding revisiting already explored solutions. This approach promotes a more efficient use of the available optimization budget. Moreover, thanks to the parallel nature of NCO models, several search threads can run simultaneously, all sharing the same memory module, enabling an efficient collaborative exploration. Empirical evaluations, carried out on the maximum cut, maximum independent set and travelling salesman problems, reveal that the memory module effectively increases the exploration, enabling the model to discover diverse, higher-quality solutions. MARCO achieves good performance in a low computational cost, establishing a promising new direction in the field of NCO.

List of keywords

Search -> S: Combinatorial search and optimisation
Search -> S: Search and machine learning

2937

Guidance Graph Optimization for Lifelong Multi-Agent Path Finding

Yulun Zhang, He Jiang, Varun Bhatt, Stefanos Nikolaidis, Jiaoyang Li

[+] More

[-] Less

We study how to use guidance to improve the throughput of lifelong Multi-Agent Path Finding (MAPF). Previous studies have demonstrated that while incorporating guidance, such as highways, can accelerate MAPF algorithms, this often results in a trade-off with solution quality. In addition, how to generate good guidance automatically remains largely unexplored, with current methods falling short of surpassing manually designed ones. In this work, we introduce the directed guidance graph as a versatile representation of guidance for lifelong MAPF, framing Guidance Graph Optimization (GGO) as the task of optimizing its edge weights. We present two GGO algorithms to automatically generate guidance for arbitrary lifelong MAPF algorithms and maps. The first method directly solves GGO by employing CMA-ES, a black-box optimization algorithm. The second method, PIU, optimizes an update model capable of generating guidance, demonstrating the ability to transfer optimized guidance graphs to larger maps with similar layouts. Empirically, we show that (1) our guidance graphs improve the throughput of three representative lifelong MAPF algorithms in four benchmark maps, and (2) our update model can generate guidance graphs for as large as $93 \times 91$ maps and as many as 3000 agents.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Multi-agent planning
Planning and Scheduling -> PS: Applications
Robotics -> ROB: Multi-robot systems
Search -> S: Evolutionary computation

2939

Capturing Knowledge Graphs and Rules with Octagon Embeddings

Victor Charpenay, Steven Schockaert

[+] More

[-] Less

Region based knowledge graph embeddings represent relations as geometric regions. This has the advantage that the rules which are captured by the model are made explicit, making it straightforward to incorporate prior knowledge and to inspect learned models. Unfortunately, existing approaches are severely restricted in their ability to model relational composition, and hence also their ability to model rules, thus failing to deliver on the main promise of region based models. With the aim of addressing these limitations, we investigate regions which are composed of axis-aligned octagons. Such octagons are particularly easy to work with, as intersections and compositions can be straightforwardly computed, while they are still sufficiently expressive to model arbitrary knowledge graphs. Among others, we also show that our octagon embeddings can properly capture a non-trivial class of rule bases. Finally, we show that our model achieves competitive experimental results.

List of keywords

Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Data Mining -> DM: Knowledge graphs and knowledge base completion
Machine Learning -> ML: Neuro-symbolic methods

2942

Enabling Mixed Effects Neural Networks for Diverse, Clustered Data Using Monte Carlo Methods

Andrej Tschalzev, Paul Nitschke, Lukas Kirchdorfer, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt

[+] More

[-] Less

Neural networks often assume independence among input data samples, disregarding correlations arising from inherent clustering patterns in real-world datasets (e.g., due to different sites or repeated measurements). Recently, mixed effects neural networks (MENNs) which separate cluster-specific ‘random effects’ from cluster-invariant ‘fixed effects’ have been proposed to improve generalization and interpretability for clustered data. However, existing methods only allow for approximate quantification of cluster effects and are limited to regression and binary targets with only one clustering feature. We present MC-GMENN, a novel approach employing Monte Carlo techniques to train Generalized Mixed Effects Neural Networks. We empirically demonstrate that MC-GMENN outperforms existing mixed effects deep learning models in terms of generalization performance, time complexity, and quantification of inter-cluster variance. Additionally, MC-GMENN is applicable to a wide range of datasets, including multi-class classification tasks with multiple high-cardinality categorical features. For these datasets, we show that MC-GMENN outperforms conventional encoding and embedding methods, simultaneously offering a principled methodology for interpreting the effects of clustering patterns.

List of keywords

Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Classification
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Probabilistic machine learning

2947

Ordinal Maximin Guarantees for Group Fair Division

Pasin Manurangsi, Warut Suksompong

[+] More

[-] Less

We investigate fairness in the allocation of indivisible items among groups of agents using the notion of maximin share (MMS). While previous work has shown that no nontrivial multiplicative MMS approximation can be guaranteed in this setting for general group sizes, we demonstrate that ordinal relaxations are much more useful. For example, we show that if n agents are distributed equally across g groups, there exists a 1-out-of-k MMS allocation for k = O(g log(n/g)), while if all but a constant number of agents are in the same group, we obtain k = O(log n / log log n). We also establish the tightness of these bounds and provide non-asymptotic results for the case of two groups.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Fair division
Game Theory and Economic Paradigms -> GTEP: Computational social choice

2950

On Using Admissible Bounds for Learning Forward Search Heuristics

Carlos Núñez-Molina, Masataro Asai, Pablo Mesejo, Juan Fernandez-Olivares

[+] More

[-] Less

In recent years, there has been growing interest in utilizing modern machine learning techniques to learn heuristic functions for forward search algorithms. Despite this, there has been little theoretical understanding of what they should learn, how to train them, and why we do so. This lack of understanding has resulted in the adoption of diverse training targets (suboptimal vs optimal costs vs admissible heuristics) and loss functions (e.g., square vs absolute errors) in the literature. In this work, we focus on how to effectively utilize the information provided by admissible heuristics in heuristic learning. We argue that learning from poly-time admissible heuristics by minimizing mean square errors (MSE) is not the correct approach, since its result is merely a noisy, inadmissible copy of an efficiently computable heuristic. Instead, we propose to model the learned heuristic as a truncated gaussian, where admissible heuristics are used not as training targets but as lower bounds of this distribution. This results in a different loss function from the MSE commonly employed in the literature, which implicitly models the learned heuristic as a gaussian distribution. We conduct experiments where both MSE and our novel loss function are applied to learning a heuristic from optimal plan costs. Results show that our proposed method converges faster during training and yields better heuristics.

List of keywords

Planning and Scheduling -> PS: Learning in planning and scheduling
Machine Learning -> ML: Knowledge-aided learning
Search -> S: Heuristic search
Machine Learning -> ML: Neuro-symbolic methods

2971

Towards Automatic Composition of ASP Programs from Natural Language Specifications

Manuel Borroto Santana, Irfan Kareem, Francesco Ricca

[+] More

[-] Less

This paper moves the first step towards automating the composition of Answer Set Programming (ASP) specifications.In particular, the following contributions are provided: (i) A dataset focused on graph-related problem specifications, designed to develop and assess tools for ASP automatic coding;(ii) A two-step architecture, implemented in the NL2ASP tool, for generating ASP programs from natural language specifications.NL2ASP uses neural machine translation to transform natural language into Controlled Natural Language (CNL) statements. Subsequently, CNL statements are converted into ASP code using the CNL2ASP tool. An experimental analysis confirms the viability of the approach.

List of keywords

Natural Language Processing -> NLP: Applications
Knowledge Representation and Reasoning -> KRR: Logic programming
Machine Learning -> ML: Generative models

2975

ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Mengqi Xue, Qihan Huang, Haofei Zhang, Jingwen Hu, Jie Song, Mingli Song, Canghong Jin

[+] More

[-] Less

Prototypical part network (ProtoPNet) and its variants have drawn wide attention and been applied to various tasks due to their inherent self-explanatory property. Previous ProtoPNets are primarily built upon convolutional neural networks (CNNs). Therefore, it is natural to investigate whether these explainable methods can be advantageous for the recently emerged Vision Transformers (ViTs). However, directly utilizing ViT-backed models as backbones can lead to prototypes paying excessive attention to background positions rather than foreground objects (i.e., the “distraction” problem). To address the problem, this paper proposes prototypical part Transformer (ProtoPFormer) for interpretable image recognition. Based the architectural characteristics of ViTs, we modify the original ProtoPNet by creating separate global and local branches, each accompanied by corresponding prototypes that can capture and highlight representative holistic and partial features. Specifically, the global prototypes can guide local prototypes to concentrate on the foreground and effectively suppress the background influence. Subsequently, local prototypes are explicitly supervised to concentrate on different discriminative visual parts. Finally, the two branches mutually correct each other and jointly make the final decisions. Moreover, extensive experiments demonstrate that ProtoPFormer can consistently achieve superior performance on accuracy, visualization results, and quantitative interpretability evaluation over the state-of-the-art (SOTA) baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.

List of keywords

Computer Vision -> CV: Interpretability and transparency
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Representation learning

2979

Epistemic Logic Programs: Non-Ground and Counting Complexity

Thomas Eiter, Johannes Fichte, Markus Hecher, Stefan Woltran

[+] More

[-] Less

This paper establishes the computational complexity of non-ground ELPs. We provide a comprehensive picture for well-known program fragments, which turns out to be complete for the class NEXPTIME with access to oracles up to SigmaP2. In the quantitative setting, which enables more fine-grained reasoning, we establish complexity results for counting complexity beyond #EXP. To mitigate the high complexity, we establish encouraging results in case of bounded predicate arity, reaching up to the fourth level of the polynomial hierarchy. Finally, we provide ETH-tight runtime results for the structural parameter treewidth, which has applications in quantitative reasoning modes, where we reason on (marginal) probabilities of epistemic literals.

List of keywords

Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning
Knowledge Representation and Reasoning -> KRR: Knowledge representation languages
Knowledge Representation and Reasoning -> KRR: Logic programming
Knowledge Representation and Reasoning -> KRR: Non-monotonic reasoning

2980

HVOFusion: Incremental Mesh Reconstruction Using Hybrid Voxel Octree

Shaofan Liu, Junbo Chen, Jianke Zhu

[+] More

[-] Less

Incremental scene reconstruction is essential to the navigation in robotics. Most of the conventional methods typically make use of either TSDF (truncated signed distance functions) volume or neural networks to implicitly represent the surface. Due to the voxel representation or involving with time-consuming sampling, they have difficulty in balancing speed, memory storage, and surface quality. In this paper, we propose a novel hybrid voxel-octree approach to effectively fuse octree with voxel structures so that we can take advantage of both implicit surface and explicit triangular mesh representation. Such sparse structure preserves triangular faces in the leaf nodes and produces partial meshes sequentially for incremental reconstruction. This storage scheme allows us to naturally optimize the mesh in explicit 3D space to achieve higher surface quality. We iteratively deform the mesh towards the target and recovers vertex colors by optimizing a shading model. Experimental results on several datasets show that our proposed approach is capable of quickly and accurately reconstructing a scene with realistic colors. Code is available at https://github.com/Frankuzi/HVOFusion

List of keywords

Robotics -> ROB: Localization, mapping, state estimation
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Applications
Robotics -> ROB: Robotics and vision

2991

Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation

Yaohua Liu, Jiaxin Gao, Xuan Liu, Xianghao Jiao, Xin Fan, Risheng Liu

[+] More

[-] Less

Transfer attacks generate significant interest for real-world black-box applications by crafting transferable adversarial examples through surrogate models. Whereas, existing works essentially directly optimize the single-level objective w.r.t. the surrogate model, which always leads to poor interpretability of attack mechanism and limited generalization performance over unknown victim models. In this work, we propose the \textbf{B}il\textbf{E}vel \textbf{T}ransfer \textbf{A}ttac\textbf{K} (BETAK) framework by establishing an initialization derived bilevel optimization paradigm, which explicitly reformulates the nested constraint relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker. Algorithmically, we introduce the Hyper Gradient Response (HGR) estimation as an effective feedback for the transferability over pseudo-victim attackers, and propose the Dynamic Sequence Truncation (DST) technique to dynamically adjust the back-propagation path for HGR and reduce computational overhead simultaneously. Meanwhile, we conduct detailed algorithmic analysis and provide convergence guarantee to support non-convexity of the LL surrogate attacker. Extensive evaluations demonstrate substantial improvement of BETAK (e.g., $\mathbf{53.41}$\% increase of attack success rates against IncRes-v$2_{ens}$) against different victims and defense methods in targeted and untargeted attack scenarios.

List of keywords

Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Machine learning for vision

2993

Learning from Long-Tailed Noisy Data with Sample Selection and Balanced Loss

Lefan Zhang, Zhang-Hao Tian, Wujun Zhou, Wei Wang

[+] More

[-] Less

The success of deep learning depends on large-scale and well-curated training data, while data in real-world applications are commonly long-tailed and noisy. Existing methods are usually dependent on label frequency to tackle class imbalance, while the model bias on different classes is not directly related to label frequency and the true label frequency is inaccessible under label noise. To solve this, we propose a robust method for learning from long-tailed noisy data with sample selection and balanced loss. Specifically, we separate the noisy training data into clean labeled set and unlabeled set with sample selection, and train the deep neural network in a semi-supervised manner with a balanced loss based on model bias. Extensive experiments on benchmarks demonstrate that our method outperforms existing state-of-the-art methods.

List of keywords

Machine Learning -> ML: Classification

2994

Preferred Reasoning in ABA by Cycle-Breaking

Kiet Nguyen Anh, Markus Ulbricht

[+] More

[-] Less

We develop a fixed-parameter tractable (FPT) algorithm for skeptical preferred reasoning in assumption-based argumentation (ABA). To this end we make use of so-called backdoors, i.e. sets of assumptions that need to be evaluated s.t. the remaining ABA framework (ABAF) belongs to a computational beneficial sub-class. In order to identify such target classes, we employ a suitable notion of a dependency graph of an ABAF. We show that these graphs can be constructed in polynomial time and that one can efficiently check sufficient properties ensuring that reasoning in the underlying ABAF is tractable. After establishing the theoretical foundations, we test our implementation against the ASPforABA solver which convincingly won the ABA track of the ICCMA’23 competition. As it turns out, our algorithm outperforms ASPforABA on instances with small backdoor sizes.

List of keywords

Knowledge Representation and Reasoning -> KRR: Argumentation
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning

3000

PRASS: Probabilistic Risk-averse Robust Learning with Stochastic Search

Tianle Zhang, Yanghao Zhang, Ronghui Mu, Jiaxu Liu, Jonathan Fieldsend, Wenjie Ruan

[+] More

[-] Less

Deep learning models, despite their remarkable success in various tasks, have been shown to be vulnerable to adversarial perturbations. Although robust learning techniques that consider adversarial risks against worst-case perturbations can effectively increase a model’s robustness, they may not always be the most suitable approach. This is due to the fact that in certain scenarios, perturbations are more likely to occur probabilistically rather than being intentionally crafted by attackers.To address this challenge, we propose a novel risk-averse robust learning method based on entropic value-at-risk, called PRASS (Probabilistical Risk-Averse Robust Learning with Stochastic Search). Our approach leverages principles of stochastic optimisation and considers perturbing distributions rather than solely worst-case adversaries. By applying adaptive stochastic search to parameterised distributions, we further enhance the scalability of PRASS to handle distributional robustness. Empirical experiments demonstrate that PRASS outperforms existing state-of-the-art baselines.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Machine Learning -> ML: Adversarial machine learning
Machine Learning -> ML: Robustness

3009

CPa-WAC: Constellation Partitioning-based Scalable Weighted Aggregation Composition for Knowledge Graph Embedding

Sudipta Modak, Aakarsh Malhotra, Sarthak Malik, Anil Surisetty, Esam Abdel-Raheem

[+] More

[-] Less

Scalability and training time are crucial for any graph neural network model processing a knowledge graph (KG). While partitioning knowledge graphs helps reduce the training time, the prediction accuracy reduces significantly compared to training the model on the whole graph. In this paper, we propose CPa-WAC: a lightweight architecture that incorporates graph convolutional networks and modularity maximization-based constellation partitioning to harness the power of local graph topology. The proposed CPa-WAC method reduces the training time and memory cost of knowledge graph embedding, making the learning model scalable. The results from our experiments on standard databases, such as Wordnet and Freebase, show that by achieving meaningful partitioning, any knowledge graph can be broken down into subgraphs and processed separately to learn embeddings. Furthermore, these learned embeddings can be used for knowledge graph completion, retaining similar performance compared to training a GCN on the whole KG, while speeding up the training process by almost five times. Additionally, the proposed CPa-WAC method outperforms several other state-of-the-art KG in terms of prediction accuracy.

List of keywords

Knowledge Representation and Reasoning -> KRR: Applications
Machine Learning -> ML: Automated machine learning
Machine Learning -> ML: Knowledge-aided learning
Machine Learning -> ML: Optimization

3020

Mechanisms That Play a Game, Not Toss a Coin

Toby Walsh

[+] More

[-] Less

Randomized mechanisms can have good normative properties compared to their deterministic counter-parts. However, randomized mechanisms are problematic in several ways such as in their verifiability. We propose here to de-randomize such mechanisms by having agents play a game instead of tossing a coin. The game is designed so agents play randomly, and this play injects “randomness” into the mechanism. Surprisingly this de-randomization retains many of the good normative properties of the original randomized mechanism but gives a mechanism that is deterministic and easy, for instance, to audit. We consider three general purpose methods to de-randomize mechanisms, and apply these to six different domains: voting, facility location, task allocation, school choice, peer selection, and resource allocation. We propose a number of novel de-randomized mechanisms for these six domains with good normative properties (such as equilibria in which agents sincerely report preferences over the original problem). In one domain, we additionally show that a new and desirable normative property emerges as a result of de-randomization.property emerges as a result of de-randomization.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Mechanism design
Agent-based and Multi-agent Systems -> MAS: Resource allocation
Game Theory and Economic Paradigms -> GTEP: Computational social choice
Game Theory and Economic Paradigms -> GTEP: Fair division

3024

Scalable Landmark Hub Labeling for Optimal and Bounded Suboptimal Pathfinding

Sabine Storandt

[+] More

[-] Less

Hub Labeling and A* are two well-established algorithms for shortest path computation in large graphs. Hub Labeling offers excellent query times for distance computation, but at the cost of a high space consumption for label storage. Landmark-based A* search requires less space but answers queries much slower. Recently, Landmark Hub Labeling (LHL) has been proposed, which combines both concepts and achieves a smaller space consumption than Hub Labeling and also much better query times than A*. However, the known algorithms for computing a LHL do not scale to large graphs, limiting its applicability. In this paper, we devise novel algorithms for LHL construction that work on graphs with millions of edges. We also further improve the LHL query answering algorithm and investigate how to reduce the space consumption of labeling techniques by performing bounded suboptimal pathfinding. In an extensive experimental study, we demonstrate the effectiveness of our methods and illuminate that sensible trade-offs between space consumption, query time, and path quality can be achieved with LHL.

List of keywords

Planning and Scheduling -> PS: Routing
Multidisciplinary Topics and Applications -> MTA: Transportation
Search -> S: Applications
Search -> S: Combinatorial search and optimisation

3030

Toward Completing the Picture of Control in Schulze and Ranked Pairs Elections

Cynthia Maushagen, David Niclaus, Paul Nüsken, Joerg Rothe, Tessa Seeger

[+] More

[-] Less

Both Schulze and ranked pairs are voting systems that satisfy many natural, desirable axioms. Many standard types of electoral control (with a chair seeking to change the outcome of an election by interfering with the election structure) have already been studied. However, for control by replacing candidates or voters and for (exact) multimode control that combines multiple standard attacks, many questions remain open. We solve a number of these open cases for Schulze and ranked pairs. In addition, we fix a flaw in the reduction of Menton and Singh [IJCAI 2013] showing that Schulze is resistant to constructive control by deleting candidates and re-establish a vulnerability result for destructive control by deleting candidates. In some of our proofs, we study variants of s-t vertex cuts in graphs that are related to our control problems.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

3049

Attention Based Document-level Relation Extraction with None Class Ranking Loss

Xiaolong Xu, Chenbin Li, Haolong Xiang, Lianyong Qi, Xuyun Zhang, Wanchun Dou

[+] More

[-] Less

Through document-level relation extraction (RE), the analysis of the global relation between entities in the text is feasible, and more comprehensive and accurate semantic information can be obtained. In document-level RE, the model needs to infer the implicit relations between two entities in different sentences. To obtain more semantic information, existing methods mainly focus on exploring entity representations. However, they ignore the correlations and indivisibility between relations, entities and contexts. Furthermore, current methods only independently estimate the cases of predefined relations, ignoring the case of "no relation”, which results in poor prediction. To address the above issues, we propose a document-level RE method based on attention mechanisms, which considers the case of "no relation”. Specifically, our approach leverages graph attention and multi-head attention networks to capture the correlations and indivisibility among relations, entities, and contexts, respectively. In addition, a novel multi-label loss function that promotes large margins in label confidence scores between each predefined class and the none class is employed to improve the prediction performance. Extensive experiments conducted on benchmarking datasets demonstrate that our proposed method outperforms the state-of-the-art baselines with higher accuracy.

List of keywords

Natural Language Processing -> NLP: Information extraction
Natural Language Processing -> NLP: Embeddings

3057

Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces

Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou

[+] More

[-] Less

Deepfake videos are becoming increasingly realistic, showing few tampering traces on facial areas that vary between frames. Consequently, existing Deepfake detection methods struggle to detect unknown domain Deepfake videos while accurately locating the tampered region. To address this limitation, we propose Delocate, a novel Deepfake detection model that can both recognize and localize unknown domain Deepfake videos. Our method consists of two stages named recovering and localization. In the recovering stage, the model randomly masks regions of interest (ROIs) and reconstructs real faces without tampering traces, leading to a relatively good recovery effect for real faces and a poor recovery effect for fake faces. In the localization stage, the output of the recovery phase and the forgery ground truth mask serve as supervision to guide the forgery localization process. This process strategically emphasizes the recovery phase of fake faces with poor recovery, facilitating the localization of tampered regions. Our extensive experiments on four widely used benchmark datasets demonstrate that Delocate not only excels in localizing tampered areas but also enhances cross-domain detection performance.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Security and privacy
Computer Vision -> CV: Biometrics, face, gesture and pose recognition

3058

Deep Multi-Dimensional Classification with Pairwise Dimension-Specific Features

Teng Huang, Bin-Bin Jia, Min-Ling Zhang

[+] More

[-] Less

In multi-dimensional classification (MDC), each instance is associated with multiple class variables characterizing the semantics of objects from different dimensions. To consider the dependencies among class variables and the specific characteristics contained in different semantic dimensions, a novel deep MDC approach named PIST is proposed to jointly deal with the two issues via learning pairwise dimension-specific features. Specifically, PIST conducts pairwise grouping to model the dependencies between each pair of class variables, which are more reliable with limited training samples. For extracting pairwise dimension-specific features, PIST weights the feature embedding with a feature importance vector, which is learned via utilizing a global loss measurement based on intra-class and inter-class covariance. Final prediction w.r.t. each dimension is determined by combining the joint probabilities related to this dimension. Comparative studies with eleven real-world MDC data sets clearly validate the effectiveness of the proposed approach.

List of keywords

Machine Learning -> ML: Classification
Machine Learning -> ML: Multi-label learning

3083

Fast One-Stage Unsupervised Domain Adaptive Person Search

Tianxiang Cui, Huibing Wang, Jinjia Peng, Ruoxi Deng, Xianping Fu, Yang Wang

[+] More

[-] Less

Unsupervised person search aims to localize a particular target person from a gallery set of scene images without annotations, which is extremely challenging due to the unexpected variations of the unlabeled domains. However, most existing methods dedicate to developing multi-stage models to adapt domain variations while using clustering for iterative model training, which inevitably increase model complexity. To address this issue, we propose a Fast One-stage Unsupervised person Search (FOUS) which complementaryly integrates domain adaption with label adaption within an end-to-end manner without iterative clustering. To minimize the domain discrepancy, FOUS introduced an Attention-based Domain Alignment Module (ADAM) which can not only align various domains for both detection and ReID tasks but also construct an attention mechanism to reduce the adverse impacts of low-quality candidates resulting from unsupervised detection. Moreover, to avoid the redundant iterative clustering mode, FOUS adopts a prototype-guided labeling method which minimizes redundant correlation computations for partial samples and assigns noisy coarse label groups efficiently. The coarse label groups will be continuously refined via label-flexible training network with an adaptive selection strategy. With the adapted domains and labels, FOUS can achieve the state-of-the-art (SOTA) performance on two benchmark datasets, CUHK-SYSU and PRW. The code is available at https://github.com/whbdmu/FOUS.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Image and video retrieval

3090

Discriminative Feature Decoupling Enhancement for Speech Forgery Detection

Yijun Bei, Xing Zhou, Erteng Liu, Yang Gao, Sen Lin, Kewei Gao, Zunlei Feng

[+] More

[-] Less

The emergence of AIGC has brought attention to the issue of generating realistic deceptive content. While AIGC has the potential to revolutionize content creation, it also facilitates criminal activities. Specifically, the manipulation of speech has been exploited in tele-fraud and financial fraud schemes, posing a significant threat to societal security. Current deep learning-based methods for detecting forged speech extract mixed features from the original speech, which often contain redundant information. Moreover, these methods fail to consider the distinct characteristics of human voice-specific features and the diversity of background environmental sounds.This paper introduces a framework called Discriminative fEature dEcoupling enhanceMent (DEEM) for detecting speech forgery. Initially, the framework decouples the original speech into human voice features and background sound features. Subsequently, DEEM enhances voice-specific features through temporal dimension aggregation and improves continuity-related features in the background sound map via spectral-dimension aggregation. By employing the decoupling enhancement features, extensive experiments demonstrate that DEEM achieves an accuracy improvement of over 5% on FoR dataset compared to the state-of-the-art methods.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Safety and robustness
AI Ethics, Trust, Fairness -> ETF: AI and law, governance, regulation
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
AI Ethics, Trust, Fairness -> ETF: Societal impact of AI

3100

Rethinking Correlation Learning via Label Prior for Open Set Domain Adaptation

Zi-Xian Huang, Chuan-Xian Ren

[+] More

[-] Less

Open Set Domain Adaptation (OSDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain, where known classes exist across domains while unknown classes are present only in the target domain. Existing methods rely on the clustering structure to identify the unknown classes, which empirically induces a large identification error if the unknown classes are a mixture of multiple components. To break through this barrier, we formulate OSDA from the view of correlation and propose a correlation metric-based framework called Balanced Correlation Learning (BCL). BCL employs Hilbert-Schmidt Independence Criterion (HSIC) to characterize the separation between unknown and known classes, where HSIC is reformulated as the nodes’ relation on graph. By considering the label prior as variable, theoretical results are derived to analytically show a sufficient condition for desired learning direction for OSDA. Methodologically, the class-balanced HSIC is proposed to preserve domain-invariant and class-discriminative features. With the guarantee of correlation learning, the entropy-based principle can effectively identify the unknown classes via uncertainty. Empirically, extensive evaluations are conducted, where BCL achieves significant performance improvements.

List of keywords

Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

3106

Exploring the Role of Node Diversity in Directed Graph Representation Learning

Jincheng Huang, Yujie Mo, Ping Hu, Xiaoshuang Shi, Shangbo Yuan, Zeyu Zhang, Xiaofeng Zhu

[+] More

[-] Less

Many methods of Directed Graph Neural Networks (DGNNs) are designed to equally treat nodes in the same neighbor set (i.e., out-neighbor set and in-neighbor set) for every node, without considering the node diversity in directed graphs, so they are often unavailable to adaptively acquire suitable information from neighbors of different directions. To alleviate this issue, in this paper, we investigate a new way to first consider node diversity for representation learning on directed graphs, i.e., neighbor diversity and degree diversity, and then propose a new NDDGNN framework to adaptively assign weights to both outgoing information and incoming information at the node level. Extensive experiments on seven real-world datasets validate the superior performance of our method compared to state-of-the-art methods in terms of both node classification and link prediction tasks.

List of keywords

Data Mining -> DM: Mining graphs
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Semi-supervised learning

3110

With a Little Help from Language: Semantic Enhanced Visual Prototype Framework for Few-Shot Learning

Hecheng Cai, Yang Liu, Shudong Huang, Jiancheng Lv

[+] More

[-] Less

Few-shot learning (FSL) aims to recognize new categories given limited training samples. The core challenge is to avoid overfitting to the minimal data while ensuring good generalization to novel classes. One mainstream method employs prototypes from visual feature extractors as classifier weight and the performance depends on the quality of the prototype. Since different categories may have similar visual features, the visual prototype has limitations. This is because existing methods only learn a simple visual feature extractor during the pre-training stage but neglect the importance of a well-developed feature space for the prototype. We introduce the Semantic Enhanced Visual Prototype framework (SEVpro) to address this issue. SEVpro refines prototype learning from the pre-training stage and serves as a versatile plug-and-play framework for all prototype-based FSL methods. Specifically, we enhance prototype discriminability by transforming semantic embeddings into the visual space, aiding in separating categories with similar visual features. For novel class learning, we leverage knowledge from base classes and incorporate semantic information to elevate prototype quality further. Meanwhile, extensive experiments on FSL benchmarks and ablation studies demonstrate the superiority of our proposed SEVpro for FSL.

List of keywords

Machine Learning -> ML: Few-shot learning
Machine Learning -> ML: Multi-modal learning

3113

SAEIR: Sequentially Accumulated Entropy Intrinsic Reward for Cooperative Multi-Agent Reinforcement Learning with Sparse Reward

Xin He, Hongwei Ge, Yaqing Hou, Jincheng Yu

[+] More

[-] Less

Multi-agent reinforcement learning (MARL) performs well for solving complex cooperative tasks when the scenarios have well-defined dense rewards. However, there are usually sparse reward settings in many real-world multi-agent systems, which makes it difficult for MARL algorithms to successfully learn an effective strategy. To tackle this problem, we propose a novel sequentially accumulated entropy intrinsic reward named SAEIR, which utilizes the entropy of multi-agent system as a bonus to accelerate learning. Specifically, the multi-scale hypergraph critic is proposed to obtain high-order system state representation, which also enhances the ability to effectively evaluate the action produced by the actor. Based on the comprehensive and compact system state representation, the orderliness of multi-agent systems can be measured to determine the highly valuable states for adding entropy-based intrinsic rewards which leads to a highly efficient learning process. Empirical results demonstrate that our proposed method achieves state-of-the-art performance in several complex cooperative multi-agent environments with sparse reward settings.

List of keywords

Machine Learning -> ML: Multiagent Reinforcement Learning
Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning

3119

Towards Dynamic-Prompting Collaboration for Source-Free Domain Adaptation

Mengmeng Zhan, Zongqian Wu, Rongyao Hu, Ping Hu, Heng Tao Shen, Xiaofeng Zhu

[+] More

[-] Less

In domain adaptation, challenges such as data privacy constraints can impede access to source data, catalyzing the development of source-free domain adaptation (SFDA) methods. However, current approaches heavily rely on models trained on source data, posing the risk of overfitting and suboptimal generalization.This paper introduces a dynamic prompt learning paradigm that harnesses the power of large-scale vision-language models to enhance the semantic transfer of source models. Specifically, our approach fosters robust and adaptive collaboration between the source-trained model and the vision-language model, facilitating the reliable extraction of domain-specific information from unlabeled target data, while consolidating domain-invariant knowledge. Without the need for accessing source data, our method amalgamates the strengths inherent in both traditional SFDA approaches and vision-language models, formulating a collaborative framework for addressing SFDA challenges. Extensive experiments conducted on three benchmark datasets showcase the superiority of our framework over previous SOTA methods.

List of keywords

Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Representation learning

3125

Hard-Thresholding Meets Evolution Strategies in Reinforcement Learning

Chengqian Gao, William de Vazelhes, Hualin Zhang, Bin Gu, Zhiqiang Xu

[+] More

[-] Less

Evolution Strategies (ES) have emerged as a competitive alternative for model-free reinforcement learning, showcasing exemplary performance in tasks like Mujoco and Atari. Notably, they shine in scenarios with imperfect reward functions, making them invaluable for real-world applications where dense reward signals may be elusive. Yet, an inherent assumption in ES—that all input features are task-relevant—poses challenges, especially when confronted with irrelevant features common in real-world problems. This work scrutinizes this limitation, particularly focusing on the Natural Evolution Strategies (NES) variant. We propose NESHT, a novel approach that integrates Hard-Thresholding (HT) with NES to champion sparsity, ensuring only pertinent features are employed. Backed by rigorous analysis and empirical tests, NESHT demonstrates its promise in mitigating the pitfalls of irrelevant features and shines in complex decision-making problems like noisy Mujoco and Atari tasks.

List of keywords

Machine Learning -> ML: Evolutionary learning
Machine Learning -> ML: Feature extraction, selection and dimensionality reduction
Machine Learning -> ML: Learning sparse models
Machine Learning -> ML: Optimization

3146

SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models

Ran Zuo, Haoxiang Hu, Xiaoming Deng, Cangjun Gao, Zhengming Zhang, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, Hongan Wang

[+] More

[-] Less

Jointly using text and sketch for scene-level image retrieval utilizes the complementary between text and sketch to describe the fine-grained scene content and retrieve the target image, which plays a pivotal role in accurate image retrieval. Existing methods directly fuse the features of sketch and text and thus suffer from the bottleneck of limited utilization for crucial semantic and structural information, leading to inaccurate matching with images. In this paper, we propose SceneDiff, a novel retrieval network that leverages a pre-trained diffusion model to establish a shared generative latent space, enabling a joint latent representation learning for both sketch and text features and precise alignment with the corresponding image. Specifically, we encode text, sketch and image features, and project them into the diffusion-based share space, conditioning the denoising process on sketch and text features to generate latent fusion features, while employing the pre-trained autoencoder for latent image features. Within this space, we introduce the content-aware feature transformation module to reconcile encoded sketch and image features with the diffusion latent space’s dimensional requirements and preserve their visual content information. Then we augment the representation capability of the generated latent fusion features by integrating multiple samplings with partition attention, and utilize contrastive learning to align both direct fusion features and generated latent fusion features with corresponding image representations. Our method outperforms the state-of-the-art works through extensive experiments, providing a novel insight into the related retrieval field.

List of keywords

Computer Vision -> CV: Image and video retrieval
Computer Vision -> CV: Multimodal learning

3164

Pointsoup: High-Performance and Extremely Low-Decoding-Latency Learned Geometry Codec for Large-Scale Point Cloud Scenes

Kang You, Kai Liu, Li Yu, Pan Gao, Dandan Ding

[+] More

[-] Less

Despite considerable progress being achieved in point cloud geometry compression, there still remains a challenge in effectively compressing large-scale scenes with sparse surfaces. Another key challenge lies in reducing decoding latency, a crucial requirement in real-world application. In this paper, we propose Pointsoup, an efficient learning-based geometry codec that attains high-performance and extremely low-decoding-latency simultaneously. Inspired by conventional Trisoup codec, a point model-based strategy is devised to characterize local surfaces. Specifically, skin features are embedded from local windows via an attention-based encoder, and dilated windows are introduced as cross-scale priors to infer the distribution of quantized features in parallel. During decoding, features undergo fast refinement, followed by a folding-based point generator that reconstructs point coordinates with fairly fast speed. Experiments show that Pointsoup achieves state-of-the-art performance on multiple benchmarks with significantly lower decoding complexity, i.e., up to 90~160× faster than the G-PCCv23 Trisoup decoder on a comparatively low-end platform (e.g., one RTX 2080Ti). Furthermore, it offers variable-rate control with a single neural model (2.9MB), which is attractive for industrial practitioners.

List of keywords

Machine Learning -> ML: Geometric learning
Computer Vision -> CV: 3D computer vision
Multidisciplinary Topics and Applications -> MTA: Real-time systems
Robotics -> ROB: Robotics and vision

3165

ADMN: Agent-Driven Modular Network for Dynamic Parameter Sharing in Cooperative Multi-Agent Reinforcement Learning

Yang Yu, Qiyue Yin, Junge Zhang, Pei Xu, Kaiqi Huang

[+] More

[-] Less

Parameter sharing is a common strategy in multi-agent reinforcement learning (MARL) to make the training more efficient and scalable. However, applying parameter sharing among agents indiscriminately hinders the emergence of agents diversity and degrades the final cooperative performance. To better balance parameter sharing and agents diversity, we propose a novel Agent-Driven Modular Network (ADMN), where agents share a base network consisting of multiple specialized modules, and each agent has its own routing to connect these modules. In ADMN, modules are shared among agents to improve the training efficiency, while the combination of different modules brings rich diversity. The agent routing at different time steps is learned end-to-end to achieve a dynamic and adaptive balance. Specifically, we also propose an information-theoretical regularization between the routing of agents and their behavior to further guarantee the identifiability of different routing. We evaluated ADMN in challenging StarCraft micromanagement games and Google Research Football games, and results demonstrate the superior performance of ADMN, particularly in larger or heterogeneous cooperative tasks.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Machine Learning -> ML: Reinforcement learning

3180

Unlearning from Weakly Supervised Learning

Yi Tang, Yi Gao, Yonggang Luo, Yang JuCheng, Miao Xu, Min-Ling Zhang

[+] More

[-] Less

Machine unlearning provides users with the right to remove their privacy data from a well-trained model. Existing approaches of machine unlearning mainly focus on exploring data removing within supervised learning (SL) tasks. However, weakly supervised learning (WSL) is more applicable to real-world scenarios since collecting WSL data is less laborious than collecting fully supervised data. In this paper, we first propose a machine unlearning approach for WSL by updating the model parameters. Motivated by the uniform distributions of untrained model predictions, we derive a formulated target to force the model’s predictions of removed data to be indistinguishable. This encourages the model to forget its ability to recognize features of data slated for unlearning. Moreover, we employ formulated targets to transform the classification unlearning into the convex regression, which can significantly reduce computational cost and avoid extra information storage during the training process. Additionally, we discuss how to design a target to ensure the models’ predictions of removed data being indistinguishable in different learning scenarios, e.g., SL or WSL. As the flexibility in formulating targets, the proposed approach effectively deals with the WSL problem while still excels in SL models. Empirical studies show the superiority of the proposed approach.

List of keywords

Machine Learning -> ML: Weakly supervised learning
Machine Learning -> ML: Other

3202

A Prior-information-guided Residual Diffusion Model for Multi-modal PET Synthesis from MRI

Zaixin Ou, Caiwen Jiang, Yongsheng Pan, Yuanwang Zhang, Zhiming Cui, Dinggang Shen

[+] More

[-] Less

Alzheimer’s disease (AD) leads to abnormalities in various biomarkers (i.e., amyloid-β and tau proteins), which makes PET imaging (which can detect these biomarkers) essential in AD diagnosis. However, the high radiation risk of PET imaging limits its scanning number within a short period, presenting challenges to the joint multi-biomarker diagnosis of AD. In this paper, we propose a novel unified model to simultaneously synthesize multi-modal PET images from MRI, to achieve low-cost and time-efficient joint multi-biomarker diagnosis of AD. Specifically, we incorporate residual learning into the diffusion model to emphasize inter-domain differences between PET and MRI, thereby forcing each modality to maximally reconstruct its modality-specific details. Furthermore, we leverage prior information, such as age and gender, to guide the diffusion model in synthesizing PET images with semantic consistency, enhancing their diagnostic value. Additionally, we develop an intra-domain difference loss to ensure that the intra-domain differences among synthesized PET images closely match those among real PET images, promoting more accurate synthesis, especially at the modality-specific information. Extensive experiments conducted on the ADNI dataset demonstrate that our method achieves superior performance both quantitatively and qualitatively compared to the state-of-the-art methods. All codes for this study have been uploaded to GitHub (https://github.com/Ouzaixin/ResDM).

List of keywords

Machine Learning -> ML: Generative models
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Multi-modal learning
Machine Learning -> ML: Supervised Learning

3203

Fine-grained Analysis of Stability and Generalization for Stochastic Bilevel Optimization

Xuelin Zhang, Hong Chen, Bin Gu, Tieliang Gong, Feng Zheng

[+] More

[-] Less

Stochastic bilevel optimization (SBO) has been integrated into many machine learning paradigms recently including hyperparameter optimization, meta learning, reinforcement learning, etc. Along with the wide range of applications, there have been abundant studies on concerning the computing behaviors of SBO. However, the generalization guarantees of SBO methods are far less understood from the lens of statistical learning theory. In this paper, we provide a systematical generalization analysis of the first-order gradient-based bilevel optimization methods. Firstly, we establish the quantitative connections between the on-average argument stability and the generalization gap of SBO methods. Then, we derive the upper bounds of on-average argument stability for single timescale stochastic gradient descent (SGD) and two timescale SGD, where three settings (nonconvex-nonconvex (NC-NC), convex-convex (C-C) and strongly-convex-strongly-convex (SC-SC)) are considered respectively. Experimental analysis validates our theoretical findings. Compared with the previous algorithmic stability analysis, our results do not require the re-initialization of the inner-level parameters before each iteration and are suit for more general objective functions.

List of keywords

Machine Learning -> ML: Learning theory

3228

Boosting Diffusion Models with an Adaptive Momentum Sampler

Xiyu Wang, AnhDung Dinh, Daochang Liu, Chang Xu

[+] More

[-] Less

Diffusion probabilistic models (DPMs) have been shown to generate high-quality images without the need for delicate adversarial training. The sampling process of DPMs is mathematically similar to Stochastic Gradient Descent (SGD), with both being iteratively updated with a function increment. Building on this, we present a novel reverse sampler for DPMs in this paper, drawing inspiration from the widely-used Adam optimizer. Our proposed sampler can be readily applied to a pre-trained diffusion model, utilizing momentum mechanisms and adaptive updating to enhance the generated image’s quality. By effectively reusing update directions from early steps, our proposed sampler achieves a better balance between high-level semantics and low-level details. Additionally, this sampler is flexible and can be easily integrated into pre-trained DPMs regardless of the sampler used during training. Our experimental results on multiple benchmarks demonstrate that our proposed reverse sampler yields remarkable improvements over different baselines.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation

3230

Image Retrieval with Self-Supervised Divergence Minimization and Cross-Attention Classification

Vivek Trivedy, Longin Jan Latecki

[+] More

[-] Less

Common approaches to image retrieval include contrastive methods and specialized loss functions such as ranking losses and entropy regularizers. We present DMCAC (Divergence Minimization with Cross-Attention Classification), a novel image retrieval method that offers a new perspective on this training paradigm. We use self-supervision with a novel divergence loss framework alongside a simple data flow adjustment that minimizes a distribution over a database directly during training. We show that jointly learning a query representation over a database is a competitive and often improved alternative to traditional contrastive methods for image retrieval. We evaluate our method across several model configurations and four datasets, achieving state-of-the-art performance in multiple settings. We also conduct a thorough set of ablations that show the robustness of our method across full vs. approximate retrieval and different hyperparameter configurations.

List of keywords

Computer Vision -> CV: Image and video retrieval
Computer Vision -> CV: Representation learning

3244

Learning Label Dependencies for Visual Information Extraction

MingHong Yao, Liansheng Zhuang, Houqiang Li, Jiuchang Wei

[+] More

[-] Less

Visual Information Extraction (VIE) refers to the process of extracting specified categories of text information from visually rich document images. Previous methods treat the VIE task as a sequence labeling problem, ignoring the importance of dependency between labels. A feasible solution is to apply linear-chain conditional random fields (CRF), which learn the probabilities of transition from the current label to the next one. But simply applying linear CRF could not work well when faced with long-range label dependency, i.e., the next label depends on several preceding labels. To address this issue, we propose to learn a label language model based on a transformer neural network and an inference network for VIE. The label language model considers a label sequence as a whole and is trained to assign a higher likelihood to the label sequence that respects the long-range label dependency. The inference transformer aims to predict the label sequence by considering not only the features of each text token but also the likelihood of the whole label sequence evaluated by the label language model. Comprehensive experiments on public datasets have demonstrated the effectiveness of our method and it is a good complement to the existing methods.

List of keywords

Natural Language Processing -> NLP: Applications
Natural Language Processing -> NLP: Information extraction

3245

Nonparametric Detection of Gerrymandering in Multiparty Plurality Elections

Dariusz Stolicki, Wojciech Słomczyński, Stanisław Szufa

[+] More

[-] Less

Partisan gerrymandering, i.e., manipulation of electoral district boundaries for political advantage, is one of the major challenges to election integrity in modern day democracies. Yet most of the existing methods for detecting partisan gerrymandering are narrowly tailored toward fully contested two-party elections, and fail if there are more parties or if the number of candidates per district varies. We propose a new method, applying nonparametric statistical learning to detect anomalies in the relation between (aggregate) votes and (aggregate) seats. Unlike in most of the existing methods, we propose to learn the standard of fairness in districting from empirical data rather than assume one a priori. Finally, we test the proposed methods against experimental data as well as real-life data from 17 countries employing the plurality (FPTP) system.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice
Multidisciplinary Topics and Applications -> MTA: Social sciences

3261

An Image-enhanced Molecular Graph Representation Learning Framework

Hongxin Xiang, Shuting Jin, Jun Xia, Man Zhou, Jianmin Wang, Li Zeng, Xiangxiang Zeng

[+] More

[-] Less

Extracting rich molecular representation is a crucial prerequisite for accurate drug discovery. Recent molecular representation learning methods achieve impressive progress, but the paradigm of learning from a single modality gradually encounters the bottleneck of limited representation capabilities. In this work, we fully consider the rich visual information contained in 3D conformation molecular images (i.e., texture, shadow, color and planar spatial information) and distill graph-based models for more discriminative drug discovery. Specifically, we propose an image-enhanced molecular graph representation learning framework that leverages multi-view molecular images rendered from 3D conformations to boost molecular graph representations. To extract useful auxiliary knowledge from multi-view images, we design a teacher, which is pre-trained on 2 million molecules with conformations through five meticulously designed pre-training tasks. To transfer knowledge from teacher to graph-based students, we pose an efficient cross-modal knowledge distillation strategy with knowledge enhancer and task enhancer. It is worth noting that the distillation architecture of IEM can be directly integrated into existing graph-based models, and significantly improves the capabilities of these models (e.g. GIN, EdgePred, GraphMVP, MoleBERT) for molecular representation learning. In particular, GraphMVP and MoleBERT equipped with IEM achieve new state-of-the-art performance on MoleculeNet benchmark, achieving average 73.89% and 73.81% ROC-AUC, respectively. Code is available at https://github.com/HongxinXiang/IEM.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Bioinformatics
Machine Learning -> ML: Knowledge-aided learning
Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Representation learning

3265

Formalisation and Evaluation of Properties for Consequentialist Machine Ethics

Raynaldio Limarga, Yang Song, Abhaya Nayak, David Rajaratnam, Maurice Pagnucco

[+] More

[-] Less

As artificial intelligence (AI) technologies continue to influence our daily lives, there has been a growing need to ensure that AI enabled decision making systems adhere to principles expected of human decision makers. This need has given rise to the area of Machine Ethics. We formalise several ethical principles from the philosophical literature in the situation calculus framework to verify the ethical permissibility of a plan. Moreover, we propose several important properties, including some of our own that are intuitively appealing, and a number derived from the social choice literature that would appear to be relevant in evaluating the various approaches. Finally we provide an assessment of how our various situation calculus models of Machine Ethics that we examine satisfy the important properties we have identified.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Moral decision making
Knowledge Representation and Reasoning -> KRR: Common-sense reasoning
Knowledge Representation and Reasoning -> KRR: Other
Knowledge Representation and Reasoning -> KRR: Reasoning about actions

3276

Multi-scale Context-Aware Networks Based on Fragment Association for Human Activity Recognition

Zhiqiong Wang, Hanyu Liu, Boyang Zhao, Qi Shen, Mingzhe Li, NingFeng Que, Mingke Yan, Junchang Xin

[+] More

[-] Less

Sensor-based Human Activity Recognition (HAR) constitutes a key component of many artificial intelligence applications. Although deep feature extraction technology is constantly updated and iterated with excellent results, it is still a difficult task to find a balance between performance and computational efficiency. Through an in-depth exploration of the inherent characteristics of HAR data, we propose a lightweight feature perception model, which encompasses an internal feature extractor and a contextual feature perceiver. The model mainly consists of two stages. The first stage is a hierarchical multi-scale feature extraction module, which is composed of deep separable convolution and multi-head attention mechanism. This module serves to extract conventional features for Human Activity Recognition. After the feature goes through a fragment recombination operation, it is passed into the context-aware module of the second stage, which is based on Retentive Transformer and optimized by Dropkey method to efficiently extract the relationship between the feature fragments, so as to mine more valuable feature information. Importantly, this does not add too much complexity to the model, thereby preventing excessive resource consumption. We conducted extensive experimental validation on multiple publicly available HAR datasets.

List of keywords

Humans and AI -> HAI: Applications
Data Mining -> DM: Networks

3281

LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory

Zicheng Liu, Li Wang, Siyuan Li, Zedong Wang, Haitao Lin, Stan Z. Li

[+] More

[-] Less

Transformer models have been successful in various sequence processing tasks, but the self-attention mechanism’s computational cost limits its practicality for long sequences. Although there are existing attention variants that improve computational efficiency, they have a limited ability to abstract global information effectively based on their hand-crafted mixing strategies. On the other hand, state-space models (SSMs) are tailored for long sequences but cannot capture complicated local information. Therefore, the combination of them as a unified token mixer is a trend in recent long-sequence models. However, the linearized attention degrades performance significantly even when equipped with SSMs. To address the issue, we propose a new method called LongVQ. LongVQ uses the vector quantization (VQ) technique to compress the global abstraction as a length-fixed codebook, enabling the linear-time computation of the attention matrix. This technique effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues. Our experiments on the Long Range Arena benchmark, autoregressive language modeling, and image and speech classification demonstrate the effectiveness of LongVQ. Our model achieves significant improvements over other sequence models, including variants of Transformers, Convolutions, and recent State Space Models.

List of keywords

Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Representation learning
Natural Language Processing -> NLP: Embeddings

3286

Unlearning during Learning: An Efficient Federated Machine Unlearning Method

Hanlin Gu, Gongxi Zhu, Jie Zhang, Xinyuan Zhao, Yuxing Han, Lixin Fan, Qiang Yang

[+] More

[-] Less

Federated Learning (FL) has indeed garnered significant attention as a distributed machine learning paradigm in recent years. However, it is vulnerable to malicious attacks, such as model inversion attacks and membership inference attacks. To address these vulnerabilities and to enable the “right to be forgotten," the concept of federated machine unlearning (FMU) has emerged. Nevertheless, existing approaches for FMU introduce additional time-consuming steps, such as retraining or fine-tuning, which are not practical in FL due to the necessity of respecting the time constraints of other clients.In this paper, we introduce FedAU, an innovative and efficient FMU framework aimed at overcoming these limitations. Specifically, FedAU incorporates a lightweight auxiliary unlearning module into the learning process and employs a straightforward linear operation to facilitate unlearning. This approach eliminates the requirement for extra time-consuming steps, rendering it well-suited for FL.Furthermore, FedAU exhibits remarkable versatility. It not only enables multiple clients to carry out unlearning tasks concurrently but also supports unlearning at various levels of granularity, including individual data samples, specific classes, and even at the client level.We conducted extensive experiments on MNIST, CIFAR10, and CIFAR100 datasets to evaluate the performance of FedAU. The results demonstrate that FedAU effectively achieves the desired unlearning effect while maintaining model accuracy.

List of keywords

Machine Learning -> ML: Federated learning
Multidisciplinary Topics and Applications -> MTA: Security and privacy

3291

A Behavior-Aware Approach for Deep Reinforcement Learning in Non-stationary Environments without Known Change Points

Zihe Liu, Jie Lu, Guangquan Zhang, Junyu Xuan

[+] More

[-] Less

Deep reinforcement learning is used in various domains, but usually under the assumption that the environment has stationary conditions like transitions and state distributions. When this assumption is not met, performance suffers. For this reason, tracking continuous environmental changes and adapting to unpredictable conditions is challenging yet crucial because it ensures that systems remain reliable and flexible in practical scenarios. Our research introduces Behavior-Aware Detection and Adaptation (BADA), an innovative framework that merges environmental change detection with behavior adaptation. The key inspiration behind our method is that policies exhibit different global behaviors in changing environments. Specifically, environmental changes are identified by analyzing variations between behaviors using Wasserstein distances without manually set thresholds. The model adapts to the new environment through behavior regularization based on the extent of changes. The results of a series of experiments demonstrate better performance relative to several current algorithms. This research also indicates significant potential for tackling this long-standing challenge.

List of keywords

Machine Learning -> ML: Reinforcement learning

3294

Welfare Loss in Connected Resource Allocation

Xiaohui Bei, Alexander Lam, Xinhang Lu, Warut Suksompong

[+] More

[-] Less

We study the allocation of indivisible goods that form an undirected graph and investigate the worst-case welfare loss when requiring that each agent must receive a connected subgraph. Our focus is on both egalitarian and utilitarian welfare. Specifically, we introduce the concept of egalitarian (resp., utilitarian) price of connectivity, which captures the worst-case ratio between the optimal egalitarian (resp., utilitarian) welfare among all allocations and that among the connected allocations. We provide tight or asymptotically tight bounds on the price of connectivity for various large classes of graphs when there are two agents, and for paths, stars and cycles in the general case. Many of our results are supplemented with algorithms which find connected allocations with a welfare guarantee corresponding to the price of connectivity.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Fair division
Agent-based and Multi-agent Systems -> MAS: Resource allocation

3297

CausalNET: Unveiling Causal Structures on Event Sequences by Topology-Informed Causal Attention

Hua Zhu, Hong Huang, Kehan Yin, Zejun Fan, Hai Jin, Bang Liu

[+] More

[-] Less

Causal discovery on event sequences holds a pivotal significance across domains such as healthcare, finance, and industrial systems. The crux of this endeavor lies in unraveling causal structures among event types, typically portrayed as directed acyclic graphs (DAGs). Nonetheless, prevailing methodologies often grapple with untenable assumptions and intricate optimization hurdles. To address these challenges, we present a novel model named CausalNET. At the heart of CausalNET is a special prediction module based on the Transformer architecture, which prognosticates forthcoming events by leveraging historical occurrences, with its predictive prowess amplified by a trainable causal graph engineered to fathom causal relationships among event types. Further, to augment the predictive paradigm, we devise a causal decay matrix to encapsulate the reciprocal influence of events upon each other within the topological network. During training, we alternatively refine the prediction module and fine-tune the causal graph. Comprehensive evaluation on a spectrum of real-world and synthetic datasets underscores the superior performance and scalability of CausalNET, which marks a promising step forward in the realm of causal discovery. Code and Appendix are available at https://github.com/CGCL-codes/CausalNET.

List of keywords

Uncertainty in AI -> UAI: Causality, structural causal models and causal inference
Machine Learning -> ML: Causality

3305

Prospective Learning for Personalized Heart Disease Detection via Digital Twin

Yaojun Hu, Jintai Chen, Lianting Hu, Dantong Li, Jiahuan Yan, Haochao Ying, Huiying Liang, Jian Wu

[+] More

[-] Less

Heart diseases rank among the leading causes of global mortality, demonstrating a crucial need for early diagnosis and intervention. Most traditional electrocardiogram (ECG) based automated diagnosis methods are trained at population level, neglecting the customization of personalized ECGs to enhance individual healthcare management. A potential solution to address this limitation is to employ digital twins to simulate symptoms of diseases in real patients. In this paper, we present an innovative prospective learning approach for personalized heart disease detection, which generates digital twins of healthy individuals’ anomalous ECGs and enhances the model sensitivity to the personalized symptoms. In our approach, a vector quantized feature separator is proposed to locate and isolate the disease symptom and normal segments in ECG signals with ECG report guidance. Thus, the ECG digital twins can simulate specific heart diseases used to train a personalized heart disease detection model. Experiments demonstrate that our approach not only excels in generating high-fidelity ECG signals but also improves personalized heart disease detection. Moreover, our approach ensures robust privacy protection, safeguarding patient data in model development.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Health and medicine
Machine Learning -> ML: Applications
Machine Learning -> ML: Generative models

3314

Label Distribution Learning from Logical Label

Yuheng Jia, Jiawei Tang, Jiahao Jiang

[+] More

[-] Less

Label distribution learning (LDL) is an effective method to predict the label description degree (a.k.a. label distribution) of a sample. However, annotating label distribution (LD) for training samples is extremely costly. So recent studies often first use label enhancement (LE) to generate the estimated label distribution from the logical label and then apply external LDL algorithms on the recovered label distribution to predict the label distribution for unseen samples. But this step-wise manner overlooks the possible connections between LE and LDL. Moreover, the existing LE approaches may assign some description degrees to invalid labels. To solve the above problems, we propose a novel method to learn an LDL model directly from the logical label, which unifies LE and LDL into a joint model, and avoids the drawbacks of the previous LE methods. We also give the generalization error bound of our method and theoretically prove that directly learning an LDL model from the logical labels is feasible. Extensive experiments on various datasets prove that the proposed approach can construct a reliable LDL model directly from the logical label, and produce more accurate label distribution than the state-of-the-art LE methods. The code and the supplementary file can be found in https://github.com/seutjw/DLDL.

List of keywords

Machine Learning -> ML: Multi-label learning
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Machine Learning -> ML: Optimization

3319

Optimizing Prosumer Policies in Periodic Double Auctions Inspired by Equilibrium Analysis

Bharat Manvi, Sanjay Chandlekar, Easwar Subramanian

[+] More

[-] Less

We consider a periodic double auction (PDA) wherein the main participants are wholesale suppliers and brokers representing retailers. The suppliers are represented by a composite supply curve and the brokers are represented by individual bids. Additionally, the brokers can also participate in small-scale selling by placing individual asks; hence, they act as prosumers. Specifically, in a PDA, the prosumers who are net buyers have multiple opportunities to buy or sell multiple units of a commodity with the aim of minimising the cost of buying across multiple rounds of the PDA. Formulating optimal bidding strategies for such a PDA setting involves planning across current and future rounds while taking into account the bidding strategies of other agents. In this work, we propose Markov perfect Nash equilibrium (MPNE) policies for a setup where multiple prosumers with knowledge of the composite supply curve compete to procure commodities. Thereafter, the MPNE policies are used to develop an algorithm called MPNE-BBS for the case wherein the prosumers need to re-construct an approximate composite supply curve using past auction information. The efficacy of the proposed algorithm is demonstrated on the PowerTAC wholesale market simulator against several baselines and state-of-the-art bidding policies.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Auctions and market-based systems
Agent-based and Multi-agent Systems -> MAS: Agent theories and models
Agent-based and Multi-agent Systems -> MAS: Agent-based simulation and emergence
Game Theory and Economic Paradigms -> GTEP: Noncooperative games

3333

Dynamic Weighted Graph Fusion for Deep Multi-View Clustering

Yazhou Ren, Jingyu Pu, Chenhang Cui, Yan Zheng, Xinyue Chen, Xiaorong Pu, Lifang He

[+] More

[-] Less

By exploring complex graph information hidden in data from multiple views, multi-view clustering based on graph neural network significantly enhances the clustering performance and has drawn increasing attention in recent years. Although considerable progress has been made, most existing GNN based MVC models merely consider the explicit presence of graph structure in raw data and ignore that latent graphs of different views also provide specific information for the clustering task. We propose dynamic weighted graph fusion for deep multi-view clustering (DFMVC) to address this issue. Specifically, DFMVC learns embedded features via deep autoencoders and then constructs latent graphs for each individual view. Then, it concatenates the embedded features of all views to form a global feature to leverage complementary information, as well as generates a fusion graph via combining all latent graphs to accurately capture the topological information among samples. Based on the informative fusion graph and global features, the graph convolution module is adopted to derive a representation with global comprehensive information, which is further used to generate pseudo-label information. In a self-supervised manner, such information guides each view to dynamically learn discriminative features and latent graphs. Extensive experimental results demonstrate the efficacy of DFMVC.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering

3339

Dual Semantic Fusion Hashing for Multi-Label Cross-Modal Retrieval

Kaiming Liu, Yunhong Gong, Yu Cao, Zhenwen Ren, Dezhong Peng, Yuan Sun

[+] More

[-] Less

Cross-modal hashing (CMH) has been widely used for multi-modal retrieval tasks due to its low storage cost and fast query speed. Although existing CMH methods achieve promising performance, most of them mainly rely on coarse-grained supervision information (\ie pairwise similarity matrix) to measure the semantic similarities between all instances, ignoring the impact of multi-label distribution. To address this issue, we construct fine-grained semantic similarity to explore the cluster-level semantic relationships between multi-label data, and propose a new dual semantic fusion hashing (DSFH) for multi-label cross-modal retrieval. Specifically, we first learn the modal-specific representation and consensus hash codes, thereby merging the specificity with consistency. Then, we fuse the coarse-grained and fine-grained semantics to mine multiple-level semantic relationships, thereby enhancing hash codes discrimination. Extensive experiments on three benchmarks demonstrate the superior performance of our DSFH compared with 16 state-of-the-art methods.

List of keywords

Machine Learning -> ML: Multi-modal learning
Machine Learning -> ML: Multi-view learning

3347

GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

Jiafeng Liang, Shixin Jiang, Zekun Wang, Haojie Pan, Zerui Chen, Zheng Chu, Ming Liu, Ruiji Fu, Zhongyuan Wang, Bing Qin

[+] More

[-] Less

There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the Guide (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with Guide and perform in-depth analysis. Given the diversity and practicality of Guide, we believe that it can be used as a better benchmark for instructional video comprehension.

List of keywords

Computer Vision -> CV: Video analysis and understanding
Natural Language Processing -> NLP: Resources and evaluation

3355

DiffStega: Towards Universal Training-Free Coverless Image Steganography with Diffusion Models

Yiwei Yang, Zheyuan Liu, Jun Jia, Zhongpai Gao, Yunhao Li, Wei Sun, Xiaohong Liu, Guangtao Zhai

[+] More

[-] Less

Traditional image steganography focuses on concealing one image within another, aiming to avoid steganalysis by unauthorized entities. Coverless image steganography (CIS) enhances imperceptibility by not using any cover image. Recent works have utilized text prompts as keys in CIS through diffusion models. However, this approach faces three challenges: invalidated when private prompt is guessed, crafting public prompts for semantic diversity, and the risk of prompt leakage during frequent transmission. To address these issues, we propose DiffStega, an innovative training-free diffusion-based CIS strategy for universal application. DiffStega uses a password-dependent reference image as an image prompt alongside the text, ensuring that only authorized parties can retrieve the hidden information. Furthermore, we develop Noise Flip technique to further secure the steganography against unauthorized decryption. To comprehensively assess our method across general CIS tasks, we create a dataset comprising various image steganography instances. Experiments indicate substantial improvements in our method over existing ones, particularly in aspects of versatility, password sensitivity, and recovery quality. Codes are available at https://github.com/evtricks/DiffStega.

List of keywords

Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Applications
Computer Vision -> CV: Structural and model-based approaches, knowledge representation and reasoning

3356

Computational Complexity of Verifying the Group No-show Paradox

Farhad Mohsin, Qishen Han, Sikai Ruan, Pin-Yu Chen, Francesca Rossi, Lirong Xia

[+] More

[-] Less

The (group) no-show paradox refers to the undesirable situation where a group of agents have incentive to abstain from voting to make the winner more favorable to them. To understand whether it is a critical concern in practice, in this paper, we take a computational approach by examining the computational complexity of verifying whether the group no-show paradox exists given agents’ preferences and the voting rule. We prove that, unfortunately, the verification problem is NP-hard to compute for some commonly studied voting rules, i.e., Copeland, maximin, single transferable vote, and all Condorcetified positional scoring rules such as Black’s rule. We propose integer linear programming-based algorithms and a search-based algorithm for the verification problem for different voting rules. Experimental results on synthetic data illustrate that the former is efficient when the number of unique rankings in a profile is not too high, and the latter is efficient for a small number of agents. With the help of these algorithms, we observe that group no-show paradoxes rarely occur in real-world data.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

3359

KG-CoT: Chain-of-Thought Prompting of Large Language Models over Knowledge Graphs for Knowledge-Aware Question Answering

Ruilin Zhao, Feng Zhao, 隆 王, Xianzhi Wang, Guandong Xu

[+] More

[-] Less

Large language models (LLMs) encounter challenges such as hallucination and factual errors in knowledge-intensive tasks. One the one hand, LLMs sometimes struggle to generate reliable answers based on the black-box parametric knowledge, due to the lack of responsible knowledge. Moreover, fragmented knowledge facts extracted by knowledge retrievers fail to provide explicit and coherent reasoning paths for improving LLM reasoning. To address these challenges, we propose KG-CoT, a novel knowledge-augmented paradigm that leverages a small-scale step-by-step graph reasoning model to reason over knowledge graphs (KGs) and utilizes a reasoning path generation method to generate chains of reasoning with high confidence for large-scale LLMs. Extensive experiments demonstrate that our KG-CoT significantly improves the performance of LLMs on knowledge-intensive question answering tasks, such as multi-hop, single-hop, and open-domain question answering benchmarks, without fine-tuning LLMs. KG-CoT outperforms the CoT prompting as well as prior retrieval-augmented and knowledge base question answering baselines. Moreover, KG-CoT can reduce the number of API calls and cost and generalize to various LLM backbones in a lightweight plug-and-play manner.

List of keywords

Natural Language Processing -> NLP: Question answering
Natural Language Processing -> NLP: Language generation

3360

Global Optimality of Single-Timescale Actor-Critic under Continuous State-Action Space: A Study on Linear Quadratic Regulator

Xuyang Chen, Jingliang Duan, Lin Zhao

[+] More

[-] Less

Actor-critic methods have achieved state-of-the-art performance in various challenging tasks. However, theoretical understandings of their performance remain elusive and challenging. Existing studies mostly focus on practically uncommon variants such as double-loop or two-timescale stepsize actor-critic algorithms for simplicity. These results certify local convergence on finite state- or action- space only. We push the boundary to investigate the classic single-sample single-timescale actor-critic on continuous (infinite) state-action space, where we employ the canonical linear quadratic regulator (LQR) problem as a case study. We show that the popular single-timescale actor-critic can attain an epsilon-optimal solution with an order of epsilon to -2 sample complexity for solving LQR on the demanding continuous state-action space. Our work provides new insights into the performance of single-timescale actor-critic, which further bridges the gap between theory and practice.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Learning theory

3364

Learning to Solve Geometry Problems via Simulating Human Dual-Reasoning Process

Tong Xiao, Jiayu Liu, Zhenya Huang, Jinze Wu, Jing Sha, Shijin Wang, Enhong Chen

[+] More

[-] Less

Geometry Problem Solving (GPS), which is a classic and challenging math problem, has attracted much attention in recent years. It requires a solver to comprehensively understand both text and diagram, master essential geometry knowledge, and appropriately apply it in reasoning. However, existing works follow a paradigm of neural machine translation and only focus on enhancing the capability of encoders, which neglects the essential characteristics of human geometry reasoning. In this paper, inspired by dual-process theory, we propose a Dual-Reasoning Geometry Solver (DualGeoSolver) to simulate the dual-reasoning process of humans for GPS. Specifically, we construct two systems in DualGeoSolver, namely Knowledge System and Inference System. Knowledge System controls an implicit reasoning process, which is responsible for providing diagram information and geometry knowledge according to a step-wise reasoning goal generated by Inference System. Inference System conducts an explicit reasoning process, which specifies the goal in each reasoning step and applies the knowledge to generate program tokens for resolving it. The two systems carry out the above process iteratively, which behaves more in line with human cognition. We conduct extensive experiments on two benchmark datasets, GeoQA and GeoQA+. The results demonstrate the superiority of DualGeoSolver in both solving accuracy and robustness from explicitly modeling human reasoning process and knowledge application.

List of keywords

Natural Language Processing -> NLP: Question answering
Knowledge Representation and Reasoning -> KRR: Learning and reasoning

3379

BlockEcho: Retaining Long-Range Dependencies for Imputing Block-Wise Missing Data

Qiao Han, Mingqian Li, Yao Yang, Yiteng Zhai

[+] More

[-] Less

Block-wise missing data poses significant challenges in real-world data imputation tasks. Compared to scattered missing data, block-wise gaps exacerbate adverse effects on subsequent analytic and machine learning tasks, as the lack of local neighboring elements significantly reduces the interpolation capability and predictive power. However, this issue has not received adequate attention. Most SOTA matrix completion methods appeared less effective, primarily due to overreliance on neighboring elements for predictions. We systematically analyze the issue and propose a novel matrix completion method "BlockEcho" for a more comprehensive solution. This method creatively integrates Matrix Factorization (MF) within Generative Adversarial Networks (GAN) to explicitly retain long-distance inter-element relationships in the original matrix. Besides, we incorporate an additional discriminator for GAN, comparing the generator’s intermediate progress with pre-trained MF results to constrain high-order feature distributions. Subsequently, we evaluate BlockEcho on public datasets across three domains. Results demonstrate superior performance over both traditional and SOTA methods when imputing block-wise missing data, especially at higher missing rates. The advantage also holds for scattered missing data at high missing rates. We also contribute on the analyses in providing theoretical justification on the optimality and convergence of fusing MF and GAN for missing block data.

List of keywords

Machine Learning -> ML: Generative models
Data Mining -> DM: Other

3383

A Grassmannian Manifold Self-Attention Network for Signal Classification

Rui Wang, Chen Hu, Ziheng Chen, Xiao-Jun Wu, Xiaoning Song

[+] More

[-] Less

In the community of artificial intelligence, significant progress has been made in encoding sequential data using deep learning techniques. Nevertheless, how to effectively mine useful information from channel dimensions remains a major challenge, as these features have a submanifold structure. Linear subspace, the basic element of the Grassmannian manifold, has proven to be an effective manifold-valued feature descriptor in statistical representation. Besides, the Euclidean self-attention mechanism has shown great success in capturing long-range relationships of data. Inspired by these facts, we extend the self-attention mechanism to the Grassmannian manifold. Our framework can effectively characterize the spatiotemporal fluctuations of sequential data encoded in the Grassmannian manifold. Extensive experimental results on three benchmarking datasets (a drone recognition dataset and two EEG signal classification datasets) demonstrate the superiority of our method over the state-of-the-art.

List of keywords

Machine Learning -> ML: Attention models
Machine Learning -> ML: Classification
Machine Learning -> ML: Geometric learning

3384

Fair Distribution of Delivery Orders

Hadi Hosseini, Shivika Narang, Tomasz Wąs

[+] More

[-] Less

We initiate the study of fair distribution of delivery tasks among a set of agents wherein delivery jobs are placed along the vertices of a graph. Our goal is to fairly distribute delivery costs (modeled as a submodular function) among a fixed set of agents while satisfying some desirable notions of economic efficiency. We adopt well-established fairness concepts—such as envy-freeness up to one item (EF1) and minimax share (MMS)—to our setting and show that fairness is often incompatible with the efficiency notion of social optimality. Yet, we characterize instances that admit fair and socially optimal solutions by exploiting graph structures. We further show that achieving fairness along with Pareto optimality is computationally intractable. Nonetheless, we design an XP algorithm (parameterized by the number of agents) for finding MMS and Pareto optimal solutions on every tree instance, and show that the same algorithm can be modified to find efficient solutions along with EF1, when such solutions exist. We complement these results by theoretically and experimentally analyzing the price of fairness.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Fair division
Game Theory and Economic Paradigms -> GTEP: Computational social choice

3391

Unified View Imputation and Feature Selection Learning for Incomplete Multi-view Data

Yanyong Huang, Zongxin Shen, Tianrui Li, Fengmao Lv

[+] More

[-] Less

Although multi-view unsupervised feature selection (MUFS) is an effective technology for reducing dimensionality in machine learning, existing methods cannot directly deal with incomplete multi-view data where some samples are missing in certain views. These methods should first apply predetermined values to impute missing data, then perform feature selection on the complete dataset. Separating imputation and feature selection processes fails to capitalize on the potential synergy where local structural information gleaned from feature selection could guide the imputation, thereby improving the feature selection performance in turn. Additionally, previous methods only focus on leveraging samples’ local structure information, while ignoring the intrinsic locality of the feature space. To tackle these problems, a novel MUFS method, called UNified view Imputation and Feature selectIon lEaRning (UNIFIER), is proposed. UNIFIER explores the local structure of multi-view data by adaptively learning similarity-induced graphs from both the sample and feature spaces. Then, UNIFIER dynamically recovers the missing views, guided by the sample and feature similarity graphs during the feature selection procedure. Furthermore, the half-quadratic minimization technique is used to automatically weight different instances, alleviating the impact of outliers and unreliable restored data. Comprehensive experimental results demonstrate that UNIFIER outperforms other state-of-the-art methods.

List of keywords

Machine Learning -> ML: Feature extraction, selection and dimensionality reduction
Machine Learning -> ML: Unsupervised learning

3406

SaSDim:Self-Adaptive Noise Scaling Diffusion Model for Spatial Time Series Imputation

Shunyang Zhang, Senzhang Wang, Xianzhen Tan, Renzhi Wang, Ruochen Liu, Jian Zhang, Jianxin Wang

[+] More

[-] Less

Spatial time series imputation is of great importance to various real-world applications. As the state-of-the-art generative models, diffusion models (e.g. CSDI) have outperformed statistical and autoregressive based models in time series imputation. However, diffusion models may introduce unstable noise owing to the inherent uncertainty in sampling, leading to the generated noise deviating from the intended Gaussian distribution. Consequently, the imputed data may deviate from the real data. To this end, we propose a \textbf{S}elf-\textbf{a}daptive noise \textbf{S}caling \textbf{Di}ffusion \textbf{M}odel named SaSDim for spatial time series imputation. Specifically, we introduce a novel Probabilistic High-Order SDE Solver Module to scale the noise following the standard Gaussian distribution. The noise scaling operation helps the noise prediction module of the diffusion model to more accurately estimate the variance of noise. To effectively learn the spatial and temporal features, a Spatial guided Global Convolution Module (SgGConv) for multi-periodic temporal dependencies learning with the Fast Fourier Transformation and dynamic spatial dependencies learning with dynamic graph convolution is also proposed. Extensive experiments conducted on three real-world spatial time series datasets verify the effectiveness of SaSDim.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data

3412

vMFER: Von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement

Yiwen Zhu, Jinyi Liu, Wenya Wei, Qianyi Fu, Yujing Hu, Zhou Fang, Bo An, Jianye Hao, Tangjie Lv, Changjie Fan

[+] More

[-] Less

Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations — policy evaluation and policy improvement. Enhancing learning efficiency remains a key challenge in RL, with many efforts focused on using ensemble critics to boost policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance learning efficiency. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.

List of keywords

Machine Learning -> ML: Reinforcement learning

3415

Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval

Xiaobo Shen, Qianxin Huang, Long Lan, Yuhui Zheng

[+] More

[-] Less

As video-based social networks continue to grow exponentially, there is a rising interest in video retrieval using natural language. Cross-modal hashing, which learns compact hash code for encoding multi-modal data, has proven to be widely effective in large-scale cross-modal retrieval, e.g., image-text retrieval, primarily due to its computation and storage efficiency. However, when applied to video-text retrieval, existing cross-modal hashing methods generally extract features at the frame- or word-level for videos and texts individually, thereby ignoring their long-term dependencies. To address this issue, we propose Contrastive Transformer Cross-Modal Hashing (CTCH), a novel approach designed for video-text retrieval task. CTCH employs bidirectional transformer encoder to encode video and text and leverages their long-term dependencies. CTCH further introduces supervised multi-modality contrastive loss that effectively exploits inter-modality and intra-modality similarities among videos and texts. The experimental results on three video benchmark datasets demonstrate that CTCH outperforms the state-of-the-arts in video-text retrieval tasks.

List of keywords

Computer Vision -> CV: Image and video retrieval
Machine Learning -> ML: Multi-modal learning
Machine Learning -> ML: Multi-view learning

3435

Unsupervised Deep Graph Structure and Embedding Learning

Xiaobo Shen, Lei Shi, Xiuwen Gong, Shirui Pan

[+] More

[-] Less

Graph Neural Network (GNN) is powerful in graph embedding learning, but its performance has been shown to be heavily degraded under adversarial attacks. Deep graph structure learning (GSL) is proposed to defend attack by jointly learning graph structure and graph embedding, typically in node classification task. Label supervision is expensive in real-world applications, and thus unsupervised GSL is more challenging and still remains less studied. To fulfill this gap, this paper proposes a new unsupervised GSL method, i.e., unsupervised property GNN (UPGNN). UPGNN first refines graph structure by exploring properties of low rank, sparsity, feature smoothness. UPGNN employs graph mutual information loss to learn graph embedding by maximizing its correlation with refined graph. The proposed UPGNN learns graph structure and embedding without label supervision, and thus can be applied various downstream tasks. We further propose Accelerated UPGNN (AUPGNN) to reduce computational complexity, providing a efficient alternative to UPGNN. Our extensive experiments on node classification and clustering demonstrate the effectiveness of the proposed method over the state-of-the-arts especially under heavy perturbation.

List of keywords

Data Mining -> DM: Mining graphs
Machine Learning -> ML: Sequence and graph learning
Machine Learning -> ML: Unsupervised learning

3450

Group-Aware Coordination Graph for Multi-Agent Reinforcement Learning

Wei Duan, Jie Lu, Junyu Xuan

[+] More

[-] Less

Cooperative Multi-Agent Reinforcement Learning (MARL) necessitates seamless collaboration among agents, often represented by an underlying relation graph. Existing methods for learning this graph primarily focus on agent-pair relations, neglecting higher-order relationships. While several approaches attempt to extend cooperation modelling to encompass behaviour similarities within groups, they commonly fall short in concurrently learning the latent graph, thereby constraining the information exchange among partially observed agents. To overcome these limitations, we present a novel approach to infer the Group-Aware Coordination Graph (GACG), which is designed to capture both the cooperation between agent pairs based on current observations and group-level dependencies from behaviour patterns observed across trajectories. This graph is further used in graph convolution for information exchange between agents during decision-making. To further ensure behavioural consistency among agents within the same group, we introduce a group distance loss, which promotes group cohesion and encourages specialization between groups. Our evaluations, conducted on StarCraft II micromanagement tasks, demonstrate GACG’s superior performance. An ablation study further provides experimental evidence of the effectiveness of each component of our method.

List of keywords

Machine Learning -> ML: Multiagent Reinforcement Learning
Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation

3452

Make Bricks with a Little Straw: Large-Scale Spatio-Temporal Graph Learning with Restricted GPU-Memory Capacity

Binwu Wang, Pengkun Wang, Zhengyang Zhou, Zhe Zhao, Wei Xu, Yang Wang

[+] More

[-] Less

Traffic prediction plays a key role in various smart city applications. Accurate forecasting can help traffic managers make traffic plans in advance, assist online ride-hailing companies to deploy vehicles reasonably, and provide early warning of congestion for safety authorities. While increasingly complex models achieve impressive prediction performance, there are concerns about the effectiveness of these models in handling large-scale road networks. Especially for researchers who don’t have access to powerful GPU devices, the expensive memory burden limits the usefulness of these models. In this paper, we take the first step to learn large-scale spatio-temporal graphs, and propose a divide-and-conquer training strategy for \textbf{Lar}ge \textbf{S}patio-\textbf{T}emporal \textbf{G}raph \textbf{L}earning, namely \textbf{LarSTL}. The core idea behind this strategy is to divide the large graph into multiple subgraphs, which are treated as task streams to sequentially train the model to conquer each subgraph one by one. We introduce a novel continuous learning paradigm to achieve this goal. Specifically, the experience-based replay strategy consolidates the learned knowledge by replaying the previous subgraph sampling nodes. At the same time, we configure specific feature adaptors for each subgraph to extract personalized features, and it is beneficial to consolidate the learned knowledge from the perspective of parameters. We conduct experiments on multiple large-scale networks with only one GPU device with 16GB memory, the results demonstrate that the model can achieve competitive performance and high efficiency.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data
Data Mining -> DM: Big data and scalability
Data Mining -> DM: Mining graphs

3466

Active Deep Multi-view Clustering

Helin Zhao, Wei Chen, Peng Zhou

[+] More

[-] Less

Deep multi-view clustering has been widely studied. However, since it is an unsupervised task, where no labels are used to guide the training, it is still unreliable especially when handling complicated data. Although deep semi-supervised multi-view clustering can alleviate this problem by using some supervised information, the supervised information is often pregiven or randomly selected. Unfortunately, as we know, the clustering performance highly depends on the quality of the supervised information and most of the semi-supervised methods ignore the supervised information selection. To tackle this problem, in this paper, we propose a novel active deep multi-view clustering method, which can actively select important data for querying human annotations. In this method, we carefully design a fusion module, an active selection module, a supervised module, and an unsupervised module, and integrate them into a unified framework seamlessly. In this framework, we can obtain a more reliable clustering result with as few annotations as possible. The extensive experiments on benchmark data sets show that our method can outperform state-of-the-art unsupervised and semi-supervised methods, demonstrating the effectiveness and superiority of the proposed method. The code is available at https://github.com/wodedazhuozi/ADMC .

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Active learning
Machine Learning -> ML: Multi-modal learning

3473

Decoupled Invariant Attention Network for Multivariate Time-series Forecasting

Haihua Xu, Wei Fan, Kun Yi, Pengyang Wang

[+] More

[-] Less

To achieve more accurate prediction results in Time Series Forecasting (TSF), it is essential to distinguish between the valuable patterns (invariant patterns) of the temporal-spatial relationship and the patterns that are prone to generate distribution shifts (variant patterns), then combine them for forecasting.The existing works, such as transformer-based models and GNN-based models, focus on capturing main forecasting dependencies whether it is stable or not, and they tend to overlook patterns that carry both useful information and distribution shifts. In this paper, we propose a model for better forecasting time series: Decoupled Invariant Attention Network (DIAN), which contains two modules to learn temporal and spatial relationships respectively: 1) Spatial Decoupled Invariant-Variant Learning (SDIVL) to decouple the spatial invariant and variant attention scores, and then leverage convolutional networks to effectively integrate them for subsequent layers; 2) Temporal Augmented Invariant-Variant Learning (TAIVL) to decouple temporal invariant and variant patterns and combine them for further forecasting.In this module, we also design Temporal Intervention Mechanism to create multiple intervened samples by reassembling variant patterns across time stamps to eliminate the spurious impacts of variant patterns.In addition, we propose Joint Optimization to minimize the loss function considering all invariant patterns, variant patterns and intervened patterns so that our model can gain a more stable predictive ability.Extensive experiments on five datasets have demonstrated our superior performance with higher efficiency compared with state-of-the-art methods.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data

3481

DGR: A General Graph Desmoothing Framework for Recommendation via Global and Local Perspectives

Leilei Ding, Dazhong Shen, Chao Wang, Tianfu Wang, Le Zhang, Yanyong Zhang

[+] More

[-] Less

Graph Convolutional Networks (GCNs) have become pivotal in recommendation systems for learning user and item embeddings by leveraging the user-item interaction graph’s node information and topology. However, these models often face the famous over-smoothing issue, leading to indistinct user and item embeddings and reduced personalization. Traditional desmoothing methods in GCN-based systems are model-specific, lacking a universal solution. This paper introduces a novel, model-agnostic approach named Desmoothing Framework for GCN-based Recommendation Systems (DGR). It effectively addresses over-smoothing on general GCN-based recommendation models by considering both global and local perspectives. Specifically, we first introduce vector perturbations during each message passing layer to penalize the tendency of node embeddings approximating overly to be similar with the guidance of the global topological structure. Meanwhile, we further develop a tailored-design loss term for the readout embeddings to preserve the local collaborative relations between users and their neighboring items. In particular, items that exhibit a high correlation with neighboring items are also incorporated to enhance the local topological information.To validate our approach, we conduct extensive experiments on 5 benchmark datasets based on 5 well-known GCN-based recommendation models, demonstrating the effectiveness and generalization of our proposed framework. Our code is available at GitHub.

List of keywords

Data Mining -> DM: Collaborative filtering
Data Mining -> DM: Recommender systems

3488

Denoising-Aware Contrastive Learning for Noisy Time Series

Shuang Zhou, Daochen Zha, Xiao Shen, Xiao Huang, Rui Zhang, Korris Chung

[+] More

[-] Less

Time series self-supervised learning (SSL) aims to exploit unlabeled data for pre-training to mitigate the reliance on labels. Despite the great success in recent years, there is limited discussion on the potential noise in the time series, which can severely impair the performance of existing SSL methods. To mitigate the noise, the de facto strategy is to apply conventional denoising methods before model training. However, this pre-processing approach may not fully eliminate the effect of noise in SSL for two reasons: (i) the diverse types of noise in time series make it difficult to automatically determine suitable denoising methods; (ii) noise can be amplified after mapping raw data into latent space. In this paper, we propose denoising-aware contrastive learning (DECL), which uses contrastive learning objectives to mitigate the noise in the representation and automatically selects suitable denoising methods for every sample. Extensive experiments on various datasets verify the effectiveness of our method. The code is open-sourced.

List of keywords

Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Time series and data streams

3496

Improving Pseudo Labels with Global-Local Denoising Framework for Cross-lingual Named Entity Recognition

Zhuojun Ding, Wei Wei, Xiaoye Qu, Dangyang Chen

[+] More

[-] Less

Cross-lingual named entity recognition (NER) aims to train an NER model for the target language leveraging only labeled source language data and unlabeled target language data. Prior approaches either perform label projection on translated source language data or employ a source model to assign pseudo labels for target language data and train a target model on these pseudo-labeled data to generalize to the target language. However, these automatic labeling procedures inevitably introduce noisy labels, thus leading to a performance drop. In this paper, we propose a Global-Local Denoising framework (GLoDe) for cross-lingual NER. Specifically, GLoDe introduces a progressive denoising strategy to rectify incorrect pseudo labels by leveraging both global and local distribution information in the semantic space. The refined pseudo-labeled target language data significantly improves the model’s generalization ability. Moreover, previous methods only consider improving the model with language-agnostic features, however, we argue that target language-specific features are also important and should never be ignored. To this end, we employ a simple auxiliary task to achieve this goal. Experimental results on two benchmark datasets with six target languages demonstrate that our proposed GLoDe significantly outperforms current state-of-the-art methods.

List of keywords

Natural Language Processing -> NLP: Named entities
Natural Language Processing -> NLP: Information extraction

3498

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

Yonghao Yu, Shunan Zhu, Huai Qin, Haorui Li

[+] More

[-] Less

Witnessing the evolution of text-to-image diffusion models, significant strides have been made in text-to-3D generation. Currently, two primary paradigms dominate the field of text-to-3D: the feed-forward generation solutions, capable of swiftly producing 3D assets but often yielding coarse results, and the Score Distillation Sampling (SDS) based solutions, known for generating high-fidelity 3D assets albeit at a slower pace. The synergistic integration of these methods holds substantial promise for advancing 3D generation techniques. In this paper, we present BoostDream, a highly efficient plug-and-play 3D refining method designed to transform coarse 3D assets into high-quality. The BoostDream framework comprises three distinct processes: (1) We introduce 3D model distillation that fits differentiable representations from the 3D assets obtained through feed-forward generation. (2) A novel multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion model to refine the 3D assets. (3) We propose to use prompt and multi-view consistent normal maps as guidance in refinement. Our extensive experiment is conducted on different differentiable 3D representations, revealing that BoostDream excels in generating high-quality 3D assets rapidly, overcoming the Janus problem compared to conventional SDS-based methods. This breakthrough signifies a substantial advancement in both the efficiency and quality of 3D generation processes.

List of keywords

Machine Learning -> ML: Generative models
Computer Vision -> CV: 3D computer vision
Multidisciplinary Topics and Applications -> MTA: Arts and creativity

3519

M2Beats: When Motion Meets Beats in Short-form Videos

Dongxiang Jiang, Yongchang Zhang, Shuai He, Anlong Ming

[+] More

[-] Less

In recent years, short-form videos have gained popularity and the editing of these videos, particularly when motion is synchronized with music, is highly favored due to its beat-matching effect. However, detecting motion rhythm poses a significant challenge as it is influenced by multiple factors that make it difficult to define using explicit rules. While traditional methods attempt to define motion rhythm, they often yield unsatisfactory results. On the other hand, learning-based methods can extract motion rhythm without relying on explicit rules but require high-quality datasets. Unfortunately, existing datasets simply substitute music rhythm for motion rhythm which are not equivalent. To address these challenges, we present the motion rhythm dataset AIST-M2B, which is annotated with meticulously curated motion rhythm labels derived from the profound correlation between motion and music in professional dance. We propose a novel network architecture called M2BNet that is specifically trained on AIST-M2B to effectively extract intricate motion rhythms by incorporating both human body structure and temporal information. Additionally, we introduce a pioneering algorithm for enhancing motion rhythm synchronization with beats. Experimental results substan- tiate the superior performance of our method compared to other existing algorithms in the domain of motion rhythm analysis. Our code is available at https://github.com/mRobotit/M2Beats.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Image and video retrieval
Computer Vision -> CV: Motion and tracking

3521

Advancing Medical Image Segmentation via Self-supervised Instance-adaptive Prototype Learning

Guoyan Liang, Qin Zhou, Jingyuan Chen, Zhe Wang, Chang Yao

[+] More

[-] Less

Medical Image Segmentation (MIS) plays a crucial role in medical therapy planning and robot navigation. Prototype learning methods in MIS focus on generating segmentation masks through pixel-to-prototype comparison. However, current approaches often overlook sample diversity by using a fixed prototype per semantic class and neglect intra-class variation within each input. In this paper, we propose to generate instance-adaptive prototypes for MIS, which integrates a common prototype proposal (CPP) capturing common visual patterns and an instance-specific prototype proposal (IPP) tailored to each input. To further account for the intra-class variation, we propose to guide the IPP generation by re-weighting the intermediate feature map according to their confidence scores. These confidence scores are hierarchically generated using a transformer decoder. Additionally we introduce a novel self-supervised filtering strategy to prioritize the foreground pixels during the training of the transformer decoder. Extensive experiments demonstrate favorable performance of our method.

List of keywords

Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Representation learning
Computer Vision -> CV: Segmentation
Machine Learning -> ML: Self-supervised Learning

3525

Spatial-Temporal Perceiving: Deciphering User Hierarchical Intent in Session-Based Recommendation

Xiao Wang, Tingting Dai, Qiao Liu, Shuang Liang

[+] More

[-] Less

Session-based recommendation (SBR) aims to predict the next-interacted item based on anonymous users’ behavior sequences. The main challenge is how to recognize the user intent with limited interactions to achieve a more accurate inference of user behavior. Existing works usually regard several consecutive items in the current session as intent. However, we argue such intent generation based on temporal transition ignores the fact that each item also has its semantically connected items in the feature space, which can be regarded as spatial intent. The limited consideration of intent fails to capture complex behavioral patterns in real-world scenarios, leading to sub-optimal solutions. To address this issue, we propose the Hierarchical Intent Perceiving Contrastive Learning Framework (HearInt) for SBR, which proposes a hierarchical consideration of intents from both temporal and spatial perspective. Specifically, we first propose that the user’s temporal intents are mutually exclusive while the spatial intents are mutually compatible. Following these analyses, we design a Temporal Intent Decoupling module to mitigate the mutual influence of long-term and short-term intents, and a Cross-scale Contrastive Learning task to enhance the consistency of intents across different spatial scales. Experimental results on three real-world datasets exhibit that HearInt achieves state-of-the-art performance.

List of keywords

Data Mining -> DM: Recommender systems

3528

Towards Robust Trajectory Representations: Isolating Environmental Confounders with Causal Learning

Kang Luo, Yuanshao Zhu, Wei Chen, Kun Wang, Zhengyang Zhou, Sijie Ruan, Yuxuan Liang

[+] More

[-] Less

Trajectory modeling refers to characterizing human movement behavior, serving as a pivotal step in understanding mobility patterns. Nevertheless, existing studies typically ignore the confounding effects of geospatial context, leading to the acquisition of spurious correlations and limited generalization capabilities. To bridge this gap, we initially formulate a Structural Causal Model (SCM) to decipher the trajectory representation learning process from a causal perspective. Building upon the SCM, we further present a Trajectory modeling framework (TrajCL) based on Causal Learning, which leverages the backdoor adjustment theory as an intervention tool to eliminate the spurious correlations between geospatial context and trajectories. Extensive experiments on two real-world datasets verify that TrajCL markedly enhances performance in trajectory classification tasks while showcasing superior generalization and interpretability.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data
Multidisciplinary Topics and Applications -> MTA: Transportation

3538

Practical Hybrid Gradient Compression for Federated Learning Systems

Sixu Hu, Linshan Jiang, Bingsheng He

[+] More

[-] Less

The high communication cost is a major challenge in the federated learning (FL) training process. Several methods have been proposed to reduce communication costs on the uplink channel, primarily sparsification-based methods, which have overlooked the impact of downlink channels. However, model accuracy and communication cost issues arise when applying them in practical FL applications, especially when the bandwidth is limited both on the uplink and downlink channels. In this paper, we propose a novel secure-FL-compatible hybrid gradient compression framework (HGC) that handles both uplink and downlink communication. Specifically, HGC identifies and exploits three types of redundancies in the FL training process. With proposed optimization methods based on compression ratio correction and dynamic momentum correction, HGC improves the trade-off between communication cost and model performance. The extensive theoretical and empirical analysis demonstrates the effectiveness of our framework in achieving a high compression ratio for both uplink and downlink communications with negligible loss of model accuracy, surpassing the state-of-the-art compression methods.

List of keywords

Machine Learning -> ML: Federated learning

3549

Make Graph Neural Networks Great Again: A Generic Integration Paradigm of Topology-Free Patterns for Traffic Speed Prediction

Yicheng Zhou, Pengfei Wang, Hao Dong, Denghui Zhang, Dingqi Yang, Yanjie Fu, Pengyang Wang

[+] More

[-] Less

Urban traffic speed prediction aims to estimate the future traffic speed for improving urban transportation services. Enormous efforts have been made to exploit Graph Neural Networks (GNNs) for modeling spatial correlations and temporal dependencies of traffic speed evolving patterns, regularized by graph topology. While achieving promising results, current traffic speed prediction methods still suffer from ignoring topology-free patterns, which cannot be captured by GNNs. To tackle this challenge, we propose a generic model for enabling the current GNN-based methods to preserve topology-free patterns. Specifically, we first develop a Dual Cross-Scale Transformer (DCST) architecture, including a Spatial Transformer and a Temporal Transformer, to preserve the cross-scale topology-free patterns and associated dynamics, respectively. Then, to further integrate both topology-regularized/-free patterns, we propose a distillation-style learning framework, in which the existing GNN-based methods are considered as the teacher model, and the proposed DCST architecture is considered as the student model. The teacher model would inject the learned topology-regularized patterns into the student model for integrating topology-free patterns. The extensive experimental results demonstrated the effectiveness of our methods.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data

3577

AK4Prompts: Aesthetics-driven Automatically Keywords-Ranking for Prompts in Text-To-Image Models

Haiyang Zhang, Mengchao Wang, Shuai He, Anlong Ming

[+] More

[-] Less

Current text-to-image synthesis (TIS) models have demonstrated the ability to generate high-fidelity images based on textual prompts. However, the efficacy of these models heavily relies on the keywords present in the prompts, and there is a dearth of objective analysis regarding how different keywords impact the ultimate quality of generated results. Therefore, manual evaluation becomes necessary but limited and inefficient to ascertain the role played by keywords. In this paper, we propose automated keywords-ranking for prompts (AK4Prompts), a keyword evaluation model based on mainstream TIS models that explicitly quantifies the multidimensional impact of various keywords on image generation based on prompts. To enable personalized keyword evaluation based on prompt content, we propose decoupling the latent representations of keywords and prompts in TIS models, followed by integrating the semantic features of prompts into keywords. For quantitative and multidimensional evaluation, we align the fused features of keywords using HPSv2, aesthetic score, and CLIP score, each representing distinct factors contributing to keyword impact. Our AK4Prompts can flexibly and automatically select the keywords that best match the original prompt based on individual user preferences. Extensive experimental results show the superiority of AK4Prompts to improve the quality of generated images significantly over strong baselines. Our approach not only enhances usability and user experience but also addresses the current gap in automated analysis and evaluation of keyword effects. Our code is availableat https://github.com/mRobotit/AK4Prompts.

List of keywords

Computer Vision -> CV: Image and video synthesis and generation
Computer Vision -> CV: Computational photography
Computer Vision -> CV: Machine learning for vision

3578

Label-efficient Semantic Scene Completion with Scribble Annotations

Song Wang, Jiawei Yu, Wentong Li, Hao Shi, Kailun Yang, Junbo Chen, Jianke Zhu

[+] More

[-] Less

Semantic scene completion aims to infer the 3D geometric structures with semantic classes from camera or LiDAR, which provide essential occupancy information in autonomous driving. Prior endeavors concentrate on constructing the network or benchmark in a fully supervised manner. While the dense occupancy grids need point-wise semantic annotations, which incur expensive and tedious labeling costs. In this paper, we build a new label-efficient benchmark, named ScribbleSC, where the sparse scribble-based semantic labels are combined with dense geometric labels for semantic scene completion. In particular, we propose a simple yet effective approach called Scribble2Scene, which bridges the gap between the sparse scribble annotations and fully-supervision. Our method consists of geometric-aware auto-labelers construction and online model training with an offline-to-online distillation module to enhance the performance. Experiments on SemanticKITTI demonstrate that Scribble2Scene achieves competitive performance against the fully-supervised counterparts, showing 99% performance of the fully-supervised models with only 13.5% voxels labeled. Both annotations of ScribbleSC and our full implementation are available at https://github.com/songw-zju/Scribble2Scene.

List of keywords

Computer Vision -> CV: Scene analysis and understanding
Computer Vision -> CV: Applications

3582

Generalized Taxonomy-Guided Graph Neural Networks

Yu Zhou, Di Jin, Jianguo Wei, Dongxiao He, Zhizhi Yu, Weixiong Zhang

[+] More

[-] Less

Graph neural networks have been demonstrated to be effective analytic apparatus for mining network data. Most real-world networks are inherently hierarchical, offering unique opportunities to acquire latent, intrinsic network organizational properties by utilizing network taxonomies. The existing approaches for learning implicit hierarchical network structures focus on introducing taxonomy to graph neural networks but often run short of exploiting the rich network semantics and structural properties in the taxonomy, resulting in poor generalizability and reusability. To address these issues, we propose generalized Taxonomy-Guided Graph Neural Networks (TG-GNN) to integrate taxonomy into network representation learning. We first construct a taxonomy representation learning module that introduces the concept of ego network to propagate and aggregate rich semantic and structural information in the taxonomy. We then design a taxonomy-guided Markov mechanism, which encapsulates taxonomy knowledge in pairwise potential functions, to refine network embeddings. Extensive experiments on various real-world networks illustrate the effectiveness of TG-GNN over the state-of-the-art methods on scenarios involving incomplete taxonomies and inductive settings.

List of keywords

Data Mining -> DM: Mining graphs
Machine Learning -> ML: Sequence and graph learning

3586

Individual Causal Structure Learning from Population Data

Wei Chen, Xiaokai Huang, Zijian Li, Ruichu Cai, Zhiyi Huang, Zhifeng Hao

[+] More

[-] Less

Learning the causal structure of each individual plays a crucial role in neuroscience, biology, and so on. Existing methods consider data from each individual separately, which may yield inaccurate causal structure estimations in limited samples. To leverage more samples, we consider incorporating data from all individuals as population data. We observe that the variables of all individuals are influenced by the common environment variables they share. These shared environment variables can be modeled as latent variables and serve as a bridge connecting data from different individuals. In particular, we propose an Individual Linear Acyclic Model (ILAM) for each individual from population data, which models the individual’s variables as being linearly influenced by their parents, in addition to environment variables and noise terms. Theoretical analysis shows that the model is identifiable when all environment variables are non-Gaussian, or even if some are Gaussian with an adequate diversity in the variance of noises for each individual. We then develop an individual causal structures learning method based on the Share Independence Component Analysis technique. Experimental results on synthetic and real-world data demonstrate the correctness of the method even when the sample size of each individual’s data is small.

List of keywords

Uncertainty in AI -> UAI: Causality, structural causal models and causal inference

3589

Robust Contrastive Multi-view Kernel Clustering

Peng Su, Yixi Liu, Shujian Li, Shudong Huang, Jiancheng Lv

[+] More

[-] Less

Multi-view kernel clustering (MKC) aims to fully reveal the consistency and complementarity of multiple views in a potential Hilbert space, thereby enhancing clustering performance. The clustering results of most MKC methods are highly sensitive to the quality of the constructed kernels, as traditional methods independently compute kernel matrices for each view without fully considering complementary information across views. In previous contrastive multi-view kernel learning, the goal was to bring cross-view instances of the same sample closer during the kernel construction process while pushing apart instances across samples to achieve a comprehensive integration of cross-view information. However, its inherent drawback is the potential inappropriate amplification of distances between different instances of the same clusters (i.e., false negative pairs) during the training process, leading to a reduction in inter-class discriminability. To address this challenge, we propose a Robust Contrastive multi-view kernel Learning approach (R-CMK) against false negative pairs. It partitions negative pairs into different intervals based on distance or similarity, and for false negative pairs, reverses their optimization gradient. This effectively avoids further amplification of distances for false negative pairs while simultaneously pushing true negative pairs farther apart. We conducted comprehensive experiments on various MKC methods to validate the effectiveness of the proposed method. The code is available at https://github.com/Duo-laimi/rcmk_main.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
Machine Learning -> ML: Kernel methods

3593

Regression Residual Reasoning with Pseudo-labeled Contrastive Learning for Uncovering Multiple Complex Compositional Relations

Chengtai Li, Yuting He, Jianfeng Ren, Ruibin Bai, Yitian Zhao, Heng Yu, Xudong Jiang

[+] More

[-] Less

Abstract Visual Reasoning (AVR) has been widely studied in literature. Our study reveals that AVR models tend to rely on appearance matching rather than a genuine understanding of underlying rules. We hence develop a challenging benchmark, Multiple Complex Compositional Reasoning (MC$^2$R), composed of diverse compositional rules on attributes with intentionally increased variations. It aims to identify two outliers from five given images, in contrast to single-answer questions in previous AVR tasks. To solve MC$^2$R tasks, a Regression Residual Reasoning with Pseudo-labeled Contrastive Learning (R$^3$PCL) is proposed, which first transforms the original problem by selecting three images following the same rule, and iteratively regresses one normal image by using the other two, allowing the model to gradually comprehend the underlying rules. The proposed PCL leverages a set of min-max operations to generate more reliable pseudo labels, and exploits contrastive learning with data augmentation on pseudo-labeled images to boost the discrimination and generalization of features. Experimental results on two AVR datasets show that the proposed R$^3$PCL significantly outperforms state-of-the-art models.

List of keywords

Knowledge Representation and Reasoning -> KRR: Learning and reasoning

3597

ScreenAgent: A Vision Language Model-driven Computer Control Agent

Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Qi Wang, Yi Chang

[+] More

[-] Less

Large Language Models (LLM) can invoke a variety of tools and APIs to complete complex tasks. The computer, as the most powerful and universal tool, could potentially be controlled by a trained LLM agent. Powered by the computer, we can hopefully build a more generalized agent to assist humans in various daily digital works. In this paper, we construct an environment for a Vision Language Model (VLM) agent to interact with a real computer screen. Within this environment, the agent can observe screenshots and manipulate the Graphical User Interface (GUI) by outputting mouse and keyboard actions. We also design an automated control pipeline that includes planning, acting, and reflecting phases, guiding the agent to continuously interact with the environment and complete multi-step tasks. Additionally, we construct the ScreenAgent Dataset, which collects screenshots and action sequences when completing daily computer tasks. Finally, we train a model, ScreenAgent, which achieves comparable computer control capabilities to GPT-4V and demonstrated more precise UI positioning capabilities. Our attempts could inspire further research on building a generalist LLM agent. The code and more detailed information are at \url{https://github.com/niuzaisheng/ScreenAgent}.

List of keywords

Natural Language Processing -> NLP: Dialogue and interactive systems
Agent-based and Multi-agent Systems -> MAS: Human-agent interaction
Computer Vision -> CV: Vision, language and reasoning
Natural Language Processing -> NLP: Resources and evaluation

3600

TFLOP: Table Structure Recognition Framework with Layout Pointer Mechanism

Minsoo Khang, Teakgyu Hong

[+] More

[-] Less

Table Structure Recognition (TSR) is a task aimed at converting table images into a machine-readable format (e.g. HTML), to facilitate other applications such as information retrieval. Recent works tackle this problem by identifying the HTML tags and text regions, where the latter is used for text extraction from the table document. These works however, suffer from misalignment issues when mapping text into the identified text regions. In this paper, we introduce a new TSR framework, called TFLOP (TSR Framework with LayOut Pointer mechanism), which reformulates the conventional text region prediction and matching into a direct text region pointing problem. Specifically, TFLOP utilizes text region information to identify both the table’s structure tags and its aligned text regions, simultaneously. Without the need for region prediction and alignment, TFLOP circumvents the additional text region matching stage, which requires finely-calibrated post-processing. TFLOP also employs span-aware contrastive supervision to enhance the pointing mechanism in tables with complex structure. As a result, TFLOP achieves the state-of-the-art performance across multiple benchmarks such as PubTabNet, FinTabNet, and SynthTabNet. In our extensive experiments, TFLOP not only exhibits competitive performance but also shows promising results on industrial document TSR scenarios such as documents with watermarks or in non-English domain. Source code of our work is publicly available at: https://github.com/UpstageAI/TFLOP.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Applications
Natural Language Processing -> NLP: Applications

3601

A Bias-Free Revenue-Maximizing Bidding Strategy for Data Consumers in Auction-based Federated Learning

Xiaoli Tang, Han Yu, Zengxiang Li, Xiaoxiao Li

[+] More

[-] Less

Auction-based Federated Learning (AFL) is a burgeoning research area. However, existing bidding strategies for AFL data consumers (DCs) primarily focus on maximizing expected accumulated utility, disregarding the more complex goal of revenue maximization. They also only consider winning bids, leading to biased estimates by overlooking information from losing bids. To address these issues, we propose a Bias-free Revenue-maximizing Federated bidding strategy for DCs in AFL (BR-FEDBIDDER). Our theoretical exploration of the relationships between Return on Investment (ROI), bid costs, and utility, and their impact on overall revenue underscores the complexity of maximizing revenue solely by prioritizing ROI enhancement. Leveraging these insights, BR-FEDBIDDER optimizes bid costs with any given ROI constraint. In addition, we incorporate an auxiliary task of winning probability estimation into the framework to achieve bias-free learning by leveraging bid records from historical bid requests, including both winning and losing ones. Extensive experiments on six widely used benchmark datasets show that BR-FEDBIDDER outperforms eight state-of-the-art methods, surpassing the best-performing baseline by 5.66%, 6.08% and 2.44% in terms of the total revenue, ROI, and test accuracy of the resulting FL models, respectively.

List of keywords

Machine Learning -> ML: Federated learning

3643

A Neural Column Generation Approach to the Vehicle Routing Problem with Two-Dimensional Loading and Last-In-First-Out Constraints

Yifan Xia, Xiangyi Zhang

[+] More

[-] Less

The vehicle routing problem with two-dimensional loading constraints (2L-CVRP) and the last-in-first-out (LIFO) rule presents significant practical and algorithmic challenges. While numerous heuristic approaches have been proposed to address its complexity, stemming from two NP-hard problems: the vehicle routing problem (VRP) and the two-dimensional bin packing problem (2D-BPP), less attention has been paid to developing exact algorithms. Bridging this gap, this article presents an exact algorithm that integrates advanced machine learning techniques, specifically a novel combination of attention and recurrence mechanisms. This integration accelerates the state-of-the-art exact algorithm by a median of 29.79% across various problem instances. Moreover, the proposed algorithm successfully resolves an open instance in the standard test-bed, demonstrating significant improvements brought about by the incorporation of machine learning models. Code is available at https://github.com/xyfffff/NCG-for-2L-CVRP.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Modeling
Machine Learning -> ML: Applications
Multidisciplinary Topics and Applications -> MTA: Transportation

3654

KDDC: Knowledge-Driven Disentangled Causal Metric Learning for Pre-Travel Out-of-Town Recommendation

Yinghui Liu, Guojiang Shen, Chengyong Cui, Zhenzhen Zhao, Xiao Han, Jiaxin Du, Xiangyu Zhao, Xiangjie Kong

[+] More

[-] Less

Pre-travel recommendation is developed to provide a variety of out-of-town Point-of-Interests (POIs) for users planning to travel away from their hometowns but have not yet decided on their destination. Existing out-of-town recommender systems work on constructing users’ latent preferences and inferring travel intentions from their check-in sequences. However, there are still two challenges that hamper the performance of these approaches: i) Users’ interactive data (including hometown and out-of-town check-ins) tend to be rare, and while candidate POIs that come from different regions contain various semantic information; ii) The causes for user check-in include not only interest but also conformity, which are easily entangled and overlooked. To fill these gaps, we propose a Knowledge-Driven Disentangled Causal metric learning framework (KDDC) that mitigates interaction data sparsity by enhancing POI semantic representation and considers the distributions of two causes (i.e., conformity and interest) for pre-travel recommendation. Specifically, we pretrain a constructed POI attribute knowledge graph through a segmented interaction method and POI semantic information is aggregated via relational heterogeneity. In addition, we devise a disentangled causal metric learning to model and infer userrelated representations. Extensive experiments on two real-world nationwide datasets display the consistent superiority of our KDDC over state-of-theart baselines.

List of keywords

Data Mining -> DM: Recommender systems
Data Mining -> DM: Mining spatial and/or temporal data

3666

BADFSS: Backdoor Attacks on Federated Self-Supervised Learning

Jiale Zhang, Chengcheng Zhu, Di Wu, Xiaobing Sun, Jianming Yong, Guodong Long

[+] More

[-] Less

Self-supervised learning (SSL) is capable of learning remarkable representations from centrally available data. Recent works further implement federated learning with SSL to learn from rapidly growing decentralized unlabeled images (e.g., from cameras and phones), often resulting from privacy constraints. Extensive attention has been paid to designing new frameworks or methods that achieve better performance for the SSL-based FL. However, such an effort has not yet taken the security of SSL-based FL into consideration. We aim to explore backdoor attacks in the context of SSL-based FL via an in-depth empirical study. In this paper, we propose a novel backdoor attack BADFSS against SSL-based FL. First, BADFSS learns a backdoored encoder via supervised contrastive learning on poison datasets constructed based on local datasets. Then, BADFSS employs attention alignment to enhance the backdoor effect and maintain the consistency between backdoored and global encoders. Moreover, we perform empirical evaluations of the proposed backdoor attacks on four datasets and compared BADFSS with three existing backdoor attacks that are transferred into federated self-supervised learning. The experiments demonstrate that BADFSS outperforms baseline methods and is effective under various settings.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Multidisciplinary Topics and Applications -> MTA: Security and privacy

3669

ESP-PCT: Enhanced VR Semantic Performance through Efficient Compression of Temporal and Spatial Redundancies in Point Cloud Transformers

Luoyu Mei, Shuai Wang, Yun Cheng, Ruofeng Liu, Zhimeng Yin, Wenchao Jiang, Shuai Wang, Wei Gong

[+] More

[-] Less

Semantic recognition is pivotal in virtual reality (VR) applications, enabling immersive and interactive experiences. A promising approach is the utilization of millimeter-wave (mmWave) signals to generate point clouds for this purpose. However, the high computational and memory demands of current mmWave point cloud models hinder their efficiency and reliability. To address this, our paper introduces ESP-PCT, a novel Enhanced Semantic Performance Point Cloud Transformer with a two-stage semantic recognition framework tailored for VR applications. ESP-PCT takes advantage of the accuracy of sensory point cloud data and optimizes the semantic recognition process by sharing the same network parameters across both the localization and focus stages, which are trained jointly in an end-to-end manner. We rigorously evaluate ESP-PCT on various VR semantic recognition tasks, demonstrating substantial enhancements in recognition efficiency. Notably, ESP-PCT achieves a remarkable accuracy of 93.2%, while concurrently reducing the computational requirements (FLOPs) by 76.9% and memory usage by 78.2% compared to the existing Point Transformer model. These results underscore ESP-PCT’s potential in revolutionizing VR semantic recognition by achieving high accuracy and reducing redundancy.

List of keywords

Computer Vision -> CV: Applications
Computer Vision -> CV: Motion and tracking
Machine Learning -> ML: Optimization
Multidisciplinary Topics and Applications -> MTA: Security and privacy

3675

Estimating before Debiasing: A Bayesian Approach to Detaching Prior Bias in Federated Semi-Supervised Learning

Guogang Zhu, Xuefeng Liu, Xinghao Wu, Shaojie Tang, Chao Tang, Jianwei Niu, Hao Su

[+] More

[-] Less

Federated Semi-Supervised Learning (FSSL) leverages both labeled and unlabeled data on clients to collaboratively train a model. In FSSL, the heterogeneous data can introduce prediction bias into the model, causing the model’s prediction to skew towards some certain classes. Existing FSSL methods primarily tackle this issue by enhancing consistency in model parameters or outputs. However, as the models themselves are biased, merely constraining their consistency is not sufficient to alleviate prediction bias. In this paper, we explore this bias from a Bayesian perspective and demonstrate that it principally originates from label prior bias within the training data. Building upon this insight, we propose a debiasing method for FSSL named FedDB. FedDB utilizes the Average Prediction Probability of Unlabeled Data (APP-U) to approximate the biased prior. During local training, FedDB employs APP-U to refine pseudo-labeling through Bayes’ theorem, thereby significantly reducing the label prior bias. Concurrently, during the model aggregation, FedDB uses APP-U from participating clients to formulate unbiased aggregate weights, thereby effectively diminishing bias in the global model. Experimental results show that FedDB can surpass existing FSSL methods. The code is available at https://github.com/GuogangZhu/FedDB.

List of keywords

Data Mining -> DM: Parallel, distributed and cloud-based high performance mining
Data Mining -> DM: Privacy-preserving data mining
Machine Learning -> ML: Semi-supervised learning

3695

FlagVNE: A Flexible and Generalizable Reinforcement Learning Framework for Network Resource Allocation

Tianfu Wang, Qilin Fan, Chao Wang, Long Yang, Leilei Ding, Nicholas Jing Yuan, Hui Xiong

[+] More

[-] Less

Virtual network embedding (VNE) is an essential resource allocation task in network virtualization, aiming to map virtual network requests (VNRs) onto physical infrastructure. Reinforcement learning (RL) has recently emerged as a promising solution to this problem. However, existing RL-based VNE methods are limited by the unidirectional action design and one-size-fits-all training strategy, resulting in restricted searchability and generalizability. In this paper, we propose a \textbf{FL}exible \textbf{A}nd \textbf{G}eneralizable RL framework for \textbf{VNE}, named \textbf{FlagVNE}. Specifically, we design a bidirectional action-based Markov decision process model that enables the joint selection of virtual and physical nodes, thus improving the exploration flexibility of solution space. To tackle the expansive and dynamic action space, we design a hierarchical decoder to generate adaptive action probability distributions and ensure high training efficiency. Furthermore, to overcome the generalization issue for varying VNR sizes, we propose a meta-RL-based training method with a curriculum scheduling strategy, facilitating specialized policy training for each VNR size. Finally, extensive experimental results show the effectiveness of FlagVNE across multiple key metrics. Our code is available at \href{https://github.com/GeminiLight/flag-vne}{https://github.com/GeminiLight/flag-vne}.

List of keywords

Data Mining -> DM: Applications
Data Mining -> DM: Parallel, distributed and cloud-based high performance mining
Machine Learning -> ML: Applications

3696

Implicit Prompt Learning for Image Denoising

Yao Lu, Bo Jiang, Guangming Lu, Bob Zhang

[+] More

[-] Less

Recently, various deep denoising methods have been proposed to solve the insufficient feature problem in image denoising. These methods can be mainly classified into two categories: (1) Injecting learnable tensors into denoising backbone to supplement feature, which is effective to some extent but may cause serious over-fitting. (2) Using diverse natural images from large image datasets to synthesize noisy images and pre-train denoising models, which can bring model generalization but require large model size and expensive training costs. To address these issues, this paper proposes Implicit Prompt Learning for Image Denoising (IPLID) method to flexibly generate adaptive prompts without meticulously designing them. Specifically, we first introduce an efficient Linear Prompt (LP) block with ultra-few parameters to produce dynamic prompts for both different stages and samples in denoising procedure. We further propose an efficient Compact Feature Fusion (CFF) block to process previous multi-level prompted denoising feature to reconstruct the denoising images. Finally, to further efficiently and effectively produce satisfactory prompt and denoising performance, a Gradient Accumulation (GA) learning scheme is proposed. Experiments on multiple benchmarks showed that the proposed IPLID achieves competitive results with only 1\% of pre-trained backbone parameters, outperforming classical denoising methods in both efficiency and quality of restored images.

List of keywords

Machine Learning -> ML: Knowledge-aided learning

3699

QFormer: An Efficient Quaternion Transformer for Image Denoising

Bo Jiang, Yao Lu, Guangming Lu, Bob Zhang

[+] More

[-] Less

Since Deep Convolutional Neural Networks (DCNNs) and Vision Transformer perform well in learning generalizable image priors from large-scale data, these models have been widely used in image denoising tasks. However, vanilla DCNNs and Transformer suffer from two problems. First, the vanilla DCNNs and Transformer only accumulate the output along the channel axis, ignoring the internal relationship among channels. This results in the severely inadequate color structure representation retrieved from color images. Secondly, the DCNNs or Transformer-based image denoising models usually have a large number of parameters, high computational complexity, and slow inference speed. To resolve these issues, this paper proposes a highly-efficient Quaternion Transformer (QFormer) for image denoising. Specifically, the proposed Quaternion Transformer Block (QTB) simplifies the typical Transformer from a multi-branch structure to an elaborately sequential structure mainly with quaternion transformations, to alternately capture both long-range dependencies and local contextual features with color structure information. Furthermore, the proposed QTB can also avoid considerable element-wise multiplications of computing the self-attention matrices. Thus, our QTB can significantly reduce the computational complexity and its sequential structure can further improve the practical inference speed. Comprehensive experiments demonstrate that the proposed QFormer produces state-of-the-art results in both denoising performance and efficiency. We hope that our work will encourage further research to explore the Quaternion Transformer architecture for image denoising tasks.

List of keywords

Machine Learning -> ML: Knowledge-aided learning

3718

Zero-shot Learning for Preclinical Drug Screening

Kun Li, Weiwei Liu, Yong Luo, Xiantao Cai, Jia Wu, Wenbin Hu

[+] More

[-] Less

Conventional deep learning methods typically employ supervised learning for drug response prediction (DRP). This entails dependence on labeled response data from drugs for model training. However, practical applications in the preclinical drug screening phase demand that DRP models predict responses for novel compounds, often with unknown drug responses. This presents a challenge, rendering supervised deep learning methods unsuitable for such scenarios. In this paper, we propose a zero-shot learning solution for the DRP task in preclinical drug screening. Specifically, we propose a Multi-branch Multi-Source Domain Adaptation Test Enhancement Plug-in, called MSDA. MSDA can be seamlessly integrated with conventional DRP methods, learning invariant features from the prior response data of similar drugs to enhance real-time predictions of unlabeled compounds. The results of experiments on two large drug response datasets showed that MSDA efficiently predicts drug responses for novel compounds, leading to a general performance improvement of 5-10% in the preclinical drug screening phase. The significance of this solution resides in its potential to accelerate the drug discovery process, improve drug candidate assessment, and facilitate the success of drug discovery. The code is available at https://github.com/DrugD/MSDA.

List of keywords

Data Mining -> DM: Mining graphs
Data Mining -> DM: Knowledge graphs and knowledge base completion
Multidisciplinary Topics and Applications -> MTA: Bioinformatics

3743

Correct and Optimal: The Regular Expression Inference Challenge

Mojtaba Valizadeh, Philip John Gorinski, Ignacio Iacobacci, Martin Berger

[+] More

[-] Less

We propose regular expression inference (REI) as a challenge for code/language modelling, and the wider machine learning community. REI is a supervised machine learning (ML) and program optimisation task, and poses the problem of finding minimal regular expressions from examples: Given two finite sets of strings P and N and a cost function cost(·), the task is to generate an expression r that accepts all strings in P and rejects all strings in N , while no other such expression r′ exists with cost(r′) < cost(r).REI has advantages as a challenge problem: (i) regular expressions are well-known, widely used, and a natural idealisation of code; (ii) REI’s asymptotic worst-case complexity is well understood; (iii) REI has a small number of easy to understand parameters (e.g. P or N cardinality, string lengths of examples, or the cost function); this lets us easily finetune REI-hardness; (iv) REI, with its emphasis on optimisation, is an unsolved problem for deep learning based ML.Recently, an REI solver was implemented on GPUs, using program synthesis techniques. This enabled, for the first time, fast generation of minimal regular expressions for complex REI instances. Building on this advance, we generate and publish the first large-scale datasets for REI, and devise and evaluate several initial heuristic and machine learning baselines.We invite the community to participate and explore ML methods that learn to solve REI problems. We believe that progress in REI directly translates to progress in code/language modelling.

List of keywords

Natural Language Processing -> NLP: Resources and evaluation
Machine Learning -> ML: Applications
Machine Learning -> ML: Other
Natural Language Processing -> NLP: Other

3762

Contrastive Learning Is Not Optimal for Quasiperiodic Time Series

Adrian Atienza, Jakob Bardram, Sadasivan Puthusserypady

[+] More

[-] Less

Despite recent advancements in Self-Supervised Learning (SSL) for Time Series analysis, a noticeable gap persists between the anticipated achievements and actual performance. While these methods have demonstrated formidable generalization capabilities with minimal labels in various domains, their effectiveness in distinguishing between different classes based on a limited number of annotated records is notably lacking. Our hypothesis attributes this bottleneck to the prevalent use of Contrastive Learning, a shared training objective in previous state-of-the-art (SOTA) methods. By mandating distinctiveness between representations for negative pairs drawn from separate records, this approach compels the model to encode unique record-based patterns but simultaneously neglects changes occurring across the entire record. To overcome this challenge, we introduce Distilled Embedding for Almost-Periodic Time Series (DEAPS) in this paper, offering a non-contrastive method tailored for quasiperiodic time series, such as electrocardiogram (ECG) data. By avoiding the use of negative pairs, we not only mitigate the model’s blindness to temporal changes but also enable the integration of a "Gradual Loss (L_gra)" function. This function guides the model to effectively capture dynamic patterns evolving throughout the record. The outcomes are promising, as DEAPS demonstrates a notable improvement of +10% over existing SOTA methods when just a few annotated records are presented to fit a Machine Learning (ML) model based on the learned representation.

List of keywords

Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Time series and data streams
Multidisciplinary Topics and Applications -> MTA: Health and medicine

3769

Are Logistic Models Really Interpretable?

Danial Dervovic, Freddy Lecue, Nicolas Marchesotti, Daniele Magazzeni

[+] More

[-] Less

The demand for open and trustworthy AI models points towards widespread publishing of model weights. Consumers of these model weights must be able to act accordingly with the information provided. That said, one of the simplest AI classification models, Logistic Regression (LR), has an unwieldy interpretation of its model weights, with greater difficulties when extending LR to generalised additive models. In this work, we show via a User Study that skilled participants are unable to reliably reproduce the action of small LR models given the trained parameters. As an antidote to this, we define Linearised Additive Models (LAMs), an optimal piecewise linear approximation that augments any trained additive model equipped with a sigmoid link function, requiring no retraining. We argue that LAMs are more interpretable than logistic models — survey participants are shown to solve model reasoning tasks with LAMs much more accurately than with LR given the same information. Furthermore, we show that LAMs do not suffer from large performance penalties in terms of ROC-AUC and calibration with respect to their logistic counterparts on a broad suite of public financial modelling data.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
Machine Learning -> ML: Classification

3807

Contrastive Learning Drug Response Models from Natural Language Supervision

Kun Li, Xiuwen Gong, Jia Wu, Wenbin Hu

[+] More

[-] Less

Deep learning-based drug response prediction (DRP) methods can accelerate the drug discovery process and reduce research and development costs. Despite their high accuracy, generating regression-aware representations remains challenging for mainstream approaches. For instance, the representations are often disordered, aggregated, and overlapping, and they fail to characterize distinct samples effectively. This results in poor representation during the DRP task, diminishing generalizability and potentially leading to substantial costs during the drug discovery. In this paper, we propose CLDR, a contrastive learning framework with natural language supervision for the DRP. The CLDR converts regression labels into text, which is merged with the drug response caption as a second sample modality instead of the traditional modes, i.e., graphs and sequences. Simultaneously, a common-sense numerical knowledge graph is introduced to improve the continuous text representation. Our framework is validated using the genomics of drug sensitivity in cancer dataset with average performance increases ranging from 7.8% to 31.4%. Furthermore, experiments demonstrate that the proposed CLDR effectively maps samples with distinct label values into a high-dimensional space. In this space, the sample representations are scattered, significantly alleviating feature overlap. The code is available at: https://github.com/DrugD/CLDR.

List of keywords

Data Mining -> DM: Mining graphs
Data Mining -> DM: Knowledge graphs and knowledge base completion
Multidisciplinary Topics and Applications -> MTA: Bioinformatics

3816

MGCBS: An Optimal and Efficient Algorithm for Solving Multi-Goal Multi-Agent Path Finding Problem

Mingkai Tang, Yuanhang Li, Hongji Liu, Yingbing Chen, Ming Liu, Lujia Wang

[+] More

[-] Less

With the expansion of the scale of robotics applications, the multi-goal multi-agent pathfinding (MG-MAPF) problem began to gain widespread attention. This problem requires each agent to visit pre-assigned multiple goal points at least once without conflict. Some previous methods have been proposed to solve the MG-MAPF problem based on Decoupling the goal Vertex visiting order search and the Single-agent pathfinding (DVS). However, this paper demonstrates that the methods based on DVS cannot always obtain the optimal solution. To obtain the optimal result, we propose the Multi-Goal Conflict-Based Search (MGCBS), which is based on Decoupling the goal Safe interval visiting order search and the Single-agent pathfinding (DSS). Additionally, we present the Time-Interval-Space Forest (TIS Forest) to enhance the efficiency of MGCBS by maintaining the shortest paths from any start point at any start time step to each safe interval at the goal points. The experiment demonstrates that our method can consistently obtain optimal results and execute up to 7 times faster than the state-of-the-art method in our evaluation.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Multi-agent planning
Planning and Scheduling -> PS: Robot planning
Robotics -> ROB: Multi-robot systems
Search -> S: Combinatorial search and optimisation

3838

Deep Frequency Derivative Learning for Non-stationary Time Series Forecasting

Wei Fan, Kun Yi, Hangting Ye, Zhiyuan Ning, Qi Zhang, Ning An

[+] More

[-] Less

While most time series are non-stationary, it is inevitable for models to face the distribution shift issue in time series forecasting. Existing solutions manipulate statistical measures (usually mean and std.) to adjust time series distribution. However, these operations can be theoretically seen as the transformation towards zero frequency component of the spectrum which cannot reveal full distribution information and would further lead to information utilization bottleneck in normalization, thus hindering forecasting performance. To address this problem, we propose to utilize the whole frequency spectrum to transform time series to make full use of data distribution from the frequency perspective. We present a deep frequency derivative learning framework, DERITS, for non-stationary time series forecasting. Specifically, DERITS is built upon a novel reversible transformation, namely Frequency Derivative Transformation (FDT) that makes signals derived in the frequency domain to acquire more stationary frequency representations. Then, we propose the Order-adaptive Fourier Convolution Network to conduct adaptive frequency filtering and learning. Furthermore, we organize DERITS as a parallel-stacked architecture for the multi-order derivation and fusion for forecasting. Finally, we conduct extensive experiments on several datasets which show the consistent superiority in both time series forecasting and shift alleviation.

List of keywords

Machine Learning -> ML: Time series and data streams
Data Mining -> DM: Mining spatial and/or temporal data

3854

On the Power and Limitations of Examples for Description Logic Concepts

Raoul Koudijs, Balder Ten Cate, Ana Ozaki

[+] More

[-] Less

Labeled examples (i.e., positive and negative examples) are an attractive medium for communicating complex concepts. They are useful for deriving concept expressions (such as in concept learning, interactive concept specification, and concept refinement) as well as for illustrating concept expressions to a user or domain expert. We investigate the power of labeled examples for describing description-logic concepts. Specifically, we systematically study the existence and efficient computability of \emph{finite characterizations}, i.e., finite sets of labeled examples that uniquely characterize a single concept,for a wide variety of description logics between $\EL$ and $\mathcal{ALCQI}$, both without an ontology and in the presence of a DL-Lite ontology. Finite characterizations are relevant for debugging purposes, and their existence is a necessary condition for exact learnability with membership queries.

List of keywords

Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Knowledge Representation and Reasoning -> KRR: Description logics and ontologies
Machine Learning -> ML: Learning theory

3863

HypBO: Accelerating Black-Box Scientific Experiments Using Experts’ Hypotheses

Abdoulatif Cissé, Xenophon Evangelopoulos, Sam Carruthers, Vladimir V. Gusev, Andrew I. Cooper

[+] More

[-] Less

Robotics and automation offer massive acceleration for solving intractable, multivariate scientific problems such as materials discovery, but the available search spaces can be dauntingly large. Bayesian optimization has emerged as a popular sample-efficient optimization engine, thriving in tasks where no analytic form of the target function/property is known. Here, we exploit expert human knowledge in the form of hypotheses to direct Bayesian searches more quickly to promising regions of chemical space. Previous methods have used underlying distributions derived from existing experimental measurements, which is unfeasible for new, unexplored scientific tasks. Also, such distributions cannot capture intricate hypotheses. Our proposed method uses expert human hypotheses to generate improved seed samples. Unpromising seeds are automatically discounted, while promising seeds are used to augment the surrogate model data, thus achieving better-informed sampling. This process continues in a global versus local search fashion, organized in a bilevel optimization framework. We validate the performance of our method on a range of synthetic functions and demonstrate its practical utility on a real chemical design task where the use of expert hypotheses accelerates the search performance significantly.

List of keywords

Machine Learning -> ML: Optimization
Humans and AI -> HAI: Human-AI collaboration
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Machine Learning -> ML: Bayesian learning

3866

PoRank: A Practical Framework for Learning to Rank Policies

Pengjie Gu, Mengchen Zhao, Xu He, Yi Cai, Bo An

[+] More

[-] Less

In many real-world scenarios, we need to select from a set of candidate policies before online deployment. Although existing Off-policy evaluation (OPE) methods can be used to estimate the online performance, they suffer from high variance. Fortunately, we care only about the ranking of the candidate policies, rather than their exact online rewards. Based on this, we propose a novel framework PoRank for learning to rank policies. In practice, learning to rank policies faces two main challenges: 1) generalization over the huge policy space and 2) lack of supervision signals. To overcome the first challenge, PoRank uses a Policy Comparison Transformer (PCT) for learning cross-policy representations, which capture the core discrepancies between policies and generalizes well across the whole policy space. The second challenge arises because learning to rank requires online comparisons of policies as ground-truth labels, whereas deploying policies online might be highly expensive. To overcome this, PoRank adopts a crowdsourcing based learning-to-rank (LTR) framework, where a set of OPE algorithms are employed to provide weak comparison labels. Experimental results show that PoRank not only outperforms baselines when the ground-truth labels are provided, but also achieves competitive performance when the ground-truth labels are unavailable.

List of keywords

Machine Learning -> ML: Reinforcement learning

3874

Dirichlet-based Uncertainty Quantification for Personalized Federated Learning with Improved Posterior Networks

Nikita Kotelevskii, Samuel Horváth, Karthik Nandakumar, Martin Takac, Maxim Panov

[+] More

[-] Less

In modern federated learning, one of the main challenges is to account for inherent heterogeneity and the diverse nature of data distributions for different clients. This problem is often addressed by introducing personalization of the models towards the data distribution of the particular client. However, a personalized model might be unreliable when applied to the data that is not typical for this client. Eventually, it may perform worse for these data than the non-personalized global model trained in a federated way on the data from all the clients. This paper presents a new approach to federated learning that allows selecting a model from global and personalized ones that would perform better for a particular input point. It is achieved through a careful modeling of predictive uncertainties that helps to detect local and global in- and out-of-distribution data and use this information to select the model that is confident in a prediction. The comprehensive experimental evaluation on the popular real-world image datasets shows the superior performance of the model in the presence of out-of-distribution data while performing on par with state-of-the-art personalized federated learning algorithms in the standard scenarios.

List of keywords

Uncertainty in AI -> UAI: Bayesian networks
Machine Learning -> ML: Bayesian learning
Machine Learning -> ML: Federated learning
Machine Learning -> ML: Probabilistic machine learning

3880

Are Watermarks Bugs for Deepfake Detectors? Rethinking Proactive Forensics

Xiaoshuai Wu, Xin Liao, Bo Ou, Yuling Liu, Zheng Qin

[+] More

[-] Less

AI-generated content has accelerated the topic of media synthesis, particularly Deepfake, which can manipulate our portraits for positive or malicious purposes. Before releasing these threatening face images, one promising forensics solution is the injection of robust watermarks to track their own provenance. However, we argue that current watermarking models, originally devised for genuine images, may harm the deployed Deepfake detectors when directly applied to forged images, since the watermarks are prone to overlap with the forgery signals used for detection. To bridge this gap, we thus propose AdvMark, on behalf of proactive forensics, to exploit the adversarial vulnerability of passive detectors for good. Specifically, AdvMark serves as a plug-and-play procedure for fine-tuning any robust watermarking into adversarial watermarking, to enhance the forensic detectability of watermarked images; meanwhile, the watermarks can still be extracted for provenance tracking. Extensive experiments demonstrate the effectiveness of the proposed AdvMark, leveraging robust watermarking to fool Deepfake detectors, which can help improve the accuracy of downstream Deepfake detection without tuning the in-the-wild detectors. We believe this work will shed some light on the harmless proactive forensics against Deepfake.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Security and privacy
Computer Vision -> CV: Biometrics, face, gesture and pose recognition

3882

It Ain’t That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models

Xingcheng Xu, Zihao Pan, Haipeng Zhang, Yanqing Yang

[+] More

[-] Less

Large language models (LLMs) have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not always satisfying and the generalization problem is common for generative transformer models in general. Researchers take basic mathematical tasks like n-digit addition or multiplication as important perspectives for investigating their generalization behaviors. It is observed that when training models on n-digit operations (e.g., additions) in which both input operands are n-digit in length, models generalize successfully on unseen n-digit inputs (in-distribution (ID) generalization), but fail miserably on longer, unseen cases (out-of-distribution (OOD) generalization). We bring this unexplained performance drop into attention and ask whether there is systematic OOD generalization. Towards understanding LLMs, we train various smaller language models which may share the same underlying mechanism. We discover that the strong ID generalization stems from structured representations, while behind the unsatisfying OOD performance, the models still exhibit clear learned algebraic structures. Specifically, these models map unseen OOD inputs to outputs with learned equivalence relations in the ID domain, which we call the equivalence generalization. These findings deepen our knowledge regarding the generalizability of generative models including LLMs, and provide insights into potential avenues for improvement.

List of keywords

Natural Language Processing -> NLP: Interpretability and analysis of models for NLP
AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
Knowledge Representation and Reasoning -> KRR: Learning and reasoning

3885

Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

Shiqing Gao, Jiaxin Ding, Luoyi Fu, Xinbing Wang, Chenghu Zhou

[+] More

[-] Less

In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints. The penalty function method has recently been studied as an effective approach for handling constraints, which imposes constraints penalties on the objective to transform the constrained problem into an unconstrained one. However, it is challenging to choose appropriate penalties that balance policy performance and constraint satisfaction efficiently. In this paper, we propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN). PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration. We theoretically prove that EPO consistently improves constraint satisfaction with convergence guarantee. We propose a new surrogate function and provide worst-case constraint violation and approximation error. In practice, we propose an effective smooth penalty function, which can be easily implemented with a first-order optimizer. Extensive experiments are conducted, showing that EPO outperforms the baselines in terms of policy performance and constraint satisfaction with a stable training process, particularly on complex tasks.

List of keywords

Machine Learning -> ML: Reinforcement learning
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Machine Learning -> ML: Optimization
Machine Learning -> ML: Theory of deep learning

3891

Learning-Based Tracking-before-Detect for RF-Based Unconstrained Indoor Human Tracking

Zhi Wu, Dongheng Zhang, Zixin Shang, Yuqin Yuan, Hanqin Gong, Binquan Wang, Zhi Lu, Yadong Li, Yang Hu, Qibin Sun, Yan Chen

[+] More

[-] Less

Existing efforts on human tracking using wireless signal are primarily focused on constrained scenarios with only a few individuals in empty spaces. However, in practical unconstrained scenarios with severe interference and attenuation, accurate multi-person tracking has been intractable. In this paper, we propose NeuralTBD, utilizing the capability of deep models and advancement of Tracking-Before-Detect (TBD) methodology to achieve accurate human tracking. TBD is a classical tracking methodology from signal processing accumulating measurement in time domain to distinguish target traces from interference, which however relies on handcrafted shape/motion models, impeding efficacy in complex indoor scenarios. To tackle this challenge, we build an end-to-end learning-based TBD framework leverages the advanced modeling capabilities of deep models to significantly enhance the performance of TBD. To evaluate NeuralTBD, we collect an RF-based tracking dataset in unconstrained scenarios, which encompasses 4 million annotated radar frames with up to 19 individuals acting in 6 different scenarios. NeuralTBD realizes a 70% improvement in performance compared to conventional TBD methods. To our knowledge, this is the first attempt dealing with RF-based unconstrained human tracking. The code and dataset will be released.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Sensor networks and smart cities
Multidisciplinary Topics and Applications -> MTA: Ubiquitous computing cystems

3894

Dual Contrastive Graph-Level Clustering with Multiple Cluster Perspectives Alignment

Jinyu Cai, Yunhe Zhang, Jicong Fan, Yali Du, Wenzhong Guo

[+] More

[-] Less

Graph-level clustering, essential for data analysis in medical, biomedical, and social networking, involves grouping a set of graphs into various clusters. However, existing methods generally rely on single clustering criteria, e.g., k-means, which limits their ability to fully exploit the complex Euclidean and structural information inherent in graphs. To bridge this gap, this paper proposes a dual contrastive graph-level clustering (DCGLC) method. DCGLC leverages graph contrastive learning and introduces the Eucildian-based and subspace-based cluster heads to capture the cluster information from different cluster perspectives. To overcome the inconsistency estimations and fuse the cluster information of multiple cluster heads, we propose a contrastive mechanism to align the cluster information derived from them. The cluster perspectives contrast facilitates the capture of more comprehensive cluster information. Importantly, DCGLC is an end-to-end framework in which graph contrastive learning and cluster perspectives contrast are mutually improved. We demonstrate its superiority against the state-of-the-art baselines on numerous graph benchmarks.

List of keywords

Machine Learning -> ML: Unsupervised learning
Machine Learning -> ML: Clustering

3909

Alleviating Imbalanced Pseudo-label Distribution: Self-Supervised Multi-Source Domain Adaptation with Label-specific Confidence

Shuai Lü, Meng Kang, Ximing Li

[+] More

[-] Less

The existing self-supervised Multi-Source Domain Adaptation (MSDA) methods often suffer an imbalanced characteristic among the distribution of pseudo-labels. Such imbalanced characteristic results in many labels with too many or too few pseudo-labeled samples on the target domain, referred to as easy-to-learn label and hard-to-learn label, respectively. Both of these labels hurt the generalization performance on the target domain. To alleviate this problem, in this paper we propose a novel multi-source domain adaptation method, namely Self-Supervised multi-Source Domain Adaptation with Label-specific Confidence (S3DA-LC). Specifically, we estimate the label-specific confidences, i.e., the learning difficulties of labels, and adopt them to generate the pseudo-labels for target samples, enabling to simultaneously constrain and enrich the pseudo supervised signals for easy-to-learn and hard-to-learn labels. We evaluate S3DA-LC on several benchmark datasets, indicating its superior performance compared with the existing MSDA baselines.

List of keywords

Machine Learning -> ML: Multi-task and transfer learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Self-supervised Learning

3918

Prompt-enhanced Network for Hateful Meme Classification

Junxi Liu, Yanyan Feng, Jiehai Chen, Yun Xue, Fenghuan Li

[+] More

[-] Less

The dynamic expansion of social media has led to an inundation of hateful memes on media platforms, accentuating the growing need for efficient identification and removal. Acknowledging the constraints of conventional multimodal hateful meme classification, which heavily depends on external knowledge and poses the risk of including irrelevant or redundant content, we developed Pen—a prompt-enhanced network framework based on the prompt learning approach. Specifically, after constructing the sequence through the prompt method and encoding it with a language model, we performed region information global extraction on the encoded sequence for multi-view perception. By capturing global information about inference instances and demonstrations, Pen facilitates category selection by fully leveraging sequence information. This approach significantly improves model classification accuracy. Additionally, to bolster the model’s reasoning capabilities in the feature space, we introduced prompt-aware contrastive learning into the framework to improve the quality of sample feature distributions. Through extensive ablation experiments on two public datasets, we evaluate the effectiveness of the Pen framework, concurrently comparing it with state-of-the-art model baselines. Our research findings highlight that Pen surpasses manual prompt methods, showcasing superior generalization and classification accuracy in hateful meme classification tasks. Our code is available at https://github.com/juszzi/Pen.

List of keywords

Natural Language Processing -> NLP: Text classification
Machine Learning -> ML: Clustering
Machine Learning -> ML: Feature extraction, selection and dimensionality reduction
Natural Language Processing -> NLP: Language models

3920

Pre-DyGAE: Pre-training Enhanced Dynamic Graph Autoencoder for Occupational Skill Demand Forecasting

Xi Chen, Chuan Qin, Zhigaoyuan Wang, Yihang Cheng, Chao Wang, Hengshu Zhu, Hui Xiong

[+] More

[-] Less

Occupational skill demand (OSD) forecasting seeks to predict dynamic skill demand specific to occupations, beneficial for employees and employers to grasp occupational nature and maintain a competitive edge in the rapidly evolving labor market. Although recent research has proposed data-driven techniques for forecasting skill demand, the focus has remained predominantly on overall trends rather than occupational granularity. In this paper, we propose a novel Pre-training Enhanced Dynamic Graph Autoencoder (Pre-DyGAE), forecasting skill demand from an occupational perspective. Specifically, we aggregate job descriptions (JDs) by occupation and segment them into several timestamps. Subsequently, in the initial timestamps, we pre-train a graph autoencoder (GAE), consisting of a semantically-aware cross-attention enhanced uncertainty-aware encoder and decoders for link prediction and edge regression to achieve graph reconstruction. In particular, we utilize contrastive learning on skill cooccurrence clusters to solve the data sparsity and a unified Tweedie and ranking loss for predicting the imbalanced distribution. Afterward, we incorporate an adaptive temporal encoding unit and a temporal shift module into GAE to achieve a dynamic GAE (DyGAE). Furthermore, we fine-tune the DyGAE with a two-stage optimization strategy and infer future representations. Extensive experiments on four real-world datasets validate the effectiveness of Pre-DyGAE compared with state-of-the-art baselines.

List of keywords

Data Mining -> DM: Applications
Data Mining -> DM: Mining spatial and/or temporal data

3926

Enhancing Dual-Target Cross-Domain Recommendation with Federated Privacy-Preserving Learning

Zhenghong Lin, Wei Huang, Hengyu Zhang, Jiayu Xu, Weiming Liu, Xingting Liao, Fan Wang, Shiping Wang, Yanchao Tan

[+] More

[-] Less

Recently, dual-target cross-domain recommendation (DTCDR) has been proposed to alleviate the data sparsity problem by sharing the common knowledge across domains simultaneously. However, existing methods often assume that personal data containing abundant identifiable information can be directly accessed, which results in a controversial privacy leakage problem of DTCDR. To this end, we introduce the P2DTR framework, a novel approach in DTCDR while protecting private user information. Specifically, we first design a novel inter-client knowledge extraction mechanism, which exploits the private set intersection algorithm and prototype-based federated learning to enable collaboratively modeling among multiple users and a server. Furthermore, to improve the recommendation performance based on the extracted common knowledge across domains, we proposed an intra-client enhanced recommendation, consisting of a constrained dominant set (CDS) propagation mechanism and dual-recommendation module. Extensive experiments on real-world datasets validate that our proposed P2DTR framework achieves superior utility under a privacy-preserving guarantee on both domains.

List of keywords

Data Mining -> DM: Recommender systems
Data Mining -> DM: Applications
Data Mining -> DM: Privacy-preserving data mining

3942

One-step Spiking Transformer with a Linear Complexity

Xiaotian Song, Andy Song, Rong Xiao, Yanan Sun

[+] More

[-] Less

Spiking transformers have recently emerged as a robust alternative in deep learning. One focus of this field is the reduction of energy consumption, given that spiking transformers require lengthy simulation timesteps and complex floating-point attention mechanisms. In this paper, we propose a one-step approach that requires only one timestep and is of linear complexity. The proposed One-step Spiking Transformer (OST) incorporates a Time Domain Compression and Compensation (TDCC) component, which can significantly mitigate the spatio-temporal overhead of spiking transformers. Another novel component in OST is the Spiking Linear Transformation (SLT), designed to greatly reduce the number of floating-point multiply-and-accumulate operations. Experiments on both static and neuromorphic images show that OST can perform as well as or better than SOTA methods with just one timestep, even for more difficult tasks. For instance, comparing with Spikeformer, OST gains 1.59% in accuracy on ImageNet, yet 40.27% more efficient, and gains 0.7% on DVS128 Gesture. The supplementary materials and source code are available at https://github.com/songxt3/OST.

List of keywords

Humans and AI -> HAI: Cognitive modeling
Humans and AI -> HAI: Applications
Machine Learning -> General

3966

LG-GNN: Local-Global Adaptive Graph Neural Network for Modeling Both hom*ophily and Heterophily

Dongxiao He, Bin Feng, Zhizhi Yu, Zizhen Wang, Yuxiao Huang, Zhiyong Feng

[+] More

[-] Less

Most Graph Neural Networks (GNNs) are based on the hom*ophily assumption, where nodes with the same label or similar features tend to be connected to each other. However, real-world graphs often do not adhere to this hom*ophily assumption. Currently, most researches aggregate multi-hop neighbor information to discover more potentially relevant nodes. However, in the aggregation process of GNNs, the difference between modeling global and local information is not considered, leading to information loss during aggregation. Inspired by this, we propose LG-GNN, a local-global adaptive graph neural network for modeling both hom*ophily and heterophily. Specifically, we model the long-distance structural similarity and local feature similarity between nodes from global and local perspectives, in order to capture distant dependencies in highly heterophilic networks while reducing the mixing of locally dissimilar feature nodes, thereby increasing the effectiveness of information aggregation in highly heterophilic networks. Extensive experiments on a wide range of real-world datasets demonstrate that our proposed approach performs well in both heterophilic and hom*ophilic graphs.

List of keywords

Data Mining -> DM: Mining graphs
Machine Learning -> ML: Sequence and graph learning

3971

ROME: Robust Multi-Modal Density Estimator

Anna Mészáros, Julian F. Schumann, Javier Alonso-Mora, Arkady Zgonnikov, Jens Kober

[+] More

[-] Less

The estimation of probability density functions is a fundamental problem in science and engineering. However, common methods such as kernel density estimation (KDE) have been demonstrated to lack robustness, while more complex methods have not been evaluated in multi-modal estimation problems. In this paper, we present ROME (RObust Multi-modal Estimator), a non-parametric approach for density estimation which addresses the challenge of estimating multi-modal, non-normal, and highly correlated distributions. ROME utilizes clustering to segment a multi-modal set of samples into multiple uni-modal ones and then combines simple KDE estimates obtained for individual clusters in a single multi-modal estimate. We compared our approach to state-of-the-art methods for density estimation as well as ablations of ROME, showing that it not only outperforms established methods but is also more robust to a variety of distributions. Our results demonstrate that ROME can overcome the issues of over-fitting and over-smoothing exhibited by other estimators.

List of keywords

Machine Learning -> ML: Evaluation
Machine Learning -> ML: Probabilistic machine learning

3988

Deep Embedding Clustering Driven by Sample Stability

Zhanwen Cheng, Feijiang Li, Jieting Wang, Yuhua Qian

[+] More

[-] Less

Deep clustering methods improve the performance of clustering tasks by jointly optimizing deep representation learning and clustering. While numerous deep clustering algorithms have been proposed, most of them rely on artificially constructed pseudo targets for performing clustering. This construction process requires some prior knowledge, and it is challenging to determine a suitable pseudo target for clustering. To address this issue, we propose a deep embedding clustering algorithm driven by sample stability (DECS), which eliminates the requirement of pseudo targets. Specifically, we start by constructing the initial feature space with an autoencoder and then learn the cluster-oriented embedding feature constrained by sample stability. The sample stability aims to explore the deterministic relationship between samples and all cluster centroids, pulling samples to their respective clusters and keeping them away from other clusters with high determinacy. We analyzed the convergence of the loss using Lipschitz continuity in theory, which verifies the validity of the model. The experimental results on five datasets illustrate that the proposed method achieves superior performance compared to state-of-the-art clustering approaches.

List of keywords

Machine Learning -> ML: Clustering
Machine Learning -> ML: Convolutional networks
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Unsupervised learning

3991

VCformer: Variable Correlation Transformer with Inherent Lagged Correlation for Multivariate Time Series Forecasting

Yingnan Yang, Qingling Zhu, Jianyong Chen

[+] More

[-] Less

Multivariate time series (MTS) forecasting has been extensively applied across diverse domains, such as weather prediction and energy consumption. However, current studies still rely on the vanilla point-wise self-attention mechanism to capture cross-variable dependencies, which is inadequate in extracting the intricate cross-correlation implied between variables. To fill this gap, we propose Variable Correlation Transformer (VCformer), which utilizes Variable Correlation Attention (VCA) module to mine the correlations among variables. Specifically, based on the stochastic process theory, VCA calculates and integrates the cross-correlation scores corresponding to different lags between queries and keys, thereby enhancing its ability to uncover multivariate relationships. Additionally, inspired by Koopman dynamics theory, we also develop Koopman Temporal Detector (KTD) to better address non-stationarity in time series. The two key components enable VCformer to extract both multivariate correlations and temporal dependencies. Our extensive experiments on eight real-world datasets demonstrate the effectiveness of VCformer, achieving top-tier performance compared to other state-of-the-art baseline models. Code is available at this repository: https://github.com/CSyyn/VCformer.

List of keywords

Machine Learning -> ML: Time series and data streams
Machine Learning -> ML: Attention models

3992

Integrating Vision-Language Semantic Graphs in Multi-View Clustering

JunLong Ke, Zichen Wen, Yechenhao Yang, Chenhang Cui, Yazhou Ren, Xiaorong Pu, Lifang He

[+] More

[-] Less

In recent years, a variety of graph learning-based multi-view clustering (MVC) methods have emerged. However, these methods continue to face challenges in extracting latent features from real-world data, particularly in scenarios involving high-resolution color images and high-dimensional features. This task is notably difficult in cases where images are visually similar yet semantically diverse. To address this issue, we present a novel large-scale pre-trained model for multi-view clustering, named Integrate Vision-Language Semantic Graphs in Multi-View Clustering (IVSGMV), which harnesses the capabilities of visual-language pre-training models to enhance clustering performance and confronts issues in the unsupervised tuning of pre-trained models for multi-view data. We introduce an effective unsupervised approach for creating semantic graphs from image multi-view datasets using pre-trained encoders. Our method addresses the inherent spatial noise and imbalance in these encoders by employing graph filters and a joint process that integrates both image node and edge features. Additionally, we demonstrate the application of our approach to multi-view image clustering on extensive datasets, notably the high-resolution MVImgNet, achieving an impressive 82% accuracy. Furthermore, our method extends the zero-shot capabilities of large-scale pre-trained models, resulting in good performance in clustering tasks on untrained multi-view datasets.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
Machine Learning -> ML: Multi-modal learning

4008

Model Checking Causality

Tiago de Lima, Emiliano Lorini

[+] More

[-] Less

We present a novel modal language for causal reasoning and interpret it by means of a semantics in which causal information is represented using causal bases in propositional form. The language includes modal operators of conditional causal necessity where the condition is a causal change operation. We provide a succinct formulation of model checking for our language and a model checking procedure based on a polysize reduction to QBF. We illustrate the expressiveness of our language through some examples and show that it allows us to represent and to formally verify a variety of concepts studied in the field of explainable AI including abductive explanation, intervention and actual cause.

List of keywords

Knowledge Representation and Reasoning -> KRR: Causality
Knowledge Representation and Reasoning -> KRR: Knowledge representation languages

4034

Scene-Adaptive Person Search via Bilateral Modulations

Yimin Jiang, Huibing Wang, Jinjia Peng, Xianping Fu, Yang Wang

[+] More

[-] Less

Person search aims to localize specific a target person from a gallery set of images with various scenes. As the scene of moving pedestrian changes, the captured person image inevitably bring in lots of background noise and foreground noise on the person feature, which are completely unrelated to the person identity, leading to severe performance degeneration. To address this issue, we present a Scene-Adaptive Person Search (SEAS) model by introducing bilateral modulations to simultaneously eliminate scene noise and maintain a consistent person representation to adapt to various scenes. In SEAS, a Background Modulation Network (BMN) is designed to encode the feature extracted from the detected bounding box into a multi-granularity embedding, which reduces the input of background noise from multiple levels with norm-aware. Additionally, to mitigate the effect of foreground noise on the person feature, SEAS introduces a Foreground Modulation Network (FMN) to compute the clutter reduction offset for the person embedding based on the feature map of the scene image. By bilateral modulations on both background and foreground within an end-to-end manner, SEAS obtains consistent feature representations without scene noise. SEAS can achieve state-of-the-art (SOTA) performance on two benchmark datasets, CUHK-SYSU with 97.1% mAP and PRW with 60.5% mAP. The code is available at https://github.com/whbdmu/SEAS.

List of keywords

Computer Vision -> CV: Image and video retrieval
Computer Vision -> CV: Representation learning

4044

Self-Promoted Clustering-based Contrastive Learning for Brain Networks Pretraining

Junbo Ma, Caixuan Luo, Jia Hou, Kai Zhao

[+] More

[-] Less

Rapid advancements in neuroimaging techniques, such as magnetic resonance imaging (MRI), have facilitated the acquisition of the structural and functional characteristics of the brain. Brain network analysis is one of the essential tools for exploring brain mechanisms from MRI, providing valuable insights into the brain’s organization, and stimulating the understanding of brain cognitions and neuro diseases’ pathology. Graph Neural Networks (GNNs) are commonly used for brain network analysis, but they are limited by the scarcity of medical data. Although Graph Contrastive Learning methods have been developed to address this, they often involve graph augmentations that distort the anatomical brain structures. To address these challenges, an augmentation-free contrastive learning method, named Self-Promoted Clustering-based Contrastive Learning (SPCCL), is proposed in this paper. Specifically, by introducing a Clustering-based Contrastive Learning loss and a self-promoted contrastive pairs generation scheme, the proposed SPCCL can be pre-trained from extra healthy subjects’ data that are relatively easier to acquire than the ones with disorders. The proposed SPCCL introduces extra data without damaging the data balance and maintains label consistency while respecting the integrity of the original brain network structure, making SPCCL a promising approach for effective brain network analysis. Comprehensive experiments are conducted on an open-access schizophrenic dataset, demonstrating the effectiveness of the proposed method.

List of keywords

Computer Vision -> CV: Biomedical image analysis
Humans and AI -> HAI: Brain sciences
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Multi-modal learning

4050

A New Paradigm for Counterfactual Reasoning in Fairness and Recourse

Lucius Bynum, Joshua Loftus, Julia Stoyanovich

[+] More

[-] Less

Counterfactuals and counterfactual reasoning underpin numerous techniques for auditing and understanding artificial intelligence (AI) systems. The traditional paradigm for counterfactual reasoning in this literature is the interventional counterfactual, where hypothetical interventions are imagined and simulated. For this reason, the starting point for causal reasoning about legal protections and demographic data in AI is an imagined intervention on a legally-protected characteristic, such as ethnicity, race, gender, disability, age, etc. We ask, for example, what would have happened had your race been different? An inherent limitation of this paradigm is that some demographic interventions — like interventions on race — may not translate into the formalisms of interventional counterfactuals. In this work, we explore a new paradigm based instead on the backtracking counterfactual, where rather than imagine hypothetical interventions on legally-protected characteristics, we imagine alternate initial conditions while holding these characteristics fixed. We ask instead, what would explain a counterfactual outcome for you as you actually are or could be? This alternate framework allows us to address many of the same social concerns, but to do so while asking fundamentally different questions that do not rely on demographic interventions.

List of keywords

Uncertainty in AI -> UAI: Causality, structural causal models and causal inference
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
AI Ethics, Trust, Fairness -> ETF: Moral decision making

4089

Resolving Word Vagueness with Scenario-guided Adapter for Natural Language Inference

Yonghao Liu, Mengyu Li, Di Liang, Ximing Li, Fausto Giunchiglia, Lan Huang, Xiaoyue Feng, Renchu Guan

[+] More

[-] Less

Natural Language Inference (NLI) is a crucial task in natural language processing that involves determining the relationship between two sentences, typically referred to as the premise and the hypothesis. However, traditional NLI models solely rely on the semantic information inherent in independent sentences and lack relevant situational visual information, which can hinder a complete understanding of the intended meaning of the sentences due to the ambiguity and vagueness of language. To address this challenge, we propose an innovative ScenaFuse adapter that simultaneously integrates large-scale pre-trained linguistic knowledge and relevant visual information for NLI tasks. Specifically, we first design an image-sentence interaction module to incorporate visuals into the attention mechanism of the pre-trained model, allowing the two modalities to interact comprehensively. Furthermore, we introduce an image-sentence fusion module that can adaptively integrate visual information from images and semantic information from sentences. By incorporating relevant visual information and leveraging linguistic knowledge, our approach bridges the gap between language and vision, leading to improved understanding and inference capabilities in NLI tasks. Extensivebenchmark experiments demonstrate that our proposed ScenaFuse, a scenario-guided approach, consistently boosts NLI performance.

List of keywords

Machine Learning -> ML: Multi-modal learning
Data Mining -> DM: Mining text, web, social media
Knowledge Representation and Reasoning -> KRR: Learning and reasoning

4094

Attention Shifting to Pursue Optimal Representation for Adapting Multi-granularity Tasks

Gairui Bai, Wei Xi, Yihan Zhao, Xinhui Liu, ji*zhong Zhao

[+] More

[-] Less

Object recognition in open environments, e.g., video surveillance, poses significant challenges due to the inclusion of unknown and multi-granularity tasks (MGT). However, recent methods exhibit limitations as they struggle to capture subtle differences between different parts within an object and adaptively handle MGT. To address this limitation, this paper proposes a Class-semantic Guided Attention Shift (SegAS) method. SegAS transforms adaptive MGT into dynamic combinations of invariant discriminant representations across different levels to effectively enhance adaptability to multi-granularity downstream tasks. Specifically, SegAS incorporates a hardness-based Attention Part Filtering Strategy (ApFS) to dynamically decompose objects into complementary parts based on the object structure and relevance to the instance. Then, SegAS shifts attention to the optimal discriminant region of each part under the guidance of hierarchical class semantics. Finally, a diversity loss is employed to emphasize the importance and distinction of different partial features. Extensive experiments validate SegAS’ effectiveness in multi-granularity recognition of three tasks.

List of keywords

Computer Vision -> CV: Representation learning
Computer Vision -> CV: Recognition (object detection, categorization)

4106

Core-Structures-Guided Multi-Modal Classification Neural Architecture Search

Pinhan Fu, Xinyan Liang, Tingjin Luo, Qian Guo, Yayu Zhang, Yuhua Qian

[+] More

[-] Less

The multi-modal classification methods based on neural architecture search (NAS-MMC) can automatically learn a satisfied classifier from a given multi-modal search space. However, as the number of multi-modal features and fusion operators increases, the complexity of search space has increased dramatically. Rapidly identifying the satisfied fusion model from this vast space is very challenging. In this paper, we propose an efficient NAS-MMC method based on an idea of shrink-and-expansion search space, called core-structure-guided neural architecture search (CSG-NAS). Specifically, an evolutionary algorithm is first used to find core structures from a shrunk space (also called core structure search space) determined by high-quality features and fusion operators. Then a local search algorithm is used to find the optimal MMC model from the expanded space determined by the discovered core structures and the rest features as well as fusion operators. Moreover, a knowledge transfer strategy is introduced to further improve the overall performance and efficiency of the entire search process. Finally, extensive experimental results demonstrate the effectiveness of our CSG-NAS, attaining the superiority of classification performance, training efficiency and model complexity, compared to state-of-the-art ompetitors on several public benchmark multi-modal tasks. The source code is available at https://github.com/fupinhan123/CSG-NAS.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Evolutionary learning
Machine Learning -> ML: Multi-modal learning

4130

FedFa: A Fully Asynchronous Training Paradigm for Federated Learning

Haotian Xu, Zhaorui Zhang, Sheng Di, Benben Liu, Khalid Ayed Alharthi, Jiannong Cao

[+] More

[-] Less

Federated learning has been identified as an efficient decentralized training paradigm for scaling the machine learning model training on a large number of devices while guaranteeing the data privacy of the trainers. FedAvg has become a foundational parameter update strategy for federated learning, which has been promising to eliminate the effect of the heterogeneous data across clients and guarantee convergence. However, the synchronization parameter update barriers for each communication round during the training significant time on waiting, slowing down the training procedure. Therefore, recent state-of-the-art solutions propose using semi-asynchronous approaches to mitigate the waiting time cost with guaranteed convergence. Nevertheless, emerging semi-asynchronous approaches are unable to eliminate the waiting time completely. We propose a full asynchronous training paradigm called FedFa, which can guarantee model convergence and eliminate the waiting time completely for federated learning by using a few buffered results on the server for parameter updating. Further, we provide theoretical proof of the convergence rate for our proposed FedFa. Extensive experimental results indicate our approach effectively improves the training performance of federated learning by up to 6x and 4x speedup compared to the state-of-the-art synchronous and semi-asynchronous strategies while retaining high accuracy in both IID and Non-IID scenarios.

List of keywords

Machine Learning -> ML: Federated learning
Multidisciplinary Topics and Applications -> MTA: Software engineering

4131

Contrastive Transformer Masked Image Hashing for Degraded Image Retrieval

Xiaobo Shen, Haoyu Cai, Xiuwen Gong, Yuhui Zheng

[+] More

[-] Less

Hashing leverages hash code as compact image representation and has achieved promising performance in large-scale image retrieval due to its superiority in computation and storage. The degraded images are commonly found on social media platforms due to imperfections in image capturing process, which poses new challenges to conventional image retrieval. To address this issue, we propose a new deep unsupervised hashing method, i.e., Contrastive Transformer Masked Image Hashing (CTMIH) for the challenging yet less studied degraded image retrieval. The key idea is to train CTMIH on transformed images and masked images to learn transform-invariant hash code in an unsupervised manner to mitigate performance degradation caused by image degradation. CTMIH employs Vision Transformer (ViT) on image patches to capture distant semantic relevance. CTMIH introduces cross-view debiased contrastive loss to align hash tokens of augmented views from the same image, and presents semantic mask reconstruction loss in patch level to recover masked patch tokens. Extensive empirical studies on three benchmark datasets demonstrate the superiority of the proposed method over the state-of-the-arts on both degraded and normal image retrieval.

List of keywords

Computer Vision -> CV: Image and video retrieval
Machine Learning -> ML: Unsupervised learning

4133

Learning Multi-Granularity and Adaptive Representation for Knowledge Graph Reasoning

Ziyu Shang, Peng Wang, Wenjun Ke, Jiajun Liu, Hailang Huang, Guozheng Li, Chenxiao Wu, Jianghan Liu, Xiye Chen, Yining Li

[+] More

[-] Less

Knowledge graph reasoning (KGR) aims to infer new factual triples from existing knowledge graphs (KGs). Recently, a new category of methods, possessing both transductive and inductive reasoning capabilities, has been proposed to tackle this task via learning entity-independent representations from local neighboring structures. However, these methods are plagued by inefficiency issues and they exclusively capture evidence from well-designed local structures, ignoring the correlation between the query and different structures within KGs. In this work, we first propose a novel multi-granularity and adaptive representation framework, MulGA, exploiting the connectivity subgraph to uniformly and hierarchically model query-related triples, relation paths, and subgraphs without explicitly extracting any graph structure, hence mitigating inefficiency issues. Second, we introduce a message-passing mechanism across connectivity subgraphs, facilitating all entities to attain query-related structural representations of diverse granularity levels, i.e., triple and relation paths of different lengths. Third, we design a self-attention-based merging mechanism that allocates weights to different granularities and then consolidates them into subgraph granularity representations for reasoning. The systematic experiments have been conducted on 15 benchmarks and MulGA achieves a significant improvement in MRR by an average of 1.5% on transductive and 2.7% on inductive tasks than existing state-of-the-art methods. Moreover, MulGA boasts faster convergence speed, competitive inference time, and alleviates the over-smoothing prevalent in graph neural networks.

List of keywords

Data Mining -> DM: Knowledge graphs and knowledge base completion

4140

EFEVD: Enhanced Feature Extraction for Smart Contract Vulnerability Detection

Chi Jiang, Xihan Liu, Shenao Wang, Jinzhuo Liu, Yin Zhang

[+] More

[-] Less

With the wide deployment of smart contracts, the vulnerability in smart contracts is a challenging risk to the blockchain security. Nowadays, deep learning-based vulnerability detection has been one of the most attractive solutions due to its capability to identify complex patterns and features. The existing methods mainly consider the features of contract code content, expert knowledge pattern, and contract code modality. For further improving the feature enhancement for smart contract vulnerability detection, this paper attempts to identify the community features from the smart contracts with similar semantic and syntactic structure, and shared features from two related vulnerability detection tasks, i.e. vulnerability classification and localization. The experimental results verify that the proposed approach significantly outperforms state-of-the-art methods in terms of accuracy, recall, precision, and F1-score.

List of keywords

Machine Learning -> ML: Feature extraction, selection and dimensionality reduction
Data Mining -> DM: Exploratory data mining
Multidisciplinary Topics and Applications -> MTA: Security and privacy

4146

Revisiting Neural Networks for Continual Learning: An Architectural Perspective

Aojun Lu, Tao Feng, Hangjie Yuan, Xiaotian Song, Yanan Sun

[+] More

[-] Less

Efforts to overcome catastrophic forgetting have primarily centered around developing more effective Continual Learning (CL) methods. In contrast, less attention was devoted to analyzing the role of network architecture design (e.g., network depth, width, and components) in contributing to CL. This paper seeks to bridge this gap between network architecture design and CL, and to present a holistic study on the impact of network architectures on CL. This work considers architecture design at the network scaling level, i.e., width and depth, and also at the network components, i.e., skip connections, global pooling layers, and down-sampling. In both cases, we first derive insights through systematically exploring how architectural designs affect CL. Then, grounded in these insights, we craft a specialized search space for CL and further propose a simple yet effective ArchCraft method to steer a CL-friendly architecture, namely, this method recrafts AlexNet/ResNet into AlexAC/ResAC. Experimental validation across various CL settings and scenarios demonstrates that improved architectures are parameter-efficient, achieving state-of-the-art performance of CL while being 86%, 61%, and 97% more compact in terms of parameters than the naive CL architecture in Task IL and Class IL. Code is available at https://github.com/byyx666/ArchCraft.

List of keywords

Machine Learning -> ML: Incremental learning
Computer Vision -> CV: Machine learning for vision

4147

Look-ahead Search on Top of Policy Networks in Imperfect Information Games

Ondřej Kubíček, Neil Burch, Viliam Lisy

[+] More

[-] Less

Search in test time is often used to improve the performance of reinforcement learning algorithms. Performing theoretically sound search in fully adversarial two-player games with imperfect information is notoriously difficult and requires a complicated training process. We present a method for adding test-time search to an arbitrary policy-gradient algorithm that learns from sampled trajectories. Besides the policy network, the algorithm trains an additional critic network, which estimates the expected values of players following various transformations of the policies given by the policy network. These values are then used for depth-limited search. We show how the values from this critic can create a value function for imperfect information games. Moreover, they can be used to compute the summary statistics necessary to start the search from an arbitrary decision point in the game. The presented algorithm is scalable to very large games since it does not require any search during train time. We evaluate the algorithm’s performance when trained along Regularized Nash Dynamics, and we evaluate the benefit of using the search in the standard benchmark game of Leduc hold’em, multiple variants of imperfect information Goofspiel, and Battleships.

List of keywords

Machine Learning -> ML: Multiagent Reinforcement Learning
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Search -> S: Game playing
Search -> S: Local search

4155

Efficient Cost-Minimization Schemes for Electrical Energy Demand Satisfaction by Prosumers in Microgrids with Battery Storage Capabilities

Gergely Csáji, Matthias Mnich, Laura Codazzi

[+] More

[-] Less

We introduce and study various models for satisfying electrical energy demands of prosumers in a microgrid, while optimizing their costs.Each prosumer has individual demands of electrical energy, which can vary day-by-day, and which they can satisfy by either generating electrical energy through a self-operated mini power plant like a solar panel, through buying from an external energy provider, such as the main grid or by trading with other prosumers.Our models take into account two key aspects motivated by real-life scenarios: first, we consider a daily volatility of prices for buying and selling the energy, and second, the possibility to store the self-generated energy in a battery of finite capacity to be either self-consumed or sold to other prosumers in the future.We provide a thorough complexity analysis, as well as efficient algorithms, so that prosumers can minimize their overall cost over the entire time horizon. As a byproduct, we also solve a new, generalized version of the KNAPSACK problem which may be of independent interest. We complement our theoretical findings by extensive experimental evaluations on realistic data sets.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Planning and Scheduling -> PS: Planning algorithms
Planning and Scheduling -> PS: Planning under uncertainty
Planning and Scheduling -> PS: Theoretical foundations of planning

4164

Couples Can Be Tractable: New Algorithms and Hardness Results for the Hospitals/Residents Problem with Couples

Gergely Csáji, David Manlove, Iain McBride, James Trimble

[+] More

[-] Less

In this paper we study the Hospitals/Residents problem with Couples (HRC), where a solution is a stable matching or a report that none exists. We present a novel polynomial-time algorithm that can find a near-feasible stable matching (adjusting the hospitals’ capacities by at most 1) in an HRC instance where the couples’ preferences are sub-responsive (i.e., if one member switches to a better hospital, than the couple also improves if the new pair is also acceptable) and sub-complete (i.e., each pair of hospitals that are individually acceptable to both members are jointly acceptable for the couple) by reducing it to an instance of the Stable Fixtures problem. We also present a polynomial-time algorithm for HRC in a sub-responsive, sub-complete instance that is a Dual Market, or where all couples are one of several possible types. Our polynomial-time solvability results greatly expand the class of known tractable instances of HRC.We complement our algorithms with several hardness results. We show that HRC with sub-responsive and sub-complete couples is NP-hard, even with other strong restrictions. We also show that HRC with a Dual Market is NP-hard under several simultaneous restrictions.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Mechanism design
Game Theory and Economic Paradigms -> GTEP: Computational social choice

4175

Popular and Dominant Matchings with Uncertain and Multimodal Preferences

Gergely Csáji

[+] More

[-] Less

We study the Popular Matching (PM) problem in multiple models, where the preferences of the agents in the instance may change or may be unknown or uncertain. In particular, we study an Uncertainty model, where each agent has a possible set of preference lists, a Multilayer model, where there are layers of preference profiles, and a Robust popularity model, where any agent may move some other agents up or down some places in his preference list. Our goal is always to find a matching that is popular in any possible preference profile.We study both one-sided (only one class of the agents have preferences) and two-sided bipartite markets. In the one-sided model, we show that all our problems can be solved in polynomial time by utilizing the structure of popular matchings. We also obtain nice structural results. With two-sided preferences, we show that all three above models lead to NP-hard questions for popular matchings. By using the connection between dominant matchings and stable matchings, we show that in the robust and uncertainty models, a certainly dominant matching in all possible preference profiles can be found in polynomial time, whereas in the multilayer model, the problem remains NP-hard for dominant matchings too. We also answer an open question about d-robust stable matchings.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Mechanism design
Game Theory and Economic Paradigms -> GTEP: Computational social choice

4176

Dynamic against Dynamic: An Open-Set Self-Learning Framework

Haifeng Yang, Chuanxing Geng, Pong C. Yuen, Songcan Chen

[+] More

[-] Less

In open set recognition, existing methods generally learn statically fixed decision boundaries to reject unknown classes. Though they have achieved promising results, such decision boundaries are evidently insufficient for universal unknown classes in dynamic and open scenarios as they can potentially appear at any position in the feature space. Moreover, these methods just simply reject unknown class samples during testing without any effective utilization for them. In fact, such samples completely can constitute the true instantiated representation of the unknown classes to further enhance the model’s performance. To address these issues, this paper proposes a novel dynamic against dynamic idea, i.e., dynamic method against dynamic changing open-set world, where an open-set self-learning (OSSL) framework is correspondingly developed. OSSL starts with a good closed-set classifier trained by known classes and utilizes available test samples for model adaptation during testing, thus gaining the adaptability to changing data distributions. In particular, a novel self-matching module is designed for OSSL, which can achieve the adaptation in automatically identifying known class samples while rejecting unknown class samples which are further utilized to enhance the discriminability of the model as the instantiated representation of unknown classes. Our method establishes new performance milestones respectively in almost all standard and cross-data benchmarks.

List of keywords

Machine Learning -> ML: Supervised Learning
Computer Vision -> CV: Machine learning for vision
Machine Learning -> ML: Classification

4177

Enhancing Fine-Grained Urban Flow Inference via Incremental Neural Operator

Qiang Gao, Xiaolong Song, Li Huang, Goce Trajcevski, Fan Zhou, Xueqin Chen

[+] More

[-] Less

Fine-grained urban flow inference (FUFI), which involves inferring fine-grained flow maps from their coarse-grained counterparts, is of tremendous interest in the realm of sustainable urban traffic services. To address the FUFI, existing solutions mainly concentrate on investigating spatial dependencies, introducing external factors, reducing excessive memory costs, etc., — while rarely considering the catastrophic forgetting (CF) problem. Motivated by recent operator learning, we present an Urban Neural Operator solution with Incremental learning (UNOI), primarily seeking to learn grained-invariant solutions for FUFI in addition to addressing CF. Specifically, we devise an urban neural operator (UNO) in UNOI that learns mappings between approximation spaces by treating the different-grained flows as continuous functions, allowing a more flexible capture of spatial correlations. Furthermore, the phenomenon of CF behind time-related flows could hinder the capture of flow dynamics. Thus, UNOI mitigates CF concerns as well as privacy issues by placing UNO blocks in two incremental settings, i.e., flow-related and task-related. Experimental results on large-scale real-world datasets demonstrate the superiority of our proposed solution against the baselines.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Sensor networks and smart cities
Data Mining -> DM: Applications
Data Mining -> DM: Mining spatial and/or temporal data
Multidisciplinary Topics and Applications -> MTA: Transportation

4184

MuEP: A Multimodal Benchmark for Embodied Planning with Foundation Models

Kanxue Li, Baosheng Yu, Qi Zheng, Yibing Zhan, Yuhui Zhang, Tianle Zhang, Yijun Yang, Yue Chen, Lei Sun, Qiong Cao, Li Shen, Lusong Li, Dapeng Tao, Xiaodong He

[+] More

[-] Less

Foundation models have demonstrated significant emergent abilities, holding great promise for enhancing embodied agents’ reasoning and planning capacities. However, the absence of a comprehensive benchmark for evaluating embodied agents with multimodal observations in complex environments remains a notable gap. In this paper, we present MuEP, a comprehensive Multimodal benchmark for Embodied Planning. MuEP facilitates the evaluation of multimodal and multi-turn interactions of embodied agents in complex scenes, incorporating fine-grained evaluation metrics that provide insights into the performance of embodied agents throughout each task. Furthermore, we evaluate embodied agents with recent state-of-the-art foundation models, including large language models (LLMs) and large multimodal models (LMMs), on the proposed benchmark. Experimental results show that foundation models based on textual representations of environments usually outperform their visual or multimodal counterparts, suggesting a gap in embodied planning abilities with multimodal observations. We also find that control language generation is an indispensable ability beyond common-sense knowledge for accurate embodied task completion. We hope the proposed MuEP benchmark can contribute to the advancement of embodied AI with foundation models.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Agent-based simulation and emergence
Computer Vision -> CV: Embodied vision: Active agents, simulation
Multidisciplinary Topics and Applications -> MTA: Databases

4197

ABM: Attention before Manipulation

Fan Zhuo, Tiffany He, Fei Yu, Pengteng Li, Zheyi Zhao, Xilong Sun

[+] More

[-] Less

Vision-language models (VLMs) show promising generalization and zero-shot capabilities, offering a potential solution to the impracticality and cost of enabling robots to comprehend diverse human instructions and scene semantics in the real world. Existing approaches most directly integrate the semantic representations from pre-trained VLMs with policy learning. However, these methods are limited to the labeled data learned, resulting in poor generalization ability to unseen instructions and objects. To address the above limitation, we propose a simple method called "Attention Before Manipulation" (ABM), which fully leverages the object knowledge encoded in CLIP to extract information about the target object in the image. It constructs an Object Mask Field, serving as a better representation of the target object for the model to separate visual grounding from action prediction and acquire specific manipulation skills effectively. We train ABM for 8 RLBench tasks and 2 real-world tasks via behavior cloning. Extensive experiments show that our method significantly outperforms the baselines in the zero-shot and compositional generalization experiment settings.

List of keywords

Computer Vision -> CV: Embodied vision: Active agents, simulation
Robotics -> ROB: Applications
Robotics -> ROB: Manipulation
Robotics -> ROB: Robotics and vision

4200

Multi-level Disentangling Network for Cross-Subject Emotion Recognition Based on Multimodal Physiological Signals

Ziyu Jia, Fengming Zhao, Yuzhe Guo, Hairong Chen, Tianzi Jiang

[+] More

[-] Less

Emotion recognition based on multimodal physiological signals is attracting more and more attention. However, how to deal with the consistency and heterogeneity of multimodal physiological signals, as well as individual differences across subjects, pose two significant challenges. In this paper, we propose a Multi-level Disentangling Network named MDNet for cross-subject emotion recognition based on multimodal physiological signals. Specifically, MDNet consists of a modality-level disentangling module and a subject-level disentangling module. The modality-level disentangling module projects multimodal physiological signals into modality-invariant subspace and modality-specific subspace, capturing modality-invariant features and modality-specific features. The subject-level disentangling module separates subject-shared features and subject-private features among different subjects from multimodal data, which facilitates cross-subject emotion recognition. Experiments on two multimodal emotion datasets demonstrate that MDNet outperforms other state-of-the-art baselines.

List of keywords

Humans and AI -> HAI: Applications
Machine Learning -> ML: Multi-modal learning

4207

Reinforcement Learning from Diverse Human Preferences

Wanqi Xue, Bo An, Shuicheng Yan, Zhongwen Xu

[+] More

[-] Less

The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent’s desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to a non-parameterized distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.

List of keywords

Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Applications

4215

MISA: MIning Saliency-Aware Semantic Prior for Box Supervised Instance Segmentation

Hao Zhu, Yan Zhu, Jiayu Xiao, Yike Ma, Yucheng Zhang, Jintao Li, Feng Dai

[+] More

[-] Less

Box supervised instance segmentation (BSIS) aims to achieve an effective trade-off between annotation costs and model performance by solely relying on bounding box annotations during training process. However, we observe that BSIS model is bottlenecked by the intricate objective under limited guidance, and tends to sacrifice segmentation capability in order to effectively recognize multiple instances. To boost the BSIS model’s perceptual ability for object shape and contour, we introduce MISA, that is, MIning Saliency-Aware semantic prior from a well-optimized box supervised semantic segmentation (BSSS) network, and incorporating cross-model guidance into the learning process of BSIS. Specifically, we first design a Frequency-Space Distillation (FSD) module to extract assorted salient prior knowledge from BSSS model, and perform cross-model alignment for transfering the prior to BSIS model. Furthermore, we introduce Semantic-Enhanced Pairwise Affinity (SEPA), which borrows the object perceptual ability of BSSS model to emphasize the contribution of salient objects for pairwise affinity, providing more accurate guidance for the BSIS network. Extensive experiments show that our proposed MISA consistently surpasses the existing state-of-the-art methods by a large margin in the BSIS scenario.

List of keywords

Computer Vision -> CV: Segmentation
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

4216

SACNN: Self Attention-based Convolutional Neural Network for Fraudulent Behaviour Detection in Sports

Maxx Richard Rahman, Lotfy Abdel Khaliq, Thomas Piper, Hans Geyer, Tristan Equey, Norbert Baume, Reid Aikin, Wolfgang Maass

[+] More

[-] Less

Doping practices in sports by unscrupulous athletes have been an important societal issue for several decades. Recently, sample swapping has been raised as a potential practice performed by athletes to swap their doped samples with clean samples to evade the positive doping test. So far, the only proven method to detect such cases is by performing DNA analysis on samples. However, it is expensive and time-consuming, which goes beyond the budgetary limits of anti-doping organisations when implementing to all the samples collected during sports events. Therefore, in this paper, we propose a self attention-based convolutional neural network (SACNN) that incorporates both spatial and temporal behaviour of the longitudinal profile and generates embedding maps for solving the fraud detection problem in sports. We conduct extensive experiments on the real-world datasets. The result shows that SACNN outperforms other state-of-the-art baseline models for sequential anomaly detection. Moreover, we conduct a study with domain experts on real-world profiles using both DNA analysis and our proposed method; the result demonstrates the effectiveness of our proposed method and the impact it could bring to the society.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Sports
Machine Learning -> ML: Applications
AI Ethics, Trust, Fairness -> ETF: Societal impact of AI

4221

SPGNet: A Shape-prior Guided Network for Medical Image Segmentation

Zhengxuan Song, Xun Liu, Wenhao Zhang, Yongyi Gong, Tianyong Hao, Kun Zeng

[+] More

[-] Less

Given the intricacy and variability of anatomical structures in medical images, some methods employ shape priors to constrain segmentation. However, limited by the representational capability of these priors, existing approaches often struggle to capture diverse target structure morphologies. To address this, we propose SPGNet to guide segmentation by fully exploiting category-specific shape knowledge. The key idea is to enable the network to perceive data shape distributions by learning from statistical shape models. We uncover shape relationships via clustering and obtain statistical prior knowledge using principal component analysis. Our dual-path network comprises a segmentation path and a shape-prior path that collaboratively discern and harness shape prior distribution to improve segmentation robustness. The shape-prior path further serves to refine shapes iteratively by cropping features from the segmentation path, guiding the segmentation path and directing attention specifically to the edges of shapes which could be most significantly susceptible to segmentation error. We demonstrate superior performance on chest X-ray and breast ultrasound benchmarks.

List of keywords

Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Segmentation

4230

CAP: A Context-Aware Neural Predictor for NAS

Han Ji, Yuqi Feng, Yanan Sun

[+] More

[-] Less

Neural predictors are effective in boosting the time-consuming performance evaluation stage in neural architecture search (NAS), owing to their direct estimation of unseen architectures. Despite the effectiveness, training a powerful neural predictor with fewer annotated architectures remains a huge challenge. In this paper, we propose a context-aware neural predictor (CAP) which only needs a few annotated architectures for training based on the contextual information from the architectures. Specifically, the input architectures are encoded into graphs and the predictor infers the contextual structure around the nodes inside each graph. Then, enhanced by the proposed context-aware self-supervised task, the pre-trained predictor can obtain expressive and generalizable representations of architectures. Therefore, only a few annotated architectures are sufficient for training. Experimental results in different search spaces demonstrate the superior performance of CAP compared with state-of-the-art neural predictors. In particular, CAP can rank architectures precisely at the budget of only 172 ($0.04\%$ of the entire search space) annotated architectures in NAS-Bench-101. Moreover, CAP can help find promising architectures in both NAS-Bench-101 and DARTS search spaces with $94.18\%$ and $97.58\%$ top-1 test accuracy on CIFAR-10, respectively.

List of keywords

Machine Learning -> ML: Evaluation
Machine Learning -> ML: Automated machine learning

4235

WeatherGNN: Exploiting Meteo- and Spatial-Dependencies for Local Numerical Weather Prediction Bias-Correction

Binqing Wu, Weiqi Chen, Wenwei Wang, Bingqing Peng, Liang Sun, Ling Chen

[+] More

[-] Less

Due to insufficient local area information, numerical weather prediction (NWP) may yield biases for specific areas. Previous studies correct biases mainly by employing handcrafted features or applying data-driven methods intuitively, overlooking the complicated dependencies between weather factors and between areas. To address this issue, we propose WeatherGNN, a local NWP bias-correction method that utilizes Graph Neural Networks (GNNs) to exploit meteorological dependencies and spatial dependencies under the guidance of domain knowledge. Specifically, we introduce a factor GNN to capture area-specific meteorological dependencies adaptively based on spatial heterogeneity. In addition, we introduce a fast hierarchical GNN to capture dynamic spatial dependencies efficiently guided by Tobler’s first and second laws of geography. Our experimental results on two real-world datasets demonstrate that WeatherGNN achieves the state-of-the-art performance, outperforming the best baseline with an average 4.75 \% on RMSE.

List of keywords

Data Mining -> DM: Applications
Data Mining -> DM: Mining spatial and/or temporal data

4250

A Density-driven Iterative Prototype Optimization for Transductive Few-shot Learning

Jingcong Li, Chunjin Ye, Fei Wang, Jiahui Pan

[+] More

[-] Less

Few-shot learning (FSL) poses a considerable challenge since it aims to improve the model generalization ability with limited labeled data. Previous works usually attempt to construct class-specific prototypes and then predict novel classes using these prototypes. However, the feature distribution represented by the limited labeled data is coarse-grained, leading to large information gap between the labeled and unlabeled data as well as biases in the prototypes. In this paper, we investigate the correlation between sample quality and density, and propose a Density-driven Iterative Prototype Optimization to acquire high-quality prototypes, and further improve few-shot learning performance. Specifically, the proposed method consists of two optimization strategies. The similarity-evaluating strategy is for capturing the information gap between the labeled and unlabeled data by reshaping the feature manifold for the novel feature distribution. The density-driven strategy is proposed to iteratively refine the prototypes in the direction of density growth. The proposed method could reach or even exceed the state-of-the-art performance on four benchmark datasets, including mini-ImageNet, tiered-ImageNet, CUB, and CIFAR-FS. The code will be available soon at https://github.com/tailofcat/DIPO.

List of keywords

Machine Learning -> ML: Few-shot learning
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

4255

X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement Learner

Haoyuan Jiang, Ziyue Li, Hua Wei, Xuantang Xiong, Jingqing Ruan, Jiaming Lu, Hangyu Mao, Rui Zhao

[+] More

[-] Less

The effectiveness of traffic light control has been significantly improved by current reinforcement learning-based approaches via better cooperation among multiple traffic lights. However, a persisting issue remains: how to obtain a multi-agent traffic signal control algorithm with remarkable transferability across diverse cities? In this paper, we propose a Transformer on Transformer (TonT) model for cross-city meta multi-agent traffic signal control, named as X-Light: We input the full Markov Decision Process trajectories, and the Lower Transformer aggregates the states, actions, rewards among the target intersection and its neighbors within a city, and the Upper Transformer learns the general decision trajectories across different cities. This dual-level approach bolsters the model’s robust generalization and transferability. Notably, when directly transferring to unseen scenarios, ours surpasses all baseline methods with +7.91% on average, and even +16.3% in some cases, yielding the best results.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Applications
Machine Learning -> ML: Meta-learning
Machine Learning -> ML: Multi-task and transfer learning
Multidisciplinary Topics and Applications -> MTA: Transportation

4276

Solving Quantified Boolean Formulas with Few Existential Variables

Leif Eriksson, Victor Lagerkvist, Sebastian Ordyniak, George Osipov, Fahad Panolan, Mateusz Rychlicki

[+] More

[-] Less

The quantified Boolean formula (QBF) problem is an important decision problem generally viewed as the archetype for PSPACE-completeness. Many problems of central interest in AI are in general not included in NP, e.g., planning, model checking, and non-monotonic reasoning, and for such problems QBF has successfully been used as a modelling tool. However, solvers for QBF are not as advanced as state of the art SAT solvers, which has prevented QBF from becoming a universal modelling language for PSPACE-complete problems. A theoretical explanation is that QBF (as well as many other PSPACE-complete problems) lacks natural parameters guaranteeing fixed-parameter tractability (FPT). In this paper we tackle this problem and consider a simple but overlooked parameter: the number of existentially quantified variables. This natural parameter is virtually unexplored in the literature which one might find surprising given the general scarcity of FPT algorithms for QBF. Via this parameterization we then develop a novel FPT algorithm applicable to QBF instances in conjunctive normal form (CNF) of bounded clause length. We complement this by a W[1]-hardness result for QBF in CNF of unbounded clause length as well as sharper lower bounds for the bounded arity case under the (strong) exponential-time hypothesis.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Satisfiabilty
Constraint Satisfaction and Optimization -> CSO: Constraint satisfaction
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning

4283

A Top-Down Tree Model Counter for Quantified Boolean Formulas

Jean-Marie Lagniez, Florent Capelli, Andreas Plank, Martina Seidl

[+] More

[-] Less

This paper addresses the challenge of solution counting for Quantified Boolean Formulas (QBFs), a task distinct from the well-established model counting problem for SAT (\#SAT). Unlike SAT, where models are straightforward assignments to Boolean variables, QBF solution counting involves tree models that capture dependencies among variables within different quantifier blocks. We present a comprehensive top-down tree model counter capable of handling diverse satisfiable QBF formulas. Emphasizing the critical role of the branching heuristic, which must consider variables in the correct order according to quantification blocks, we further demonstrate the importance of addressing connected components, free variables, and caching. Experimental results indicate that our proposed approach for counting tree models of QBF formulas is highly efficient in practice, surpassing existing state-of-the-art methods designed for this specific purpose.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Satisfiabilty
Constraint Satisfaction and Optimization -> CSO: Solvers and tools

4288

Learning Fair Representations for Recommendation via Information Bottleneck Principle

Junsong Xie, Yonghui Yang, Zihan Wang, Le Wu

[+] More

[-] Less

User-oriented recommender systems (RS) characterize users’ preferences based on observed behaviors and are widely deployed in personalized services. However, RS may unintentionally capture biases related to sensitive attributes (e.g., gender) from behavioral data, leading to unfair issues and discrimination against particular groups (e.g., females). Adversarial training is a popular technique for fairness-aware RS, when filtering sensitive information in user modeling. Despite advancements in fairness, achieving a good accuracy-fairness trade-off remains a challenge in adversarial training. In this paper, we investigate fair representation learning from a novel information theory perspective. Specifically, we propose a model-agnostic Fair recommendation method via the Information Bottleneck principle FairIB. The learning objective of FairIB is to maximize the mutual information between user representations and observed interactions, while simultaneously minimizing it between user representations and sensitive attributes. This approach facilitates the capturing of essential collaborative signals in user representations while mitigating the inclusion of unnecessary sensitive information. Empirical studies on two real-world datasets demonstrate the effectiveness of the proposed FairIB, which significantly improves fairness while maintaining competitive recommendation accuracy, either in single or multiple sensitive scenarios. The code is available at https://github.com/jsxie9/IJCAI_FairIB.

List of keywords

Data Mining -> DM: Recommender systems
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
Data Mining -> DM: Collaborative filtering
Machine Learning -> ML: Representation learning

4291

Heterogeneous Graph Transformer with Poly-Tokenization

Zhiyuan Lu, Yuan Fang, Cheng Yang, Chuan Shi

[+] More

[-] Less

Graph neural networks have shown widespread success for learning on graphs, but they still face fundamental drawbacks, such as limited expressive power, over-smoothing, and over-squashing. Meanwhile, the transformer architecture offers a potential solution to these issues. However, existing graph transformers primarily cater to hom*ogeneous graphs and are unable to model the intricate semantics of heterogeneous graphs. Moreover, unlike small molecular graphs where the entire graph can be considered as the receptive field in graph transformers, real-world heterogeneous graphs comprise a significantly larger number of nodes and cannot be entirely treated as such. Consequently, existing graph transformers struggle to capture the long-range dependencies in these complex heterogeneous graphs. To address these two limitations, we present Poly-tokenized Heterogeneous Graph Transformer (PHGT), a novel transformer-based heterogeneous graph model. In addition to traditional node tokens, PHGT introduces a novel poly-token design with two more token types: semantic tokens and global tokens. Semantic tokens encapsulate high-order heterogeneous semantic relationships, while global tokens capture semantic-aware long-range interactions. We validate the effectiveness of PHGT through extensive experiments on standardized heterogeneous graph benchmarks, demonstrating significant improvements over state-of-the-art heterogeneous graph representation learning models.

List of keywords

Data Mining -> DM: Mining heterogenous data
Data Mining -> DM: Mining graphs

4300

Exploring Learngene via Stage-wise Weight Sharing for Initializing Variable-sized Models

Shi-Yu Xia, Wenxuan Zhu, Xu Yang, Xin Geng

[+] More

[-] Less

In practice, we usually need to build variable-sized models adapting for diverse resource constraints in different application scenarios, where weight initialization is an important step prior to training. The Learngene framework, introduced recently, firstly learns one compact part termed as learngene from a large well-trained model, after which learngene is expanded to initialize variable-sized models. In this paper, we start from analysing the importance of guidance for the expansion of well-trained learngene layers, inspiring the design of a simple but highly effective Learngene approach termed SWS (Stage-wise Weight Sharing), where both learngene layers and their learning process critically contribute to providing knowledge and guidance for initializing models at varying scales. Specifically, to learn learngene layers, we build an auxiliary model comprising multiple stages where the layer weights in each stage are shared, after which we train it through distillation. Subsequently, we expand these learngene layers containing stage information at their corresponding stage to initialize models of variable depths. Extensive experiments on ImageNet-1K demonstrate that SWS achieves consistent better performance compared to many models trained from scratch, while reducing around 6.6× total training costs. In some cases, SWS performs better only after 1 epoch tuning. When initializing variable-sized models adapting for different resource constraints, SWS achieves better results while reducing around 20× parameters stored to initialize these models and around 10× pre-training costs, in contrast to the pre-training and fine-tuning approach.

List of keywords

Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Classification

4308

Enhancing Cross-Modal Retrieval via Visual-Textual Prompt Hashing

Bingzhi Chen, Zhongqi Wu, Yishu Liu, Biqing Zeng, Guangming Lu, Zheng Zhang

[+] More

[-] Less

Cross-modal hashing has garnered considerable research interest due to its rapid retrieval and low storage costs. However, the majority of existing methods suffer from the limitations of context loss and information redundancy, particularly in simulated textual environments enriched with manually annotated tags or virtual descriptions. To mitigate these issues, we propose a novel Visual-Textual Prompt Hashing (VTPH) that aims to bridge the gap between simulated textual and visual modalities within a unified prompt optimization paradigm for cross-modal retrieval. By seamlessly integrating robust reasoning capabilities inherent in large-scale models, we design the visual and textual alignment prompt mechanisms to collaboratively enhance the contextual awareness and semantic capabilities embedded within simulated textual features. Furthermore, an affinity-adaptive contrastive learning strategy is dedicated to dynamically recalibrating the semantic interaction between visual and textual modalities by modeling the nuanced heterogeneity and semantic gaps between simulated and real-world textual environments. To the best of our knowledge, this is the first attempt to integrate both visual and textual prompt learning into cross-modal hashing, facilitating the efficacy of semantic coherence between diverse modalities. Extensive experiments on multiple benchmark datasets consistently demonstrate the superiority and robustness of our VTPH method over state-of-the-art competitors.

List of keywords

Computer Vision -> CV: Image and video retrieval
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Scene analysis and understanding
Computer Vision -> CV: Vision, language and reasoning

4316

Individual-Rationality in Topological Distance Games Is Surprisingly Hard

Argyrios Deligkas, Eduard Eiben, Dušan Knop, Šimon Schierreich

[+] More

[-] Less

In the recently introduced topological distance games, strategic agents need to be assigned to a subset of vertices of a topology. In the assignment, the utility of an agent depends on both the agent’s inherent utilities for other agents and its distance from them on the topology. We study the computational complexity of finding individually-rational outcomes; this notion is widely assumed to be the very minimal stability requirement and requires that the utility of every agent in a solution is non-negative.We perform a comprehensive study of the problem’s complexity, and we prove that even in very basic cases, deciding whether an individually-rational solution exists is intractable. To reach at least some tractability, one needs to combine multiple restrictions of the input instance, including the number of agents and the topology and the influence of distant agents on the utility.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

4319

Approximate Algorithms for $k$-Sparse Wasserstein Barycenter with Outliers

Qingyuan Yang, Hu Ding

[+] More

[-] Less

Wasserstein Barycenter (WB) is one of the most fundamental optimization problems in optimal transportation. Given a set of distributions, the goal of WB is to find a new distribution that minimizes the average Wasserstein distance to them. The problem becomes even harder if we restrict the solution to be “$k$-sparse”. In this paper, we study the $k$-sparse WB problem in the presence of outliers, which is a more practical setting since real-world data often contains noise. Existing WB algorithms cannot be directly extended to handle the case with outliers, and thus it is urgently needed to develop some novel ideas. First, we investigate the relation between $k$-sparse WB with outliers and the clustering (with outliers) problems. In particular, we propose a clustering based LP method that yields constant approximation factor for the $k$-sparse WB with outliers problem. Further, we utilize the coreset technique to achieve the $(1+\epsilon)$-approximation factor for any $\epsilon>0$, if the dimensionality is not high. Finally, we conduct the experiments for our proposed algorithms and illustrate their efficiencies in practice.

List of keywords

Machine Learning -> ML: Optimization
Data Mining -> DM: Anomaly/outlier detection
Machine Learning -> ML: Clustering

4335

Parameterized Analysis of Bribery in Challenge the Champ Tournaments

Juhi Chaudhary, Hendrik Molter, Meirav Zehavi

[+] More

[-] Less

Challenge the champ tournaments are one of the simplest forms of competition, where a (initially selected) champ is repeatedly challenged by other players. If a player beats the champ, then that player is considered the new (current) champ. Each player in the competition challenges the current champ once in a fixed order. The champ of the last round is considered the winner of the tournament. We investigate a setting where players can be bribed to lower their winning probability against the initial champ. The goal is to maximize the probability of the initial champ winning the tournament by bribing the other players, while not exceeding a given budget for the bribes. Mattei et al. [Journal of Applied Logic, 2015] showed that the problem can be solved in pseudo-polynomial time, and that it is in XP when parameterized by the number of players.We show that the problem is weakly NP-hard and W[1]-hard when parameterized by the number of players. On the algorithmic side, we show that the problem is fixed-parameter tractable when parameterized either by the number of different bribe values or the number of different probability values. To this end, we establish several results that are of independent interest. In particular, we show that the product knapsack problem is W[1]-hard when parameterized by the number of items in the knapsack, and that constructive bribery for cup tournaments is W[1]-hard when parameterized by the number of players. Furthermore, we present a novel way of designing mixed integer linear programs, ensuring optimal solutions where all variables are integers.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

4348

Enhancing Cooperation through Selective Interaction and Long-term Experiences in Multi-Agent Reinforcement Learning

Tianyu Ren, Xiao-Jun Zeng

[+] More

[-] Less

The significance of network structures in promoting group cooperation within social dilemmas has been widely recognized. Prior studies attribute this facilitation to the assortment of strategies driven by spatial interactions. Although reinforcement learning has been employed to investigate the impact of dynamic interaction on the evolution of cooperation, there remains a lack of understanding about how agents develop neighbour selection behaviours and the formation of strategic assortment within an explicit interaction structure. To address this, our study introduces a computational framework based on multi-agent reinforcement learning in the spatial Prisoner’s Dilemma game. This framework allows agents to select dilemma strategies and interacting neighbours based on their long-term experiences, differing from existing research that relies on preset social norms or external incentives. By modelling each agent using two distinct Q-networks, we disentangle the coevolutionary dynamics between cooperation and interaction. The results indicate that long-term experience enables agents to develop the ability to identify non-cooperative neighbours and exhibit a preference for interaction with cooperative ones. This emergent self-organizing behaviour leads to the clustering of agents with similar strategies, thereby increasing network reciprocity and enhancing group cooperation.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
Game Theory and Economic Paradigms -> GTEP: Cooperative games
Machine Learning -> ML: Multiagent Reinforcement Learning
Multidisciplinary Topics and Applications -> MTA: Social sciences

4349

A Tensor-Based Formalization of the Event Calculus

Efthimis Tsilionis, Alexander Artikis, Paliouras Georgios

[+] More

[-] Less

We present a formalization of the Event Calculus (EC) in tensor spaces. The motivation for a tensor-based predicate calculus comes from the area of composite event recognition (CER). As a CER engine, we adopt a logic programming implementation of EC with optimizations for continuous narrative assimilation on data streams. We show how to evaluate EC rules algebraically and solve a linear equation to compute the corresponding models. We demonstrate the scalability of our approach with the use of large datasets from a real-world application domain, and show it outperforms significantly symbolic EC, in terms of processing time.

List of keywords

Knowledge Representation and Reasoning -> KRR: Non-monotonic reasoning
Knowledge Representation and Reasoning -> KRR: Logic programming
Knowledge Representation and Reasoning -> KRR: Qualitative, geometric, spatial, and temporal reasoning
Machine Learning -> ML: Matrix/tensor methods

4372

Fast Unpaired Multi-view Clustering

Xingfeng Li, Yuangang Pan, Yinghui Sun, Quansen Sun, Ivor Tsang, Zhenwen Ren

[+] More

[-] Less

Anchor based pair-wised multi-view clustering often assumes multi-view data are paired, and has demonstrated significant advancements in recent years. However, this presumption is easily violated, and data is commonly unpaired fully in practical applications due to the influence of data collection and storage processes. Addressing unpaired large-scale multi-view data through anchor learning remains a research gap. The absence of pairing in multi-view data disrupts the consistency and complementarity of multiple views, posing significant challenges in learning powerful and meaningful anchors and bipartite graphs from unpaired multi-view data. To tackle this challenge, this study proposes a novel Fast Unpaired Multi-view Clustering (FUMC) framework for fully unpaired large-scale multi-view data. Specifically, FUMC first designs an inverse local manifold learning paradigm to guide the learned anchors for effective pairing and balancing, ensuring alignment, fairness, and power in unpaired multi-view data. Meanwhile, a novel bipartite graph matching framework is developed to align unpaired bipartite graphs, creating a consistent bipartite graph from unpaired multi-view data. The efficacy, efficiency, and superiority of our FUMC are corroborated through extensive evaluations on numerous benchmark datasets with shallow and deep SOTA methods.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering

4392

CoAtFormer: Vision Transformer with Composite Attention

Zhiyong Chang, Mingjun Yin, Yan Wang

[+] More

[-] Less

Transformer has recently gained significant attention and achieved state-of-the-art performance in various computer vision applications, including image classification, instance segmentation, and object detection. However, the self-attention mechanism underlying the transformer leads to quadraticcomputational cost with respect to image size,limiting its widespread adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and effective attention modulewe call Composite Attention. It features parallel branches, enabling the modeling of various global dependencies. In each composite attention module, one branch employs a dynamic channel attention module to capture global channel dependencies, while the other branch utilizes an efficient spatial attention module to extract long-range spatial interactions. In addition, we effectively blending composite attention module with convolutions, and accordingly develop a simple hierarchical vision backbone, dubbed CoAtFormer, by simply repeating the basic building block over multiple stages. Extensive experiments show our CoAtFormer achieves state-of-the-art results on various different tasks. Without any pre-training and extra data, CoAtFormer-Tiny, CoAtFormer-Small, and CoAtFormer-Base achieve 84.4%, 85.3%, and 85.9% top-1 accuracy on ImageNet-1K with 24M, 37M, and 73M parameters, respectively. Furthermore, CoAtFormer also consistently outperform prior work in other vision tasks such as object detection, instance segmentation, and semantic segmentation. When further pretraining on the larger dataset ImageNet-22k, we achieve 88.7% Top-1 accuracy on ImageNet-1K

List of keywords

Computer Vision -> CV: Representation learning
Machine Learning -> ML: Deep learning architectures

4407

HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis

Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu

[+] More

[-] Less

Multimodal Sentiment Analysis (MSA) aims to identify speakers’ sentiment tendencies in multimodal video content, raising serious concerns about privacy risks associated with multimodal data, such as voiceprints and facial images. Recent distributed collaborative learning has been verified as an effective paradigm for privacy preservation in multimodal tasks. However, they often overlook the privacy distinctions among different modalities, struggling to strike a balance between performance and privacy preservation. Consequently, it poses an intriguing question of maximizing multimodal utilization to improve performance while simultaneously protecting necessary modalities. This paper forms the first attempt at modality-specified (i.e., audio and visual) privacy preservation in MSA tasks. We propose a novel Hybrid Distributed cross-modality cGAN framework (HyDiscGAN), which learns multimodality alignment to generate fake audio and visual features conditioned on shareable de-identified textual data. The objective is to leverage the fake features to approximate real audio and visual content to guarantee privacy preservation while effectively enhancing performance. Extensive experiments show that compared with the state-of-the-art MSA model, HyDiscGAN can achieve superior or competitive performance while preserving privacy.

List of keywords

Natural Language Processing -> NLP: Sentiment analysis, stylistic analysis, and argument mining
Machine Learning -> ML: Multi-modal learning
Multidisciplinary Topics and Applications -> MTA: Security and privacy

4427

Towards Generalizable Neural Solvers for Vehicle Routing Problems via Ensemble with Transferrable Local Policy

Chengrui Gao, Haopu Shang, Ke Xue, Dong Li, Chao Qian

[+] More

[-] Less

Machine learning has been adapted to help solve NP-hard combinatorial optimization problems. One prevalent way is learning to construct solutions by deep neural networks, which has been receiving more and more attention due to the high efficiency and less requirement for expert knowledge. However, many neural construction methods for Vehicle Routing Problems~(VRPs) focus on synthetic problem instances with specified node distributions and limited scales, leading to poor performance on real-world problems which usually involve complex and unknown node distributions together with large scales. To make neural VRP solvers more practical, we design an auxiliary policy that learns from the local transferable topological features, named local policy, and integrate it with a typical construction policy (which learns from the global information of VRP instances) to form an ensemble policy. With joint training, the aggregated policies perform cooperatively and complementarily to boost generalization. The experimental results on two well-known benchmarks, TSPLIB and CVRPLIB, of travelling salesman problem and capacitated VRP show that the ensemble policy significantly improves both cross-distribution and cross-scale generalization performance, and even performs well on real-world problems with several thousand nodes.

List of keywords

Search -> S: Search and machine learning
Machine Learning -> ML: Reinforcement learning
Search -> S: Combinatorial search and optimisation

4442

Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation

Jingxuan Wei, Linzhuang Sun, Yichong Leng, Xu Tan, Bihui Yu, Ruifeng Guo

[+] More

[-] Less

Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation encompasses two primary methods: sentence-level distillation and token-level distillation. In sentence-level distillation, the student model is trained to align with the output of the teacher model, which can alleviate the training difficulty and give student model a comprehensive understanding of global structure. Differently, token-level distillation requires the student model to learn the output distribution of the teacher model, facilitating a more fine-grained transfer of knowledge. Studies have revealed divergent performances between sentence-level and token-level distillation across different scenarios, leading to the confusion on the empirical selection of knowledge distillation methods. In this study, we argue that token-level distillation, with its more complex objective (i.e., distribution), is better suited for “simple” scenarios, while sentence-level distillation excels in “complex” scenarios. To substantiate our hypothesis, we systematically analyze the performance of distillation methods by varying the model size of student models, the complexity of text, and the difficulty of decoding procedure. While our experimental results validate our hypothesis, defining the complexity level of a given scenario remains a challenging task. So we further introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism, aiming to leverage the advantages of both individual methods. Experiments demonstrate that the hybrid method surpasses the performance of token-level or sentence-level distillation methods and the previous works by a margin, demonstrating the effectiveness of the proposed hybrid method.

List of keywords

Natural Language Processing -> NLP: Machine translation and multilinguality
Natural Language Processing -> NLP: Interpretability and analysis of models for NLP
Natural Language Processing -> NLP: Other
Natural Language Processing -> NLP: Summarization

4447

Skip-Timeformer: Skip-Time Interaction Transformer for Long Sequence Time-Series Forecasting

Wenchang Zhang, Hua Wang, Fan Zhang

[+] More

[-] Less

Recent studies have raised questions about the suitability of the Transformer architecture for long sequence time-series forecasting. These forecasting models leverage Transformers to capture dependencies between multiple time steps in a time series, with embedding tokens composed of data from individual time steps. However, challenges arise when applying Transformers to predict long sequences with strong periodicity, leading to performance degradation and increased computational burden. Furthermore, embedding tokens formed one time step at a time may struggle to reveal meaningful information in long sequences, failing to capture correlations between different time steps. In this study, we propose Skip-Timeformer, a Transformer-based model that utilizes a skip-time interaction for long sequence time-series forecasting. Specifically, we decompose the time series into multiple subsequences based on different time intervals, embedding various time steps into variable tokens across multiple sequences. The skip-time interaction mechanism utilizes these variable tokens to capture dependencies in the skip-time dimension. Additionally, skip-time interaction is employed to learn dependencies between sequences missed by multiple skip time steps. The Skip-Timeformer model demonstrates state-of-the-art performance on various real-world datasets, further enhancing the long sequence forecasting capabilities of the Transformer variations and better adapting to arbitrary lookback windows.

List of keywords

Machine Learning -> ML: Time series and data streams

4451

Temporal Graph ODEs for Irregularly-Sampled Time Series

Alessio Gravina, Daniele Zambon, Davide Bacciu, Cesare Alippi

[+] More

[-] Less

Modern graph representation learning works mostly under the assumption of dealing with regularly sampled temporal graph snapshots, which is far from realistic, e.g., social networks and physical systems are characterized by continuous dynamics and sporadic observations. To address this limitation, we introduce the Temporal Graph Ordinary Differential Equation (TG-ODE) framework, which learns both the temporal and spatial dynamics from graph streams where the intervals between observations are not regularly spaced. We empirically validate the proposed approach on several graph benchmarks, showing that TG-ODE can achieve state-of-the-art performance in irregular graph stream tasks.

List of keywords

Machine Learning -> ML: Sequence and graph learning
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Time series and data streams

4454

Privacy-Preserving UCB Decision Process Verification via zk-SNARKs

Xikun Jiang, He Lyu, Chenhao Ying, Yibin Xu, Boris Düdder, Yuan Luo

[+] More

[-] Less

With the increasingly widespread application of machine learning, how to strike a balance between protecting the privacy of data and algorithm parameters and ensuring the verifiability of machine learning has always been a challenge. This study explores the intersection of reinforcement learning and data privacy, specifically addressing the Multi-Armed Bandit (MAB) problem with the Upper Confidence Bound (UCB) algorithm. We introduce zkUCB, an innovative algorithm that employs the Zero-Knowledge Succinct Non-Interactive Argument of Knowledge (zk-SNARKs) to enhance UCB. zkUCB is carefully designed to safeguard the confidentiality of training data and algorithmic parameters, ensuring transparent UCB decision-making.Experiments highlight zkUCB’s superior performance, attributing its enhanced reward to judicious quantization bit usage that reduces information entropy in the decision-making process. zkUCB’s proof size and verification time scale linearly with the execution steps of zkUCB. This showcases zkUCB’s adept balance between data security and operational efficiency. This approach contributes significantly to the ongoing discourse on reinforcing data privacy in complex decision-making processes, offering a promising solution for privacy-sensitive applications.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Security and privacy
Machine Learning -> ML: Multi-armed bandits
Machine Learning -> ML: Trustworthy machine learning

4460

Joint Domain Adaptive Graph Convolutional Network

Niya Yang, Dongxiao He, Xin Huang, Zhizhi Yu, Ye Wang, Di Jin

[+] More

[-] Less

In the realm of cross-network tasks, graph domain adaptation is an effective tool due to its ability to transfer abundant labels from nodes in the source domain to those in the target domain. Existing adversarial domain adaptation methods mainly focus on domain-wise alignment. These approaches, while effective in mitigating the marginal distribution shift between the two domains, often ignore the integral aspect of structural alignment, potentially leading to negative transfer. To address this issue, we propose a joint adversarial domain adaptive graph convolutional network (JDA-GCN) that is uniquely augmented with structural graph alignment, so as to enhance the efficacy of knowledge transfer. Specifically, we construct a structural graph to delineate the interconnections among nodes within identical categories across the source and target domains. To further refine node representation, we integrate the local consistency matrix with the global consistency matrix, thereby leveraging the learning of the sub-structure similarity of nodes to enable more robust and effective representation of nodes. Empirical evaluation on diverse real-world datasets substantiates the superiority of our proposed method, marking a significant advancement over existing state-of-the-art graph domain adaptation algorithms.

List of keywords

Data Mining -> DM: Mining graphs
Machine Learning -> ML: Classification
Machine Learning -> ML: Sequence and graph learning

4464

Explaining Arguments’ Strength: Unveiling the Role of Attacks and Supports

Xiang Yin, Nico Potyka, Francesca Toni

[+] More

[-] Less

Quantitatively explaining the strength of arguments under gradual semantics has recently received increasing attention. Specifically, several works in the literature provide quantitative explanations by computing the attribution scores of arguments. These works disregard the importance of attacks and supports, even though they play an essential role when explaining arguments’ strength. In this paper, we propose a novel theory of Relation Attribution Explanations (RAEs), adapting Shapley values from game theory to offer fine-grained insights into the role of attacks and supports in quantitative bipolar argumentation towards obtaining the arguments’ strength. We show that RAEs satisfy several desirable properties. We also propose a probabilistic algorithm to approximate RAEs efficiently. Finally, we show the application value of RAEs in fraud detection and large language models case studies.

List of keywords

Knowledge Representation and Reasoning -> KRR: Argumentation

4476

ROCES: Robust Class Expression Synthesis in Description Logics via Iterative Sampling

N’Dah Jean Kouagou, Stefan Heindorf, Caglar Demir, Axel-Cyrille Ngonga Ngomo

[+] More

[-] Less

We consider the problem of class expression learning using cardinality-minimal sets of examples. Recent class expression learning approaches employ deep neural networks and have demonstrated tremendous performance improvements in execution time and quality of the computed solutions. However, they lack generalization capabilities when it comes to the number of examples used in a learning problem, i.e., they often perform poorly on unseen learning problems where only a few examples are given. In this work, we propose a generalization of the classical class expression learning problem to address the limitations above. In short, our generalized learning problem ($\mathcal{GLP}$) forces learning systems to solve the classical class expression learning problem using the smallest possible subsets of examples, thereby improving the learning systems’ ability to solve unseen learning problems with arbitrary numbers of examples. Moreover, we develop ROCES, a learning algorithm for synthesis-based approaches to solve $\mathcal{GLP}$. Experimental results suggest that post training, ROCES outperforms existing synthesis-based approaches on out-of-distribution learning problems while remaining highly competitive overall.

List of keywords

Machine Learning -> ML: Neuro-symbolic methods
Knowledge Representation and Reasoning -> KRR: Description logics and ontologies
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Machine Learning -> ML: Explainable/Interpretable machine learning

4479

Selecting the Most Conflicting Pair of Candidates

Théo Delemazure, Łukasz Janeczko, Andrzej Kaczmarczyk, Stanisław Szufa

[+] More

[-] Less

We study committee elections from a perspective of finding the most conflicting candidates, that is, candidates that imply the largest amount of conflict, as per voter preferences. By proposing basic axioms to capture this objective, we show that none of the prominent multiwinner voting rules meet them. Consequently, we design committee voting rules compliant with our desiderata, introducing conflictual voting rules. A subsequent deepened analysis sheds more light on how they operate. Our investigation identifies various aspects of conflict, for which we come up with relevant axioms and quantitative measures, which may be of independent interest. We support our theoretical study with experiments on both real-life and synthetic data.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

4485

Enhancing Controlled Query Evaluation through Epistemic Policies

Gianluca Cima, Domenico Lembo, Lorenzo Marconi, Riccardo Rosati, Domenico Fabio Savo

[+] More

[-] Less

In this paper, we propose the use of epistemic dependencies to express data protection policies in Controlled Query Evaluation (CQE), which is a form of confidentiality-preserving query answering over ontologies and databases. The resulting policy language goes significantly beyond those proposed in the literature on CQE so far, allowing for very rich and practically interesting forms of data protection rules. We show the expressive abilities of our framework and study the data complexity of CQE for (unions of) conjunctive queries when ontologies are specified in the Description Logic DL-LiteR. Interestingly, while we show that the problem is in general intractable, we prove tractability for the case of acyclic epistemic dependencies by providing a suitable query rewriting algorithm. The latter result paves the way towards the implementation and practical application of this new approach to CQE.

List of keywords

Knowledge Representation and Reasoning -> KRR: Description logics and ontologies
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning

4491

Atomic Recovery Property for Multi-view Subspace-Preserving Recovery

Yulong Wang

[+] More

[-] Less

As the theoretical underpinnings for subspace clustering and classification, subspace-preserving recovery has attracted intensive attention in recent years. However, previous theoretical advances for subspace-preserving recovery only focus on the single-view data and most of them are based on conditions that are only sufficient. In this paper, we propose a necessary and sufficient condition referred to as Atomic Recovery Property (ARP) for multi-view subspace-preserving recovery. To this end, we generalize the atomic norm from single-view data to multi-view data and define the Multi-view Atomic Norm (MAN). Our another contribution is to provide a geometrically more interpretable characterization of ARP with respect to the unit ball of MAN. Based on the proposed multi-view subspace-preserving recovery theory, we also derive novel theoretical results for multi-view subspace clustering and classification, respectively.

List of keywords

Machine Learning -> ML: Clustering
Machine Learning -> ML: Classification
Machine Learning -> ML: Matrix/tensor methods
Machine Learning -> ML: Multi-view learning

4516

A Logic for Reasoning about Aggregate-Combine Graph Neural Networks

Pierre Nunn, Marco Sälzer, Francois Schwarzentruber, Nicolas Troquard

[+] More

[-] Less

In this paper, we propose a modal logic in which counting modalities appear in linear inequalities. We show that each formula can be transformed into an equivalent graph neural network (GNN). We also show that a broad class of GNNs can be transformed efficiently into a formula, thus significantly improving upon the literature about the logical expressiveness of GNNs. We also show that the satisfiability problem is PSPACE-complete. These results bring together the promise of using standard logical methods for reasoning about GNNs and their properties, particularly in applications such as GNN querying, equivalence checking, etc. We prove that such natural problems can be solved in polynomial space.

List of keywords

Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Learning theory

4535

Normative Testimony and Belief Functions: A Formal Theory of Norm Learning

Taylor Olson, Kenneth D. Forbus

[+] More

[-] Less

The ability to learn another’s moral beliefs is necessary for all social agents. It allows us to predict their behavior and is a prerequisite to correcting their beliefs if they are incorrect. To make AI systems more socially competent, a formal theory for learning internal normative beliefs is thus needed. However, to the best of our knowledge, a philosophically justified formal theory for this process does not yet exist. This paper begins the development of such a theory, focusing on learning from testimony. We make four main contributions. First, we provide a set of axioms that any such theory must satisfy. Second, we provide justification for belief functions, as opposed to traditional probability theory, for modeling norm learning. Third, we construct a novel learning function that satisfies these axioms. Fourth, we provide a complexity analysis of this formalism and proof that deontic rules are sound under its semantics. This paper thus serves as a theoretical contribution towards modeling learning norms from testimony, paving the road towards more social AI systems.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Values
Agent-based and Multi-agent Systems -> MAS: Normative systems
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Uncertainty in AI -> UAI: Uncertainty representations

4551

Weighted EF1 and PO Allocations with Few Types of Agents or Chores

Jugal Garg, Aniket Murhekar, John Qin

[+] More

[-] Less

We investigate the existence of fair and efficient allocations of indivisible chores to asymmetric agents who have unequal entitlements or weights. We consider the fairness notion of weighted envy-freeness up to one chore (wEF1) and the efficiency notion of Pareto-optimality (PO). The existence of EF1 and PO allocations of chores to symmetric agents is a major open problem in discrete fair division, and positive results are known only for certain structured instances. In this paper, we study this problem for a more general setting of asymmetric agents and show that an allocation that is wEF1 and PO exists and can be computed in polynomial time for instances with:– Three types of agents where agents with the same type have identical preferences but can have different weights. – Two types of choresFor symmetric agents, our results establish that EF1 and PO allocations exist for three types of agents and also generalize known results for three agents, two types of agents, and two types of chores. Our algorithms use a weighted picking sequence algorithm as a subroutine; we expect this idea and our analysis to be of independent interest.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Fair division
Agent-based and Multi-agent Systems -> MAS: Resource allocation

4553

Sampling Winners in Ranked Choice Voting

Matthew Iceland, Anson Kahng, Joseph Saber

[+] More

[-] Less

Ranked choice voting (RCV) is a voting rule that iteratively eliminates least-popular candidates until there is a single winner with a majority of all remaining votes. In this work, we explore three central questions about predicting the outcome of RCV on an election given a uniform sample of votes. First, in theory, how poorly can RCV sampling predict RCV outcomes? Second, can we use insights from the recently-proposed map of elections to better predict RCV outcomes? Third, is RCV the best rule to use on a sample to predict the outcome of RCV in real-world elections? We find that although RCV can do quite poorly in the worst case and it may be better to use other rules to predict RCV winners on synthetic data from the map of elections, RCV generally predicts itself well on real-world data, further contributing to its appeal as a theoretically-flawed but practicable voting process. We further supplement our work by exploring the effect of margin of victory (MoV) on sampling accuracy.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

4563

Using Large Language Models to Improve Query-based Constraint Acquisition

Younes Mechqrane, Christian Bessiere, Ismail Elabbassi

[+] More

[-] Less

Most active constraint acquisition systems suffer from two weaknesses. They require the explicit generation of the set of potential constraints (the bias), whose size can be prohibitive for practical use of these systems, and the answers to queries contain little information. In this paper, we introduce ACQNOGOODS, an active learning schema that does not require the construction of a bias. We then propose LLMACQ, an active learning system that incorporates a Large Language Model component in the ACQNOGOODS schema. LLMACQ interprets the user’s answers given in natural language, leading to more informative communication. As our experiments show, the non requirement of a bias in extension combined to the higher level communication with the user allow LLMACQ to learn constraints of any arity and to dramatically decrease the number of queries.

List of keywords

Constraint Satisfaction and Optimization -> CSO: Constraint learning and acquisition

4564

Towards Exact Computation of Inductive Bias

Akhilan Boopathy, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete

[+] More

[-] Less

Much research in machine learning involves finding appropriate inductive biases (e.g. convolutional neural networks, momentum-based optimizers, transformers) to promote generalization on tasks. However, quantification of the amount of inductive bias associated with these architectures and hyperparameters has been limited. We propose a novel method for efficiently computing the inductive bias required for generalization on a task with a fixed training data budget; formally, this corresponds to the amount of information required to specify well-generalizing models within a specific hypothesis space of models. Our approach involves modeling the loss distribution of random hypotheses drawn from a hypothesis space to estimate the required inductive bias for a task relative to these hypotheses. Unlike prior work, our method provides a direct estimate of inductive bias without using bounds and is applicable to diverse hypothesis spaces. Moreover, we derive approximation error bounds for our estimation approach in terms of the number of sampled hypotheses. Consistent with prior results, our empirical results demonstrate that higher dimensional tasks require greater inductive bias. We show that relative to other expressive model classes, neural networks as a model class encode large amounts of inductive bias. Furthermore, our measure quantifies the relative difference in inductive bias between different neural network architectures. Our proposed inductive bias metric provides an information-theoretic interpretation of the benefits of specific model architectures for certain tasks and provides a quantitative guide to developing tasks requiring greater inductive bias, thereby encouraging the development of more powerful inductive biases.

List of keywords

Machine Learning -> ML: Learning theory
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Evaluation
Machine Learning -> ML: Other

4582

Structured d-DNNF Is Not Closed under Negation

Harry Vinall-Smeeth

[+] More

[-] Less

Both structured d-DNNF and SDD can be exponentially more succinct than OBDD. Moreover, SDD is essentially as tractable as OBDD. But this leaves left two important open questions. Firstly, does OBDD support more tractable transformations than structured d-DNNF? And secondly, is structured d-DNNF more succinct than SDD? In this paper, we answer both questions in the affirmative. For the first question we show that, unlike OBDD, structured d-DNNF does not support polytime negation, disjunction, or existential quantification operations. As a corollary, we deduce that there are functions with an equivalent polynomial-sized structured d-DNNF but with no such representation as an SDD, thus answering the second question. We also lift this second result to arithmetic circuits (AC) to show a succinctness gap between PSDD and the positive AC analogue to structured d-DNNF.

List of keywords

Knowledge Representation and Reasoning -> KRR: Knowledge compilation

4594

GRASP: A Novel Benchmark for Evaluating Language GRounding and Situated Physics Understanding in Multimodal Language Models

Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, Elia Bruni

[+] More

[-] Less

This paper presents GRASP, a novel benchmark to evaluate the language grounding and physical understanding capabilities of video-based multimodal large language models (LLMs). This evaluation is accomplished via a two-tier approach leveraging Unity simulations. The first level tests for language grounding by assessing a model’s ability to relate simple textual descriptions with visual information. The second level evaluates the model’s understanding of "Intuitive Physics" principles, such as object permanence and continuity. In addition to releasing the benchmark, we use it to evaluate several state-of-the-art multimodal LLMs. Our evaluation reveals significant shortcomings in the language grounding and intuitive physics capabilities of these models. Although they exhibit at least some grounding capabilities, particularly for colors and shapes, these capabilities depend heavily on the prompting strategy. At the same time, all models perform below or at the chance level of 50% in the Intuitive Physics tests, while human subjects are on average 80% correct. These identified limitations underline the importance of using benchmarks like GRASP to monitor the progress of future models in developing these competencies.

List of keywords

Natural Language Processing -> NLP: Resources and evaluation
Computer Vision -> CV: Vision, language and reasoning
Natural Language Processing -> NLP: Language grounding

4598

Searching for Programmatic Policies in Semantic Spaces

Rubens Moraes, Levi Lelis

[+] More

[-] Less

Syntax-guided synthesis is the approach commonly used to synthesize programs encoding policies. In syntax-guided synthesis, the set of programs that one can write in a domain-specific language defines the search space and an algorithm searches within this space for programs encoding policies that maximize the agent’s reward. In this paper, we show an alternative approach to the synthesis of programmatic policies, where we search in an approximation of the underlying semantic space of the language. We hypothesized that searching in semantic spaces is a more sample-efficient approach to the synthesis of programmatic policies. We posit that the search is more efficient if the algorithm evaluates different agent behaviors as it searches through the space, a feature often lacking in syntax-based spaces. This is because small changes to the syntax of a program often do not result in different agent behaviors. We define semantic spaces by learning a library of programs that present different agent behaviors. Then, we approximate the semantic space by defining a neighborhood function for local search algorithms, where we replace parts of the current candidate program with programs from the library. We evaluated our hypothesis in a real-time strategy game called MicroRTS. The empirical results support our hypothesis that searching in semantic spaces can be more sample-efficient than searching in syntax spaces. Our results also show that the programmatic policies our system generates are able to outperform the winner of the last three MicroRTS competitions.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Computer games
Multidisciplinary Topics and Applications -> MTA: Game playing

4601

What Is Best for Students, Numerical Scores or Letter Grades?

Evi Micha, Shreyas Sekar, Nisarg Shah

[+] More

[-] Less

We study letter grading schemes, which are routinely employed for evaluating student performance. Typically, a numerical score obtained via one or more evaluations is converted into a letter grade (e.g., A+, B-, etc.) by associating a disjoint interval of numerical scores to each letter grade. We propose the first model for studying the (de)motivational effects of such grading on the students and, consequently, on their performance in future evaluations. We use the model to compare uniform letter grading schemes, in which the range of scores is divided into equal-length parts that are mapped to the letter grades, to numerical scoring, in which the score is not converted to any letter grade (equivalently, every score is its own letter grade). Theoretically, we identify realistic conditions under which numerical scoring is better than any uniform letter grading scheme. Our experiments confirm that this holds under even weaker conditions, but also find cases where the converse occurs.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

4605

Faster Optimal Coalition Structure Generation via Offline Coalition Selection and Graph-Based Search

Redha Taguelmimt, Samir Aknine, Djamila Boukredera, Narayan Changder, Tuomas Sandholm

[+] More

[-] Less

Coalition formation is a key capability in multi-agent systems. An important problem in coalition formation is \textit{coalition structure generation}: partitioning agents into coalitions to optimize the social welfare. This is a challenging problem that has been the subject of active research for the past three decades.In this paper, we present a novel algorithm, SMART, for the problem based on a hybridization of three innovative techniques.Two of these techniques are based on dynamic programming, where we show a powerful connection between the coalitions selected for evaluation and the performance of the algorithms.These algorithms use offline phases to optimize the choice of coalitions to evaluate. The third one uses branch-and-bound and integer partition graph search to explore the solution space. Our techniques bring a new way of approaching the problem and a new level of precision to the field.In experiments over several common value distributions, we show that the hybridization of these techniques in SMART is faster than the fastest prior algorithms (ODP-IP, BOSS) in generating optimal solutions across all the value distributions.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation

4611

Physics-Informed Neural Networks: Minimizing Residual Loss with Wide Networks and Effective Activations

Nima Hosseini Dashtbayaz, Ghazal Farhani, Boyu Wang, Charles X. Ling

[+] More

[-] Less

The residual loss in Physics-Informed Neural Networks (PINNs) alters the simple recursive relation of layers in a feed-forward neural network by applying a differential operator, resulting in a loss landscape that is inherently different from those of common supervised problems. Therefore, relying on the existing theory leads to unjustified design choices and suboptimal performance. In this work, we analyze the residual loss by studying its characteristics at critical points to find the conditions that result in effective training of PINNs. Specifically, we first show that under certain conditions, the residual loss of PINNs can be globally minimized by a wide neural network. Furthermore, our analysis also reveals that an activation function with well-behaved high-order derivatives plays a crucial role in minimizing the residual loss. In particular, to solve a $k$-th order PDE, the $k$-th derivative of the activation function should be bijective. The established theory paves the way for designing and choosing effective activation functions for PINNs and explains why periodic activations have shown promising performance in certain cases. Finally, we verify our findings by conducting a set of experiments on several PDEs. Our code is publicly available at https://github.com/nimahsn/pinns_tf2.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Physical sciences
Machine Learning -> ML: Applications

4625

Bypassing the ASP Bottleneck: Hybrid Grounding by Splitting and Rewriting

Alexander Beiser, Markus Hecher, Kaan Unalan, Stefan Woltran

[+] More

[-] Less

Answer Set Programming (ASP) is a key paradigm for problems in artificial intelligence and industrial contexts. In ASP, problems are modeled via a set of rules. Over the time this paradigm grew into a rich language, enabling complex rule types like aggregate expressions. Most practical ASP systems follow a ground-and-solve pattern, where rule schemes are grounded and resulting rules are solved. There, the so-called grounding bottleneck may prevent from solving, due to sheer grounding sizes. Recently body-decoupled grounding (BDG) demonstrated how to reduce grounding sizes by delegating effort to solving. However, BDG provides limited interoperability with traditional grounders and only covers simple rule types. In this work, we establish hybrid grounding — based on a novel splitting theorem that allows us to freely combine BDG with traditional grounders. To mitigate huge groundings in practice, we define rewriting procedures for efficiently deferring grounding effort of aggregates to solving. Our experimental results indicate that this approach is competitive, especially for instances, where traditional grounding fails.

List of keywords

Knowledge Representation and Reasoning -> KRR: Logic programming
Knowledge Representation and Reasoning -> KRR: Applications
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning
Knowledge Representation and Reasoning -> KRR: Non-monotonic reasoning

4626

Scalable Ultrafast Almost-optimal Euclidean Shortest Paths

Stefan Funke, Daniel Koch, Claudius Proissl, Axel Schneewind, Armin Weiß, Felix Weitbrecht

[+] More

[-] Less

We consider the problem of computing high-quality Euclidean shortest paths amidst obstacles on a large scale. By transferring and adapting speed-up techniques from the network-constrained setting, we are able to compute source target paths amidst obstacles in problem instances of several million obstacle vertices within few milliseconds. Based on a new lower-bounding technique we can show that on average our computed paths are on average only few percent longer than the optimum paths. We compare our approach with the current state-of-the-art on large problem instances derived from the OpenStreetMap project.

List of keywords

Planning and Scheduling -> PS: Routing
Multidisciplinary Topics and Applications -> MTA: Transportation
Planning and Scheduling -> PS: Applications
Search -> S: Combinatorial search and optimisation

4631

Multi-TA: Multilevel Temporal Augmentation for Robust Septic Shock Early Prediction

Hyunwoo Sohn, Kyungjin Park, Baekkwan Park, Min Chi

[+] More

[-] Less

Early predicting the onset of a disease is critical to timely and accurate clinical decision-making, where a model determines whether a patient will develop the disease n hours later. While deep learning algorithms have demonstrated great success using multivariate irregular time-series data such as electronic health records (EHRs), they often lack temporal robustness due to data scarcity problems becoming more prominent at multilevel as n increases. At event-level, the decreasing number of available events per trajectory increases uncertainty in anticipating future disease behavior. At trajectory-level, the scarcity of patient trajectories limits diversity in the training population, hindering the model’s generalization. This work introduces Multi-TA, a multilevel temporal augmentation framework that leverages BERT-based temporal EHRs representation learning and a unified data augmentation approach, effectively addressing data scarcity issues at both event and trajectory levels. Validated on two real-world EHRs for septic shock, Multi-TA outperforms mixup and GAN-based state-of-the-art models across eight prediction windows, demonstrating improved temporal robustness. Further, we provide in-depth analyses on data augmentation for clarification.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Health and medicine
Machine Learning -> ML: Robustness
Machine Learning -> ML: Time series and data streams
Natural Language Processing -> NLP: Applications

4638

Hierarchical Reinforcement Learning for Point of Interest Recommendation

Yanan Xiao, Lu Jiang, Kunpeng Liu, Yuanbo Xu, Pengyang Wang, Minghao Yin

[+] More

[-] Less

With the increasing popularity of location-based services, accurately recommending points of interest (POIs) has become a critical task. Although existing technologies are proficient in processing time-series data, they fall short when it comes to accommodating the diversity and dynamism in users’ POI selections, particularly in extracting key signals from complex historical behaviors. To address this challenge, we introduced the Hierarchical Reinforcement Learning Preprocessing Framework (HRL-PRP), a framework that can be integrated into existing recommendation models to effectively optimize user profiles. The HRL-PRP framework employs a two-tiered decision-making process, where the high-level process determines the necessity of modifying profiles, and the low-level process focuses on selecting POIs within the profiles. Through evaluations on multiple real-world datasets, we have demonstrated that HRL-PRP surpasses existing state-of-the-art methods in various recommendation performance metrics.

List of keywords

Data Mining -> DM: Recommender systems

4641

Natural Language Decomposition and Interpretation of Complex Utterances

Harsh Jhamtani, Hao Fang, Patrick Xia, Eran Levy, Jacob Andreas, Benjamin Van Durme

[+] More

[-] Less

Designing natural language interfaces has historically required collecting supervised data to translate user requests into carefully designed intent representations. This requires enumerating and labeling a long tail of user requests, which is challenging. At the same time, large language models (LLMs) encode knowledge about goals and plans that can help conversational assistants interpret user requests requiring numerous steps to complete. We introduce an approach to handle complex-intent-bearing utterances from a user via a process of hierarchical natural language decomposition and interpretation. Our approach uses a pre-trained language model to decompose a complex utterance into a sequence of simpler natural language steps and interprets each step using the language-to-program model designed for the interface. To test our approach, we collect and release DeCU —a new NL-to-program benchmark to evaluate Decomposition of Complex Utterances. Experiments show that the proposed approach enables the interpretation of complex utterances with almost no complex training data, while outperforming standard few-shot prompting approaches.

List of keywords

Natural Language Processing -> NLP: Dialogue and interactive systems
Natural Language Processing -> NLP: Language grounding
Natural Language Processing -> NLP: Natural language semantics

4643

ConstrainedZero: Chance-Constrained POMDP Planning Using Learned Probabilistic Failure Surrogates and Adaptive Safety Constraints

Robert J. Moss, Arec Jamgochian, Johannes Fischer, Anthony Corso, Mykel J. Kochenderfer

[+] More

[-] Less

To plan safely in uncertain environments, agents must balance utility with safety constraints. Safe planning problems can be modeled as a chance-constrained partially observable Markov decision process (CC-POMDP) and solutions often use expensive rollouts or heuristics to estimate the optimal value and action-selection policy. This work introduces the ConstrainedZero policy iteration algorithm that solves CC-POMDPs in belief space by learning neural network approximations of the optimal value and policy with an additional network head that estimates the failure probability given a belief. This failure probability guides safe action selection during online Monte Carlo tree search (MCTS). To avoid overemphasizing search based on the failure estimates, we introduce Δ-MCTS, which uses adaptive conformal inference to update the failure threshold during planning. The approach is tested on a safety-critical POMDP benchmark, an aircraft collision avoidance system, and the sustainability problem of safe CO₂ storage. Results show that by separating safety constraints from the objective we can achieve a target level of safety without optimizing the balance between rewards and costs.

List of keywords

Planning and Scheduling -> PS: POMDPs
Machine Learning -> ML: Partially observable reinforcement learning and POMDPs
Planning and Scheduling -> PS: Planning under uncertainty

4663

Shadow-Free Membership Inference Attacks: Recommender Systems Are More Vulnerable Than You Thought

Xiaoxiao Chi, Xuyun Zhang, Yan Wang, Lianyong Qi, Amin Beheshti, Xiaolong Xu, Kim-Kwang Raymond Choo, Shuo Wang, Hongsheng Hu

[+] More

[-] Less

Recommender systems have been successfully applied in many applications. Nonetheless, recent studies demonstrate that recommender systems are vulnerable to membership inference attacks (MIAs), leading to the leakage of users’ membership privacy. However, existing MIAs relying on shadow training suffer a large performance drop when the attacker lacks knowledge of the training data distribution and the model architecture of the target recommender system. To better understand the privacy risks of recommender systems, we propose shadow-free MIAs that directly leverage a user’s recommendations for membership inference. Without shadow training, the proposed attack can conduct MIAs efficiently and effectively under a practice scenario where the attacker is given only black-box access to the target recommender system. The proposed attack leverages an intuition that the recommender system personalizes a user’s recommendations if his historical interactions are used by it. Thus, an attacker can infer membership privacy by determining whether the recommendations are more similar to the interactions or the general popular items. We conduct extensive experiments on benchmark datasets across various recommender systems. Remarkably, our attack achieves far better attack accuracy with low false positive rates than baselines while with a much lower computational cost.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Security and privacy
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI

4665

Symplectic Neural Gaussian Processes for Meta-learning Hamiltonian Dynamics

Tomoharu Iwata, Yusuke Tanaka

[+] More

[-] Less

We propose a meta-learning method for modeling Hamiltonian dynamics from a limited number of data. Although Hamiltonian neural networks have been successfully used for modeling dynamics that obey the energy conservation law, they require many data to achieve high performance. The proposed method meta-learns our neural network-based model using datasets in various dynamical systems, such that our model can predict vector fields of unseen systems. In our model, a system representation is inferred from given small data using an encoder network. Then, the system-specific vector field is predicted by modeling the Hamiltonian using a Gaussian process (GP) with neural network-based mean and kernel functions that depend on the inferred system representation. This GP-based Hamiltonian allows us to analytically obtain predictions that are adapted to small data while imposing the constraint of the conservation law. The neural networks are shared across systems, which enables us to learn knowledge from multiple systems, and use it for unseen systems. In our experiments, we demonstrate that the proposed method outperforms existing methods for predicting dynamics from a small number of observations in target systems.

List of keywords

Machine Learning -> ML: Meta-learning
Multidisciplinary Topics and Applications -> MTA: Physical sciences

4724

Dual Calibration-based Personalised Federated Learning

Xiaoli Tang, Han Yu, Run Tang, Chao Ren, Anran Li, Xiaoxiao Li

[+] More

[-] Less

Personalized federated learning (PFL) is designed for scenarios with non-independent and identically distributed (non-IID) client data. Existing model mixup-based methods, one of the main approaches of PFL, can only extract either global or personalized features during training, thereby limiting effective knowledge sharing among clients. To address this limitation, we propose the Dual Calibration-based PFL (DC-PFL). It divides local models into a heterogeneous feature extractor and a hom*ogeneous classifier. The FL server utilizes mean and covariance representations from clients’ feature extractors to train a global generalized classifier, facilitating information exchange while preserving privacy. To enhance personalization and convergence, we design a feature extractor-level calibration method with an auxiliary loss for local models to refine feature extractors using global knowledge. Furthermore, DC-PFL refines the global classifier through the global classifier-level calibration, utilizing sample representations derived from an approximate Gaussian distribution model specific to each class. This method precludes the need to transmit original data representations, further enhancing privacy preservation. Extensive experiments on widely used benchmark datasets demonstrate that DC-PFL outperforms eight state-of-the-art methods, surpassing the best-performing baseline by 1.22% and 9.22% in terms of accuracy on datasets CIFAR-10 and CIFAR-100, respectively.

List of keywords

Machine Learning -> ML: Federated learning

4725

LEAP: Optimization Hierarchical Federated Learning on Non-IID Data with Coalition Formation Game

Jianfeng Lu, Yue Chen, Shuqin Cao, Longbiao Chen, Wei Wang, Yun Xin

[+] More

[-] Less

Although Hierarchical Federated Learning (HFL) utilizes edge servers (ESs) to alleviate communication burdens, its model performance will be degraded by non-IID data and limited communication resources. Current works often assume that data is uniformly distributed, which however contradicts the heterogeneity of IoT. Solutions involving additional model training to check the data distribution inevitably increase computational costs and the risk of privacy leakage. The challenges in solving these issues are how to reduce the impact of non-IID data without involving raw data, and how to rationalize the communication resource allocation for addressing straggler problem. To tackle these challenges, we propose a novel optimization method based on coaLition formation gamE and grAdient Projection, called LEAP. Specifically, we combine edge data distribution with coalition formation game innovatively to adjust the correlations between clients and ESs dynamically, ensuring optimal correlations. We further capture the client heterogeneity to achieve the rational bandwidth allocation from coalition perception and determine the optimal transmission power within specified delay constraints at the client level. Experimental results on four real datasets show that LEAP is able to achieve 20.62% improvement in model accuracy compared to the state-of-the-art baselines. Moreover, LEAP effectively reduces transmission energy consumption by at least about 2.24 times.

List of keywords

Machine Learning -> ML: Evaluation
Agent-based and Multi-agent Systems -> MAS: Resource allocation
Game Theory and Economic Paradigms -> GTEP: Mechanism design
Machine Learning -> ML: Game Theory

4738

Stochastic Neural Simulator for Generalizing Dynamical Systems across Environments

Liu Jiaqi, Jiaxu Cui, Jiayi Yang, Bo Yang

[+] More

[-] Less

Neural simulators for modeling complex dynamical systems have been extensively studied for various real-world applications, such as weather forecasting, ocean current prediction, and computational fluid dynamics simulation. Although they have demonstrated powerful fitting and predicting, most existing models are only built to learn single-system dynamics. Several advanced researches have considered learning dynamics across environments, which can exploit the potential commonalities among the dynamics across environments and adapt to new environments. However, these methods still are prone to scarcity problems where per-environment data is sparse or limited. Therefore, we propose a novel CoNDP (Context-Informed Neural ODE Processes) to achieve learning system dynamics from sparse observations across environments. It can fully use contextual information of each environment to better capture the intrinsic commonalities across environments and distinguishable differences among environments while modeling uncertainty of system evolution, producing more accurate predictions. Intensive experiments are conducted on five complex dynamical systems in various fields. Results show that the proposed CoNDP can achieve optimal results compared with common neural simulators and state-of-the-art cross-environmental models.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Physical sciences
Machine Learning -> ML: Time series and data streams

4747

ZeroDDI: A Zero-Shot Drug-Drug Interaction Event Prediction Method with Semantic Enhanced Learning and Dual-modal Uniform Alignment

Ziyan Wang, Zhankun Xiong, Feng Huang, Xuan Liu, Wen Zhang

[+] More

[-] Less

Drug-drug interactions (DDIs) can result in various pharmacological changes, which can be categorized into different classes known as DDI events (DDIEs). In recent years, previously unobserved/unseen DDIEs have been emerging, posing a new classification task when unseen classes have no labelled instances in the training stage, which is formulated as a zero-shot DDIE prediction (ZS-DDIE) task. However, existing computational methods are not directly applicable to ZS-DDIE, which has two primary challenges: obtaining suitable DDIE representations and handling the class imbalance issue. To overcome these challenges, we propose a novel method named ZeroDDI for the ZS-DDIE task. Specifically, we design a biological semantic enhanced DDIE representation learning module, which emphasizes the key biological semantics and distills discriminative molecular substructure-related semantics for DDIE representation learning. Furthermore, we propose a dual-modal uniform alignment strategy to distribute drug pair representations and DDIE semantic representations uniformly in unit sphere and align the matched ones, which can mitigate the issue of class imbalance. Extensive experiments showed that ZeroDDI surpasses the baselines and indicate that it is a promising tool for detecting unseen DDIEs. Our code has been released in https://github.com/wzy-Sarah/ZeroDDI.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Bioinformatics
Multidisciplinary Topics and Applications -> MTA: Health and medicine

4750

Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Abbas Mehrabian, Ankit Anand, Hyunjik Kim, Nicolas Sonnerat, Matej Balog, Gheorghe Comanici, Tudor Berariu, Andrew Lee, Anian Ruoss, Anna Bulanova, Daniel Toyama, Sam Blackwell, Bernardino Romera-Paredes, Petar Veličković, Laurent Orseau, Joonkyung Lee, Anurag Murty Naredla, Doina Precup, Adam Wagner

[+] More

[-] Less

This work proposes a new learning-to-search benchmark and uses AI to discover new mathematical knowledge related to an open conjecture of Erdos (1975) in extremal graph theory. The problem is to find graphs with a given size (number of nodes) that maximize the number of edges without having 3- or 4-cycles. We formulate this as a sequential decision-making problem and compare AlphaZero, a neural network-guided tree search, with tabu search, a heuristic local search method. Using either method, by introducing a curriculum—jump-starting the search for larger graphs using good graphs found at smaller sizes—we improve the state-of-the-art lower bounds for several sizes. We also propose a flexible graph-generation environment and a permutation-invariant network architecture for learning to search in the space of graphs.

List of keywords

Search -> S: Search and machine learning
Multidisciplinary Topics and Applications -> MTA: Other
Search -> S: Local search
Search -> S: Combinatorial search and optimisation

4761

Full Bayesian Significance Testing for Neural Networks in Traffic Forecasting

Zehua Liu, Jingyuan Wang, Zimeng Li, Yue He

[+] More

[-] Less

Due to the complex and dynamic traffic contexts, the interpretability and uncertainty of traffic forecasting have gained increasing attention. Significance testing is a powerful tool in statistics used to determine whether a hypothesis is valid, facilitating the identification of pivotal features that predominantly contribute to the true relationship. However, existing works mainly regard traffic forecasting as a deterministic problem, making it challenging to perform effective significance testing. To fill this gap, we propose to conduct Full Bayesian Significance Testing for Neural Networks in Traffic Forecasting, namely ST-nFBST. A Bayesian neural network is utilized to capture the complicated traffic relationships through an optimization function resolved in the context of aleatoric uncertainty and epistemic uncertainty. Thereupon, ST-nFBST can achieve the significance testing by means of a delicate grad-based evidence value, further capturing the inherent traffic schema for better spatiotemporal modeling. Extensive experiments are conducted on METR-LA and PEMS-BAY to verify the advantages of our method in terms of uncertainty analysis and significance testing, helping the interpretability and promotion of traffic forecasting.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Uncertainty in AI -> UAI: Uncertainty representations
Machine Learning -> ML: Explainable/Interpretable machine learning

4766

LLMs Can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought

Zhuoxuan Jiang, Haoyuan Peng, Shanshan Feng, Fan Li, Dongsheng Li

[+] More

[-] Less

Self-correction is emerging as a promising approach to mitigate the issue of hallucination in Large Language Models (LLMs). To facilitate effective self-correction, recent research has proposed mistake detection as its initial step. However, current literature suggests that LLMs often struggle with reliably identifying reasoning mistakes when using simplistic prompting strategies. To address this challenge, we introduce a unique prompting strategy, termed the Pedagogical Chain-of-Thought (PedCoT), which is specifically designed to guide the identification of reasoning mistakes, particularly mathematical reasoning mistakes. PedCoT consists of pedagogical principles for prompts (PPP) design, two-stage interaction process (TIP) and grounded PedCoT prompts, all inspired by the educational theory of the Bloom Cognitive Model (BCM). We evaluate our approach on two public datasets featuring math problems of varying difficulty levels. The experiments demonstrate that our zero-shot prompting strategy significantly outperforms strong baselines. The proposed method can achieve the goal of reliable mathematical mistake identification and provide a foundation for automatic math answer grading. The results underscore the significance of educational theory, serving as domain knowledge, in guiding prompting strategy design for addressing challenging tasks with LLMs effectively.

List of keywords

Knowledge Representation and Reasoning -> KRR: Diagnosis and abductive reasoning
Knowledge Representation and Reasoning -> KRR: Automated reasoning and theorem proving
Multidisciplinary Topics and Applications -> MTA: Education

4777

SCTrans: Multi-scale scRNA-seq Sub-vector Completion Transformer for Gene-selective Cell Type Annotation

Lu Lin, Wen Xue, Xindian Wei, Wenjun Shen, Cheng Liu, Si Wu, Hau San Wong

[+] More

[-] Less

Cell type annotation is pivotal to single-cell RNA sequencing data (scRNA-seq)-based biological and medical analysis, e.g., identifying biomarkers, exploring cellular heterogeneity, and understanding disease mechanisms. The previous annotation methods typically learn a nonlinear mapping to infer cell type from gene expression vectors, and thus fall short in discovering and associating salient genes with specific cell types. To address this issue, we propose a multi-scale scRNA-seq Sub-vector Completion Transformer, and our model is referred to as SCTrans. Considering that the expressiveness of gene sub-vectors is richer than that of individual genes, we perform multi-scale partitioning on gene vectors followed by masked sub-vector completion, conditioned on unmasked ones. Toward this end, the multi-scale sub-vectors are tokenized, and the intrinsic contextual relationships are modeled via self-attention computation and conditional contrastive regularization imposed on an encoding transformer. By performing mutual learning between the encoder and an additional lightweight counterpart, the salient tokens can be distinguished from the others. As a result, we can perform gene-selective cell type annotation, which contributes to our superior performance over state-of-the-art annotation methods.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Bioinformatics
Computer Vision -> CV: Applications
Data Mining -> DM: Applications

4778

RSAP-DFM: Regime-Shifting Adaptive Posterior Dynamic Factor Model for Stock Returns Prediction

Quanzhou Xiang, Zhan Chen, Qi Sun, Rujun Jiang

[+] More

[-] Less

As the latest development of asset pricing research, how to use machine learning to improve the performance of factor models has become a topic of concern in recent years. The variability of the instantaneous macro environment brings great difficulties to quantitative investment, so the extended factor model must learn how to self-adapt to extract the macro pattern from the massive stock volume and price information, and how to continuously map the extracted macro pattern to the stock investment is also an open question. To this end, we propose the first continuous regime-based dynamic factor model, RSAP-DFM, which adaptively extracts continuous macroeconomic information and completes the dynamic explicit mapping of stock returns for the first time through dual regime shifting, while the adversarial posterior factors effectively correct the mapping deviation of prior factors. In addition, our model integrates an innovative two-stage optimization algorithm and normally distributed sampling, which further enhances the robustness of the model. Performance on three real stock datasets validates the validity of our model, which exceeds any previous methods available.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Finance
Machine Learning -> ML: Applications

4787

Dialogue Cross-Enhanced Central Engagement Attention Model for Real-Time Engagement Estimation

Jun Yu, Keda Lu, Ji Zhao, Zhihong Wei, Iek-Heng Chu, Peng Chang

[+] More

[-] Less

Real-time engagement estimation has been an important research topic in human-computer interaction in recent years. The emergence of the NOvice eXpert Interaction (NOXI) dataset, enriched with frame-wise engagement annotations, has catalyzed a surge in research efforts in this domain. Existing feature sequence partitioning methods for ultra-long videos have encountered challenges including insufficient information utilization and repetitive inference. Moreover, those studies focus mainly on the target participants’ features without taking into account those of the interlocutor. To address these issues, we propose the center-based sliding window method to obtain feature subsequences. The core of these subsequences is modeled using our innovative Central Engagement Attention Model (CEAM). Additionally, we introduce the dialogue cross-enhanced module that effectively incorporates the interlocutor’s features via cross-attention. Our proposed method outperforms the current best model, achieving a substantial gain of 1.5% in coordination correlation coefficient (CCC) and establishing a new state-of-the-art result. Our source codes and model checkpoints are available at https://github.com/wujiekd/Dialogue-Cross-Enhanced-CEAM.

List of keywords

Humans and AI -> HAI: Human-computer interaction
Humans and AI -> HAI: Computer-aided education
Humans and AI -> HAI: Personalization and user modeling

4797

DeepHGS: Deep Learning Based Hybrid Genetic Search Algorithm for the Capacitated Vehicle Routing Problem

Zhou Shipei, Jin-Kao Hao, Yan Jin

[+] More

[-] Less

This paper presents a novel deep learning based evolutionary algorithm DeepHGS for solving the well-known Capacitated Vehicle Routing Problem (CVRP). DeepHGS incorporates a deep neural network, coupling a distance-based parent selection into the popular Hybrid Genetic Search (HGS), where local search is the key search component used to improve the quality of offspring solutions produced by the crossover of the HGS algorithm. However, each local search run is time consuming due to its iterative nature. Therefore, it is critical to select the most promising solution for local improvement among the offspring solutions after each crossover application, allowing the algorithm to make better use of the given time budget and thus increase the chance of finding better solutions. For this purpose, we design a route permutation invariant neural network to predict the performance of each offspring solution, enabling the identification of the best offspring to be submitted for further improvement by local search. To ensure a high prediction accuracy of the neural network, we use an online supervised training method that operates on a periodically updated training dataset whose data are collected during the local search over multiple generations. We perform extensive experiments on 100 popular CVRP benchmark instances with 100 to 1000 customers and 6 to 207 routes. The experimental results demonstrate the effectiveness of DeepHGS, in particular outperforming the state-of-the-art algorithms on different categories of instances in terms of solution quality with the same time limit, and showing good generalization over different problem characteristics.

List of keywords

Machine Learning -> ML: Evolutionary learning
Machine Learning -> ML: Online learning
Machine Learning -> ML: Self-supervised Learning

4799

Dual Expert Distillation Network for Generalized Zero-Shot Learning

Zhijie Rao, Jingcai Guo, Xiaocheng Lu, Jingming Liang, Jie Zhang, Haozhao Wang, Kang Wei, Xiaofeng Cao

[+] More

[-] Less

Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform mapping function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by introducing a simple yet effective approach, dubbed Dual Expert Distillation Network (DEDN), where two experts are dedicated to coarse- and fine-grained visual-attribute modeling, respectively. Concretely, one coarse expert, namely cExp, has a complete perceptual scope to coordinate visual-attribute similarity metrics across dimensions, and moreover, another fine expert, namely fExp, consists of multiple specialized subnetworks, each corresponds to an exclusive set of attributes. Two experts cooperatively distill from each other to reach a mutual agreement during training. Meanwhile, we further equip DEDN with a newly designed backbone network, i.e., Dual Attention Network (DAN), which incorporates both region and channel attention information to fully exploit and leverage visual semantic knowledge. Extensive experiments on various benchmark datasets indicate a new state-of-the-art. Code is available at github.com/zjrao/DEDN.Experiments on various benchmark datasets indicate a new state-of-the-art.

List of keywords

Machine Learning -> ML: Cost-sensitive learning
Machine Learning -> ML: Few-shot learning

4807

Exploring Urban Semantics: A Multimodal Model for POI Semantic Annotation with Street View Images and Place Names

Dabin Zhang, Meng Chen, Weiming Huang, Yongshun Gong, Kai Zhao

[+] More

[-] Less

Semantic annotation for points of interest (POIs) is the process of annotating a POI with a category label, which facilitates many services related to POIs, such as POI search and recommendation. Most of the existing solutions extract features related to POIs from abundant user-generated content data (e.g., check-ins and user comments). However, such data are often difficult to obtain, especially for newly created POIs. In this paper, we aim to explore semantic annotation for POIs with limited information such as POI (place) names and geographic locations. Additionally, we have found that the street view images provide extensive visual clues about POI attributes and could be an essential supplement to limited information of POIs that enables semantic annotation. To this end, we propose a novel multimodal model for POI semantic annotation, namely M3PA, which achieves enhanced semantic annotation through fusing a POI’s textual and visual representations. Specifically, M3PA extracts visual features from street view images using a pre-trained image encoder and integrates these features to generate the visual representation of a targeted POI based on a geographic attention mechanism. Furthermore, M3PA utilizes the contextual information of neighboring POIs to extract textual features and captures their spatial relationships through geographical encoding to generate the textual representation of a targeted POI. Finally, the visual and textual representations of a POI are fused for semantic annotation. Extensive experiments with POI data from Amap validate the effectiveness of M3PA for POI semantic annotation, compared with several competitive baselines.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data

4831

ParaILP: A Parallel Local Search Framework for Integer Linear Programming with Cooperative Evolution Mechanism

Peng Lin, Mengchuan Zou, Zhihan Chen, Shaowei Cai

[+] More

[-] Less

The integer linear programming (ILP) problem is a fundamental research topic in operations research, and the local search method is an important class of algorithms for quickly solving many combinatorial optimization problems. With rapidly increasing computing power, parallelism turns out to be a promising approach to enhancing the efficiency of problem-solving. However, rare studies investigate parallel local search algorithms for solving the general ILP problem. We propose the first parallel local search framework (ParaILP) for solving the general ILP problem, based on two novel ideas: a new initialization method named polarity initialization to construct different initial solutions for local search threads and a cooperative evolution mechanism for managing and generating high-quality solutions using information shared by different threads. Extensive experiments demonstrate that ParaILP is significantly better than the state-of-the-art academic parallel solvers FiberSCIP and HiGHS, and is competitive with the state-of-the-art commercial parallel solver Gurobi. Experiments are also conducted to analyze the parallelization scalability and the effectiveness of our techniques.

List of keywords

Search -> S: Local search
Search -> S: Evolutionary computation

4841

Visual Attention Prompted Prediction and Learning

Yifei Zhang, Bo Pan, Siyi Gu, Guangji Bai, Meikang Qiu, Xiaofeng Yang, Liang Zhao

[+] More

[-] Less

Visual explanation (attention)-guided learning uses not only labels but also explanations to guide the model reasoning process. While visual attention-guided learning has shown promising results, it requires a large number of explanation annotations that are time-consuming to prepare. However, in many real-world situations, it is usually desired to prompt the model with visual attention without model retraining. For example, when doing AI-assisted cancer classification on a medical image, users (e.g., clinicians) can provide the AI model with visual attention prompts on which areas are indispensable and which are precluded. Despite its promising objectives, achieving visual attention-prompted prediction presents several major challenges: 1) How can the visual prompt be effectively integrated into the model’s reasoning process? 2) How should the model handle samples that lack visual prompts? 3) What is the impact on the model’s performance when a visual prompt is imperfect? This paper introduces a novel framework for visual attention prompted prediction and learning, utilizing visual prompts to steer the model’s reasoning process. To improve performance in non-prompted situations and align it with prompted scenarios, we propose a co-training approach for both non-prompted and prompted models, ensuring they share similar parameters and activation. Additionally, for instances where the visual prompt does not encompass the entire input image, we have developed innovative attention prompt refinement methods. These methods interpolate the incomplete prompts while maintaining alignment with the model’s explanations. Extensive experiments on four datasets demonstrate the effectiveness of our proposed framework in enhancing predictions for samples both with and without prompt.

List of keywords

Machine Learning -> ML: Knowledge-aided learning
Humans and AI -> HAI: Human-AI collaboration
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Multi-task and transfer learning

4850

Integrating Intent Understanding and Optimal Behavior Planning for Behavior Tree Generation from Human Instructions

Xinglin Chen, Yishuai Cai, Yunxin Mao, Minglong Li, Wenjing Yang, Weixia Xu, Ji Wang

[+] More

[-] Less

Robots executing tasks following human instructions in domestic or industrial environments essentially require both adaptability and reliability. Behavior Tree (BT) emerges as an appropriate control architecture for these scenarios due to its modularity and reactivity. Existing BT generation methods, however, either do not involve interpreting natural language or cannot theoretically guarantee the BTs’ success. This paper proposes a two-stage framework for BT generation, which first employs large language models (LLMs) to interpret goals from high-level instructions, then constructs an efficient goal-specific BT through the Optimal Behavior Tree Expansion Algorithm (OBTEA). We represent goals as well-formed formulas in first-order logic, effectively bridging intent understanding and optimal behavior planning. Experiments in the service robot validate the proficiency of LLMs in producing grammatically correct and accurately interpreted goals, demonstrate OBTEA’s superiority over the baseline BT Expansion algorithm in various metrics, and finally confirm the practical deployability of our framework. The project website is https://dids-ei.github.io/Project/LLM-OBTEA.

List of keywords

Robotics -> ROB: Behavior and control
Planning and Scheduling -> PS: Robot planning
Robotics -> ROB: Human robot interaction

4856

Multi-Modality Spatio-Temporal Forecasting via Self-Supervised Learning

Jiewen Deng, Renhe Jiang, Jiaqi Zhang, Xuan Song

[+] More

[-] Less

Multi-modality spatio-temporal (MoST) data extends spatio-temporal (ST) data by incorporating multiple modalities, which is prevalent in monitoring systems, encompassing diverse traffic demands and air quality assessments. Despite significant strides in ST modeling in recent years, there remains a need to emphasize harnessing the potential of information from different modalities. Robust MoST forecasting is more challenging because it possesses (i) high-dimensional and complex internal structures and (ii) dynamic heterogeneity caused by temporal, spatial, and modality variations. In this study, we propose a novel MoST learning framework via Self-Supervised Learning, namely MoSSL, which aims to uncover latent patterns from temporal, spatial, and modality perspectives while quantifying dynamic heterogeneity. Experiment results on two real-world MoST datasets verify the superiority of our approach compared with the state-of-the-art baselines. Model implementation is available at https://github.com/beginner-sketch/MoSSL.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data
Knowledge Representation and Reasoning -> KRR: Qualitative, geometric, spatial, and temporal reasoning
Machine Learning -> ML: Time series and data streams

4866

Deciphering the Projection Head: Representation Evaluation Self-supervised Learning

Jiajun Ma, Tianyang Hu, Wenjia Wang

[+] More

[-] Less

Self-supervised learning (SSL) aims to learn the intrinsic features of data without labels. Despite the diverse SSL architectures, the projection head always plays an important role in improving downstream task performance. In this study, we systematically investigate the role of the projection head in SSL. We find that the projection head targets the uniformity aspect, which maps samples into uniform distribution and enables the encoder to focus on extracting semantic features. Drawing on this insight, we propose a Representation Evaluation Design (RED) in SSL models in which a shortcut connection between the representation and the projection vectors is built. Our extensive experiments with different architectures (including SimCLR, MoCo-V2, and SimSiam) on various datasets demonstrate that the RED-SSL consistently outperforms their baseline counterparts in downstream tasks. Furthermore, the RED-SSL learned representations exhibit superior robustness to previously unseen augmentations and out-of-distribution data.

List of keywords

Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Representation learning

4883

Safety Constrained Multi-Agent Reinforcement Learning for Active Voltage Control

Yang Qu, Jinming Ma, Feng Wu

[+] More

[-] Less

Active voltage control presents a promising avenue for relieving power congestion and enhancing voltage quality, taking advantage of the distributed controllable generators in the power network, such as roof-top photovoltaics. While Multi-Agent Reinforcement Learning (MARL) has emerged as a compelling approach to address this challenge, existing MARL approaches tend to overlook the constrained optimization nature of this problem, failing in guaranteeing safety constraints. In this paper, we formalize the active voltage control problem as a constrained Markov game and propose a safety-constrained MARL algorithm. We expand the primal-dual optimization RL method to multi-agent settings, and augment it with a novel approach of double safety estimation to learn the policy and to update the Lagrange-multiplier. In addition, we proposed different cost functions and investigated their influences on the behavior of our constrained MARL method. We evaluate our approach in the power distribution network simulation environment with real-world scale scenarios. Experimental results demonstrate the effectiveness of the proposed method compared with the state-of-the-art MARL methods.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Applications

4888

Designing Behavior-Aware AI to Improve the Human-AI Team Performance in AI-Assisted Decision Making

Syed Hasan Amin Mahmood, Zhuoran Lu, Ming Yin

[+] More

[-] Less

With the rapid development of decision aids that are driven by AI models, the practice of AI-assisted decision making has become increasingly prevalent. To improve the human-AI team performance in decision making, earlier studies mostly focus on enhancing humans’ capability in better utilizing a given AI-driven decision aid. In this paper, we tackle this challenge through a complementary approach—we aim to train "behavior-aware AI" by adjusting the AI model underlying the decision aid to account for humans’ behavior in adopting AI advice. In particular, as humans are observed to accept AI advice more when their confidence in their own judgement is low, we propose to train AI models with a human-confidence-based instance weighting strategy, instead of solving the standard empirical risk minimization problem. Under an assumed, threshold-based model characterizing when humans will adopt the AI advice, we first derive the optimal instance weighting strategy for training AI models. We then validate the efficacy and robustness of our proposed method in improving the human-AI joint decision making performance through systematic experimentation on synthetic datasets. Finally, via randomized experiments with real human subjects along with their actual behavior in adopting the AI advice, we demonstrate that our method can significantly improve the decision making performance of the human-AI team in practice.

List of keywords

Humans and AI -> HAI: Human-computer interaction
Humans and AI -> HAI: Human computation and crowdsourcing
Humans and AI -> HAI: Human-AI collaboration

4892

Reinforcement Nash Equilibrium Solver

Xinrun Wang, Chang Yang, Shuxin Li, Pengdeng Li, Xiao Huang, Hau Chan, Bo An

[+] More

[-] Less

Nash Equilibrium (NE) is the canonical solution concept of game theory, which provides an elegant tool to understand the rationalities. Though mixed strategy NE exists in any game with finite players and actions, computing NE in two- or multi-player general-sum games is PPAD-Complete. Various alternative solutions, e.g., Correlated Equilibrium (CE), and learning methods, e.g., fictitious play (FP), are proposed to approximate NE. For convenience, we call these methods as “inexact solvers”, or “solvers” for short. However, the alternative solutions differ from NE and the learning methods generally fail to converge to NE. Therefore, in this work, we propose REinforcement Nash Equilibrium Solver (RENES), which \emph{trains a single policy to modify the games with different sizes and applies the solvers on the modified games where the obtained solution is evaluated on the original games}. Specifically, our contributions are threefold. i) We represent the games as $\alpha$-rank response graphs and leverage graph neural network (GNN) to handle the games with different sizes as inputs; ii) We use tensor decomposition, e.g., canonical polyadic (CP), to make the dimension of modifying actions fixed for games with different sizes; iii) We train the modifying strategy for games with the widely-used proximal policy optimization (PPO) and apply the solvers to solve the modified games, where the obtained solution is evaluated on original games. Extensive experiments on large-scale normal-form games show that our method can further improve the approximation of NE of different solvers, i.e., $\alpha$-rank, CE, FP and PRD, and can be generalized to unseen games.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Game Theory and Economic Paradigms -> GTEP: Noncooperative games
Machine Learning -> ML: Game Theory
Machine Learning -> ML: Reinforcement learning

4902

Making LLMs as Fine-Grained Relation Extraction Data Augmentor

Yifan Zheng, Wenjun Ke, Qi Liu, Yuting Yang, Ruizhuo Zhao, Dacheng Feng, Jianwei Zhang, Zhi Fang

[+] More

[-] Less

Relation Extraction (RE) identifies relations between entities in text, typically relying on supervised models that demand abundant high-quality data. Various approaches, including Data Augmentation (DA), have been proposed as promising solutions for addressing low-resource challenges in RE. However, existing DA methods in RE often struggle to ensure consistency and contextual diversity in generated data due to the fine-grained nature of RE. Inspired by the extensive generative capabilities of large language models (LLMs), we introduce a novel framework named ConsistRE, aiming to maintain context consistency in RE. ConsistRE initiates by collecting a substantial corpus from external resources and employing statistical algorithms and semantics to identify keyword hints closely related to relation instances. These keyword hints are subsequently integrated as contextual constraints in sentence generation, ensuring the preservation of relation dependence and diversity with LLMs. Additionally, we implement syntactic dependency selection to enhance the syntactic structure of the generated sentences. Experimental results from the evaluation of SemEval, TACRED, and TACREV datasets unequivocally demonstrate that ConsistRE outperforms other baselines in F1 values by 1.76%, 3.92%, and 2.53%, respectively, particularly when operating under low-resource experimental conditions.

List of keywords

Natural Language Processing -> NLP: Language generation
Natural Language Processing -> NLP: Information extraction
Natural Language Processing -> NLP: Information retrieval and text mining
Natural Language Processing -> NLP: Resources and evaluation

4906

Diversification of Adaptive Policy for Effective Offline Reinforcement Learning

Yunseon Choi, Li Zhao, Chuheng Zhang, Lei Song, Jiang Bian, Kee-Eung Kim

[+] More

[-] Less

Offline Reinforcement Learning (RL) aims to learn policies from pre-collected datasets that capture only a subset of the environment’s dynamics. The predominant approach has been to solve a constrained optimization formulation, which ensures that the policy visits state-action pairs within the support of the offline dataset. However, this approach has limited the ability to make decisions when the agent faces unknown parts of the environment at deployment time. To address the challenge of decision-making in out-of-support regions, model-based Bayes-adaptive approaches have been proposed by considering all dynamics models that could potentially be the true environment. Since it is generally infeasible to compute the posterior of all dynamics models based on the offline dataset, these approaches usually approximate the posterior by using a finite ensemble of highly probable dynamics models. Hence, the diversity of these models is the key to obtaining good policies. In this work, we propose MoDAP (Model-based Diverse Adaptive Policy Learning), an algorithm to enable the adaptive policy to make informed decisions in previously unexplored states. MoDAP adopts an iterative strategy that simultaneously training the policy and dynamics models. The policy optimization seeks to maximize expected returns across dynamics models, while the dynamics models are trained to promote policy diversification through the proposed information-theoretic objective. We evaluate MoDAP through experiments on the D4RL and NeoRL benchmarks, showcasing its performance superiority over state-of-the-art algorithms.

List of keywords

Machine Learning -> ML: Offline reinforcement learning
Machine Learning -> ML: Model-based and model learning reinforcement learning

4921

CONC: Complex-noise-resistant Open-set Node Classification with Adaptive Noise Detection

Qin Zhang, Jiexin Lu, Xiaowei Li, Huisi Wu, Shirui Pan, Junyang Chen

[+] More

[-] Less

Node classification is a popular graph learning task, where the goal is to label nodes based on their features and connections. However, an important challenge for its application in real-world scenarios is the presence of newly-emerged out-of-distribution samples and noisy samples, which affect the quality and robustness of learned classifiers. Out-of-distribution (OOD) samples are often found in both the training and testing phases. They are samples that do not belong to any known classes. These OOD samples are outliers if they occur in training (OOD noise), and open-set samples if they occur in testing. Meanwhile, in-distribution (IND) noisy data, \ie, known class samples with wrong labels, are also prevalent and inevitably degrade a model’s performance.The problem of open set learning with complex IND and OOD noise has not been sufficiently explored so far, and it becomes even more difficult for non-IID graph data.To address these challenges, this paper proposes a novel complex-noise-resistant open-set node classification method, for open-set graph data with both IND and OOD noisy nodes. Specifically, a trustworthiness learner is adopted to learn the trustworthiness rates of the feature and label for each node while a decoder and an open-set classifier are trained to reconstruct the structure of a node and to predict its category simultaneously with the guidance of node trustworthiness. Experimental evaluations of CONC demonstrate its superiority.

List of keywords

Machine Learning -> ML: Classification
Data Mining -> DM: Anomaly/outlier detection
Data Mining -> DM: Applications

4923

LeRet: Language-Empowered Retentive Network for Time Series Forecasting

Qihe Huang, Zhengyang Zhou, Kuo Yang, Gengyu Lin, Zhongchao Yi, Yang Wang

[+] More

[-] Less

Time series forecasting (TSF) plays a pivotal role in many real-world applications. Recently, the utilization of Large Language Models (LLM) in TSF has demonstrated exceptional predictive performance, surpassing most task-specific forecasting models. The success of LLM-based forecasting methods underscores the importance of causal dependence modeling and pre-trained knowledge transfer. However, challenges persist in directly applying LLM to TSF, i.e., the unacceptable parameter scales for resource-intensive model optimization, and the significant gap of feature space between structural numerical time series and natural language. To this end, we propose LeRet, a \underline{L}anguage-\underline{e}mpowered \underline{Ret}entive network for TSF. Technically, inspired by the causal extraction in LLM, we propose a causal dependence learner, enhanced by a patch-level pre-training task, to capture sequential causal evolution. To minimize the gap between numeric and language, we initialize a language description protocol for time series and design a TS-related language knowledge extractor to learn from language description, avoiding training with large-scale parameters. Finally, we dedicatedly achieve a Language-TS Modality Integrator for the fusion of two types data, and enable language-empowered sequence forecasting. Extensive evaluations demonstrate the effectiveness of our LeRet, especially reveal superiority on few-shot, and zero-shot forecasting tasks.

List of keywords

Machine Learning -> ML: Time series and data streams
Data Mining -> DM: Applications
Machine Learning -> ML: Applications
Machine Learning -> ML: Regression

4926

Purpose Enhanced Reasoning through Iterative Prompting: Uncover Latent Robustness of ChatGPT on Code Comprehension

Yi Wang, Qidong Zhao, Dongkuan Xu, Xu Liu

[+] More

[-] Less

Code comments are crucial for gaining in-depth insights to facilitate code comprehension. The key to obtaining these insights lies in precisely summarizing the main purpose of the code. Recent approaches on code comment generation lie in prompting large language models (LLMs) such as ChatGPT, instead of training/fine-tuning specific models. Although ChatGPT demonstrates an impressive performance in code comprehension, it still suffers from robustness challenges in consistently producing high-quality code comments. This is because ChatGPT prioritizes the semantics of code tokens, which makes it vulnerable to commonly encountered benign perturbations such as variable name replacements. This study proposes a modular prompting paradigm Perthept to effectively mitigate the negative effects caused by such minor perturbations. Perthept iteratively enhances the reasoning depth to reach the main purpose of the code. Perthept demonstrates robustness under the scenario where there is stochasticity or unreliability in ChatGPT’s responses. We give a comprehensive evaluation across four public datasets to show the consistent robustness improvement with our proposed methodology over other models.

List of keywords

Natural Language Processing -> NLP: Summarization
Natural Language Processing -> NLP: Tools

4932

Self-adaptive PSRO: Towards an Automatic Population-based Game Solver

Pengdeng Li, Shuxin Li, Chang Yang, Xinrun Wang, Xiao Huang, Hau Chan, Bo An

[+] More

[-] Less

Policy-Space Response Oracles (PSRO) as a general algorithmic framework has achieved state-of-the-art performance in learning equilibrium policies of two-player zero-sum games. However, the hand-crafted hyperparameter value selection in most of the existing works requires extensive domain knowledge, forming the main barrier to applying PSRO to different games. In this work, we make the first attempt to investigate the possibility of self-adaptively determining the optimal hyperparameter values in the PSRO framework. Our contributions are three-fold: (1) Using several hyperparameters, we propose a parametric PSRO that unifies the gradient descent ascent (GDA) and different PSRO variants. (2) We propose the self-adaptive PSRO (SPSRO) by casting the hyperparameter value selection of the parametric PSRO as a hyperparameter optimization (HPO) problem where our objective is to learn an HPO policy that can self-adaptively determine the optimal hyperparameter values during the running of the parametric PSRO. (3) To overcome the poor performance of online HPO methods, we propose a novel offline HPO approach to optimize the HPO policy based on the Transformer architecture. Experiments on various two-player zero-sum games demonstrate the superiority of SPSRO over different baselines.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Game Theory and Economic Paradigms -> GTEP: Noncooperative games
Machine Learning -> ML: Game Theory
Machine Learning -> ML: Hyperparameter optimization

4944

CF-Deformable DETR: An End-to-End Alignment-Free Model for Weakly Aligned Visible-Infrared Object Detection

Haolong Fu, Jin Yuan, Guojin Zhong, Jiacheng Lin, Xuan He, Zhiyong Li

[+] More

[-] Less

Weakly aligned visible-infrared object detection poses significant challenges due to the imprecise alignment between visible and infrared images. Most existing methods explore the alignment strategies between visible and infrared images, yielding unbearable computation costs. This paper first proposes an end-to-end alignment-free architecture Cross-modal Fusion Deformable DEtection TRansformer (“CF-Deformable DETR”) for weakly aligned visible-infrared object detection. Abandoning the traditional image alignment, CF-Deformable DETR introduces a simple yet effective cross-modal deformable attention mechanism to directly implement automatic cross-modal point mapping, generating well-aligned bimodal features with high efficiency. Moreover, we design a Point-level Feature Consistency Loss to guide the cross-modal point mapping, ensuring the consistency of paired features to support the following fusion. Extensive experiments are conducted on three benchmark datasets. The experimental results demonstrate that CF-Deformable DETR achieves close accuracy on weakly aligned and strictly aligned data as well as maintains stable performance to a certain extent against various offset degrees of weakly aligned data.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Multimodal learning

4959

Pareto Inverse Reinforcement Learning for Diverse Expert Policy Generation

Woo Kyung Kim, Minjong Yoo, Honguk Woo

[+] More

[-] Less

Data-driven offline reinforcement learning and imitation learning approaches have been gaining popularity in addressing sequential decision-making problems. Yet, these approaches rarely consider learning Pareto-optimal policies from a limited pool of expert datasets. This becomes particularly marked due to practical limitations in obtaining comprehensive datasets for all preferences, where multiple conflicting objectives exist and each expert might hold a unique optimization preference for these objectives. In this paper, we adapt inverse reinforcement learning (IRL) by using reward distance estimates for regularizing the discriminator. This enables progressive generation of a set of policies that accommodate diverse preferences on the multiple objectives, while using only two distinct datasets, each associated with a different expert preference. In doing so, we present a Pareto IRL framework (ParIRL) that establishes a Pareto policy set from these limited datasets. In the framework, the Pareto policy set is then distilled into a single, preference-conditioned diffusion model, thus allowing users to immediately specify which expert’s patterns they prefer. Through experiments, we show that ParIRL outperforms other IRL algorithms for various multi-objective control tasks, achieving the dense approximation of the Pareto frontier. We also demonstrate the applicability of ParIRL with autonomous driving in CARLA.

List of keywords

Machine Learning -> ML: Reinforcement learning

4979

A Self-explaining Neural Architecture for Generalizable Concept Learning

Sanchit Sinha, Guangzhi Xiong, Aidong Zhang

[+] More

[-] Less

With the wide proliferation of Deep Neural Networks in high-stake applications, there is a growing demand for explainability behind their decision-making process. Concept learning models attempt to learn high-level ‘concepts’ – abstract entities that align with human understanding, and thus provide interpretability to DNN architectures. However, in this paper, we demonstrate that present SOTA concept learning approaches suffer from two major problems – lack of concept fidelity wherein the models fail to learn consistent concepts among similar classes and limited concept interoperability wherein the models fail to generalize learned concepts to new domains for the same task. Keeping these in mind, we propose a novel self-explaining architecture for concept learning across domains which – i) incorporates a new concept saliency network for representative concept selection, ii) utilizes contrastive learning to capture representative domain invariant concepts, and iii) uses a novel prototype-based concept grounding regularization to improve concept alignment across domains. We demonstrate the efficacy of our proposed approach over current SOTA concept learning approaches on four widely used real-world datasets. Empirical results show that our method improves both concept fidelity measured through concept overlap and concept interoperability measured through domain adaptation performance. An appendix of the paper with more comprehensive results can also be viewed at https://arxiv.org/abs/2405.00349.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning
Machine Learning -> ML: Explainable/Interpretable machine learning

4997

Personalized Federated Learning for Cross-city Traffic Prediction

Yu Zhang, Hua Lu, Ning Liu, Yonghui Xu, Qingzhong Li, Lizhen Cui

[+] More

[-] Less

Traffic prediction is an important problem in urban computing. However, many cities face data scarcity due to low levels of urban development. Although many approaches propose to transfer knowledge from data-rich cities to data-scarce cities, the centralized training paradigm cannot uphold data privacy. To bridge this gap, we propose a novel personalized Federated learning method for Cross-city Traffic Prediction (pFedCTP). It learns traffic knowledge from multiple data-rich source cities and transfers the knowledge to a data-scarce target city while preserving inter-city data privacy. Specifically, we design an ST-Net to extract spatial structure features, spatio-temporal knowledge, and traffic patterns. The ST-Net is decoupled to learning common traffic patterns in order to mitigate the effect of cross-city spatial structural differences. Besides, pFedCTP adaptively aggregates the layer-wise global and local parameters to deal with data heterogeneity across cities. Extensive experiments on four real-world traffic datasets demonstrate significant advantages of pFedCTP over seven state-of-the-art methods. pFedCTP is able to reduce average MAE and RMSE by 1.9% and 0.8% respectively compared to the best-performing baseline.

List of keywords

Machine Learning -> ML: Federated learning
Data Mining -> DM: Mining spatial and/or temporal data
Multidisciplinary Topics and Applications -> MTA: Transportation

5016

A Coarse-to-Fine Fusion Network for Event-Based Image Deblurring

Huan Li, Xingyu Gao, Hailong Shi

[+] More

[-] Less

Event-driven image deblurring is an innovative approach involving the input of event streams obtained from event camera alongside blurred frames to facilitate the deblurring process. Unlike conventional cameras, event cameras in event-driven imaging exhibit low-latency characteristics and are immune to motion blur, resulting in significant advancements in image deblurring. In this study, we present a pioneering event-based coarse-to-fine image deblurring network named CFFNet. In contrast to existing deblurring methods, our approach incorporates event data, generating multiple coarse frames from a single frame before further refining them into a clear image. We introduce an Event Image Fusion Block (EIFB) for the coarse fusion of events and images, producing coarse frames at different time intervals. Additionally, we propose a Bidirectional Frame Fusion Block (BFFB) for the fine fusion of coarse frames. CFFNet effectively harnesses the spatiotemporal information of event data through a comprehensive fusion process from coarse to fine. Evaluation results on the GoPro and REBlur datasets affirm that our method attains state-of-the-art performance in image deblurring tasks.

List of keywords

Computer Vision -> CV: Applications

5025

HyQ: Hardware-Friendly Post-Training Quantization for CNN-Transformer Hybrid Networks

Nam Joon Kim, Jongho Lee, Hyun Kim

[+] More

[-] Less

Hybrid models that combine CNNs and ViTs have recently emerged as state-of-the-art computer vision models. To efficiently deploy these hybrid models on resource-constrained mobile/edge devices, quantization is emerging as a promising solution. However, post-training quantization (PTQ), which does not require retraining or labeled data, has not been extensively studied for hybrid models. In this study, we propose a novel PTQ technique specialized for CNN-transformer hybrid models by considering the hardware design of hybrid models on AI accelerators such as GPUs and FPGAs. First, we introduce quantization-aware distribution scaling to address the large outliers caused by inter-channel variance in convolution layers. Furthermore, in the transformer block, we propose approximating the integer-only softmax with a linear function. This approach allows us to avoid costly FP32/INT32 multiplications, resulting in more efficient computations. Experimental results show that the proposed quantization method with INT8 precision demonstrated a 0.39% accuracy drop compared with the FP32 baseline on MobileViT-s with the ImageNet-1k dataset. Furthermore, when implemented on the FPGA platform, the proposed linear softmax achieved significant resource savings, reducing the look-up table and flip-flop usage by 1.8 ~ 2.1x and 1.3 ~ 1.9x, respectively, compared with the existing second-order polynomial approximation. The code is available at https://github.com/IDSL-SeoulTech/HyQ.

List of keywords

Machine Learning -> ML: Optimization
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Recognition (object detection, categorization)
Machine Learning -> ML: Deep learning architectures

5040

MetaJND: A Meta-Learning Approach for Just Noticeable Difference Estimation

Miaohui Wang, Yukuan Zhu, Rong Zhang, Wuyuan Xie

[+] More

[-] Less

The modeling of just noticeable difference (JND) in supervised learning for visual signals has made significant progress. However, existing JND models often suffer from limited generalization due to the need for large-scale training data and their constraints to certain image types. Moreover, these models primarily focus on a single RGB modality, ignoring the potential complementary impacts of multiple modalities. To address these challenges, we propose a new meta-learning approach for the JND modeling, called MetaJND. We introduce two key visual-sensitive modalities like saliency and depth, and leverage a self-attention mechanism for effective interdependence of multi-modal features. Additionally, we incorporate meta-learning for the modality alignment, facilitating dynamic weight generation. Furthermore, we perform hierarchical fusion through multi-layer channel and spatial feature rectification. Experimental results on four benchmark datasets demonstrate the effectiveness of our MetaJND. Moreover, we have also evaluated its performance in compression and watermarking applications, observing higher bit-rate savings and better watermark hiding capabilities.

List of keywords

Humans and AI -> HAI: Cognitive modeling
Humans and AI -> HAI: Applications
Humans and AI -> HAI: Cognitive systems

5041

Learning Robust Classifiers with Self-Guided Spurious Correlation Mitigation

Guangtao Zheng, Wenqian Ye, Aidong Zhang

[+] More

[-] Less

Deep neural classifiers tend to rely on spurious correlations between spurious attributes of inputs and targets to make predictions, which could jeopardize their generalization capability. Training classifiers robust to spurious correlations typically relies on annotations of spurious correlations in data, which are often expensive to get. In this paper, we tackle an annotation-free setting and propose a self-guided spurious correlation mitigation framework. Our framework automatically constructs fine-grained training labels tailored for a classifier obtained with empirical risk minimization to improve its robustness against spurious correlations. The fine-grained training labels are formulated with different prediction behaviors of the classifier identified in a novel spuriousness embedding space. We construct the space with automatically detected conceptual attributes and a novel spuriousness metric which measures how likely a class-attribute correlation is exploited for predictions. We demonstrate that training the classifier to distinguish different prediction behaviors reduces its reliance on spurious correlations without knowing them a priori and outperforms prior methods on five real-world datasets.

List of keywords

Machine Learning -> ML: Robustness
Machine Learning -> ML: Knowledge-aided learning

5055

Detector Collapse: Backdooring Object Detection to Catastrophic Overload or Blindness in the Physical World

Hangtao Zhang, Shengshan Hu, Yichen Wang, Leo Yu Zhang, Ziqi Zhou, Xianlong Wang, Yanjun Zhang, Chao Chen

[+] More

[-] Less

Object detection tasks, crucial in safety-critical systems like autonomous driving, focus on pinpointing object locations. These detectors are known to be susceptible to backdoor attacks. However, existing backdoor techniques have primarily been adapted from classification tasks, overlooking deeper vulnerabilities specific to object detection. This paper is dedicated to bridging this gap by introducing Detector Collapse (DC), a brand-new backdoor attack paradigm tailored for object detection. DC is designed to instantly incapacitate detectors (i.e., severely impairing detector’s performance and culminating in a denial-of-service). To this end, we develop two innovative attack schemes: Sponge for triggering widespread misidentifications and Blinding for rendering objects invisible. Remarkably, we introduce a novel poisoning strategy exploiting natural objects, enabling DC to act as a practical backdoor in real-world environments. Our experiments on different detectors across several benchmarks show a significant improvement (~10%-60% absolute and ~2-7x relative) in attack efficacy over state-of-the-art attacks.

List of keywords

Computer Vision -> CV: Recognition (object detection, categorization)
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Safety and robustness

5059

Rethinking the Effectiveness of Graph Classification Datasets in Benchmarks for Assessing GNNs

Zhengdao Li, Yong Cao, Kefan Shuai, Yiming Miao, Kai Hwang

[+] More

[-] Less

Graph classification benchmarks, vital for assessing and developing graph neural network (GNN) models, have recently been scrutinized, as simple methods like MLPs have demonstrated comparable performance. This leads to an important question: Do these benchmarks effectively distinguish the advancements of GNNs over other methodologies? If so, how do we quantitatively measure this effectiveness? In response, we first propose an empirical protocol based on a fair benchmarking framework to investigate the performance discrepancy between simple methods and GNNs. We further propose a novel metric to quantify the dataset effectiveness by considering both dataset complexity and model performance. To the best of our knowledge, our work is the first to thoroughly study and provide an explicit definition for dataset effectiveness in the graph learning area. Through testing across 16 real-world datasets, we found our metric to align with existing studies and intuitive assumptions. Finally, we explore the causes behind the low effectiveness of certain datasets by investigating the correlation between intrinsic graph properties and class labels, and we developed a novel technique supporting the correlation-controllable synthetic dataset generation. Our findings shed light on the current understanding of benchmark datasets, and our new platform could fuel the future evolution of graph classification benchmarks.

List of keywords

Data Mining -> DM: Mining graphs
Machine Learning -> ML: Classification
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Supervised Learning

5099

BadFusion: 2D-Oriented Backdoor Attacks against 3D Object Detection

Saket Sanjeev Chaturvedi, Lan Zhang, Wenbin Zhang, Pan He, Xiaoyong Yuan

[+] More

[-] Less

3D object detection plays an important role in autonomous driving; however, its vulnerability to backdoor attacks has become evident. By injecting “triggers” to poison the training dataset, backdoor attacks manipulate the detector’s prediction for inputs containing these triggers. Existing backdoor attacks against 3D object detection primarily poison 3D LiDAR signals, where large-sized 3D triggers are injected to ensure their visibility within the sparse 3D space, rendering them easy to detect and impractical in real-world scenarios. In this paper, we delve into the robustness of 3D object detection, exploring a new backdoor attack surface through 2D cameras. Given the prevalent adoption of camera and LiDAR signal fusion for high-fidelity 3D perception, we investigate the latent potential of camera signals to disrupt the process. Although the dense nature of camera signals enables the use of nearly imperceptible small-sized triggers to mislead 2D object detection, realizing 2D-oriented backdoor attacks against 3D object detection is non-trivial. The primary challenge emerges from the fusion process that transforms camera signals into a 3D space, compromising the association with the 2D trigger to the target output. To tackle this issue, we propose an innovative 2D-oriented backdoor attack against LiDAR-camera fusion methods for 3D object detection, named BadFusion, for preserving trigger effectiveness throughout the entire fusion process. The evaluation demonstrates the effectiveness of BadFusion, achieving a significantly higher attack success rate compared to existing 2D-oriented attacks.

List of keywords

AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: 3D computer vision

5109

ReinforceNS: Reinforcement Learning-based Multi-start Neighborhood Search for Solving the Traveling Thief Problem

Tao Wu, Huachao Cui, Tao Guan, Yuesong Wang, Yan Jin

[+] More

[-] Less

The Traveling Thief Problem (TTP) is a challenging combinatorial optimization problem with broad practical applications. TTP combines two NP-hard problems: the Traveling Salesman Problem (TSP) and Knapsack Problem (KP). While a number of machine learning and deep learning based algorithms have been developed for TSP and KP, there is limited research dedicated to TTP. In this paper, we present the first reinforcement learning based multi-start neighborhood search algorithm, denoted by ReinforceNS, for solving TTP. To accelerate the search, we employ a pre-processing procedure for neighborhood reduction. A TSP routing and an iterated greedy packing are independently utilized to construct a high-quality initial solution, further improved by a reinforcement learning based neighborhood search. Additionally, a post-optimization procedure is devised for continued solution improvement. We conduct extensive experiments on 60 commonly used benchmark instances with 76 to 33810 cities in the literature. The experimental results demonstrate that our proposed ReinforceNS algorithm outperforms three state-of-the-art algorithms in terms of solution quality with the same time limit. In particular, ReinforceNS achieves 12 new results for 18 instances publicly reported in a recent TTP competition. We also perform an additional experiment to validate the effectiveness of the reinforcement learning strategy.

List of keywords

Search -> S: Search and machine learning
Machine Learning -> ML: Reinforcement learning
Search -> S: Heuristic search

5130

Cross-Domain Few-Shot Semantic Segmentation via Doubly Matching Transformation

Jiayi Chen, Rong Quan, Jie Qin

[+] More

[-] Less

Cross-Domain Few-shot Semantic Segmentation (CD-FSS) aims to train generalized models that can segment classes from different domains with a few labeled images. Previous works have proven the effectiveness of feature transformation in addressing CD-FSS. However, they completely rely on support images for feature transformation, and repeatedly utilizing a few support images for each class may easily lead to overfitting and overlooking intra-class appearance differences. In this paper, we propose a Doubly Matching Transformation-based Network (DMTNet) to solve the above issue. Instead of completely relying on support images, we propose Self-Matching Transformation (SMT) to construct query-specific transformation matrices based on query images themselves to transform domain-specific query features into domain-agnostic ones. Calculating query-specific transformation matrices can prevent overfitting, especially for the meta-testing stage where only one or several images are used as support images to segment hundreds or thousands of images. After obtaining domain-agnostic features, we exploit a Dual Hypercorrelation Construction (DHC) module to explore the hypercorrelations between the query image with the foreground and background of the support image, based on which foreground and background prediction maps are generated and supervised, respectively, to enhance the segmentation result. In addition, we propose a Test-time Self-Finetuning (TSF) strategy to more accurately self-tune the query prediction in unseen domains. Extensive experiments on four popular datasets show that DMTNet achieves superior performance over state-of-the-art approaches. Code is available at https://github.com/ChenJiayi68/DMTNet.

List of keywords

Computer Vision -> CV: Segmentation
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

5131

Reconstructing Missing Variables for Multivariate Time Series Forecasting via Conditional Generative Flows

Xuanming Hu, Wei Fan, Haifeng Chen, Pengyang Wang, Yanjie Fu

[+] More

[-] Less

The Variable Subset Forecasting (VSF) problem, where the majority of variables are unavailable during the inference stage of multivariate forecasting, has been an important but under-explored task with broad impacts in many real-world applications. Missing values, absent inter-correlation, and the impracticality of retraining hinder the ability of multivariate forecasting models to capture inherent relationships among variables, impacting their performance. Nevertheless, existing approaches aimed at addressing these issues either heavily relies on local temporal correlation or falls short in fully recovering missing information from the unavailable subset while simultaneously introducing significant computational costs. To tackle these problems, we propose a novel density estimation solution to recover the information of missing variables via flows-based generative framework. In particular, a novel generative network for time series, namely Time-series Reconstruction Flows (TRF), is proposed to estimate and reconstruct the missing subset. In addition, we design Variable-Agnostic Meta Learning as the training framework to improve the generalization ability of TRF. Finally, extensive experiments is conducted to demonstrate the consistent superiority of our proposed method compared with baseline methods.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data

5143

Vertical Symbolic Regression via Deep Policy Gradient

Nan Jiang, Md Nasim, Yexiang Xue

[+] More

[-] Less

Vertical Symbolic Regression (VSR) recently has been proposed to expedite the discovery of symbolic equations with many independent variables from experiment data. VSR reduces the search spaces following the vertical discovery path by building from reduced-form equations involving a subset of independent variables to full-fledged ones. Proved successful by many symbolic regressors, deep neural networks are expected to further scale up VSR. Nevertheless, directly combining VSR with deep neural networks will result in difficulty in passing gradients and other engineering issues. We propose Vertical Symbolic Regression using Deep Policy Gradient (VSR-DPG) and demonstrate that VSR-DPG can recover ground-truth equations involving multiple input variables, significantly beyond both deep reinforcement learning-based approaches and previous VSR variants. Our VSR-DPG models symbolic regression as a sequential decision-making process, in which equations are built from repeated applications of grammar rules. The integrated deep model is trained to maximize a policy gradient objective. Experimental results demonstrate that our VSR-DPG significantly outperforms popular baselines in identifying both algebraic equations and differential equations on a series of benchmarks.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Physical sciences
Machine Learning -> ML: Symbolic methods

5149

EC-SNN: Splitting Deep Spiking Neural Networks for Edge Devices

Di Yu, Xin Du, Linshan Jiang, Wentao Tong, Shuiguang Deng

[+] More

[-] Less

Deep Spiking Neural Networks (SNNs), as an advanced form of SNNs characterized by their multi-layered structure, have recently achieved significant breakthroughs in performance across various domains. The biological plausibility and energy efficiency of SNNs naturally align with the requisites of edge computing (EC) scenarios, thereby prompting increased interest among researchers to explore the migration of these deep SNN models onto edge devices such as sensors and smartphones. However, the progress of migration work has been notably challenging due to the influence of the substantial increase in model parameters and the demanding computational requirements in practical applications. In this work, we propose a deep SNN splitting framework named EC-SNN to run the intricate SNN models on edge devices. We first partition the full SNN models into smaller sub-models to allocate their model parameters on multiple edge devices. Then, we provide a channel-wise pruning method to reduce the size of each sub-model, thereby further reducing the computational load. We design extensive experiments on six datasets (i.e., four non-neuromorphic and two neuromorphic datasets) to substantiate that our approach can significantly diminish the inference execution latency on edge devices and reduce the overall energy consumption per deployed device with an average reduction of 60.7% and 27.7% respectively while keeping the effectiveness of the accuracy.

List of keywords

Machine Learning -> ML: Applications
Humans and AI -> HAI: Applications
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Feature extraction, selection and dimensionality reduction

5158

VF-Detector: Making Multi-Granularity Code Changes on Vulnerability Fix Detector Robust to Mislabeled Changes

Zhenkan Fu, Shikai Guo, Hui Li, Rong Chen, Xiaochen Li, He Jiang

[+] More

[-] Less

As software development projects increasingly rely on open-source software, users face the risk of security vulnerabilities from third-party libraries. To address label and character noise in code changes, we present VF-Detector to automatically identifying bug-fix commits in actual noise development environment. VF-Detector consists of three componments: Data Pre-processing (DP), Vulnerability Confidence Computation (VCC) and Confidence Learning Denoising (CLD). The DP component is responsible for preprocessing code change data. The VCC component calculates code change confidence value for each bug-fix by extracting features at various granularity levels. The CLD component removes noise and enhances model robustness by pruning noisy data with confidence values and performing effort-aware adjustments. Experimental results demonstrate VF-Detector’s superiority over state-of-the-art methods in \emph{EffortCost@L} and $P_{opt}$\emph{@L} metrics on Java and Python datasets. The improvements were 6.5\% and 5\% for Java, and 23.4\% and 17.8\% for Python.

List of keywords

Multidisciplinary Topics and Applications -> MTA: Software engineering
Agent-based and Multi-agent Systems -> MAS: Trust and reputation
Data Mining -> DM: Applications
Machine Learning -> ML: Applications

5189

SVD-AE: Simple Autoencoders for Collaborative Filtering

Seoyoung Hong, Jeongwhan Choi, Yeon-Chang Lee, Srijan Kumar, Noseong Park

[+] More

[-] Less

Collaborative filtering methods for recommendation systems have been extensively researched, ranging from matrix factorization and autoencoder-based methods to graph filtering-based methods. In particular, lightweight methods that require almost no training have been recently proposed to reduce the overall computation – for example, designing a linear autoencoder model using a closed-form solution. Despite their successes, existing methods include heuristic techniques and still have room to improve the trade-offs among accuracy, efficiency, and robustness. In particular, there are no well-designed closed-form studies for balanced collaborative filtering in terms of the aforementioned trade-offs. In this paper, we design SVD-AE, a simple yet effective singular vector decomposition (SVD)-based linear autoencoder, whose closed-form solution can be defined based on SVD, for collaborative filtering. Since its closed-form solution can be calculated at once, our proposed method does not involve any iterative training processes. Furthermore, given the noisy nature of the rating matrix, we explore the robustness against such noisy interactions of existing collaborative filtering methods and our SVD-AE. As a result, we demonstrate that our simple design choice based on truncated SVD can be used to strengthen the noise robustness of the recommendation while improving efficiency. In the end, we conclude that our method offers the best overall balance among the recommendation accuracy, computation time, and robustness.

List of keywords

Data Mining -> DM: Collaborative filtering
Multidisciplinary Topics and Applications -> MTA: Web and social networks

5196

Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts

Haodong Hong, Sen Wang, Zi Huang, Qi Wu, Jiajun Liu

[+] More

[-] Less

Current Vision-and-Language Navigation (VLN) tasks mainly employ textual instructions to guide agents. However, being inherently abstract, the same textual instruction can be associated with different visual signals, causing severe ambiguity and limiting the transfer of prior knowledge in the vision domain from the user to the agent. To fill this gap, we propose Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP), a novel task augmenting traditional VLN by integrating both natural language and images in instructions. VLN-MP not only maintains backward compatibility by effectively handling text-only prompts but also consistently shows advantages with different quantities and relevance of visual prompts. Possible forms of visual prompts include both exact and similar object images, providing adaptability and versatility in diverse navigation scenarios. To evaluate VLN-MP under a unified framework, we implement a new benchmark that offers: (1) a training-free pipeline to transform textual instructions into multi-modal forms with landmark images; (2) diverse datasets with multi-modal instructions for different downstream tasks; (3) a novel module designed to process various image prompts for seamless integration with state-of-the-art VLN models. Extensive experiments on four VLN benchmarks (R2R, RxR, REVERIE, CVDN) show that incorporating visual prompts would significantly boost navigation performance. While maintaining efficiency with text-only prompts, VLN-MP enables agents to navigate in the pre-explore setting and outperform text-based models, showing its broader applicability. Code is available at https://github.com/honghd16/VLN-MP.

List of keywords

Computer Vision -> CV: Vision, language and reasoning
Computer Vision -> CV: Multimodal learning
Machine Learning -> ML: Multi-modal learning

5198

A Transformer-Based Adaptive Prototype Matching Network for Few-Shot Semantic Segmentation

Sihan Chen, Yadang Chen, Yuhui Zheng, Zhi-Xin Yang, Enhua Wu

[+] More

[-] Less

Few-shot semantic segmentation (FSS) aims to generate a model for segmenting novel classes using a limited number of annotated samples. Previous FSS methods have shown sensitivity to background noise due to inherent bias, attention bias, and spatial-aware bias. In this study, we propose a Transformer-Based Adaptive Prototype Matching Network to establish robust matching relationships by improving the semantic and spatial perception of query features. The model includes three modules: target enhancement module (TEM), dual constraint aggregation module (DCAM), and dual classification module (DCM). In particular, TEM mitigates inherent bias by exploring the relevance of multi-scale local context to enhance foreground features. Then, DCAM addresses attention bias through the dual semantic-aware attention mechanism to strengthen constraints. Finally, the DCM module decouples the segmentation task into semantic alignment and spatial alignment to alleviate spatial-aware bias. Extensive experiments on PASCAL-5i and COCO-20i confirm the effectiveness of our approach.

List of keywords

Computer Vision -> CV: Segmentation
Computer Vision -> CV: Representation learning
Computer Vision -> CV: Scene analysis and understanding
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

5205

PHSIC against Random Consistency and Its Application in Causal Inference

Jue Li, Yuhua Qian, Jieting Wang, Saixiong Liu

[+] More

[-] Less

The Hilbert-Schmidt Independence Criterion (HSIC) based on kernel functions is capable of detecting nonlinear dependencies between variables, making it a common method for association relationship mining. However, in situations with small samples, high dimensions, or noisy data, it may generate spurious associations, causing two unrelated variables to have certain scores. To address this issue, we propose a novel criterion, named as Pure Hilbert-Schmidt Independence Criterion (PHSIC). PHSIC is achieved by subtracting the mean HSIC obtained under random conditions from the original HSIC value. We demonstrate three significant advantages of PHSIC through theoretical and simulation experiments: (1) PHSIC has a baseline of zero, enhancing the interpretability of HSIC. (2) Compared to HSIC, PHSIC exhibits lower bias. (3) PHSIC enables a fairer comparison across different samples and dimensions. To validate the effectiveness of PHSIC, we apply it to multiple causal inference tasks to measure the independence between cause and residual. Experimental results demonstrate that the causal model based on PHSIC performs well compared to other methods in scenarios involving small sample sizes and noisy data, both in real and simulated datasets.

List of keywords

Data Mining -> DM: Exploratory data mining
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Machine Learning -> ML: Causality

5207

Automated CPU Design by Learning from Input-Output Examples

Shuyao Cheng, Pengwei Jin, Qi Guo, Zidong Du, Rui Zhang, Xing Hu, Yongwei Zhao, Yifan Hao, Guan Xiangtao, Husheng Han, Zhengyue Zhao, Ximing Liu, Xishan Zhang, Yuejie Chu, Weilong Mao, Tianshi Chen, Yunji Chen

[+] More

[-] Less

Designing a central processing unit (CPU) requires intensive manual work of talented experts to implement the circuit logic from design specifications. Although considerable progress has been made in electronic design automation (EDA) to relieve human efforts, all existing EDA tools require hand-crafted formal program codes (e.g., Verilog, Chisel, or C) as the input. To automate the CPU design without human programming, we are motivated to learn the CPU design from only input-output (IO) examples. The key challenge is that the learned CPU design should have almost zero tolerance for inaccuracy, which makes well-known approximate algorithms such as neural networks ineffective.We propose a new AI approach to generate the CPU design in the form of a large-scale Boolean function, from only external IO examples instead of formal program code. This approach employs a novel graph structure called Binary Speculative Diagram (BSD) to approximate the CPU-scale Boolean function accurately. We propose an efficient BSD expansion method based on Boolean Distance, a new metric to quantitatively measure the structural similarity between Boolean functions, gradually increasing the design accuracy up to 100%. Our approach generates an industrial-scale RISC-V CPU design within 5 hours, reducing the design cycle by about 1000x without human involvement. The taped-out chip, Enlightenment-1, the world’s first CPU designed by AI, successfully runs the Linux operating system and performs comparably against the human-design Intel 80486SX CPU. Our approach even autonomously discovers human knowledge of the von Neumann architecture.

List of keywords

Machine Learning -> ML: Applications

5227

Predictive Accuracy-Based Active Learning for Medical Image Segmentation

Jun Shi, Shulan Ruan, Ziqi Zhu, Minfan Zhao, Hong An, Xudong Xue, Bing Yan

[+] More

[-] Less

Active learning is considered a viable solution to alleviate the contradiction between the high dependency of deep learning-based segmentation methods on annotated data and the expensive pixel-level annotation cost of medical images. However, most existing methods suffer from unreliable uncertainty assessment and the struggle to balance diversity and informativeness, leading to poor performance in segmentation tasks. In response, we propose an efficient Predictive Accuracy-based Active Learning (PAAL) method for medical image segmentation, first introducing predictive accuracy to define uncertainty. Specifically, PAAL mainly consists of an Accuracy Predictor (AP) and a Weighted Polling Strategy (WPS). The former is an attached learnable module that can accurately predict the segmentation accuracy of unlabeled samples relative to the target model with the predicted posterior probability. The latter provides an efficient hybrid querying scheme by combining predicted accuracy and feature representation, aiming to ensure the uncertainty and diversity of the acquired samples. Comprehensive evaluations and comparisons on multiple open-source datasets demonstrate the superiority of PAAL over existing methods. PAAL achieves comparable accuracy to fully annotated data while reducing annotation costs by approximately 50% to 80%, showcasing significant potential in clinical applications. The code is available at https://github.com/shijun18/PAAL-MedSeg.

List of keywords

Machine Learning -> ML: Active learning
Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Segmentation
Uncertainty in AI -> UAI: Uncertainty representations

5237

The Distortion of Threshold Approval Matching

Mohamad Latifian, Alexandros A. Voudouris

[+] More

[-] Less

We study matching settings in which a set of agents have private utilities over a set of items. Each agent reports a partition of the items into approval sets of different threshold utility levels. Given this limited information on input, the goal is to compute an assignment of the items to the agents (subject to cardinality constraints depending on the application) that (approximately) maximizes the social welfare (the total utility of the agents for their assigned items). We first consider the well-known, simple one-sided matching problem in which each of a set of agents is to be assigned exactly one item. We show tight bounds on distortion of deterministic and randomized matching algorithms that are functions of the number of threshold utility levels. We further show that our distortion bounds extend to a more general setting in which there are multiple copies of the items, each agent can be assigned a number of items (even copies of the same one) up to a capacity, and the utility of an agent for an item depends on the number of its copies that the agent is given.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice
Game Theory and Economic Paradigms -> GTEP: Mechanism design

5243

Innovative Directional Encoding in Speech Processing: Leveraging Spherical Harmonics Injection for Multi-Channel Speech Enhancement

Jiahui Pan, Pengjie Shen, Hui Zhang, Xueliang Zhang

[+] More

[-] Less

Multi-channel speech enhancement leverages multiple microphones to extract target speech signals amid background noise. Effectively utilizing directional cues is key for robust enhancement. While deep learning shows promise for multi-channel speech processing, most methods operate on short-time Fourier transform (STFT) coefficients directly. We propose using spherical harmonics transform (SHT) coefficients as auxiliary inputs to models. which concisely represent spatial distributions. SHT allows signals from varying numbers of microphones to be converted into coefficients of a consistent dimension. The proposed technique enables a single model to generalize to microphone arrays with varying configurations, rather than requiring a specialized model for each array layout. We present two architectures with SHT-based auxiliary inputs: parallel and serial. Specifically, the parallel model contains two encoders – one for STFT and another for SHT. By fusing both encoders’ outputs in the decoder to estimate the enhanced STFT, it effectively incorporates spatial context. For the serial approach, we first apply SHT to the signals and then take STFT of the transformed signals as network inputs. Evaluations of the TIMIT dataset under fluctuating noise and reverberation demonstrate our model outperforms established benchmarks. Remarkably, these results are attained with reduced computations and parameters. Furthermore, experiments on the MS-SNSD dataset show the proposed method can enhance the generalization ability of networks. The source code is publicly accessible at https://github.com/Pandade1997/SH_injection.

List of keywords

Natural Language Processing -> NLP: Information extraction
Machine Learning -> ML: Applications
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Trustworthy machine learning

5245

Guiding GBFS through Learned Pairwise Rankings

Mingyu Hao, Felipe Trevizan, Sylvie Thiébaux, Patrick Ferber, Jörg Hoffmann

[+] More

[-] Less

We propose a new approach based on ranking to learn to guide Greedy Best-First Search (GBFS). As previous ranking approaches, ours is based on the observation that directly learning a heuristic function is overly restrictive, and that GBFS is capable of efficiently finding good plans for a much more flexible class of total quasi-orders over states. In order to learn an optimal ranking function, we introduce a new ranking framework capable of leveraging any neural network regression model and efficiently handling the training data through batching. Compared with previous ranking approaches for planning, ours does not require complex loss functions and allows training on states outside the optimal plan with minimal overhead. Our experiments on the domains of the latest planning competition learning track show that our approach substantially improves the coverage of the underlying neural network models without degrading plan quality.

List of keywords

Planning and Scheduling -> PS: Learning in planning and scheduling

5253

Diversifying Training Pool Predictability for Zero-shot Coordination: A Theory of Mind Approach

Dung Nguyen, Hung Le, Kien Do, Sunil Gupta, Svetha Venkatesh, Truyen Tran

[+] More

[-] Less

The challenge in constructing artificial social agents is to enable adaptation ability to novel agents, and is called zero-shot coordination (ZSC). A promising approach is to train the adaptive agents by interacting with a diverse pool of collaborators, assuming that the greater the diversity in other agents seen during training, the better the generalisation. In this paper, we explore an alternative procedure by considering the behavioural predictability of collaborators, i.e. whether their actions and intentions are predictable, and use it to select a diverse set of agents for the training pool. More specifically, we develop a pool of agents through self-play training during which agents’ behaviour evolves and has diversity in levels of behavioural predictability (LoBP) through its evolution. We construct an observer to compute the level of behavioural predictability for each version of the collaborators. To do so, the observer is equipped with the theory of mind (ToM) capability to learn to infer the actions and intentions of others. We then use an episodic memory based on the LoBP metric to maintain agents with different levels of behavioural predictability in the pool of agents. Since behaviours that emerge at the later training phase are more complex and meaningful, the memory is updated with the latest versions of training agents. Our extensive experiments demonstrate that LoBP-based diversity training leads to better ZSC than other diversity training methods.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
Machine Learning -> ML: Multiagent Reinforcement Learning
Machine Learning -> ML: Reinforcement learning

5275

Layered Graph Security Games

Jakub Cerny, Chun Kai Ling, Christian Kroer, Garud Iyengar

[+] More

[-] Less

Security games model strategic interactions in adversarial real-world applications. Such applications often involve extremely large but highly structured strategy sets (e.g., selecting a distribution over all patrol routes in a given graph). In this paper, we represent each player’s strategy space using a \textit{layered graph} whose paths represent an exponentially large strategy space. Our formulation entails not only classic pursuit-evasion games, but also other security games, such as those modeling anti-terrorism and logistical interdiction. We study two-player zero-sum games under two distinct utility models: linear and binary utilities. We show that under linear utilities, Nash equilibrium can be computed in polynomial time, while binary utilities may lead to situations where even computing a best-response is computationally intractable. To this end, we propose a practical algorithm based on incremental strategy generation and mixed integer linear programs. We show through extensive experiments that our algorithm efficiently computes $\epsilon$-equilibrium for many games of interest. We find that target values and graph structure often have a larger influence on running times as compared to the size of the graph per se.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Noncooperative games

5300

A Graph-based Representation Framework for Trajectory Recovery via Spatiotemporal Interval-Informed Seq2Seq

Yaya Zhao, Kaiqi Zhao, Zhiqian Chen, Yuanyuan Zhang, Yalei Du, Xiaoling Lu

[+] More

[-] Less

The prevalent issue in urban trajectory data usage, notably in low-sample rate datasets, revolves around the accuracy of travel time estimations, traffic flow predictions, and trajectory similarity measurements. Conventional methods, often relying on simplistic mixes of static road networks and raw GPS data, fail to adequately integrate both network and trajectory dimensions. Addressing this, the innovative GRFTrajRec framework offers a graph-based solution for trajectory recovery. Its key feature is a trajectory-aware graph representation, enhancing the understanding of trajectory-road network interactions and facilitating the extraction of detailed embedding features for road segments. Additionally, GRFTrajRec’s trajectory representation acutely captures spatiotemporal attributes of trajectory points. Central to this framework is a novel spatiotemporal interval-informed seq2seq model, integrating an attention-enhanced transformer and a feature differences-aware decoder. This model specifically excels in handling spatiotemporal intervals, crucial for restoring missing GPS points in low-sample datasets. Validated through extensive experiments on two large real-life trajectory datasets, GRFTrajRec has proven its efficacy in significantly boosting prediction accuracy and spatial consistency.

List of keywords

Data Mining -> DM: Mining spatial and/or temporal data
Data Mining -> DM: Applications
Data Mining -> DM: Exploratory data mining

5315

FedES: Federated Early-Stopping for Hindering Memorizing Heterogeneous Label Noise

Bixiao Zeng, Xiaodong Yang, Yiqiang Chen, Zhiqi Shen, Hanchao Yu, Yingwei Zhang

[+] More

[-] Less

Federated learning (FL) facilitates collaborative model training across distributed clients while maintaining privacy. Federated noisy label learning (FNLL) is more of a challenge for data inaccessibility and noise heterogeneity. Existing works primarily assume clients are either noisy or clean, which may lack the flexibility to adapt to diverse label noise across different clients, especially when entirely clean or noisy clients are not the majority. To address this, we propose a general noise-robust federated learning framework called Federated Early-Stopping (FedES), which adaptively updates critical parameters of each local model based on their noise rates, thereby avoiding overfitting to noisy labels. FedES is composed of two stages: federated noise estimation and parameter-adaptive local updating \& global aggregation. We introduce a signed distance based on local and global gradients during a federated round to estimate clients’ noise rates without requiring additional information. Based on this measure, we employ various degrees of early-stopping during local updating on the clients, and further, a noise-aware global aggregation is employed to achieve noise-robust learning. Extensive experiments conducted on varying synthetic and real-world label noise demonstrate the superior performance of FedES over the state-of-the-art methods.

List of keywords

Machine Learning -> ML: Federated learning
Machine Learning -> ML: Applications
Machine Learning -> ML: Weakly supervised learning

5337

PDENNEval: A Comprehensive Evaluation of Neural Network Methods for Solving PDEs

Ping Wei, Menghan Liu, Jianhuan Cen, Ziyang Zhou, Liao Chen, Qingsong Zou

[+] More

[-] Less

The rapid development of neural network (NN) methods for solving partial differential equations (PDEs) has created an urgent need for evaluation and comparison of these methods. In this study, we propose PDENNEval, a comprehensive and systematic evaluation of 12 NN methods for PDEs. These methods are classified into function learning type and operator learning type based on their different mathematical foundations. The evaluation is implemented using a diverse dataset comprising 19 distinct PDE problems selected from various scientific fields such as fluid, materials, finance, and electromagnetic. Several evaluation results are reported, aiming to provide guidance for further research in this field. Our code and data are publicly available at https://github.com/zhouzy36/PDENNEval.

List of keywords

Machine Learning -> ML: Evaluation
Machine Learning -> ML: Applications
Multidisciplinary Topics and Applications -> MTA: Physical sciences

5345

NanoAdapt: Mitigating Negative Transfer in Test Time Adaptation with Extremely Small Batch Sizes

Shiji Zhao, Shao-Yuan Li, Sheng-Jun Huang

[+] More

[-] Less

Test Time Adaptation (TTA) has garnered significant attention in recent years, with the research focus on addressing distribution shifts during test time. As one fundamental component of many TTA methods, the Batch Normalization (BN) layer plays a crucial role in enabling the model adaptability. However, existing BN strategies can prove detrimental when the batch size is (extremely) small. In numerous real-world scenarios, limited hardware resources or just-in-time demand often necessitates adjusting models with very small batch sizes, making existing methods less practical. In this paper, we first showcase and thoroughly analyze the negative transfer phenomenon in previous TTA methods encountering extremely small batch sizes. Subsequently, we propose a novel batch size-agnostic method called NanoAdapt to effectively mitigate the negative transfer even with batch size 1. NanoAdapt is composed of three key components: a dynamic BN calibration strategy that leverages historical information and the Taylor series to refine the statistics estimations, an entropy-weighted gradient accumulation strategy that uses the entropy of each sample’s label prediction to weigh and accumulate the loss for backpropagation, and a novel proxy computation graph to capture the sample interactions. Extensive experiments are conducted to validate the superiority of NanoAdapt, showing its consistent efficacy in improving existing TTA methods.

List of keywords

Machine Learning -> ML: Multi-task and transfer learning
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning

5353

Metric Distortion with Elicited Pairwise Comparisons

Soroush Ebadian, Daniel Halpern, Evi Micha

[+] More

[-] Less

In many social choice applications, information about individuals’ preferences can only be elicited using a limited number of pairwise comparisons. In these cases, the task is twofold: we must first choose the queries, and then second, we must aggregate the responses to choose an outcome. We study the problem of designing algorithms for this setting. To compare the effectiveness of different outcomes, we use the metric distortion framework. In addition, we consider various constraints on the query algorithms, namely, placing restrictions on how the choice of the next query may depend on previous answers. Our main contributions are nearly optimal algorithms for all settings considered.

List of keywords

Game Theory and Economic Paradigms -> GTEP: Computational social choice

5354

Best Arm Identification with Retroactively Increased Sampling Budget for More Resource-Efficient HPO

Jasmin Brandt, Marcel Wever, Viktor Bengs, Eyke Hüllermeier

[+] More

[-] Less

Hyperparameter optimization (HPO) is indispensable for achieving optimal performance in machine learning tasks. A popular class of methods in this regard is based on Successive Halving (SHA), which casts HPO into a pure-exploration multi-armed bandit problem under finite sampling budget constraints. This is accomplished by considering hyperparameter configurations as arms and rewards as the negative validation losses. While enjoying theoretical guarantees as well as working well in practice, SHA comes, however, with several hyperparameters itself, one of which is the maximum budget that can be allocated to evaluate a single arm (hyperparameter configuration). Although there are already solutions to this meta hyperparameter optimization problem, such as the doubling trick or asynchronous extensions of SHA, these are either practically inefficient or lack theoretical guarantees. In this paper, we propose incremental SHA (iSHA), a synchronous extension of SHA, allowing to increase the maximum budget a posteriori while still enjoying theoretical guarantees. Our empirical analysis of HPO problems corroborates our theoretical findings and shows that iSHA is more resource-efficient than existing SHA-based approaches.

List of keywords

Machine Learning -> ML: Multi-armed bandits
Machine Learning -> ML: Hyperparameter optimization
Machine Learning -> ML: Incremental learning

5363

Multimodal Representation Distribution Learning for Medical Image Segmentation

Chao Huang, Weichao Cai, Qiuping Jiang, Zhihua Wang

[+] More

[-] Less

Medical image segmentation is one of the most critical tasks in medical image analysis. However, the performance of existing methods is limited by the lack of high-quality labeled data due to the expensive data annotation. To alleviate this limitation, we propose a novel multi-modal learning method for medical image segmentation. In our method, medical text annotation is incorporated to compensate for the quality deficiency in image data. Moreover, previous multi-modal fusion methods ignore the commonalities and differences between different modalities. Ideally, the fused features should maximize valuable information while minimizing redundant information. To achieve this goal, we propose a multimodal feature distribution learning method. It is adopted to model the commonalities and differences between text and image. Since medical image segmentation needs to predict detailed segmentation boundaries, we also design a prompt encoder to achieve fine-grained segmentation. Experimental results on three datasets show that the proposed method obtains superior segmentation performance. Source codes will be available at https://github.com/GPIOX/Multimodal.git.

List of keywords

Machine Learning -> ML: Multi-modal learning
Computer Vision -> CV: Segmentation
Computer Vision -> CV: Representation learning

5391

Efficient Multi-view Unsupervised Feature Selection with Adaptive Structure Learning and Inference

Chenglong Zhang, Yang Fang, Xinyan Liang, Han Zhang, Peng Zhou, Xingyu Wu, Jie Yang, Bingbing Jiang, Weiguo Sheng

[+] More

[-] Less

As data with diverse representations become high-dimensional, multi-view unsupervised feature selection has been an important learning paradigm. Generally, existing methods encounter the following challenges: (i) traditional solutions either concatenate different views or introduce extra parameters to weight them, affecting the performance and applicability; (ii) emphasis is typically placed on graph construction, yet disregarding the clustering information of data; (iii) exploring the similarity structure of all samples from the original features is suboptimal and extremely time-consuming. To solve this dilemma, we propose an efficient multi-view unsupervised feature selection (EMUFS) to construct bipartite graphs between samples and anchors. Specifically, a parameter-free manner is devised to collaboratively fuse the membership matrices and graphs to learn the compatible structure information across all views, naturally balancing different views. Moreover, EMUFS leverages the similarity relations of data in the feature subspace induced by l2,0-norm to dynamically update the graph. Accordingly, the cluster information of anchors can be accurately propagated to samples via the graph structure and further guide feature selection, enhancing the quality of selected features and the computational costs in solution processes. A convergent optimization is developed to solve the formulated problem, and experiments demonstrate the effectiveness and efficiency of EMUFS.

List of keywords

Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
Machine Learning -> ML: Feature extraction, selection and dimensionality reduction
Machine Learning -> ML: Unsupervised learning

5395

Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models

Yixuan Tang, Anthony K. H. Tung

[+] More

[-] Less

Automatic Speech Recognition (ASR) systems have witnessed notable advancements in recent years. Contextualized ASR tasks require recognizing speech not as isolated utterances but within the broader context in which they occur. Conventional approaches often employ a second-pass paradigm to re-rank initial transcriptions, yet they risk propagating errors across candidate hypotheses, thereby compromising recognition precision. In this study, we introduce a novel framework that diverges from typical second-pass rescoring methods. Given N-best hypotheses, we leverage prompting with a large language model for contextualized second-pass generation. Besides pursuing higher accuracy, we aim to explore the performance boundaries without substantially altering the underlying pre-trained speech and language models. We investigate the effectiveness of the proposed paradigm through zero-shot prompting and strategic low-rank adaptation tuning. On the multi-accent spoken reading comprehension benchmark SQuAD-SRC, both prompting and fine-tuned models outperform the 1-best ASR hypothesis, achieving notable relative Word Error Rate (WER) improvements of 10.9\% and 45.0\%, respectively. The results suggest that the proposed approach enhances transcription accuracy and contextual understanding.

List of keywords

Natural Language Processing -> NLP: Speech

5455

Intention Progression with Temporally Extended Goals

Yuan Yao, Natasha Alechina, Brian Logan

[+] More

[-] Less

The Belief-Desire-Intention (BDI) approach to agent development has formed the basis for much of the research on architectures for autonomous agents. A key advantage of the BDI approach is that agents may purse multiple intentions in parallel. However, previous approaches to managing possible interactions between concurrently executing intentions are limited to interactions between simple achievement goals (and in some cases maintenance goals). In this paper we present a new approach to intention progression for agents with temporally extended goals which allow mixing reachability and invariant properties, e.g., “travel to location A while not exceeding a gradient of 5%”. Temporally extended goals may be specified at run-time (top-level goals), and as subgoals in plans. In addition, our approach allows human-authored plans and plans implemented as RL policies to be freely mixed in an agent program, allowing the development of agents with `neuro-symbolic’ architectures.

List of keywords

Agent-based and Multi-agent Systems -> MAS: Agent theories and models
Agent-based and Multi-agent Systems -> MAS: Engineering methods, platforms, languages and tools

5461

NegativePrompt: Leveraging Psychology for Large Language Models Enhancement via Negative Emotional Stimuli

Xu Wang, Cheng Li, Yi Chang, Jindong Wang, Yuan Wu

[+] More

[-] Less

Large Language Models (LLMs) have become integral to a wide spectrum of applications, ranging from traditional computing tasks to advanced artificial intelligence (AI) applications. This widespread adoption has spurred extensive research into LLMs across various disciplines, including the social sciences. Notably, studies have revealed that LLMs possess emotional intelligence, which can be further developed through positive emotional stimuli. This discovery raises an intriguing question: can negative emotions similarly influence LLMs, potentially enhancing their performance? In response to this question, we introduce NegativePrompt, a novel approach underpinned by psychological principles, involving ten specifically designed negative emotional stimuli. We embark on rigorous experimental evaluations of five LLMs including Flan-T5-Large, Vicuna, Llama 2, ChatGPT, and GPT-4, across a set of 45 tasks. The results are revealing: NegativePrompt markedly enhances the performance of LLMs, evidenced by relative improvements of 12.89% in Instruction Induction tasks and 46.25% in BIG-Bench tasks. Moreover, we conduct attention visualization experiments to decipher the underlying mechanisms of NegativePrompt’s influence. Our research contributes significantly to the understanding of LLMs and emotion interaction, demonstrating the practical efficacy of NegativePrompt as an emotion-driven method and offering novel insights for the enhancement of LLMs in real-world applications. The code is available at https://github.com/wangxu0820/NegativePrompt.

List of keywords

Natural Language Processing -> NLP: Language models
Natural Language Processing -> NLP: Applications

5467

CIC: A Framework for Culturally-Aware Image Captioning

Youngsik Yun, Jihie Kim

[+] More

[-] Less

Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly. However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups. In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures. Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts. Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs. Resources can be found at https://shane3606.github.io/cic.

List of keywords

Computer Vision -> CV: Bias, fairness and privacy
Computer Vision -> CV: Scene analysis and understanding
Computer Vision -> CV: Vision, language and reasoning

5468

DBPNet: Dual-Branch Parallel Network with Temporal-Frequency Fusion for Auditory Attention Detection

Qinke Ni, Hongyu Zhang, Cunhang Fan, Shengbing Pei, Chang Zhou, Zhao Lv

[+] More

[-] Less

Auditory attention decoding (AAD) aims to recognize the attended speaker based on electroencephalography (EEG) signals in multi-talker environments. Most AAD methods only focus on the temporal or frequency domain, but neglect the relationships between these two domains, which results in the inability to simultaneously consider both time-varying and spectral-spatial information. To address this issue, this paper proposes a dual-branch parallel network with temporal-frequency fusion for AAD, named DBPNet, which consists of the temporal attentive branch and the frequency residual branch. Specifically, the temporal attentive branch aims to capture the time-varying features in the EEG time-series signal. The frequency residual branch aims to extract spectral-spatial features of multi-band EEG signals by the residual convolution. Finally, these dual branches are fused to consider both EEG signals time-varying and spectral-spatial features and get classification results. Experimental results show that compared with the best baseline, DBPNet achieves a relative improvement of 20.4% with a 0.1-second decision window for the MM-AAD dataset, but the number of trainable parameters is reduced by about 91 times.

List of keywords

Humans and AI -> HAI: Brain sciences
Humans and AI -> HAI: Cognitive modeling

5479

Sample Quality Heterogeneity-aware Federated Causal Discovery through Adaptive Variable Space Selection

Xianjie Guo, Kui Yu, Hao Wang, Lizhen Cui, Han Yu, Xiaoxiao Li

[+] More

[-] Less

Federated causal discovery (FCD) aims to uncover causal relationships among variables from decentralized data across multiple clients, while preserving data privacy. In practice, the sample quality of each client’s local data may vary across different variable spaces, referred to as sample quality heterogeneity. Thus, data from different clients might be suitable for learning different causal relationships among variables. Model aggregated under existing FCD methods requires the entire model parameters from each client, thereby being unable to handle the sample quality heterogeneity issue. In this paper, we propose the Federated Adaptive Causal Discovery (FedACD) method to bridge this gap. During federated model aggregation, it adaptively selects the causal relationships learned under the "good" variable space (i.e., one with high-quality samples) from each client, while masking those learned under the "bad" variable space (i.e., one with low-quality samples). This way, each client only needs to send the optimal learning results to the server, achieving accurate FCD. Extensive experiments on various types of datasets demonstrate significant advantages of FedACD over existing methods. The source code is available at https://github.com/Xianjie-Guo/FedACD.

L