Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization

Min Wang; Xin Li; Mingzhong Wang; Hasnaa Bennis

doi:10.1609/aaai.v40i31.39845

Back

Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization

Conference paper

Peer reviewed

Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization

Min Wang, Xin Li, Mingzhong Wang and Hasnaa Bennis

Proceedings of the AAAI Conference on Artificial Intelligence, Vol.40(31), pp.26390-26397

Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, 40th (Singapore, 20-Jan-2026–27-Jan-2026)

AAAI Press

2026

DOI: https://doi.org/10.1609/aaai.v40i31.39845

Files and links (1)

url

https://doi.org/10.1609/aaai.v40i31.39845View

Published Version Open

Related material

Has preprint

Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive ...

Abstract

Offline meta-reinforcement learning (OMRL) combines the strengths of learning from diverse datasets in offline RL with the adaptability to new tasks of meta-RL, promising safe and efficient knowledge acquisition by RL agents. However, OMRL still suffers extrapolation errors due to out-of-distribution (OOD) actions, compromised by broad task distributions and Markov Decision Process (MDP) ambiguity in meta-RL setups. Existing research indicates that the generalization of the Q network affects the extrapolation error in offline RL. This paper investigates this relationship by decomposing the Q value into feature and weight components, observing that while decomposition enhances adaptability and convergence in the case of high-quality data, it often leads to policy degeneration or collapse in complex tasks. We observe that decomposed Q values introduce a large estimation bias when the feature encounters OOD samples, a phenomenon we term "feature overgeneralization''. To address this issue, we propose FLORA, which identifies OOD samples by modeling feature distributions and estimating their uncertainties. FLORA integrates a return feedback mechanism to adaptively adjust feature components. Furthermore, to learn precise task representations, FLORA explicitly models the complex task distribution using a chain of invertible transformations. We theoretically and empirically demonstrate that FLORA achieves rapid adaptation and meta-policy improvement compared to baselines across various environments.

Details

Title: Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization
Authors: Min Wang - Beijing Institute of Technology
Xin Li - Jilin University
Mingzhong Wang - University of the Sunshine Coast
Hasnaa Bennis - Beijing Institute of Technology
Publication details: Proceedings of the AAAI Conference on Artificial Intelligence, Vol.40(31), pp.26390-26397
Conference details: Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, 40th (Singapore, 20-Jan-2026–27-Jan-2026)
Publisher: AAAI Press
Date published: 2026
DOI: 10.1609/aaai.v40i31.39845
ISSN: 2374-3468
Grant note: This work was partially supported by the NSFC under Grants 92270125 and 62276024; by the Fundamental Research Funds for the Central Universities, JLU, under Grant 93K172025K01; and by the Fundamental Research Funds for the Central Universities under Grant 2025CX01010.
Organisation Unit: Healthy Ageing Research Cluster; School of Science, Technology and Engineering
Language: English
Record Identifier: 991224885302621
Output Type: Conference paper

Metrics

1 Record Views