| 
                 
                  Zhi Wang
                 
                I'm an Associate Professor at Nanjing University in Nanjing, China. I received the Ph.D. degree from City University of Hong Kong and the bachelor's degree from Nanjing University. I was a visiting scholar at University of New South Wales and Institute of Automation, Chinese Academy of Sciences.
                 
                
                
                  Email  / 
                  Google Scholar  / 
                  Github  / 
                  Publication
                 
               | 
              
                 
               | 
             
           
          
              
              
                News
                
              
                 [Sept. 19th, 2025] Three papers on in-context RL, language agents, and RL for LLM reasoning were accepted at NeurIPS 2025.
                  
                 [May 1st, 2025] One paper on hierarchical LLM agents was accepted at ICML 2025.
                  
                 [Jan. 29th, 2025] One paper on interpretable multi-agent RL was accepted at TPAMI.
                  
                 [Sept. 26th, 2024] One paper on generalist RL agents was accepted at NeurIPS 2024.
                  
                 [Jan. 16th, 2024] One paper on efficient multi-agent RL coordination was accepted at ICLR 2024.
                  
               
               | 
             
           
          
              
              
                Research
                
                  I'm interested in reinfocement learning algorithms and applications.
                  Specifically, I work on how learning algorithms can scale RL agents to i) dynamic environments, ii) offline settings, and iii) multi-agent systems, allowing them to autonomously adapt to i) non-stationary task distributions, ii) non-interactive scenarios, and iii) cooperative or competitive task assignments, facilitating RL's deployment in real-world domains.
                 
                
                  Recently, I work on leveraging foundation models in decision-making problems, exploring ideas of language agents, RL for LLM reasoning, in-context RL, and embodied intelligence.
                 
               | 
             
           
    
    
      
        
          
            Language Agents, RL for LLM Reasoning
           | 
         
      
     
  
      
        
          | 
            
            
           | 
          
          
          Diversity-Incentivized Exploration for Versatile Reasoning
          
             
            Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, Zhi Wang
             
            arXiv preprint, arXiv:2509.26209, 2025
             
            paper / code
            
            
            Based on a strong positive correlation between global diversity and reasoning capacity, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. 
             
           | 
         
      
     
    
      
        
          | 
            
            
           | 
          
          
          ExGRPO: Learning to Reason from Experience
          
             
            Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng
             
            arXiv preprint, arXiv:2510.02245, 2025
             
            paper / code
            
            
            We investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we prioritize valuable experiences with a mixed-policy objective to balance exploration with experience exploitation.
             
           | 
         
      
     
  
      
        
          | 
            
            
           | 
          
          
          The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
          
             
            Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding
             
            arXiv preprint, arXiv:2505.22617, 2025
             
            paper / code
            
            
            Our finding necessitates entropy management for continuous exploration toward scaling compute for RL.
            Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. 
             
           | 
         
      
     
  
    
      
        
          | 
            
            
           | 
          
          
          Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision
          
             
            Shilin Zhang, Zican Hu, Wenhao Wu, Xinyi Xie, Jianxiang Tang, Chunlin Chen, Daoyi Dong, Yu Cheng, Zhenhong Sun, Zhi Wang*
             
            Advances in Neural Information Processing Systems (NeurIPS), 2025
             
            paper / code
            
            
            we propose Text-to-Decision Agent (T2DA),  a simple and scalable pre-training framework for learning generalist policies via aligning language knowledge with environment dynamics of decision tasks.
             
           | 
         
      
     
    
      
        
          | 
            
            
           | 
          
          
          Learning to Reason under Off-Policy Guidance
          
             
            Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
             
            Advances in Neural Information Processing Systems (NeurIPS), 2025
             
            code
            /
            paper
            
            
            We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces, balancing imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training.
             
           | 
         
      
     
    
      
        
          
            In-context RL, Generalization in RL
           | 
         
      
     
    
      
        
          | 
            
            
           | 
          
          
            Mixture-of-Experts Meets In-Context Reinforcement Learning
          
             
            Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang*, 
             
            Advances in Neural Information Processing Systems (NeurIPS), 2025
             
            code
            /
            paper
            
            
            We propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models.
             
           | 
         
      
     
        
      
        
          | 
            
            
           | 
          
          
            Scalable In-Context Q-Learning
          
             
            Jinmei Liu, Fuhong Liu, Jianye Hao, Bo Wang, Huaxiong Li, Chunlin Chen, Zhi Wang*, 
             
            arXiv preprint, arXiv:2506.01299, 2025
             
            code
            /
            paper
            
            
            we propose SICQL, an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining.
             
           | 
         
      
     
          
    
    
    
          
    
      
        
          | 
            
            
           | 
          
            
          Attention-Guided Contrastive Role Representations for Multi-Agent Reinforcement Learning
            
             
            Zican Hu, Zongzhang Zhang, Huaxiong Li, Chunlin Chen, Hongyu Ding, Zhi Wang*
             
            International Conference on Learning Representations (ICLR), 2024
             
            code
            /
            paper
            
            
            Our main insight is to learn a compact role representation that can capture complex behavior patterns of agents, and use that role representation to promote behavior heterogeneity, knowledge transfer, and skillful coordination across agents.
             
           | 
         
        
          | 
            
            
           | 
          
            
          MIXRTs: Toward Interpretable Multi-Agent Reinforcement Learning Via Mixing Recurrent Soft Decision Trees
            
             
            Zichuan Liu, Yuanyang Zhu, Zhi Wang*, Yang Gao, Chunlin Chen
             
            IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
             
            paper
            
            
            We propose a novel architecture based on differentiable soft decision trees to tackle the tension between model interpretability and learning performance in MARL domains, paving the way for interpretable and high-performing MARL systems.
             
           | 
         
      
     
    
            
              
                RL Applications and Others
               | 
             
           
          
    
    
      
        
          | 
            
            
           | 
          
            
          Better Fine-Tuning via Instance Weighting for Text Classification
            
             
            Zhi Wang, Wei Bi, Yan Wang, Xiaojiang Liu
             
            AAAI Conference on Artificial Intelligence (AAAI), 2019
             
            paper
            /
            supp
            
            
            we propose an Instance Weighting based Fine-tuning (IW-Fit) method, which revises the fine-tuning stage to improve the classification accuracy on the target domain when a pre-trained model from the source domain is given.
             
           | 
         
      
     
          
					
          
          
              | 
                  
                  Teaching
                 
               | 
              
                
               | 
             
            
              | 
						      
								  Academic Service
								 
               | 
              
              
                - Area Chair: AAMAS 2026
 
                - Associate Editor: IEEE SMC 2023/2022/2021, IEEE ICNSC 2020
 
                - Reviewer: ICML/NeurIPS/ICLR/CVPR/AAAI/ECAI, IEEE TPAMI/TNNLS/TCYB/TSYS/TMECH/JAS
 
               
               | 
             
						
           
        
          
          
        
      
     
   |