Zizhan Zheng: BOR-RCS-2025-project-page

Enhancing LLM Safety via Constrained Fine-Tuning and Decoding

Personnel

Principal Investigator: Zizhan Zheng
Graduate Research Associate: Zixuan Liu

Goals

Large language models (LLMs) have demonstrated remarkable proficiency in tasks like chat completion, instruction following, coding, and problem-solving. However, they also exhibit various weaknesses and vulnerabilities, which can be a barrier to their use in safety-critical domains. Techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) or AI feedback (RLAIF) have been employed to align these models more closely with human preferences. Yet, they fall short in providing strong safety guarantees. The limitation stems from the conflicting objectives inherent in LLM training, such as balancing helpfulness and harmlessness, which are difficult to achieve with a single preference model typically used in practice. In this project, we propose a new paradigm for improving LLM safety by integrating constrained fine-tuning and decoding, addressing two types of uncertainty arising from diverse human preferences and sampling randomness respectively. Our framework provides a systematic solution to ensuring LLM safety while achieving a desired tradeoff between safety, utility, and efficiency.

Tasks

First, we will develop a risk-constrained reinforcement learning from human feedback (RLHF) framework using conditional value-at-risk as the risk measure, which will provide a spectrum of policies with varying levels of risk sensitivity. We will further leverage multi-model learning to identify a small set of representative policies that can be combined at inference time to accommodate users' desired risk sensitivity.

Second, we will explore how fine-grained process reward and cost functions can guide content generation to improve the helpfulness-safety tradeoff on the fly. We will further design learning methods to estimate the uncertainty of rewards and costs and develop uncertainty-aware adaptive decoding strategies to improve sampling efficiency.

Publications

Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
Zixuan Liu, Xiaolin Sun, and Zizhan Zheng
International Conference on Learning Representations (ICLR), April 2026.

Support

The project is funded by a Louisiana Board of Regents Research Competitiveness Subprogram (RCS) grant.

Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Louisiana Board of Regents.