Enhancing LLM Safety via Constrained Fine-Tuning and Decoding
Personnel
- Principal Investigator: Zizhan Zheng
- Graduate Research Associate: Zixuan Liu
Goals
Large language models (LLMs) have demonstrated remarkable proficiency in tasks like chat completion, instruction following, coding, and problem-solving. However, they also exhibit various weaknesses and vulnerabilities, which can be a barrier to their use in safety-critical domains. Techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) or AI feedback (RLAIF) have been employed to align these models more closely with human preferences. Yet, they fall short in providing strong safety guarantees. The limitation stems from the conflicting objectives inherent in LLM training, such as balancing helpfulness and harmlessness, which are difficult to achieve with a single preference model typically used in practice. In this project, we propose a new paradigm for improving LLM safety by integrating constrained fine-tuning and decoding, addressing two types of uncertainty arising from diverse human preferences and sampling randomness respectively. Our framework provides a systematic solution to ensuring LLM safety while achieving a desired tradeoff between safety, utility, and efficiency.
Tasks
First, we will develop a risk-constrained reinforcement learning from human feedback (RLHF) framework using conditional value-at-risk as the risk measure, which will provide a spectrum of policies with varying levels of risk sensitivity. We will further leverage multi-model learning to identify a small set of representative policies that can be combined at inference time to accommodate users' desired risk sensitivity.
Second, we will explore how fine-grained process reward and cost functions can guide content generation to improve the helpfulness-safety tradeoff on the fly. We will further design learning methods to estimate the uncertainty of rewards and costs and develop uncertainty-aware adaptive decoding strategies to improve sampling efficiency.
Publications
- Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
Zixuan Liu, Xiaolin Sun, and Zizhan Zheng
International Conference on Learning Representations (ICLR), April 2026.
Support
The project is funded by a Louisiana Board of Regents Research Competitiveness Subprogram (RCS) grant.Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Louisiana Board of Regents.