DPO-LLM

1 Introduction

1.1 Problem Statement

由无监督方式训练出来的LLM在推理，知识等方面存在缺陷，需要通过人类标签微调来获得更加符合人类期望的输出。目前基于PPO的方法通常是不稳定的复杂过程。

1.2 Stated Contribution

RLHF(reinforcement learning from human feedback)是一个

3 Method

4 Experiment

4.1 Implementation Details

5 Summary

5.1 Relative Position

6 Comments