Direct Preference Optimization: Your Language Model is Secretly a Reward Model

References

  1. Rafailov et. al. (2023), Direct Preference Optimization: Your Language Model is Secretly a Reward Model