@BramVanroy on Hugging Face: "🕵️ Looking for DPO experts! I have a dataset that is gpt-4-turbo as chosen…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

BramVanroy

posted an update Jan 23

Post

🕵️ Looking for DPO experts!

I have a dataset that is gpt-4-turbo as chosen and a lower performing model as rejected. The objective should therefore be fairly easy because the two are easy to discern. As a consequence, the model achieves very low losses (0.021 train; 0.013 validation) and high reward accuracies (0.995). **However**, when using the model in practice, it often detoriates after the first one or two tokens and continuously outputs sequences of /*****/. So despite the good performance on the DPO objective and strong scores on the validation set (no overfitting), something seems to go wrong. Perhaps the outputs are too different and the task is too easy, in which case DPO is not useful. But why then would the model start hallucinating and repeating the same token over and over again?

Any thoughts? Any suggestions to get around this? All discussions are welcome!

FremyCompany

Jan 23

Not an expert, but I think you should create your negative examples in a way where the first few tokens are not enough to differentiate between good and bad.

One easy way to do this would be to first sample the GPT-4 examples, then keep "n" tokens (sampled from 0 to the length of the answer) then generate the rest of the answer with the other (worse) model.

That way, the DPO model cannot just ignore every tokens after a few, because the branching can happen at any point.

BramVanroy

Jan 24

I think that this is a sensible point but looking at other DPO datasets (e.g. https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) this should not be "required" for DPO to work. So before I jump into this rabbit hole I'm first looking into other options. Thanks for brainstorming!

FremyCompany

Jan 23

(my other thought is that you should increase the KL divergence penalty, if your DPO models diverges too much from your initial model, but I think making the negative examples better is a stronger first step to take)

CultriX

Jan 23

I think this happened to my model: CultriX/MistralTrix-v1 as well. It sometimes starts to randomly output sequences of exactly that and I am not sure why. @mlabonne maybe you have any input on this?

mlabonne

Jan 24

I've never seen that before actually. Maybe too many steps?

lewtun

Jan 23

What model and hyperparameters are you using for your DPO training?

I have seen similar behaviour with Mistral 7B when beta was large (ie >0.5) and also when training for too many epochs. In both cases I think it’s a form of “reward hacking” where the model learns that some tokens have a spurious high reward.

In general, DPO takes a bit of tweaking to find good hyperparamters and the main ones I’d look at are beta, batch size, and number of epochs (assuming you are using a small LR like 5e-7).

The other ablation would be to run your model through something like dpo_orca_pairs and see if you observe the same effect - if you do then you can eliminate the dataset as the source of the problem.

BramVanroy

Jan 24

I used the alignment-handbook hyperparameters for Zephyr (beta 0.01 for 1 epoch) and the architecture is also based on Mistral. The only change I made is set max_length=8192 because I didn't understand why it was set to 1024. But I don't know if length would impact the result so much?

wlhgtc

Jan 24

Another perspective to consider:

While you've utilized reward accuracies to validate the effectiveness of your model, there are additional aspects to examine. It's advisable to ensure that the responses selected by your model are associated with higher probabilities (logits) within your policy model.
However, it's important to note that even if a chosen response has higher probabilities compared to those that were rejected, this does not necessarily mean it is the optimal response with the highest probability for the given prompt. The DPO loss primarily focuses on enhancing the chosen responses over the rejected ones, rather than identifying the absolute best response.

dvilasuero

Jan 24

This approach of generating dpo pairs seems to lead to reward hacking, as it becomes easy for the model to quickly exploit patterns in the chosen vs rejected response (even the first words). It happens with the original orca pairs too where it overfits very quickly (see Argilla's version). Besides all the recommendations above I'd try working on the dataset 🙂 you can try PairRM which is cheap to compute and see if it helps you reranking the pairs, unless you're using a pretty bad model for the rejected (which I would discourage)

UMCU

Jan 24

Just guessing here but perhaps part of the syntax for the positives is not parsed correctly and ends up in the tokens used for DPO. If that is always present in the positives, those sequences drive the DPO and end up as the only thing being generated.

BramVanroy

Jan 25

•

edited Jan 26

So I tweaked the learning rate to 1e-7 (was 5e-7), removed portion of the dataset (I used two datasets, one based on Orca Pairs, the other based on UltraFeedback, I kept the latter), and I decreased the max length to 2048 max_length 1152 prompt length. EDIT: I thought this had worked, but apparently I am wrong. For some prompts, the same result occurs with repetitions of /******/ everywhere. I am very confused about this and should find time to dig deeper but it is a tedious trial and error process of training and testing that east up a lot of my time. If anyone wants to have a look, I can provide gated access to the model.

Mostly cc @lewtun but also a shout out to everyone in this thread for brainstorming along!

robinsmits

Jan 26

You mention the Mistral model. Have you tried the different versions available for that model?
I've tried DPO on multiple runs with Mistral and haven't run into similar issues.

It might be the dataset ...but testing it with different models and with same setup might give you some more clues.

In this post