Edit model card

phi-2-gpo-newSFT-b0.001-i0

This model is a fine-tuned version of DUAL-GPO/phi-2-sft-lora-ultrachat-merged on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0401
  • Rewards/chosen: -0.0556
  • Rewards/rejected: -0.0895
  • Rewards/accuracies: 0.6018
  • Rewards/margins: 0.0338
  • Logps/rejected: -337.8973
  • Logps/chosen: -325.6151
  • Logits/rejected: 0.2226
  • Logits/chosen: 0.1916

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 3
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 48
  • total_eval_batch_size: 12
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.0569 0.08 100 0.0535 0.0003 -0.0004 0.5973 0.0006 -248.8037 -269.7070 1.0547 0.9877
0.0565 0.16 200 0.0515 -0.0016 -0.0068 0.6078 0.0052 -255.2234 -271.5318 0.9706 0.8998
0.0467 0.24 300 0.0470 -0.0316 -0.0485 0.6033 0.0169 -296.9762 -301.5743 0.5528 0.4973
0.0467 0.31 400 0.0443 -0.0370 -0.0583 0.6033 0.0213 -306.7241 -306.9802 0.3457 0.3071
0.0359 0.39 500 0.0428 -0.0574 -0.0869 0.5988 0.0296 -335.3609 -327.3275 0.2119 0.1835
0.0431 0.47 600 0.0418 -0.0450 -0.0725 0.6033 0.0275 -320.9161 -314.9630 0.2891 0.2554
0.0438 0.55 700 0.0413 -0.0574 -0.0889 0.6018 0.0316 -337.3519 -327.3254 0.2356 0.2040
0.0446 0.63 800 0.0409 -0.0522 -0.0842 0.6048 0.0320 -332.6603 -322.1777 0.2566 0.2236
0.0426 0.71 900 0.0408 -0.0624 -0.0977 0.6048 0.0353 -346.1494 -332.3424 0.2089 0.1797
0.0448 0.79 1000 0.0403 -0.0545 -0.0869 0.6063 0.0324 -335.3596 -324.4480 0.2463 0.2141
0.0411 0.86 1100 0.0402 -0.0549 -0.0884 0.6018 0.0336 -336.8657 -324.8257 0.2283 0.1971
0.0459 0.94 1200 0.0401 -0.0558 -0.0896 0.6033 0.0338 -338.0042 -325.7257 0.2205 0.1898

Framework versions

  • PEFT 0.7.1
  • Transformers 4.36.2
  • Pytorch 2.1.2
  • Datasets 2.14.6
  • Tokenizers 0.15.2
Downloads last month
44
Unable to determine this model’s pipeline type. Check the docs .

Adapter for

Dataset used to train DUAL-GPO/phi-2-gpo-newSFT-b0.001-i0