griffin-c3t-8L-v0.02-fineweb

Pretraining experiment with griffin/recurrent_gemma arch

Model description

Further training of pszemraj/griffin-v0.01-c3t-8layer-simplewiki-silu on the BEE-spoke-data/fineweb-1M_en-med dataset. It achieves the following results on the evaluation set:

Loss: 5.1888
Accuracy: 0.2326
Num Input Tokens Seen: 798621696

numbers

tl;dr its bad/would need more training:

hf (pretrained=pszemraj/griffin-c3t-8L-v0.02-fineweb,trust_remote_code=True,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4

Tasks	Version	Filter	Metric	Value		Stderr
winogrande	1	none	acc	0.5146	±	0.0140
piqa	1	none	acc	0.5511	±	0.0116
		none	acc_norm	0.5261	±	0.0116
openbookqa	1	none	acc	0.1140	±	0.0142
		none	acc_norm	0.2240	±	0.0187
lambada_openai	1	none	perplexity	209503.2246	±	11711.4041
		none	acc	0.0000	±	0.0000
boolq	2	none	acc	0.3783	±	0.0085
arc_easy	1	none	acc	0.2593	±	0.0090
		none	acc_norm	0.2774	±	0.0092

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 2
eval_batch_size: 2
seed: 80085
gradient_accumulation_steps: 32
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.99) and epsilon=1e-07
lr_scheduler_type: inverse_sqrt
lr_scheduler_warmup_ratio: 0.05
num_epochs: 1.0

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	Input Tokens Seen
6.0703	0.0656	400	6.2332	0.1701	52428800
5.723	0.1313	800	5.9116	0.1893	104857600
5.5106	0.1969	1200	5.7516	0.1976	157286400
5.455	0.2626	1600	5.6427	0.2032	209715200
5.3236	0.3282	2000	5.5567	0.2103	262144000
5.2764	0.3938	2400	5.4919	0.2151	314572800
5.1625	0.4595	2800	5.4436	0.2176	367001600
5.1851	0.5251	3200	5.3975	0.2206	419430400
5.0618	0.5908	3600	5.3624	0.2199	471859200
5.0278	0.6564	4000	5.3242	0.2236	524288000
5.0389	0.7220	4400	5.2920	0.2264	576716800
4.9732	0.7877	4800	5.2674	0.2276	629145600
4.9375	0.8533	5200	5.2418	0.2292	681574400
4.9322	0.9190	5600	5.2166	0.2312	734003200
4.8818	0.9846	6000	5.1981	0.2315	786432000

Framework versions

Transformers 4.40.1
Pytorch 2.3.0+cu121
Datasets 2.19.0
Tokenizers 0.19.1

pszemraj
/

griffin-c3t-8L-v0.02-fineweb

griffin-c3t-8L-v0.02-fineweb

Model description

numbers

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from

Dataset used to train pszemraj/griffin-c3t-8L-v0.02-fineweb

griffin-c3t-8L-v0.02-fineweb

Model description

numbers

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from pszemraj/griffin-v0.01-c3t-8layer-simplewiki-silu

Dataset used to train pszemraj/griffin-c3t-8L-v0.02-fineweb

Finetuned from