griffin-llama3t-8L-v0.02-fineweb

Pretraining experiment with griffin/recurrent_gemma arch. This one uses the Llama-3 tokenizer.

Model description

Further training of pszemraj/griffin-1024-llama3t-8layer-simplewiki-silu on the BEE-spoke-data/fineweb-1M_en-med dataset. It achieves the following results on the evaluation set:

Loss: 5.6538
Accuracy: 0.1881
Num Input Tokens Seen: 766509056

evals

tl;dr its bad/would need more training:

hf (pretrained=pszemraj/griffin-llama3t-8L-v0.02-fineweb,trust_remote_code=True,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4

Tasks	Version	Filter	Metric	Value		Stderr
winogrande	1	none	acc	0.4964	±	0.0141
piqa	1	none	acc	0.5332	±	0.0116
		none	acc_norm	0.5299	±	0.0116
openbookqa	1	none	acc	0.1280	±	0.0150
		none	acc_norm	0.2320	±	0.0189
lambada_openai	1	none	perplexity	638060.0702	±	43608.0044
		none	acc	0.0000	±	0.0000
boolq	2	none	acc	0.3783	±	0.0085
arc_easy	1	none	acc	0.2614	±	0.0090
		none	acc_norm	0.2744	±	0.0092

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 2
eval_batch_size: 2
seed: 80085
gradient_accumulation_steps: 32
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.99) and epsilon=1e-07
lr_scheduler_type: inverse_sqrt
lr_scheduler_warmup_ratio: 0.05
num_epochs: 1.0

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	Input Tokens Seen
6.4019	0.0684	400	6.7690	0.1278	52428800
6.0547	0.1368	800	6.4214	0.1460	104857600
5.8133	0.2052	1200	6.2566	0.1550	157286400
5.7212	0.2736	1600	6.1411	0.1620	209715200
5.6175	0.3420	2000	6.0502	0.1669	262144000
5.5014	0.4104	2400	5.9827	0.1687	314572800
5.4882	0.4788	2800	5.9203	0.1731	367001600
5.3972	0.5472	3200	5.8614	0.1782	419430400
5.3983	0.6156	3600	5.8340	0.1773	471859200
5.3175	0.6840	4000	5.7916	0.1814	524288000
5.3014	0.7524	4400	5.7565	0.1814	576716800
5.2749	0.8208	4800	5.7303	0.1849	629145600
5.2264	0.8892	5200	5.6993	0.1850	681574400
5.2107	0.9576	5600	5.6745	0.1884	734003200

Framework versions

Transformers 4.40.1
Pytorch 2.3.0+cu121
Datasets 2.19.0
Tokenizers 0.19.1

pszemraj
/

griffin-llama3t-8L-v0.02-fineweb

griffin-llama3t-8L-v0.02-fineweb

Model description

evals

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from

Dataset used to train pszemraj/griffin-llama3t-8L-v0.02-fineweb

Evaluation results

griffin-llama3t-8L-v0.02-fineweb

Model description

evals

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from pszemraj/griffin-1024-llama3t-8layer-simplewiki-silu

Dataset used to train pszemraj/griffin-llama3t-8L-v0.02-fineweb

Evaluation results

Finetuned from