Model Card for pseudo-flex-base (1024x1024 base resolution)

stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios, into a photography model (ptx0/pseudo-real-beta).

Sample images

Seed: 2695929547

Steps: 25

Sampler: DDIM, default model config settings

Version: Pytorch 2.0.1, Diffusers 0.17.1

Guidance: 9.2

Guidance rescale: 0.0

resolution	model	stable diffusion	pseudo-flex	realism-engine
753x1004 (4:3)	v2-1
1280x720 (16:9)	v2-1
1024x1024 (1:1)	v2-1
1024x1024 (1:1)	v2-1

Background

The ptx0/pseudo-real-beta pretrained checkpoint had its unet trained for 4,200 steps and its text encoder trained for 15,600 steps at a batch size of 15 with 10 gradient accumulations, on a diverse dataset:

cushman (8000 kodachrome slides from 1939 to 1969)
midjourney v5.1-filtered (about 22,000 upscaled v5.1 images)
national geographic (about 3-4,000 >1024x768 images of animals, wildlife, landscapes, history)
a small dataset of stock images of people vaping / smoking

It has a diverse capability of photorealistic and adventure with strong prompt coherence. However, it lacks multi-aspect capability.

The code used to train pseudo-real-beta did not have aspect bucketing support. I discovered pseudo-flex-base by @ttj, which supported theories I had.

Training code

I added thorough aspect bucketing support to my training loop dataloader by having it throw away any image under 1024x1024, and condition all images so that the smaller side of the image is 1024. The aspect ratio of the image is used to determine the new length of the other dimension, eg. used as a multiple for landscape or a divisor for portrait mode.

All batches have image of the same resolution. Different resolutions at the same aspect are all conditioned to 1024x... or ...x1024. A 1920x1080 image becomes approx 1820x1024.

Starting checkpoint

This model, pseudo-flex-base was created by fine-tuning the base stabilityai/stable-diffusion-2-1 768 model on its frozen text encoder, for 1000 steps on 148,000 images from LAION HD using the TEXT field as their caption.

The batch size was effectively 150 again. Batch size of 15 with 10 accumulations. This is very slow at very high resolutions, an aspect ratio of 1.5-1.7 will cause this to take about 700 seconds per iter on an A100 80G.

This training took two days.

Text encoder swap

At 1000 steps, the text encoder from ptx0/pseudo-real-beta was used experimentally with this model's unet in an attempt to resolve some residual image noise, eg. pixelation. That worked!

The training was restarted from ckpt 1000 with this text encoder.

The beginnings of wide / portrait aspect appearing

Validation prompts began to "pull together" from 1300 to 2950 steps. Some checkpoints show regression, but these usually resolve in about 100 steps. Improvements were always present, despite regresions.

Degradation and dataset swap

As training has been going on for some time now on 148,000 images at a batch size of 150 over 3000 steps, images began to degrade. This is presumably due to having completed 3 repeats on all images in the set, and that's IF all images in the set had been used. Considering some of the image filters discarded about 50,000 images, we landed at 9 repeats per image on our super low learning rate.

This caused two issues:

The images were beginning to show static noise.
The training was taking a very long time, and each checkpoint showed little improvement.
Overfitting to prompt vocabulary, and a lack of generalization.

Ergo, at 1300 steps, the decision was made to cease training on the original LAION HD dataset, and instead, train on a new freshly-retrieved subset of high-resolution Midjourney v5.1 data.

This consisted of 17,800 images at a base resolution of 1024x1024, with about 700 samples in portrait and 700 samples in landscape.

Contrast issues

As the checkpoint 3275 was tested, a common observation was that darker images were washed out, and brighter images seemed "meh".

Various CFG rescale and guidance levels were tested, with the best dark images occurring around guidance_scale=9.2 and guidance_rescale=0.0 but they remained "washed out".

Dataset change number two

A new LAION subset was prepared with unique images and no square images - just a limited collection of aspect ratios:

16:9
9:16
2:3
3:2

This was intended to speed up the understanding of the model, and prevent overfitting on captions.

This LAION subset contained 17,800 images, evenly distributed through aspect ratios.

The images were then captioned using T5 Flan with BLIP2, to obtain highly accurate results.

Contrast fix: offset noise / SNR gamma to the rescue?

Offset noise and SNR gamma were applied experimentally to the checkpoint 4250:

snr_gamma=5.0
noise_offset=0.2
noise_pertubation=0.1

Within 25 steps of training, the contrast was back, and the prompt a solid black square once again produced a reasonable result.

At 50 steps of offset noise, things really seemed to "click" and a solid black square had the fewest deformities I've seen.

Step 75 checkpoint was broken. The SNR gamma math results in numeric instability and was disabled. The offset noise parameters were untouched.

Success! Improvement in quality and contrast.

Similar to the text encoder swap, the images showed a marked improvement over the next several checkpoints.

It was left to its own devices, and at step 4475, enough improvement was observed that another revision in this repository was created.

Status: Test release

This model has been packaged up in a test form so that it can be thoroughly assessed by users.

For usage, see - How to Get Started with the Model

It aims to solve the following issues:

Generated images looks like they are cropped from a larger image.
Generating non-square images creates weird results, due to the model being trained on square images.

Limitations:

It's trained on a small dataset, so its improvements may be limited.
The model architecture of SD 2.1 is older than SDXL, and will not generate comparably good results.

For 1:1 aspect ratio, it's fine-tuned at 1024x1024, although ptx0/pseudo-real-beta that it was based on, was last finetuned at 768x768.

Potential improvements:

Train on a captioned dataset. This model used the TEXT field from LAION for convenience, though COCO-generated captions would be superior.
Train the text encoder on large images.
Periodic caption drop-out enforced to help condition classifier-free guidance capabilities.

Model Card for pseudo-flex-base
Table of Contents
Table of Contents
Model Details
- Model Description
Uses
Bias, Risks, and Limitations
- Recommendations
Training Details
- Training Data
- Training Procedure
  - Preprocessing
  - Speeds, Sizes, Times
Evaluation
- Testing Data, Factors & Metrics
- Results
Model Examination
Environmental Impact
Technical Specifications [optional]
- Model Architecture and Objective
- Compute Infrastructure
  - Hardware
  - Software
Citation
Glossary [optional]
More Information [optional]
Model Card Authors [optional]
Model Card Contact
How to Get Started with the Model

Model Details

Model Description

stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1 and ptx0/pseudo-real-beta) finetuned for dynamic aspect ratios.

finetuned resolutions:

	width	height	aspect ratio	images
0	1024	1024	1:1	90561
1	1536	1024	3:2	8716
2	1365	1024	4:3	6933
3	1468	1024	~3:2	113
4	1778	1024	~5:3	6315
5	1200	1024	~5:4	6376
6	1333	1024	~4:3	2814
7	1281	1024	~5:4	52
8	1504	1024	~3:2	139
9	1479	1024	~3:2	25
10	1384	1024	~4:3	1676
11	1370	1024	~4:3	63
12	1499	1024	~3:2	436
13	1376	1024	~4:3	68

Other aspects were in smaller buckets. It could have been done more succinctly or carefully, but careless handling of the data was a part of the experiment parameters.

Developed by: pseudoterminal
Model type: Diffusion-based text-to-image generation model
Language(s): English
License: creativeml-openrail-m
Parent Model: https://huggingface.co/ptx0/pseudo-real-beta
Resources for more information: More information needed

Uses

see https://huggingface.co/stabilityai/stable-diffusion-2-1

Training Details

Training Data

LAION HD dataset subsets
- https://huggingface.co/datasets/laion/laion-high-resolution We only used a small portion of that, see Preprocessing

Preprocessing

All pre-processing is done via the scripts in bghira/SimpleTuner on GitHub.

Speeds, Sizes, Times

Dataset size: 100k image-caption pairs, after filtering.
Hardware: 1 A100 80G GPUs
Optimizer: 8bit Adam
Batch size: 150
- actual batch size: 15
- gradient_accumulation_steps: 10
- effective batch size: 150
Learning rate: Constant 4e-8 which was adjusted by reducing batch size over time.
Training steps: WIP (ongoing)
Training time: approximately 4 days (so far)

Results

More information needed

Model Card Authors

pseudoterminal

How to Get Started with the Model

Use the code below to get started with the model.

# Use Pytorch 2!
import torch
from diffusers import StableDiffusionPipeline, DiffusionPipeline, AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel

# Any model currently on Huggingface Hub.
model_id = 'ptx0/pseudo-flex-base'
pipeline = DiffusionPipeline.from_pretrained(model_id)

# Optimize!
pipeline.unet = torch.compile(pipeline.unet)
scheduler = DDPMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler"
)

# Remove this if you get an error.
torch.set_float32_matmul_precision('high')

pipeline.to('cuda')
prompts = {
    "woman": "a woman, hanging out on the beach",
    "man": "a man playing guitar in a park",
    "lion": "Explore the ++majestic beauty++ of untamed ++lion prides++ as they roam the African plains --captivating expressions-- in the wildest national geographic adventure",
    "child": "a child flying a kite on a sunny day",
    "bear": "best quality ((bear)) in the swiss alps cinematic 8k highly detailed sharp focus intricate fur",
    "alien": "an alien exploring the Mars surface",
    "robot": "a robot serving coffee in a cafe",
    "knight": "a knight protecting a castle",
    "menn": "a group of smiling and happy men",
    "bicycle": "a bicycle, on a mountainside, on a sunny day",
    "cosmic": "cosmic entity, sitting in an impossible position, quantum reality, colours",
    "wizard": "a mage wizard, bearded and gray hair, blue  star hat with wand and mystical haze",
    "wizarddd": "digital art, fantasy, portrait of an old wizard, detailed",
    "macro": "a dramatic city-scape at sunset or sunrise",
    "micro": "RNA and other molecular machinery of life",
    "gecko": "a leopard gecko stalking a cricket"
}
for shortname, prompt in prompts.items():
    # old prompt: ''
    image = pipeline(prompt=prompt,
        negative_prompt='malformed, disgusting, overexposed, washed-out',
        num_inference_steps=32, generator=torch.Generator(device='cuda').manual_seed(1641421826), 
        width=1368, height=720, guidance_scale=7.5, guidance_rescale=0.3, num_inference_steps=25).images[0]
    image.save(f'test/{shortname}_nobetas.png', format="PNG")

ptx0
/

pseudo-flex-base