SEED-X

We introduce SEED-X, a unified and versatile foundation model, which can serve as various multimodal AI assistants in the real world after different instruction tuning, capable of responding to a variety of user needs through unifying multi-granularity comprehension and generation.

All models and inference code are released!

News

2024-04-22 :hugs: We release the models including the pre-trained foundation model SEED-X, the general instruction-tuned model SEED-X-I, the editing model SEED-X-Edit, and our de-tokenier, which can generate realistic images from ViT features (w/o or w/ a condition image).

2024-04-22 :hugs: We release an online gradio demo of a general instruction-tuned model SEED-X-I. SEED-X-I can follow multimodal instruction (including images with dynamic resolutions) and make responses with images, texts and bounding boxes in multi-turn conversation. SEED-X-I does not support image manipulation. If you want to experience SEED-X-Edit for high-precision image editing, the inference code and model will be released soon.

TODOs

Release the multimodal foundation model SEED-X.
Release the instruction-tuned model SEED-X-Edit for high-precision image editing.
Release 3.7M in-house image editing data.

Usage

Dependencies

Python >= 3.8 (Recommend to use Anaconda)
PyTorch >=2.0.1
NVIDIA GPU + CUDA

Installation

Clone the repo and install dependent packages

git clone https://github.com/AILab-CVC/SEED-X.git
cd SEED-X
pip install -r requirements.txt

Model Weights

We release the pretrained De-Tokenizer, the pre-trained foundation model SEED-X, the general instruction-tuned model SEED-X-I, the editing model SEED-X-Edit in in SEED-X-17B Hugging Face.

Please download the checkpoints and save them under the folder ./pretrained. For example, ./pretrained/seed_x.

You also need to download stable-diffusion-xl-base-1.0 and Qwen-VL-Chat, and save them under the folder ./pretrained. Please use the following script to extract the weights of visual encoder in Qwen-VL-Chat.

python3 src/tools/reload_qwen_vit.py

Inference with SEED-X De-tokenizer

# For image reconstruction with ViT image features
python3 src/inference/eval_seed_x_detokenizer.py
# For image reconstruction with ViT image features and conditional image
python3 src/inference/eval_seed_x_detokenizer_with_condition.py

Inference with pre-trained model SEED-X

# For image comprehension and detection
python3 src/inference/eval_img2text_seed_x.py
# For image generation
python3 src/inference/eval_text2img_seed_x.py

Inference with the general instruction-tuned model SEED-X-I

# For image comprehension and detection
python3 src/inference/eval_img2text_seed_x_i.py
# For image generation
python3 src/inference/eval_text2img_seed_x_i.py

Inference with the editing model SEED-X-Edit

# For image editing
python3 src/inference/eval_img2edit_seed_x_edit.py

Citation

If you find the work helpful, please consider citing:

@article{ge2024seed,
  title={SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation},
  author={Ge, Yuying and Zhao, Sijie and Zhu, Jinguo and Ge, Yixiao and Yi, Kun and Song, Lin and Li, Chen and Ding, Xiaohan and Shan, Ying},
  journal={arXiv preprint arXiv:2404.14396},
  year={2024}
}

License

SEED is licensed under the Apache License Version 2.0 except for the third-party components listed in License.

During training SEED-X, we freeze the original parameters of LLaMA2 and optimize the LoRA module.