CCMat
's Collections
toread
updated
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced
Training
Paper
•
2311.17049
•
Published
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts
Language Model
Paper
•
2405.04434
•
Published
•
10
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper
•
2303.17376
•
Published
Sigmoid Loss for Language Image Pre-Training
Paper
•
2303.15343
•
Published
•
4
Better & Faster Large Language Models via Multi-token Prediction
Paper
•
2404.19737
•
Published
•
62
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
•
2401.10774
•
Published
•
50
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Paper
•
2404.19427
•
Published
•
65
CogVLM: Visual Expert for Pretrained Language Models
Paper
•
2311.03079
•
Published
•
19
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model
Paper
•
2401.16420
•
Published
•
54
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image
Generation
Paper
•
2404.02733
•
Published
•
19
Demonstration-Regularized RL
Paper
•
2310.17303
•
Published
Vision Transformers Need Registers
Paper
•
2309.16588
•
Published
•
73
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video
Generation
Paper
•
2405.01434
•
Published
•
44
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
•
2404.19752
•
Published
•
19
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
•
2405.01535
•
Published
•
103
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper
•
2405.00732
•
Published
•
115
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
•
2405.07863
•
Published
•
57
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
87
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with
Fine-Grained Chinese Understanding
Paper
•
2405.08748
•
Published
•
17
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
•
2405.10300
•
Published
•
24
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
25
CAT3D: Create Anything in 3D with Multi-View Diffusion Models
Paper
•
2405.10314
•
Published
•
37
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
73
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
•
2405.09818
•
Published
•
95
Layer-Condensed KV Cache for Efficient Inference of Large Language
Models
Paper
•
2405.10637
•
Published
•
16
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
•
2405.11143
•
Published
•
33
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper
•
2405.12130
•
Published
•
41
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Paper
•
2405.11473
•
Published
•
50
Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and
Attribute Control
Paper
•
2405.12970
•
Published
•
20
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper
•
2405.12981
•
Published
•
23
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
•
2405.12399
•
Published
•
25
Your Transformer is Secretly Linear
Paper
•
2405.12250
•
Published
•
135
ReVideo: Remake a Video with Motion and Content Control
Paper
•
2405.13865
•
Published
•
21
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
29
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
64
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
41