Rivendell

Summary of "Chameleon: Mixed-Modal Early-Fusion Foundation Models"

Chameleon is a groundbreaking new AI model that can understand and generate both text and images, representing a significant advancement in multimodal AI capabilities. Its architecture largely follows LLaMa-2, with key deviations like query-key normalization, dropout, re-ordering of normalization layers, and Z-loss regularization to improve stability during training. At its core, Chameleon encodes images as a series of 1024 discrete tokens from a codebook of 8192 tokens using a specifically trained tokenizer based on Gafni et al. (2022). This tokenizer was trained on licensed images and struggles with reconstructing images with large amounts of text, impacting OCR tasks. The text is tokenized using a vocabulary of 65,536 tokens, which includes the 8192 image tokens. Chameleon was pre-trained on a massive dataset comprising 2.9 trillion text-only tokens from LLaMa-2 and CodeLLaMa, 1.4 billion text-image pairs (1.5 trillion tokens), and 400 billion tokens of interleaved text and images from public web sources. The last 20% of pre-training used higher-quality datasets, including filtered instruction tuning data. Training the 7B parameter model required 1024 NVIDIA A100 GPUs and 856,481 GPU hours, while the 34B model used 3072 GPUs and 4,282,407 GPU hours on Meta's Research Super Cluster. Optimization techniques like AdamW, linear warmup, exponential learning rate decay, weight decay, gradient clipping, and varying batch sizes were employed. Chameleon's text-only capabilities were evaluated following LLaMa-2's protocol on commonsense reasoning, reading comprehension, math, and world knowledge benchmarks. Its image-to-text performance was assessed on image captioning (COCO, Flickr30k) and visual question answering (VQAv2) tasks, outperforming or matching larger open-source models like Flamingo and IDEFICS. For mixed-modal reasoning, human annotators provided 1048 prompts (441 mixed-modal, 607 text-only) across 12 task types. Chameleon's responses were preferred over GPT-4V and Gemini Pro, with and without image augmentation. However, certain visual understanding tasks like OCR and infographic interpretation were not evaluated. Safety and alignment were handled through lightweight supervised fine-tuning on curated datasets covering text, code, visual chat, image generation, interleaved generation, and safety examples from sources like LLaMa-2-Chat, Rainbow Teaming, Pick-A-Pic, and internal data. Safety testing involved crowdsourced prompts and internal red team probing for vulnerabilities. While Chameleon demonstrated impressive performance, limitations include the image tokenizer's weakness in reconstructing text-heavy images and the lack of direct comparison to other native mixed-modal models due to API unavailability for interleaved output.

18 likes