close
close

LLaVA-OneVision: A family of open large multimodal models (LMMs) to simplify visual task transfer

LLaVA-OneVision: A family of open large multimodal models (LMMs) to simplify visual task transfer

A key goal in AI development is to create general-purpose assistants using large multimodal models (LMMs). Building AI systems that can work with humans in different environments and on a variety of tasks is central to the concept of the general-purpose assistant. These helpers are not limited to just one area of ​​expertise; they can easily handle customer service, creative projects, personal task management, and even difficult analytical tasks. Using LMMs, these assistants can process and respond to a wider variety of inputs, increasing their versatility and practicality.

A collaboration between ByteDance, NTU, CUHK and HKUST has resulted in the development of LLaVA-OneVision, a significant advancement in large-scale vision and voice assistant (LLaVA) research. This system demonstrates how to construct a model that can understand and perform a wide range of computer vision tasks in real-world scenarios. Using a basic interconnect module that connects vision encoders to large-scale language models (LLMs) is a low-cost recipe that can be of benefit to the entire AI community.

The first LLaVA model shows remarkable multimodal conversational capabilities, occasionally mimicking GPT-4V’s behavior on new images and instructions. LLaVA-1.5 achieves state-of-the-art (SoTA) performance, that is, it outperforms all other existing models on hundreds of benchmarks with a data-efficient recipe, and greatly extends and improves capabilities by incorporating more academically related instruction data. LLaVA-NeXT takes advantage of this quality by significantly improving performance through three main methods: AnyRes operates on the largest open-source LLM available at the time, processes high-resolution photos, and augments high-quality instruction data. The minimalist design of the LLaVA series is carried over into the model architecture, with the main goals of making good use of the pre-trained capabilities of the LLM and visual model, and enabling strong data and model scaling behavior.

Modelling of LLaVA-OneVision

The key to the success of visual coding is the representation of visual signals. The raw pixel resolution and the token count in the feature space are related to this as they determine the configuration of the input visual representation. Both features are scaled to increase performance, especially in visual detail tasks. The researchers have found that scaling the resolution is more effective than scaling the token count to achieve a performance-cost balance and propose an AnyRes method with pooling.

The proposed method of data scaling in multimodal pretraining provides a more efficient approach, especially given the often poor quality of public image-text data at web scale. By focusing on high-quality knowledge learning within a limited computational budget, the researchers aim to refine and improve the information that the pre-trained LLMs and ViTs already possess. To ensure top-notch knowledge acquisition, they carefully examine data from three main areas:

  • Detailed description data with new captions. Among open-source LMMs, LLaVA-NeXT-34B stands out for its impressive ability to create detailed captions. The team used the model to create new captions for the COCO118K, BLIP558K, and CC3M datasets. Using a total of 3.5 million examples, they created the newly captioned detailed description data. Using the early version of the model to create training data is one way to think of this as a basic attempt at AI self-improvement.
  • Document and optical character recognition data: The team used the 100,000-record text reading subset of the UReader dataset, which is readily available through PDF rendering. The document/OCR data, consisting of 1.1 million examples, was created by combining this text reading data with the SynDOG EN/CN.
  • Chinese and language data: The researchers wanted to increase the model’s capacity in Chinese by using the original ShareGPT4V photos and GPT-4V offered by the Azure API to generate 92,000 detailed subtitle data. Their goal was to ensure that the model’s language understanding capacity was balanced given the massive amount of precise subtitle data used. From the Evo-Instruct dataset, they extracted 143,000 examples.

Adapting an LMM to interpret and respond to visual instructions is called visual instruction adaptation. The speech-visual medium (LMM) processes and responds to these instructions, such as text, images, or videos. Interpreting the instructions and giving the required responses requires the combination of visual understanding and natural language processing. Previous research has shown that LMM capability relies heavily on visual instruction adaptation data. Consequently, it is important and beneficial for the community to maintain a repository of high-quality datasets. The researchers began collecting an uneven ratio of data across categories from a variety of original sources to create a large pool of instruction adaptation datasets. They also use several newly acquired subsets of the Cauldron and Cambrian datasets. Vision, instruction, and response form a three-level hierarchy that is used to classify the data.

Academic datasets such as VQAv2, GQA, and Visual Genome provide fixed-form data, while advanced models such as Gemini and GPT-4V/o annotate free-form data. The original answers are preserved with free-form data. However, when dealing with fixed-form data, the team manually reviews each piece of material and fixes any errors in the question-and-answer formats that they find. For data types such as multiple choice, short answers, and special tasks (e.g. OCR), the LLaVA-1.5 prompting technique is used. This is important to guide the model’s behavior, avoid conflicts caused by different data sources, and ensure a good balance between QA performance, conversational ability, and reasoning skills on more complex tasks.

One set of instructions is intended for situations with only one image, the second for all possible viewing situations. Their previous research provided the basis for this separation by demonstrating the interdependence of image and video models. In particular, a more robust image model can generalize better to tasks with multiple photos or videos. Training data sets for single-image tasks are also far larger and of better quality than those for movies and multi-image tasks.

The team consistently divides three important functions for ablation experiments into three different learning phases to enable LLM for multimodal skills. To train the model, they follow a curriculum learning principle that systematically considers training objectives and examples of increasingly challenging tasks.

  1. The first step is to match speech and images. The goal is to match the visual features with the word embedding space of the LLMs.
  2. The next step involves learning high-quality knowledge. The researchers propose to consider high-quality knowledge for LMM learning to combine computational power with adding new information to LMMs.
  3. The researchers then implement Visual Instruction Tuning by categorizing the instruction data into multiple groups to train LMM to respond appropriately to different visual tasks. The visual instruction tuning procedure involves two distinct steps: (i) Training with a single image: After training with 3.2 million individual images, the model develops a strong ability to follow different instructions to complete visual tasks using just one image. (ii) Using a combination of video, single-image, and multiple-image data, the model is trained using OneVision. At this point, the model can handle more complex scenarios than just those involving a single image. Emergent skills emerge as the model learns to follow instructions to perform tasks in different environments and applies this knowledge to other scenarios.

Using LMMs-Eval, the researchers perform consistent and repeatable tests on all benchmarks to evaluate LLaVA-OneVision models. They mainly use data from original papers so that other well-known LMMs can be fairly compared. They load the models into LMMs-Eval and test them with consistent parameters when no results are available. Unless otherwise stated, they use greedy decoding and 0-shot settings for all results. To uncover the effectiveness and generalizability of the proposed paradigm, they thoroughly evaluate their LLaVA-OneVision models on different modalities such as video, audio, and single-images. After the single-image and one-vision phases of model training, they refer to the resulting checkpoint as LLaVA-OV (SI) and LLaVA-OV, respectively. Applications ranging from edge devices to cloud serving can use the three available model sizes – 0.5B, 7B and 72B – to address different performance and throughput trade-offs.

These results serve as benchmarks for GPT-4V and GPT-4o. When comparing GPT-4V to GPT-4o on most benchmarks, the largest model, LLaVA-OneVision-72B, performs better. The results show that the recipe is effective, which bodes well for future scaling efforts. Nevertheless, there is still a significant gap on more complicated tasks such as visual chat scenarios; the team will save this for future studies that focus on more robust LLMs, larger training datasets, and improved preference learning.


Check out the Paper And Project page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Þjórsárdalur and join our Telegram channel And LinkedInphew. If you like our work, you will Newsletters..

Don’t forget to join our 48k+ ML SubReddit

Find upcoming AI webinars here



Dhanshree Shenwai is a Computer Science Engineer with extensive experience in FinTech companies in Finance, Cards & Payments and Banking with a keen interest in AI applications. She is passionate about exploring new technologies and advancements in today’s evolving world that make everyone’s life easier.

Leave a Reply

Your email address will not be published. Required fields are marked *