AnyGPT:

Unified Multimodal LLM with Discrete Sequence Modeling

1 Fudan University, 2 Multimodal Art Projection Research Community, 3 Shanghai AI Laboratory
(*Equal contribution) (Correspondence)

Video Presentation

Abstract

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model.


Method


• Architecture

Teaser

Figure 1: An overview of the AnyGPT model architecture. All modalities are tokenized into discrete tokens, upon which the LLM performs multimodal understanding and generation autoregressively. Only data pre-processing and post-processing are required, with the model's architecture and training objectives remaining unaltered.




• AnyInstruct Dataset Synthesis Process


Teaser

Figure 4: The construction process of the multimodal interleaved instruction dataset AnyInstruct is divided into two stages: Generation of text-based conversations incorporating multimodal elements and Text-to-Multimodality Conversion. The first stage generates topics, scenarios, and textual dialogues, while the second stage produces the final multimodal dialogues.


Demonstrations



• Example-1: Speech conversations (Voice Clone)

voice prompt (to clone):

avatar-user

(Transcription: write a spring poem)

avatar-bot

(Transcription:

Spring's beauty,
The grass is green,
Colors abound,
A blooming spring we'll see,
Nature's miracle,
A new season of grace,
Spring's here, it's Spring!)

angry voice prompt1 (to clone):

avatar-user

(Transcription: Generate an angry dramatic character line)

avatar-bot

(Transcription: I'm not going to take this anymore! I know it's a long-running joke but please, let me live in peace!)

• Example-2: Speech Instruction → text +Image + Music + Speech Response

avatar-user

(Transcription: I've been too tired recently. Where do you think suitable to relax?)

avatar-bot

(Transcription: How about a serene lakeside?)

avatar-user

(Transcription: Can you give me some relaxing music?)

avatar-bot

(Transcription: Enjoy this peaceful setting.)

avatar-user

(Transcription: Can you draw me a picture of sunny beach?)

avatar-bot
avatar-user

(Transcription: Give me a similar style of music?)

avatar-bot

• Example-3: Speech Instruction + Music → Text + Image + Speech Response

avatar-user

(Transcription: Can you convert the feeling of this music into an image?)

avatar-bot

(Transcription: The music feels energetic and triumphant.)

• Example-4: Speech Instruction + Image → text + Music + Speech Response

avatar-user

(Transcription: What do you see in this picture?)

avatar-bot

(Transcription: Unity and the warmth of human connection shining brightly out here. Want to feel it too?)

avatar-user

(Transcription: Generate a music for it.)

avatar-bot

• Example-5: Text → Image + Music

avatar-user

Can you show me a picture of a sunny beach?

avatar-bot
avatar-user

Can you give me a piece of similar style of music?.

avatar-bot

avatar-user

Draw a scene of a mysterious forest where night has fallen and moonlight filters through the treetops. In the center of the image, a winding stream quietly flows, reflecting the moonlight.

avatar-bot
avatar-user

Compose a piece of music for this scene.

avatar-bot

• Example-6: Text + Image → Music

avatar-user

Generate a piece of background music for this image.

avatar-bot

(case1)
avatar-bot

(case2)
avatar-bot

(case3)
avatar-user

Generate a piece of music based on the picture's emotion.

avatar-bot

• Example-7: Text + Image → Text + Music

avatar-user

Can you translate the emotion in this picture into music?

avatar-bot

The image exudes tranquility and meditation.

avatar-user

Understand this picture, then generate a piece of matching music. Please think step by step.

avatar-bot

The painting is abstract with bold use of color. Here's the music

• Example-8: Text + Music → Text + Image

avatar-user

What instrument is in this piece of music?

avatar-bot

The bass drum is prominently featured, often indicating a strong beat in music.

avatar-user

Draw a picture about it.

avatar-bot

avatar-user

Can you find an image to match this song's mood?

avatar-bot

The music feels tender and nostalgic.

• Example-9: Text + Music → Muisc

avatar-user

can you generate similar music I might enjoy?

avatar-bot


Related Links: SpeechGPT, SEED, CoDi, NExTGPT