ChatAnything:

FACETIME CHAT WITH LLM-ENHANCED PERSONAS


Nankai University, Bytedance Inc.   *Equal Contribution   §Project Lead   +Corresponding
"Hello, how are you doing today!"

ChatAnything targets generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically.

A Simple Video demonstrating the application.

Abstract

In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically. For MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. We termed the whole framework as \emph{ChatAnything}. With it, users could be able to animate anything with any personas that are anthropomorphic using just a few text inputs. However, we have observed that the anthropomorphic objects produced by current popular generative models often go undetected by pre-trained face landmark detectors thus leading to the failure of the face motion generation, even if these faces possess human-like appearances because those images are nearly seen during the training (e.g., OOD samples). To address this issue, we have incorporated pixel-level guidance to infuse human face landmarks during the image generation phase. To benchmark these metrics, we have built an evaluation dataset. Based on it, we verify that the detection rate of the face landmark is significantly increased from 57.0\% to 92.5\% thus allowing automatic face animation based on generated speech content. In the whole process, only texts are needed for the definition of static images and the driving signal.

  1. ChatAnything Architecture . We build a multi-modal LLM, BuboGPT for multi-modal understanding including image, audio and text by learning a common semantic space and further explore the fine-grained relation between different visual objects and different modalities.
  2. Mixture of Diffusers. We introduce a novel framework dedicated to the generation of LLM-enhanced personas exclusively from textual inputs. Predicated on user-specified keywords, our method synthe- sizes both a portrait and an associated personality and voice, facilitating meaningful user interaction with the resultant persona.
  3. Mixture of Voicers. We introduce a zero-shot approach designed to harmonize the distribution between pre- trained generative models and per-trained talking head models. This alignment ensures the production of expressive facial movements based on the synthesized avatar portrait.
  4. Evaluation Dataset. We propose an evaluation dataset to quantify the alignment between the generative models and the talking-head models.

ChatAnything Architecture

The overall pipeline of ChatAnything. ChatAnything framework includes four key components:

  1. a portrait generation component. A LLM-based control module that initializes the personality of the text-described persona from the user. It is also used to manage the system operation and call applications based on the interactions with the users.
  2. a personality generation component. A portrait initializer that generates the reference image for the persona. It includes a mix- ture of finetuned diffusion models (MoD) along with their LoRA module (if applicable). Each model is specialized in generating a specific style of images. The most matching model will be called automatically based on the user text description of the persona vis LLM.
  3. a voice generation component. A mixture of text-to-speech modules (MoV) that converts the text input from the personas to speech signals with customized tones. The selection is done automatically based on the user text descriptions via LLM.
  4. a face-driving component. A Motion generation module that takes in the speech signal and drives the generated image.
To further reinforce the customization freedom of the generated personas, we introduce two novel concepts: the mixture of diffusers (MoD) and the mixture of voices (MoV) where the style of the appearance and the tones can be customized based on the user text description.
Pipeline of ChatAnything

ChayAnything: Face Landmark Control

Impact of landmark guidance during the diffusion process. As shown in the first row, directly apply SD1.5 for portrait generation tends to produce abstract face images. Those images are rarely seen during the training of the talking head models and thus cannot be used as the input for facial expression generation. Differently, after applying the proposed techniques in ChatAnything (including the face landmark guidance, prompt engineering, and LoRA fine-tuning for aesthetics improvements), the model tends to generate more anthropomorphic images with high visual quality that can be used as the input for pre-trained talking head models

Adopting a pretrained facial landmark control and diffusion inversion on a human face demonstrates a powerful usage of the finetuned pretrained derivatives of image generative models. This ensures a high aesthetics starting point for the Talking head Chat.

Examples on Text-based Chat Persona

To prompt everything with a simple text input. Multi image generative models and multi TTS Voice are selected by the powerful Language Model. A pretrained generative model will be selected for the initial frame generation. And with the generated frame, a open-source animation module is used for rendering the video base on the audio output of another selected TTS Voice. To start with, here are a grid of faces for animation generated by the pipeline, showing an average expectation of the Imaginary talking face.


Examples on Image-based Chat Persona

We support uploading your own face guidance to the generative model. Also Try out some none face images. There might be a suprise.


Examples on Age Adjustment

BibTeX


        @misc{zhao2023ChatAnything,
          title={ChatAnything: Facetime Chat with LLM-Enhanced Personas}, 
          author={Yilin, Zhao and Xinbin, Yuan and Shanghua, Gao and Zhijie Lin and Qibin, Hou and Jiashi, Feng and Daquan, Zhou},
          publisher={arXiv:2311.06772},
          year={2023},
        }
  

The logo is an image generated with "Sichuan Opera with modern elements. Cyberpunk, ultra-modern, futurism, mechanical ascension. Mechanical style, super fantasy, dreamlike fairyland and cyberpunk. Artificial intelligence Sichuan Opera with an internationalist perspective."