TheaterGen

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

Junhao Cheng¹, Baiqiao Yin¹, Kaixin Cai¹, Minbin Huang², Hanhui Li¹, Yuxin He¹, Xi Lu¹, Yue Li¹, Yifei Li¹, Yuhao Cheng³, Yiqiang Yan³, Xiaodan Liang¹,

1. Shenzhen Campus of Sun Yat-sen University
2. The Chinese University of Hong Kong
3. Lenovo Research

Abstract

Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a "Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on character prompts and layouts, we generate a list of character images and extract guidance information from them, akin to the "Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, we generate the final image, as conducting the "Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance, and hence it is of great diversity. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. For example, it raises the performance bar of the cutting-edge Mini DALL·E 3 model by 21% in average character-character similarity and 19% in average text-image similarity. Our code and CMIGBench can be found in supplementary materials.

Framework

The overall structure of TheaterGen. TheaterGen utilizes three key components to generate an image in each interaction turn: (a) an LLM-based character designer that interacts with the user and maintains a structured prompt book for all character prompts and layouts, which serves as the "screenwriter" ; (b) a character image manager for "rehearsal", which generates character images and extracts guidance based on the prompt book; (c) a character-guided generator that conducts the "final performance", i.e., generates the final image for the current turn by combining the prompt book and guidance information.

The proposed guidance extractor. It first extracts subjects from character images and rearranges them into the same image according to the layout. Then the lineart guidance and the latent guidance for subsequent image generation are obtained via a lineart processer and the forward diffusion process, respectively.

Benchmark


        {
          {
            "dialogue 1": {
              "characters": [
                  "sparrow",
                  "lion",
                  "eagle"
              ],
              "scene": [
                  "library"
              ],
              "turn 1": {
                  "caption": "In the silent library, a tiny sparrow was fluttering near a shelf.",
                  "objects": [
                      [
                          "a tiny sparrow",
                          [
                              115.5,
                              170.5,
                              89,
                              59
                          ],
                          1
                      ],
                      [
                          "a library shelf",
                          [
                              215.5,
                              165.5,
                              171,
                              171
                          ],
                          2
                      ]
                  ],
                  "background": "A silent library",
                  "negative": "None"
              },
              "turn 2": {
                  "caption": "An attentive lion in one corner was carefully observing the bird and holding its breath.",
                  "objects": [
                      [
                          "an attentive lion",
                          [
                              300.5,
                              221.0,
                              162,
                              180
                          ],
                          3
                      ],
                      [
                          "a tiny sparrow",
                          [
                              40.5,
                              101.0,
                              89,
                              59
                          ],
                          1
                      ]
                  ],
                  "background": "A silent library",
                  "negative": "None"
              },
              "turn 3": {
                  "caption": "Above them, a vigilant eagle watched the suspenseful scene unfold from the library ceiling.",
                  "objects": [
                      [
                          "a vigilant eagle",
                          [
                              345.5,
                              41.0,
                              119,
                              72
                          ],
                          4
                      ],
                      [
                          "an observing lion",
                          [
                              295.5,
                              281.0,
                              162,
                              180
                          ],
                          3
                      ],
                      [
                          "a sparrow",
                          [
                              45.5,
                              171.0,
                              89,
                              59
                          ],
                          1
                      ]
                  ],
                  "background": "A silent library",
                  "negative": "None"
              },
              "turn 4": {
                  "caption": "The scenario ended peacefully as the eagle, the lion, and the sparrow all resumed their own activities in the vast library.",
                  "objects": [
                      [
                          "an occupying eagle",
                          [
                              335.5,
                              41.0,
                              119,
                              72
                          ],
                          4
                      ],
                      [
                          "a peaceful lion",
                          [
                              285.5,
                              281.0,
                              162,
                              180
                          ],
                          3
                      ],
                      [
                          "a sparrow",
                          [
                              55.5,
                              181.0,
                              89,
                              59
                          ],
                          1
                      ]
                  ],
                  "background": "A vast library",
                  "negative": "None"
              }
            }
          }
        }