TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

Junhao Cheng1, Baiqiao Yin1, Kaixin Cai1, Minbin Huang2, Hanhui Li1, Yuxin He1, Xi Lu1, Yue Li1, Yifei Li1, Yuhao Cheng3, Yiqiang Yan3, Xiaodan Liang1,
1. Shenzhen Campus of Sun Yat-sen University
2. The Chinese University of Hong Kong
3. Lenovo Research

Theatergen can interact with users to consistently generate images over multiple Turns.

Abstract

Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a "Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on character prompts and layouts, we generate a list of character images and extract guidance information from them, akin to the "Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, we generate the final image, as conducting the "Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance, and hence it is of great diversity. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. For example, it raises the performance bar of the cutting-edge Mini DALL·E 3 model by 21% in average character-character similarity and 19% in average text-image similarity. Our code and CMIGBench can be found in supplementary materials.

Framework

Benchmark


        {
          {
            "dialogue 1": {
              "characters": [
                  "sparrow",
                  "lion",
                  "eagle"
              ],
              "scene": [
                  "library"
              ],
              "turn 1": {
                  "caption": "In the silent library, a tiny sparrow was fluttering near a shelf.",
                  "objects": [
                      [
                          "a tiny sparrow",
                          [
                              115.5,
                              170.5,
                              89,
                              59
                          ],
                          1
                      ],
                      [
                          "a library shelf",
                          [
                              215.5,
                              165.5,
                              171,
                              171
                          ],
                          2
                      ]
                  ],
                  "background": "A silent library",
                  "negative": "None"
              },
              "turn 2": {
                  "caption": "An attentive lion in one corner was carefully observing the bird and holding its breath.",
                  "objects": [
                      [
                          "an attentive lion",
                          [
                              300.5,
                              221.0,
                              162,
                              180
                          ],
                          3
                      ],
                      [
                          "a tiny sparrow",
                          [
                              40.5,
                              101.0,
                              89,
                              59
                          ],
                          1
                      ]
                  ],
                  "background": "A silent library",
                  "negative": "None"
              },
              "turn 3": {
                  "caption": "Above them, a vigilant eagle watched the suspenseful scene unfold from the library ceiling.",
                  "objects": [
                      [
                          "a vigilant eagle",
                          [
                              345.5,
                              41.0,
                              119,
                              72
                          ],
                          4
                      ],
                      [
                          "an observing lion",
                          [
                              295.5,
                              281.0,
                              162,
                              180
                          ],
                          3
                      ],
                      [
                          "a sparrow",
                          [
                              45.5,
                              171.0,
                              89,
                              59
                          ],
                          1
                      ]
                  ],
                  "background": "A silent library",
                  "negative": "None"
              },
              "turn 4": {
                  "caption": "The scenario ended peacefully as the eagle, the lion, and the sparrow all resumed their own activities in the vast library.",
                  "objects": [
                      [
                          "an occupying eagle",
                          [
                              335.5,
                              41.0,
                              119,
                              72
                          ],
                          4
                      ],
                      [
                          "a peaceful lion",
                          [
                              285.5,
                              281.0,
                              162,
                              180
                          ],
                          3
                      ],
                      [
                          "a sparrow",
                          [
                              55.5,
                              181.0,
                              89,
                              59
                          ],
                          1
                      ]
                  ],
                  "background": "A vast library",
                  "negative": "None"
              }
            }
          }
        }

we propose a Consistent Multi-turn Image Generation Benchmark: CMIGBench. This benchmark focuses on two common types found in multi-turn image generation: story generation and multi-turn editing, comprising 8000 multi-turn scripted dialogues (4000 for each task) with each consisting of 4 turns of natural language instructions. This allows us to evaluate the semantic consistency and contextual consistency of different models in multi-turn image generation in a zero-shot manner.

Story generation

Multi turn editing