Part 1 - Content Generation

Introduction

In this series of posts I'll walk you through building a fully fledged, fully automated, visual story generator including text, images, audio, video and background music.

Your code will go from an input sentence as short as "The Whistling Scarecrow" to the below video in ~1 min.

This is my favourite ever side project. I built it early this year when OpenAI APIs came out. The work will be fully in Python and the full code is published open source in this repo.

https://github.com/hatemfaheem/ai-story-generator

High Level Overview

The below diagram shows a high level overview about what we will be building. As you can see on the right side, we produce lots of raw and processed results, but most importantly a video (like the on shown in intro).

There are 5 main subproblems with different levels of complexity and different tools. If we can solve these problems independently, we can just pipe them together to create our beautiful vide.

Content Generation & NLP: Our story needs text and images which is the core content of the story. We will be using OpenAI for this. We will also need to process the text (Natural Language Processing) of the story for a couple of reasons (a) to breakdown into sentences/pages and (b) to produce keywords for SEO (we will not really dive deep into SEO but I'll show you how to produce keywords to be used in things like hashtags).
Text to Speech (Audio): A nice video story is not perfect without a narrator. And guess what, we will also generate this.
Image Processing: Once we have images and text, we will need to combine these into nice looking visual pages. This is a super cool subsystem written using Pillow (a popular image processing python library). And yes, this will include the page wrinkling effect.
Video Processing: Once we have nice looking pages and narrator audio, we compile a full video.
PDF Processing: Similar to video but compile a pdf this time. Like a printable version of the story.

The same diagram above can also be viewed as a pipeline (data flow diagram).

Content Generation & NLP

Story Text

Let's jump in straight away. Given a simple sentence i.e. story title we want to generate a story. In this case, we just need the story text. Thanks to OpenAI APIs, we can use text-davinci-003 model to obtain this with a few lines of code.

# prompt: str = "The Whistling Scarecrow"
story_content = openai.Completion.create(
    model="text-davinci-003",
    prompt="Give me a story about " + prompt,
    max_tokens=self._MAX_TOKENS,
    temperature=0,
)
story_raw_text = story_content["choices"][0]["text"]

As you can see I had to prepend "Give me a story about " to the title prompt, to instruct OpenAI to give me a story about The Whistling Scarecrow. And non-surprisingly it's very good at generating such stories (try it out on ChatGPT if you have access). You may be used to this level of AI now, but the quality of the stories was super impressive when I was writing this code in December 2022.

Next, we need to process this text into sentences i.e. story pages. You can think something as simple as this:

story_raw_text.split(".")

This will work for lots of stories, but it's not reliable. Consider the following story.

In the small village of Elmridge, people told tales of the Whistling Scarecrow. Not as a mere bedtime story, but as a local legend that had seen generations.

Splitting on '.' will produce the following list.

[
  "In the small village of Elmridge, people told tales of the Whistling Scarecrow",
  "Not as a mere bedtime story, but as a local legend that had seen generations"
]

Which is actually correct, but here are the next few sentences in the same story:

The scarecrow stood in the middle of Mr. Whitaker's cornfield, lanky and faded from years under the sun and rain. Its clothes were tattered, its straw body peeking out from holes and tears, yet it stood proud, guarding the field as though it were its own.

And as you can see, this solution will break at Mr., it will separate "Mr." and "Whitaker's" into 2 different sentences, although it shouldn't. How can we fix this? We use a smarter sentence tokenizer. Thanks to NLTK we can do this in 1 line:

import nltk

story_sentences = nltk.sent_tokenize(story_raw_text)

Keywords

Now that we have the story text and as we are talking about text processing. Let's also generate a bunch of keywords that are representative of the story content.

Why do we need keywords?

This could be used as hashtags if you're publishing this story to social media.

How do we automatically generate high quality keywords?

The answer is KeyBERT. KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

from keybert import KeyBERT

keybert_model = KeyBERT()
keybert_model.extract_keywords(story_raw_text)

For the Whistling Scarecrow story in this video, that generated the following set of keywords. Which is similar to human tagging abilities if you think about it.

scarecrow, farmer, whistling, whistle, crops

Story Images

Now that we have generated and processed the story text, let's jump into generating images for each sentence. For this we will use DALL.E 2 from Open AI. It may not be the best image generation model in the market, but it has an API that allows us to automate this process.

# prompt -> story sentence
def generate_image(prompt: str) -> str:
    response = openai.Image.create(
        prompt=prompt, n=1, size="1024x1024"
    )
    return response["data"][0]["url"]

Given the image url, we can download the actual image by doing something like this:

def download_image(
    workdir: str, url: str, image_number: str
) -> Tuple[Image.Image, str]:
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    filepath = os.path.join(workdir, f"image_{image_number}.png")
    img.save(filepath)
    return img, filepath

We first make a GET request to the get the URL content, then we open the image using PIL library (we will talk a lot about this in the image processing part of this series). We then save the image to local dir.

Story Content Generator

Now let's bring it all together. The story content generation algorithm is simple:

Generate and process story text for the given prompt/title.
Generate and download images for each sentence in the story.
Construct StoryContent object that contains all story content/details.

def generate_new_story(
    self, workdir_images: str, story_seed_prompt: str, story_size: StorySize
) -> StoryContent:
    """Generate a new story for the given prompt

    Args:
        workdir_images: The workdir where images should be stored
        story_seed_prompt: The title/seed of the story
        story_size: Story size configuration

    Returns: The contents of the newly generated story
    """
    story_text = self.text_generator.generate_story_text(story_seed_prompt)
    raw_text = story_text.raw_text
    processed_sentences = story_text.processed_sentences
    page_contents = []

    for i in range(len(processed_sentences)):
        image_prompt = (
            f"A painting for '{processed_sentences[i]}'. "
            f"{story_seed_prompt}."
        )
        url = self.image_generator.generate_image(
            prompt=image_prompt, story_size=story_size
        )
        image_number: str = str(i).zfill(3)
        image, image_path = self.image_generator.download_image(
            workdir=workdir_images,
            url=url,
            image_number=image_number,
        )
        story_page_content = StoryPageContent(
            sentence=processed_sentences[i],
            image=image,
            image_path=image_path,
            page_number=image_number,
        )
        page_contents.append(story_page_content)

    return StoryContent(
        story_seed=story_seed_prompt,
        raw_text=raw_text,
        page_contents=page_contents,
        story_size=story_size,
    )

In the next part of the series we will talk about how to represent the story generation problem as a set of data structures, including StoryContent and StoryPageContent shown in the previous section.

Blog

Part 1 - Content Generation

Hatem Elseidy