A Generative Adversarial Network that deepfakes a person's lip to a given audio source. The version shown at CUCAI 2023 is saved in the legacy branch.
deepfake-lip-sync
A Generative Adversarial Network that deepfakes a person's lip to a given audio source
Current Progress
File structure
Make sure you process the dataset to get this file structure:
deepfake-lip-sync
|---dataset
| |---test
| | |---real
| | |---fake
| |---train
| | |---real
| | |---fake
| |---valid
| | |---real
| | |---fake
Within each of the test, train and validation folders, place the 320x320 images into their respective folders.
The code expects this to be available in project root.
Saved models
Contact the team for the trained models. I am too broke for LFS.
So hi, I'm back. It has been a while, and I suppose a lot of things have happened. I actually wrote a draft for Lifelike Devblog #3 but never really got down to finishing it. I guess this is a pretty massive case of writer's block. But the most important thing is: I'm back.
What has happened in the meantime? I basically just finished my final undergraduate exam ever. Last summer when I was making Lifelike, it felt like any other summers: opportunities, sunlight, beaches, vacations, dreading about the future, and for some of us, maybe some side hustles. Now, the whole thing has a sense of finality to it. For the first time in my life, I don't have a plan for September. I don't have a job offer lined up because the market is in the gutters. Maybe I should give another shot at academia, but it can be pretty unfair to international students.
The Project
Either way, there is a silver lining to all this: time to dig into my project backlogs. Also, I feel it is worth mentioning that this is not a 1 devblog long project. This one simply looks at what I've done so far and there will be a part 2. First on the menu is deepfakes. Ah yes, how immoral of me. However, before you judge, it is not meant to be good, I am simply challenging myself and get something fun done in the process. Still, to justify myself, I will introduce you to a great game that I really enjoy: Ghost of Tsushima (if you love this, you will probably also love the new Shogun show).
As much as I love this game, especially the Japanese dub, the game was clearly made to be played in English, as the lip movement syncs to the English voices, not English. This absolutely took me out of the experience (and why I think that anime is better subbed). Spoiler alert for the ending in the video below.
Naturally, I decided to think about what could be done about it. So in the summer of 2022, I decided that the team I was going to get for this little club at college called QMIND (Queen's AI club essentially) were going to be working on a deepfake lip syncing AI. What I turned in for the biggest undergraduate conference in AI (CUCAI 2023) was less than desireable. But it worked, and attracted a lot of attention (I have Mustafa's mutilated deepfaked persona to thank for that). Don't get me wrong, the team was the best I could have asked for, but not much can really overcome my ambitions for the project. But here's what we're working with:
I regret writing this at 3 in the morning. Needless to say, I think Mustafa looks better than this even on his worst day, so I'm making it my mission to touch him up a little. Before I address what went wrong (which is a lot), I'll do a walkthrough of the current state:
Current State
First thing first, we're working with a neural network, specifically of the Generative Adversarial variant. Think of it as a rivalry between 2 different models, the Generator and the Discriminator. The Generator does exactly as advertise, while the Discriminator try to figure out if the Generator's output is real or fake (or simply achieve some kind of objective). I like to think of the Discriminator as a trainable objective (or loss) function.
Within the code, the Generator is essentially an autoencoder stack that takes in 2 input: the spectrogram of the audio (sounds crazy, but yes, you can use computer vision techniques to process audio) as well as a stacked image (I'll explain it when I explain why I won't do it again soon enough) of the image we are trying to lip-sync, as well as a random reference image; it then spits out the lip-synced image frame via the Convolutional Transpose layers. The model I ran with based on this architecture:
The Discriminator is pretty similar, takes in 2 images: an image and the spectrogram, which we essentially train to recognize how well the image fits into the piece of audio. To do this I simply calculate the L2-distance between the embeddings of the 2 images. I ran with this model:
For training, we simply let the Generator generate something (it will start out as random noise). We then compute 2 different loss values: Mean Absolute Error (against the image it was supposed to generate) as well as the Discriminator output and perform gradient descent on both the Generator using the combined loss and the Discriminator based on how much it got wrong about the synchronicity. I recorded the generation of the same 2 faces on their respective audio piece and put it into a gif (each frame is every 1000 epochs I believe). I am fairly certain the way I record losses is faulty, so I intend to do it again properly this time.
The Problem
If you are an expert in this field, you may already see a problem or two with my implementation. If so, I would appreciate a comment. If not, here goes:
The Generator's Weird Inputs
The one part about the Wav2Lip paper that I don't really like is the fact that they are stacking the reference top half of the face "on top" of the pose prior (how I describe the image in a different pose). The main issue with it is that it takes up unnecessary space, and thus artificially inflates the model with useless parameters. This is because if you mask the bottom half of a face of size 256x256, you get an image of size 256x128, which you then half to stack on top of another 256x256 image, to result in an array of size 256x256x6, a fourth of which will be 0 no matter what. There is a good chance that gradient descent will not catch this and reduce the parameters corresponding to that area of the input to go to 0. This can explain why the model tend to plateau before it even generates a realistic human picture.
The Generator's Loss Calculations
The generator currently only has 1 loss functions:
Mean Absolute Error compared to the image it was supposed to generate
This can be seen in the code for train step:
deftrain_step(images,gen_imgs,audio_samples,unsynced_audio_samples):#with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
real=np.zeros((len(images),1))fake=np.ones((len(images),1))generated_images,gd_loss=gen([gen_imgs,audio_samples])# Very hacky, without this, the discriminator cannot be trained
qual_disc.trainable=True# train discriminator
ifnp.random.choice([True,False]):# Make sure the discriminator will not be too strong
disc_loss=qual_disc.train_on_batch([generated_images,audio_samples],fake)unsync_loss=qual_disc.test_on_batch([images,np.roll(audio_samples,10,axis=0)],fake)else:disc_loss=qual_disc.test_on_batch([generated_images,audio_samples],fake)unsync_loss=qual_disc.train_on_batch([images,np.roll(audio_samples,10,axis=0)],fake)real_loss=qual_disc.train_on_batch([images,audio_samples],real)qual_disc.trainable=False# Do not allow the discriminator to be trained
# train generator
total,mae,adv=gen.train_on_batch([gen_imgs,audio_samples],[images,real])print("Disc Loss: {} Real Loss: {} Unsync Loss: {} Total Gen Loss: {} MAE: {} Adv Loss: {}".format(disc_loss,real_loss,unsync_loss,total,mae,adv))return (disc_loss+real_loss+unsync_loss)/3,total
You might already see the problem: the generator is not trying to fool the discriminator, at least I don't think it is. The loss graph disputes this a little bit and the discriminator's output is used in the generator's code but since I was a novice at the time, a lot of things can go wrong. So to potentially fix this, I will completely rewrite a lot of the code with Tensorflow's functional API, which allowed me to do this exact thing without problems in my Facial Image Compression project.
L2 distance (L2-norm, i.e. Euclidean distance) between the generated image's embeddings and the audio piece's embeddings, both of which are trained by the Discriminator
So, mathematically (using Latex so I seem smart):
l2=i=1∑N(ai−bi)2
This is translated into code as seen here:
# L2-normalize the encoding tensors
image_encoding=tf.math.l2_normalize(image_encoding,axis=1)audio_encoding=tf.math.l2_normalize(audio_encoding,axis=1)# Find euclidean distance between image_encoding and audio_encoding
# Essentially trying to detect if the face is saying the audio
# Will return nan without the 1e-12 offset due to https://github.com/tensorflow/tensorflow/issues/12071
d=tf.norm((image_encoding-audio_encoding)+1e-12,ord='euclidean',axis=1,keepdims=True)discriminator=keras.Model(inputs=[image_input,audio_input],outputs=[d],name="discriminator")
Long Train Time
Unfortunately I do not want to run the whole thing for 30 hours again to record the time it takes to run 1 full epoch. However, I did this a year ago, and wrote a little comment in the code that says the main bottleneck is in my batch generation function. I know I said that 2023 me was stupid, but in hindsight, 2023 me had a job the previous summer and I don't. Therefore, I am inclined to trust him with this.
Disgusting Code
Self-explanatory
Does not Synchronize well as a video
Remember when I said this is a rough adaptation of the Wav2Lip model. It's rough rough 100%. The novel feature that Wav2Lip introduces is the second Discriminator: the Lip-Sync Expert, which is a pre-trained model (as oppose to traditional discriminators that are trained alongside the generator). Needless to say I did not do that. For this, I think I will change the batch generation code to load the frame directly preceding it in the timeline as the pose-prior and hopefully it will train itself to generate a frame that would work as a frame that is directly after it.
The Plan
Based on all of the above, I think I know where to start. To address the long train time, I will decrease the image size to 64x64. When I built the facial compression GAN a few months ago, I realized that the smallest size for an image containing a face with a mouth that still looks distinguishable is around 54x54. Nonetheless, my lucky number is 2, and 2 to the power of 6 is 64, so 64x64 it is. I will take all the help I can get with this project, so Lady Luck will be a guest at my house for this week. Another thing I will do is to revamp the model definition themselves with the Tensorflow functional API, which I will discuss further in part 2. This allows my custom training loop to run on Keras' graph mode and achieve the best performance possible. Plus it looks clean as hell. I will also change the inputs of the applicable models to take in the reference image and the pose prior separately, instead of stack to avoid useless parameters as mentioned above.
If anything, I am mostly relying on the fact that I have had more experience working on this kind of projects now, and can probably do a lot better if I start from scratch. If it works, well hey, that's a free ego boost because I'll know that I have improved a lot in a year.
Thank you for Reading
I don't exactly know how many people read my blogs, because to be completely honest it's mostly me rambling about some frankenstein of a creation I have made. For those who are returning, thank you, I don't write these with an audience in mind, but I do hope I either gave you some extra insight or just simply inspired you to sit down and make something awesome. I apologize for the mess that was the Lifelike Devblog release schedule though; both me and Mustafa simply did not feel like the idea was baked in the oven for long enough. Lifelike was cool, and I think I will come back to that after I am done with this little project of mine. Honestly, I was really burned out by the end of Nights&Weekends; while I think of myself as a half decent programmer, I am definitely not a good salesman and it showed there. Plus, with the power of the almighty hindsight, I think me in 2023 was pretty shit. I learned a lot since then and realized a lot of mistakes I made at the time. At the end of the day, maybe that was the real win.
Visions of the Future
Nonetheless, I have major ambitions for the interrogation game and I would definitely like it to be made. Artificial Intelligence is absolutely awesome, but it gets a lot of flak from the art community (and for good reason). With the interrogation demo I want to make a point that it should be used as a tool for game devs, not a replacement. I mean, that was the whole idea behind Lifelike. But, before big ambitions come into play, it's the little devblogs, the little projects and the little things we do along the way that really make it all worth it. Take your eyes off of the future and just take on your little side quests, see where it goes.