Deepshot in HD - New Features & Timeline
Explore what's in store for Deepshot in the near future along with our timeline for rolling out new features.
Our new, high-definition lip generation model has just been finished, and we couldn't be more excitied to bring this technology to you all.
We have plans to put out a more formal "resarch" post sometime later this week, so expect A LOT more information on the specificities from us in the coming days.
Based on current progress, we should only have around a week's worth of work left before everything will be ready.
One of the most significant advancements we've made in our new model is the increase in resolution output from 96x96 to a much clearer 512x512. This considerable leap might seem technical and unassuming, yet it marks a major upgrade in the quality of videos produced.
When we talk about our model "generating in 1080p," it's important to clarify what this truly means. In a 1920x1080 frame, it's not that every pixel is dedicated to the mouth - such a scenario would not only be impractical but virtually useless since a speaker seldom occupies the entire frame in a shot. Instead, in instances where the face in your video doesn't consume more than 50% of the frame, our model ensures the newly generated lips are virtually indistinguishable from the original speaking sequence.
Our new model also employs a new masking technique which takes into account the dynamic motion of the head, a key aspect often overlooked by previous methods.
Traditional techniques, like the rectangular lower-half mouth masking approach, often struggle with active head movement, leading to undesirable distortions and unnatural jaw movements in the generated visuals. What we've done differently is to use a 3D face mesh predictor that captures 3D parameters to predict dense face geometry from given video frames.
What we end up with is a 'pose-aware mask' that not only understands the pose information but also mimics facial semantics such as jaw shape, thereby enhancing the visual quality of the final result. This improvement is backed by our detailed studies, which demonstrate how this pose-aware masking contributes to a more visually pleasing output, even with dynamic head movement.
In the below picture, the discolored part of the face shows what our new masking technique looks like in action.
Along with our new lip generation model, we will also being rolling out face detection to eliminate the need to cut your videos before uploading and using them on Deepshot.
In practice, the new flow will resemble something like the following:
1. Upload Your Video & Specifiy the number of speakers in the video
2. Specifiy how many speakers are present
3. Face Detection Runs
After face detections runs, you will be presented with a clip of each distinct face found in your video. For each clip you have two options: either adding the clip to your profile (meaning you can generate audio on it) or ignoring the clip.
Following the release of our new model, we plan to integrate seamless translation. The end goal of what this will look like is the following:
1. Upload your video
2. Specify the video's current language
3. Chose one (or more) output languages
4. Generate your video
Expect more information regarding quick translation after we drop our next update.