ByteDance steps closer to AI video human reality – TechHQ

Joe Green
@TechForge_Media
joe@techforge.pub
“Image” by alittleblackegg is licensed under CC BY-NC-SA 2.0.
A proposal from ByteDance, the company behind TikTok, outlines an AI video generation framework designed to generate lifelike video of humans.
Given a still image and ‘signal inputs’ – especially audio, but also text – the method outlined is able to create animations at near-realistic resolutions in any aspect ratio, irrespective of that of the given image. In short, feed the model an image and an audio soundtrack, and the algorithm does the rest: lip-syncing, moving the virtual body, lighting, and texture, detailing necessary facial and bodily expressions.
The company’s GitHub.io showcases several examples of the software’s output in a range of styles including stylised animation, vox-pops, an anthropomorphise head of kale, and some representations of virtual humans ‘playing’ musical instruments that will leave any human musician rolling in the aisles.
The company’s methods are detailed in this paper, and the process of producing a simulacrum involves ‘driving signals’ of one or more elements in text, audio, and/or pose. The approach taken is in two parts, the OmniHuman diffusion model and a multi-condition training phase. The team assembled a little over two years-worth of video content, whittling that down to around 100 days of footage that was chosen for its focus on lip-syncing content and moving human poses.
It was found that the mix of ‘weak’ data, such as text instructions, with the higher-quality video containing rich images of natural movement affected the overall quality of the renedered output, in terms of its likeness to reality. The more audio in the training mix, for example, the less expressive movement was produced, negatively affecting elements like lip-syncing.
What’s interesting to note is that one of the reasons stated for the pursuit of the project was “to overcome the scarcity of high-quality data faced by previous methods,” to produce “highly realistic human videos from weak signals, especially audio.” The paucity of training data available, even taking into consideration the mass scrapes of video content has led to the emergence of a small industry that arbitrates between high-quality content creators and AI companies keen to source materials to add to their training corpora. AI companies are willing to pay for well-known creators’ out-takes to help them produce more lifelike video and audio.
ByteDance’s GitHub summary of the project states, “Currently, we do not offer services/downloads anywhere, nor do we have any SNS [social media] accounts for the project,” so if the project comes to market other than in a web instance or is released under open-source licence, it will not be in its current form.
Like text created by large language models, the output from any AI, at present, can be described as re-presentation of digitally captured human activity, but created via algorithms that are not yet able to fully pass a human’s ability to spot, or even sense, anomalies. The obvious example in ByteDance’s videos are the constructed ‘poses’ that recreate virtual musicians playing musical instruments, which are woefully inaccurate. Similarly, videographers, audio specialists, and 3D artists will likely be able to detect mistakes in the elements of the renders in areas of their specialisation.
There is a significant technical challenge in producing trained models that faithfully represent the human form, one that as a species we have evolved to be inherently expert in. Attaining true human likeness with software and data is likely to be achievable in time.
But such commitment of resources begs the question: to what end? As tool-makers, humans strive to make tasks easier. Part of the reason for doing so is to allow more time for activities as whimsical (or important, depending on point of view) as creativity. It seems pointless to commit countless hours and billions of dollars into having our silicon tools engage in creativity for us, other than as a technical exercise to explore the limits of possibility.
It will likely take several generations (developmental and human) for the worthwhile uses of AI to become apparent, but making video of humans (or vegetables) engaged in human activities won’t be one of the better ones.

Joe Green
@TechForge_Media
joe@techforge.pub
13 February 2025
10 February 2025
10 February 2025