AI Aideo Generation Will Usher In The Era Of GPT ?
When will the video field usher in its own GPT era? Over the past year, the progress in the Text-to-Video field has been much faster than expected. Runway has released Gen1 and Gen2, and the introduction of Motion Brush has taken a step forward in reliability. Stability AI also recently released its first Text-to-Video model called Stable Video Diffusion.
AI can be broadly categorized into three types: video generation (Text-to-Generate), AI video editing (AI Editor), and digital avatars (Avatars). The latter two focus on enhancing video editing processes with AI, while video generation represents a significant democratization of content creation, with the potential to revolutionize traditional workflows.
From a technical perspective, video generation (Text-to-Video) has always been considered a challenging aspect of AI and computer-generated content. It is often regarded as a “frontier” within the field of AI and Graphics Computing. Video generation faces significant challenges such as immense computational requirements, scarcity of high-quality datasets, and issues related to controllability.
AI video generation has made significant progress. If you compare the results of AI-generated videos from last year to those from March this year, and then to the ones from the past couple of months, you’ll notice that video generation models have been advancing rapidly. It’s possible that in the near future, perhaps even next year, they could reach the level of AI image generation. Although video models may not be perfect yet, it’s important to remember that image models were not great a year and a half ago either, and now they have improved significantly.
Video generation technology shares similarities and differences with image generation. The models for AI-generated images and AI-generated videos have commonalities, but they are distinct from language models as they are specialized for generating images or videos. Videos present unique challenges compared to images. For instance, ensuring smoothness in the video, preserving motion, handling larger data size as videos are typically larger than images, requiring more GPU memory. Video generation also involves considering logical aspects and deciding how to generate the video—whether frame by frame or as a whole. Currently, many models generate videos as a whole, resulting in shorter clips.
Each frame of a video is indeed an image, but video generation is much more challenging than generating individual images. Each frame requires high-quality generation, and there must be coherence between adjacent frames. Ensuring consistency across every frame becomes increasingly complex as the video length grows. Handling multiple images when dealing with video data requires models to adapt to this scenario. For example, transferring 100 frames of images to the GPU poses a challenge. During inference, generating a large number of frames slows down the process compared to generating single images, resulting in increased computational costs.
Currently, language models have a clear roadmap, and one reason for this is that OpenAI has invested significant resources to explore and refine them. The reason why models like GPT have not been used for video might be because their resources and manpower have been primarily focused on text models. If a company were to allocate substantial funding, perhaps diffusion models could also produce impressive language models. However, since OpenAI’s approach has proven effective, many may see little need to invest large amounts of funding into alternative methods.
Video generation is in a similar stage to where GPT-2 was before, and it is highly likely to witness significant advancements within the next year. Looking back at image generation in 2018, we might have thought, “How amazing would it be if we could generate illustrations for Wikipedia based on its descriptions!” Now, in 2021, we have stable diffusion models and large-scale models for image generation. Therefore, the breakthrough in video generation may happen faster than we imagine. By that time, generating videos should be a highly controllable process, allowing us to generate videos of arbitrary length in a more creative manner. We could direct the actions of the main characters, such as going to a café for a coffee, attending classes at school, and then using our product to stitch together all the clips into a complete short film, just like a director.
The competition in the field of video generation may follow a similar pattern to that of language models. When a company releases a new model, they may already have more advanced models internally, giving them a one to two-year head start over other companies. In the future, we can expect a similar dynamic in the video domain, with one company taking the lead and pushing the boundaries while others strive to catch up.
Firstly, in terms of technology, for example, whether the team is the smartest and most innovative team. Secondly, it is also related to the team. The team needs a clear goal and step-by-step execution of that goal. For example, data is an important issue. Processing a dataset is not a simple task, such as obtaining one billion data points, filtering and annotating them. Not many teams have the ability to handle the entire process exceptionally well.
High-quality data in the field of video generation. First is the pixel quality, how good the image quality is. Then, it is about aesthetics and artistic composition. The third aspect is the presence of meaningful actions in the video.
For example, in movies, there are actually many beautiful videos, but most of the actions involve people standing still and talking. Although these visuals are very stunning and specifically designed by famous directors, using them alone to train models may not yield optimal results. If only such data is used without incorporating other content, the model may end up learning that people only move their mouths and not engage in other meaningful actions.
Additionally, the length of the video is also crucial. If models are trained on 1-second videos, it would be challenging for them to generate 30-second videos. This can be addressed by either collecting more and longer data for the model to learn from or by retraining the algorithm to generate longer videos based on learning from 1-second videos. Thus, data innovation or algorithm innovation is essential in this regard.
Video generation models also require innovation in the models themselves and a great deal of engineering. This is not something that everyone can accomplish. OpenAI has also established a technological barrier, so even though there are open-source models like LLaMa available now and many people can do remarkable things with them, only OpenAI can create GPT-4.
To maintain a leading position in the industry and preserve the first-mover advantage, it is important to continuously accumulate resources, including user resources, data, GPU resources, and more. The development of technology and resource accumulation is an ongoing process. For instance, accumulating more users can help us train models better. The technology team is also crucial, and recruiting more technical talent is essential.
Interface design is also of great importance. In the end, it is likely a combination of both technology and design. Design can inspire technological advancements, and technology can support design. The barrier between design and technology may become increasingly blurred in the future.
The open-source community may not have sufficient computing power to train new video models because training a new video model requires a significant amount of computational resources. For Stable Diffusion, someone may be able to start from scratch and achieve good results using just 8 A100 GPUs. However, for video models, 8 A100 GPUs may not be enough, making it challenging to train a high-quality model. Moreover, the inherent issues of video models have not been fully resolved, leading to potential bottlenecks. Firstly, the performance of the models may not be satisfactory, and secondly, there may be algorithmic challenges that need to be addressed.
However, modifying models, architectures, and algorithms often requires starting from scratch, and individuals in the open-source community, including researchers in universities, may not have access to sufficient computational resources for such exploratory work. Therefore, the open-source community faces significant challenges unless individuals like POTX or TAI, who have access to ample resources and are willing to contribute charitably, decide to open-source a model. Apart from models being open-sourced by large companies, it is difficult for the general open-source community to engage in exploratory work.
Video models may eventually require the same scale of computational power as training GPT. Currently, we are not utilizing as much computational power for video models. This is partly because video models have not yet reached the same level as GPT, and partly because there are still architectural and technical challenges that need to be addressed. Once these issues are improved, we can expect a new generation of video models to achieve a scale similar to GPT.
熱門頭條新聞
- The CES® 2025
- ENEMY INCOMING! BASE-BUILDER TOWER DEFENSE TITLE ‘BLOCK FORTRESS 2’ ANNOUNCED FOR STEAM
- Millions of Germans look forward to Christmas events in games
- Enter A New Era of Urban Open World RPG with ANANTA
- How will multimodal AI change the world?
- Moana 2
- AI in the Workplace
- Challenging Amazon: Walmart’s Vision for the Future of Subscription Streaming