31 July 2025
21 May 2025
Papers and Models on Video Generation
- Exploring the Evolution of Physics Cognition in Video Generation: A Survey
- A Survey of Interactive Generative Video
- Video Diffusion Models: A Survey
- Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation
- Opportunities and challenges of diffusion models for generative AI
- Sora
- Video Diffusion Models
- Imagen Video
- Phenaki
- Lumiere
- Stable Video Diffusion
- AnimateDiff
- Open-Sora
- CausVid
- VideoGPT
- DVD-GAN
- MoCoGAN
- VGAN
23 March 2025
24 February 2025
20 February 2025
13 February 2025
7 February 2025
31 January 2025
Generative Models Focus On Decoder Architecture
- Autoregressive Generation: Decoders are good at predicting next item in a sequence. This autoregressive approach is fundamental to generation for coherence and novel content.
- Sequential Processing: Decoders process information step-by-step building on past generations which gives them long-range dependency capabilities to create complex and structured outputs.
- Task-Specific Optimization: They can handle complexities in grammar, semantics, and context.
- Simplified Training: Decoder-only models are simpler to train as they only need to learn the conditional probability distribution of next token given the previous ones. No need for a separate encoding step.
- Focus on Sequence-to-Sequence Tasks: Early successes on sequence-to-sequence tasks showed that for generation tasks the decoder component was most important and the computationally expensive aspect.
- Efficiency: For tasks that mainly require generating new content, you could achieve comparable or better results with just a decoder-only model and with less computational cost.
- Attention is All You Need: Decoders, by their nature, already used self-attention and adapting to a decoder-only setup was straightforward and highly effective.
- Scalability: Decoder-only Transformers scale well paving the way for very large models to generate highly coherent and creative text
- Computational Cost: Training encoder-decoder models is more expensive and often prohibitive especially for companies on limited budgets.
- Performance Gains: For generative tasks, performance gains were not seen as substantial with the introduction of an encoder step to justify the additional computational cost.
- Contextual Understanding: Encoders provide for rich representations and capture meaning and context that enables better relevancy to decoder generation tasks.
- Feature Extraction: Encoders extract key features which the decoder can then use to generate context-specific output.
- T5: This model takes in input and output as texts in a unified architecture where the encoder understands the input and the decoder generates the output.
- GPT-1 (2018): demonstrated potential of decoder-only transformer architecture with 117 million parameters
- GPT-2 (2019): scaled up with 1.5 billion parameters, this increase led to improvements in text generation quality and coherence, raising concerns for misuse
- GPT-3 (2020): massive step forward with 175 billion parameters, scale up achieved greater performance, wider coverage on NLP tasks for text generation, translation, and few-shot learning
29 January 2025
28 January 2025
Nova, LLama, Mistral, DeepSeek, and Gemini
Amazon Nova
- Focus on Practicality: Real-world applications for balance of accuracy, speed, and cost-effectiveness
- Multimodal Capabilities: Strong across data types
- Cost-Effectiveness: models that are affordable and highly performant
- Weakness: new model and full performance benchmarks are yet to be determined
- When to use: priority for practical applications, cost-effective, and strong multi-modal capabilities in Amazon ecosystem
Llama
- Open Source: useful for research and innovation as a way to access and build upon the model
- Strong Performance: consistently high performance across benchmarks
- Large Community: growing community of users and contributors and has been used across applications
- Weakness: potential for misuse for generating harmful content and malicious activities
- When to use: open source, balance of performance and accessibility, extensible
Mistral
- High Performance: Strong benchmarks and beating many competitor models
- Focus on Safety: Strong emphasis on safety and bias mitigation
- Efficiency: performance with efficiency in computational resources
- Weakness: full performance benchmarks are yet to be determined
- When to use: high performance, safety, and efficiency
Gemini
- Advanced Capabilities: cutting-edge capabilities in reasoning, code generation, and multimodal understanding
- Strong Backing: support from Google that provides significant dedicated resources and expertise
- Weakness: only available through Google services, limited accessibility and flexibility for independent developers and researchers, tends to be prone to gaps in hallucinated responses, threads of related responses are brittle
- When to use: advanced capabilities within a Google ecosystem of services
DeepSeek
- High Performance: state-of-the-art performance on benchmarks and surpassing many proprietary models
- Open source: Built with an open community in mind for continuous innovation and development
- Focus on Reasoning: strong in reasoning tasks, understanding, and solving complex problems
- Weakness: relatively new model and full performance benchmarks are yet to be determined
- When to use: open source, high performance, strong reasoning, extensible
Weakness of Gemini Compared To DeepSeek
Cost: Gemini is more expensive for large-scale applications with a proprietary license and higher computation. On the other hand, DeepSeek is open source and offers a more cost-effective and accessible option.
Multimodal Capabilities: Gemini is good with multimodal tasks. Whereas, DeepSeek focuses more on text-based reasoning. Gemini is more versatile for images, audio, video, and text. DeepSeek has recently extended the capabilities with Janus-Pro to enable multimodal AI.
Community and Support: Gemini comes with a big Google backing, community, and support. Whereas, DeepSeek is relatively new with a smaller community of users.
Maturity: Gemini has taken a longer time to develop. DeepSeek is still under development and largely influenced from reengineering and lacks sufficient refinement.
An open source model will always facilitate for a larger community of users due to accessibility and a sporadic cycle of development. In many cases, an open source model may out pace the proprietary model. DeepSeek model is showing multimodal strengths and may eventually outshine its competitor models. Although, it has already challenged the competition with some serious capabilities.