Showing posts with label computer vision. Show all posts
Showing posts with label computer vision. Show all posts

21 May 2025

Papers and Models on Video Generation

Papers:
  • Exploring the Evolution of Physics Cognition in Video Generation: A Survey
  • A Survey of Interactive Generative Video
  • Video Diffusion Models: A Survey
  • Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation
  • Opportunities and challenges of diffusion models for generative AI
Models:
  • Sora
  • Video Diffusion Models
  • Imagen Video
  • Phenaki
  • Lumiere
  • Stable Video Diffusion
  • AnimateDiff
  • Open-Sora
  • CausVid
  • VideoGPT
  • DVD-GAN
  • MoCoGAN
  • VGAN

31 January 2025

Generative Models Focus On Decoder Architecture

The influential LLM model milestone to come to light that popularized to some degree the cascade of generative models was the GPT-1 with decoder architecture and eventually followed into GPT-3. Since then, there has been a trend for LLM models to be more decoder focused. Other models that have played significant roles include: Bert, Transformer-XL, ELMo, ULMFiT, and LaMDA. T5 has been one of the few exceptions for an encoder-decoder architecture. You would wonder why this is the case. 

Why Decoders Dominate:
  • Autoregressive Generation: Decoders are good at predicting next item in a sequence. This autoregressive approach is fundamental to generation for coherence and novel content.
  • Sequential Processing: Decoders process information step-by-step building on past generations which gives them long-range dependency capabilities to create complex and structured outputs.
  • Task-Specific Optimization: They can handle complexities in grammar, semantics, and context.
  • Simplified Training: Decoder-only models are simpler to train as they only need to learn the conditional probability distribution of next token given the previous ones. No need for a separate encoding step.
  • Focus on Sequence-to-Sequence Tasks: Early successes on sequence-to-sequence tasks showed that for generation tasks the decoder component was most important and the computationally expensive aspect.
  • Efficiency: For tasks that mainly require generating new content, you could achieve comparable or better results with just a decoder-only model and with less computational cost.
  • Attention is All You Need: Decoders, by their nature, already used self-attention and adapting to a decoder-only setup was straightforward and highly effective.
  • Scalability: Decoder-only Transformers scale well paving the way for very large models to generate highly coherent and creative text
  • Computational Cost: Training encoder-decoder models is more expensive and often prohibitive especially for companies on limited budgets.
  • Performance Gains: For generative tasks, performance gains were not seen as substantial with the introduction of an encoder step to justify the additional computational cost.

The Role of Encoders: While decoders are used for generation, encoders are used for understanding and representation of inputs. 
  • Contextual Understanding: Encoders provide for rich representations and capture meaning and context that enables better relevancy to decoder generation tasks.
  • Feature Extraction: Encoders extract key features which the decoder can then use to generate context-specific output.

T5: This model uses both encoder and decoder architecture as a balance for better context understanding and generation.
  • T5: This model takes in input and output as texts in a unified architecture where the encoder understands the input and the decoder generates the output.

GPT Recap 
  • GPT-1 (2018): demonstrated potential of decoder-only transformer architecture with 117 million parameters
  • GPT-2 (2019): scaled up with 1.5 billion parameters, this increase led to improvements in text generation quality and coherence, raising concerns for misuse
  • GPT-3 (2020): massive step forward with 175 billion parameters, scale up achieved greater performance, wider coverage on NLP tasks for text generation, translation, and few-shot learning

Throughout this process training techniques and model architectures were improved. Eventually, this effort shaped the advancements of LLM and the trend towards more versatile models.

28 January 2025

LLM API Full Guide

LLM API Full Guide

Nova, LLama, Mistral, DeepSeek, and Gemini

Amazon Nova

  • Focus on Practicality: Real-world applications for balance of accuracy, speed, and cost-effectiveness
  • Multimodal Capabilities: Strong across data types
  • Cost-Effectiveness: models that are affordable and highly performant
  • Weakness: new model and full performance benchmarks are yet to be determined
  • When to use: priority for practical applications, cost-effective, and strong multi-modal capabilities in Amazon ecosystem

Llama

  • Open Source: useful for research and innovation as a way to access and build upon the model
  • Strong Performance: consistently high performance across benchmarks
  • Large Community: growing community of users and contributors and has been used across applications
  • Weakness: potential for misuse for generating harmful content and malicious activities
  • When to use: open source, balance of performance and accessibility, extensible

Mistral

  • High Performance: Strong benchmarks and beating many competitor models
  • Focus on Safety: Strong emphasis on safety and bias mitigation
  • Efficiency: performance with efficiency in computational resources
  • Weakness: full performance benchmarks are yet to be determined
  • When to use: high performance, safety, and efficiency

Gemini

  • Advanced Capabilities: cutting-edge capabilities in reasoning, code generation, and multimodal understanding
  • Strong Backing: support from Google that provides significant dedicated resources and expertise
  • Weakness: only available through Google services, limited accessibility and flexibility for independent developers and researchers, tends to be prone to gaps in hallucinated responses, threads of related responses are brittle
  • When to use: advanced capabilities within a Google ecosystem of services

DeepSeek

  • High Performance: state-of-the-art performance on benchmarks and surpassing many proprietary models
  • Open source: Built with an open community in mind for continuous innovation and development
  • Focus on Reasoning: strong in reasoning tasks, understanding, and solving complex problems
  • Weakness: relatively new model and full performance benchmarks are yet to be determined
  • When to use: open source, high performance, strong reasoning, extensible

Weakness of Gemini Compared To DeepSeek

Cost: Gemini is more expensive for large-scale applications with a proprietary license and higher computation. On the other hand, DeepSeek is open source and offers a more cost-effective and accessible option.

Multimodal Capabilities: Gemini is good with multimodal tasks. Whereas, DeepSeek focuses more on text-based reasoning. Gemini is more versatile for images, audio, video, and text. DeepSeek has recently extended the capabilities with Janus-Pro to enable multimodal AI.

Community and Support: Gemini comes with a big Google backing, community, and support. Whereas, DeepSeek is relatively new with a smaller community of users.

Maturity: Gemini has taken a longer time to develop. DeepSeek is still under development and largely influenced from reengineering and lacks sufficient refinement. 

An open source model will always facilitate for a larger community of users due to accessibility and a sporadic cycle of development. In many cases, an open source model may out pace the proprietary model. DeepSeek model is showing multimodal strengths and may eventually outshine its competitor models. Although, it has already challenged the competition with some serious capabilities.