Mabble Rabble: computer vision

Showing posts with label computer vision. Show all posts

31 July 2025

NeRF Papers

NeRF

21 May 2025

Papers and Models on Video Generation

Papers:

Exploring the Evolution of Physics Cognition in Video Generation: A Survey
A Survey of Interactive Generative Video
Video Diffusion Models: A Survey
Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation
Opportunities and challenges of diffusion models for generative AI

Models:

Sora
Video Diffusion Models
Imagen Video
Phenaki
Lumiere
Stable Video Diffusion
AnimateDiff
Open-Sora
CausVid
VideoGPT
DVD-GAN
MoCoGAN
VGAN

23 March 2025

The AI Summer With Deep Learning

24 February 2025

Foundations of Large Language Models

20 February 2025

13 February 2025

7 February 2025

31 January 2025

Generative Models Focus On Decoder Architecture

The influential LLM model milestone to come to light that popularized to some degree the cascade of generative models was the GPT-1 with decoder architecture and eventually followed into GPT-3. Since then, there has been a trend for LLM models to be more decoder focused. Other models that have played significant roles include: Bert, Transformer-XL, ELMo, ULMFiT, and LaMDA. T5 has been one of the few exceptions for an encoder-decoder architecture. You would wonder why this is the case.

Why Decoders Dominate:

Autoregressive Generation: Decoders are good at predicting next item in a sequence. This autoregressive approach is fundamental to generation for coherence and novel content.
Sequential Processing: Decoders process information step-by-step building on past generations which gives them long-range dependency capabilities to create complex and structured outputs.
Task-Specific Optimization: They can handle complexities in grammar, semantics, and context.
Simplified Training: Decoder-only models are simpler to train as they only need to learn the conditional probability distribution of next token given the previous ones. No need for a separate encoding step.
Focus on Sequence-to-Sequence Tasks: Early successes on sequence-to-sequence tasks showed that for generation tasks the decoder component was most important and the computationally expensive aspect.
Efficiency: For tasks that mainly require generating new content, you could achieve comparable or better results with just a decoder-only model and with less computational cost.
Attention is All You Need: Decoders, by their nature, already used self-attention and adapting to a decoder-only setup was straightforward and highly effective.
Scalability: Decoder-only Transformers scale well paving the way for very large models to generate highly coherent and creative text
Computational Cost: Training encoder-decoder models is more expensive and often prohibitive especially for companies on limited budgets.
Performance Gains: For generative tasks, performance gains were not seen as substantial with the introduction of an encoder step to justify the additional computational cost.

The Role of Encoders: While decoders are used for generation, encoders are used for understanding and representation of inputs.

Contextual Understanding: Encoders provide for rich representations and capture meaning and context that enables better relevancy to decoder generation tasks.
Feature Extraction: Encoders extract key features which the decoder can then use to generate context-specific output.

T5: This model uses both encoder and decoder architecture as a balance for better context understanding and generation.

T5: This model takes in input and output as texts in a unified architecture where the encoder understands the input and the decoder generates the output.

GPT Recap

GPT-1 (2018): demonstrated potential of decoder-only transformer architecture with 117 million parameters
GPT-2 (2019): scaled up with 1.5 billion parameters, this increase led to improvements in text generation quality and coherence, raising concerns for misuse
GPT-3 (2020): massive step forward with 175 billion parameters, scale up achieved greater performance, wider coverage on NLP tasks for text generation, translation, and few-shot learning

Throughout this process training techniques and model architectures were improved. Eventually, this effort shaped the advancements of LLM and the trend towards more versatile models.

29 January 2025

Paperswithcode

28 January 2025

LLM API Full Guide

Nova, LLama, Mistral, DeepSeek, and Gemini

Amazon Nova

Focus on Practicality: Real-world applications for balance of accuracy, speed, and cost-effectiveness
Multimodal Capabilities: Strong across data types
Cost-Effectiveness: models that are affordable and highly performant
Weakness: new model and full performance benchmarks are yet to be determined
When to use: priority for practical applications, cost-effective, and strong multi-modal capabilities in Amazon ecosystem

Llama

Open Source: useful for research and innovation as a way to access and build upon the model
Strong Performance: consistently high performance across benchmarks
Large Community: growing community of users and contributors and has been used across applications
Weakness: potential for misuse for generating harmful content and malicious activities
When to use: open source, balance of performance and accessibility, extensible

Mistral

High Performance: Strong benchmarks and beating many competitor models
Focus on Safety: Strong emphasis on safety and bias mitigation
Efficiency: performance with efficiency in computational resources
Weakness: full performance benchmarks are yet to be determined
When to use: high performance, safety, and efficiency

Gemini

Advanced Capabilities: cutting-edge capabilities in reasoning, code generation, and multimodal understanding
Strong Backing: support from Google that provides significant dedicated resources and expertise
Weakness: only available through Google services, limited accessibility and flexibility for independent developers and researchers, tends to be prone to gaps in hallucinated responses, threads of related responses are brittle
When to use: advanced capabilities within a Google ecosystem of services

DeepSeek

High Performance: state-of-the-art performance on benchmarks and surpassing many proprietary models
Open source: Built with an open community in mind for continuous innovation and development
Focus on Reasoning: strong in reasoning tasks, understanding, and solving complex problems
Weakness: relatively new model and full performance benchmarks are yet to be determined
When to use: open source, high performance, strong reasoning, extensible

Weakness of Gemini Compared To DeepSeek

Cost: Gemini is more expensive for large-scale applications with a proprietary license and higher computation. On the other hand, DeepSeek is open source and offers a more cost-effective and accessible option.

Multimodal Capabilities: Gemini is good with multimodal tasks. Whereas, DeepSeek focuses more on text-based reasoning. Gemini is more versatile for images, audio, video, and text. DeepSeek has recently extended the capabilities with Janus-Pro to enable multimodal AI.

Community and Support: Gemini comes with a big Google backing, community, and support. Whereas, DeepSeek is relatively new with a smaller community of users.

Maturity: Gemini has taken a longer time to develop. DeepSeek is still under development and largely influenced from reengineering and lacks sufficient refinement.

An open source model will always facilitate for a larger community of users due to accessibility and a sporadic cycle of development. In many cases, an open source model may out pace the proprietary model. DeepSeek model is showing multimodal strengths and may eventually outshine its competitor models. Although, it has already challenged the competition with some serious capabilities.

27 January 2025

OpenRouter

BoltDiy

24 January 2025

Datahub

11 January 2025

GenAI Papers

AI Papers of 2024

AI Digital Doppelgangers

6 January 2025

Pose Estimation

MoveNet
Mediapipe
MMPose
PoseNet
AlphaPose
HRNet
OpenPose
YOLOv8

Tech Sanctions on Iran

Online Services Inaccessible in Iran due to Tech Sanctions

Subscribe to: Posts ( Atom )

Mabble Rabble

31 July 2025

NeRF Papers

NerfStudio

21 May 2025

Papers and Models on Video Generation

23 March 2025

The AI Summer With Deep Learning

24 February 2025

Foundations of Large Language Models

20 February 2025

Universal Scene Description

13 February 2025

Video AI

7 February 2025

DeepSeek Deep Dive

31 January 2025

Generative Models Focus On Decoder Architecture

29 January 2025

Paperswithcode

28 January 2025

LLM API Full Guide

Nova, LLama, Mistral, DeepSeek, and Gemini

Weakness of Gemini Compared To DeepSeek

27 January 2025

OpenRouter

BoltDiy

24 January 2025

Datahub

11 January 2025

GenAI Papers

AI Digital Doppelgangers

6 January 2025

Pose Estimation

Tech Sanctions on Iran