Showing posts with label Artificial Intelligence. Show all posts
Showing posts with label Artificial Intelligence. Show all posts

Monday, December 8, 2025

AI’s Next Phase -- Specialization, Scaling, and the Coming Agent Platform Wars -- Mistral 3 vs DeepSeek 3.2 vs Claude Opus 4.5


See All Articles on AI

As 2025 comes to a close, the AI world is doing the opposite of slowing down. In just a few weeks, we’ve seen three major model launches from different labs:

  • Mistral 3

  • DeepSeek 3.2

  • Claude Opus 4.5

All three are strong. None are obviously “bad.” That alone is a big shift from just a couple of years ago, when only a handful of labs could credibly claim frontier-level models.

But the interesting story isn’t just that everything is good now.

The real story is this:

AI is entering a phase where differentiation comes from specialization and control over platforms, not just raw model quality.

We can see this in three places:

  1. How Mistral, DeepSeek, and Anthropic are carving out different strengths.

  2. How “scaling laws” are quietly becoming “experimentation laws.”

  3. How Amazon’s move against ChatGPT’s shopping agent signals an emerging platform war around agents.

Let’s unpack each.


1. Mistral vs. DeepSeek vs. Claude: When Everyone Is Good, What Makes You Different?

On paper, the new Mistral and DeepSeek releases look like they’re playing the same game: open models, strong benchmarks, competitive quality.

Under the hood, they’re leaning into very different philosophies.

DeepSeek 3.2: Reasoning and Sparse Attention for Agents

DeepSeek has become synonymous with novel attention mechanisms and high-efficiency large models. The 3.2 release extends that trend with:

  • Sparse attention techniques that help big models run more efficiently.

  • A strong emphasis on reasoning-first performance, especially around:

    • Tool use

    • Multi-step “agentic” workflows

    • Math and code-heavy tasks

If you squint, DeepSeek is trying to be “the reasoning lab”:

If your workload is complex multi-step thinking with tools, we want to be your default.

Mistral 3: Simple Transformer, Strong Multimodality, Open Weights

Mistral takes almost the opposite architectural route.

  • No flashy linear attention.

  • No wild new topology.

  • Just a dense, relatively “plain” transformer — tuned very well.

The innovation is in how they’ve packaged the lineup:

  • Multimodal by default across the range, including small models.

  • You can run something like Mistral 3B locally and still get solid vision + text capabilities.

  • That makes small, on-device, multimodal workflows actually practical.

The message from Mistral is:

You don’t need a giant proprietary model to do serious multimodal work. You can self-host it, and it’s Apache 2.0 again, not a bespoke “research-only” license.

Claude Opus 4.5: From Assistant to Digital Worker

Anthropic’s Claude Opus 4.5 sits more on the closed, frontier side of the spectrum. Its differentiation isn’t just capabilities, but how it behaves as a collaborator.

A few emerging themes:

  • Strong focus on software engineering, deep code understanding, and long-context reasoning.

  • A growing sense of “personality continuity”:

    • Users report the model doing natural “callbacks” to earlier parts of the conversation.

    • It feels less like a stateless chat and more like an ongoing working relationship.

  • Framed by Anthropic as more of a “digital worker” than a simple assistant:

    • Read the 200-page spec.

    • Propose changes.

    • Keep state across a long chain of tasks.

If DeepSeek is leaning into reasoning, and Mistral into open multimodal foundations, Claude is leaning into:

“Give us your workflows and we’ll embed a digital engineer into them.”

The Big Shift: Differentiation by Domain, Not Just Quality

A few years ago, the question was: “Which model is the best overall?”

Now the better question is:

“Best for what?”

  • Best for local multimodal tinkering? Mistral is making a strong case.

  • Best for tool-heavy reasoning and math/code? DeepSeek is aiming at that.

  • Best for enterprise-grade digital teammates? Claude wants that slot.

This is how the “no moat” moment is resolving:
When everyone can make a good general model, you specialize by domain and workflow, not just by raw benchmark scores.


2. Are Scaling Laws Still a Thing? Or Are We Just Scaling Experimentation?

A recent blog post from VC Tomas Tunguz reignited debate about scaling laws. His claim, paraphrased: Gemini 3 shows that the old scaling laws are still working—with enough compute, we still get big capability jumps.

There’s probably some truth there, but the nuance matters.

Scaling Laws, the Myth Version

The “myth” version of scaling laws goes something like:

“Make the model bigger. Feed it more data. Profit.”

If that were the full story, only the labs with the most GPUs (or TPUs) would ever meaningfully advance the frontier. Google, with deep TPU integration, is the clearest example: it has “the most computers that ever computed” and the tightest hardware–software stack.

But that’s not quite what seems to be happening.

What’s Really Scaling: Our Ability to Experiment

With Gemini 3, Google didn’t massively increase parameters relative to Gemini 1.5. The improvements likely came from:

  • Better training methods

  • Smarter data curation and filtering

  • Different mixtures of synthetic vs human data

  • Improved training schedules and hyperparameters

In other words, the action is shifting from:

“Make it bigger” → to → “Train it smarter.”

The catch?
Training smarter still requires a lot of room to experiment. When:

  • One full-scale training run costs millions of dollars, and

  • Takes weeks or months,

…you can’t explore the space of training strategies very fully. There’s a huge hyperparameter and design space we’ve barely touched, simply because it’s too expensive to try things.

That leads to a more realistic interpretation:

Scaling laws are quietly turning into experimentation laws.

The more compute you have, the more experiments you can run on:

  • architecture

  • training data

  • curricula

  • optimization tricks
    …and that’s what gives you better models.

From this angle, Google’s big advantage isn’t just size—it’s iteration speed at massive scale. As hardware gets faster, what really scales is how quickly we can search for better training strategies.


3. Agents vs Platforms: Amazon, ChatGPT, and the New Walled Gardens

While models are getting better, a different battle is playing out at the application layer: agents.

OpenAI’s Shopping Research agent is a clear example of the agent vision:

“Tell the agent what you need. It goes out into the world, compares products, and comes back with recommendations.”

If you think “online shopping,” you think Amazon. But Amazon recently took a decisive step:
It began blocking ChatGPT’s shopping agent from accessing product detail pages, review data, and deals.

Why Would Amazon Block It?

You don’t need a conspiracy theory to answer this. A few obvious reasons:

  • Control over the funnel
    Amazon doesn’t want a third-party agent sitting between users and its marketplace.

  • Protection of ad and search economics
    Product discovery is where Amazon makes a lot of money.

  • They’re building their own AI layers
    With things like Alexa+ and Rufus, Amazon wants its own assistants to be the way you shop.

In effect, Amazon is saying:

“If you want to shop here, you’ll use our agent, not someone else’s.”

The Deeper Problem: Agents Need an Open Internet, but the Internet Is Not Open

Large-language-model agents rely on a simple assumption:

“They can go out and interact with whatever site or platform is needed on your behalf.”

But the reality is:

  • Cloudflare has started blocking AI agents by default.

  • Amazon is blocking shopping agents.

  • Many platforms are exploring paywalls or tollbooths for automated access.

So before we hit technical limits on what agents can do, we’re hitting business limits on where they’re allowed to go.

It raises an uncomfortable question:

Can we really have a “universal agent” if every major platform wants to be its own closed ecosystem?

Likely Outcome: Agents Become the New Apps

The original dream:

  • One personal agent

  • Talks to every service

  • Does everything for you across the web

The likely reality:

  • You’ll have a personal meta-agent, but it will:

    • Call Amazon’s agent for shopping

    • Call your bank’s agent for finance

    • Call your airline’s agent for travel

  • Behind the scenes, this will look less like a single unified agent and more like:

    “A multi-agent OS for your life, glued together by your personal orchestrator.”

In other words, we may not be escaping the “app world” so much as rebuilding it with agents instead of apps.


The Big Picture: What Phase Are We Entering?

If you zoom out, these threads are connected:

  1. Models are converging on “good enough,” so labs specialize by domain and workflow.

  2. Scaling is shifting from “make it bigger” to “let us run more experiments on architectures, data, and training.”

  3. Agents are bumping into platform economics and control, not just technical feasibility.

Put together, it suggests we’re entering a new phase:

From the Open Frontier Phase → to the Specialization and Platform Phase.

  • Labs will succeed by owning specific domains and developer workflows.

  • The biggest performance jumps may come from training strategy innovation, not parameter count.

  • Agent ecosystems will reflect platform power struggles as much as technical imagination.

The excitement isn’t going away. But the rules of the game are changing—from who can train the biggest model to who can:

  • Specialize intelligently

  • Experiment fast

  • Control key platforms

  • And still give users something that feels like a single, coherent AI experience.

That’s the next frontier.

Tags: Artificial Intelligence,Technology,

Where We Stand on AGI: Latest Developments, Numbers, and Open Questions


See All Articles on AI

Executive summary (one line)

Top models have made rapid, measurable gains (e.g., GPT‑5 reported around 50–70% on several AGI-oriented benchmarks), but persistent, hard-to-solve gaps — especially durable continual learning, robust multimodal world models, and reliable truthfulness — mean credible AGI timelines still range from a few years (for narrow definitions) to several decades (for robust human‑level generality). Numbers below are reported by labs and studies; where results come from internal tests or single groups I flag them as provisional.

Quick snapshot of major recent headlines

  • OpenAI released GPT‑5 (announced Aug 7, 2025) — presented as a notable step up in reasoning, coding and multimodal support (press release and model paper reported improvements).
  • Benchmarks and expert studies place current top models roughly “halfway” to some formal AGI definitions: a ten‑ability AGI framework reported GPT‑4 at 27% and GPT‑5 at 57% toward its chosen AGI threshold (framework authors’ reported scores).
  • Some industry/academic reports and panels (for example, an MIT/Arm deep dive) warn AGI‑like systems might appear as early as 2026; other expert surveys keep median predictions later (many 50%‑probability dates clustered around 2040–2060).
  • Policy and geopolitics matter: RAND (modeling reported Dec 1, 2025) frames the US–China AGI race as a prisoner’s dilemma — incentives favor speed absent stronger international coordination and verification.

Methods and definitions (short)

What “AGI score” means here: the draft uses several benchmarking frameworks that combine multiple task categories (reasoning, planning, perception, memory, tool use). Each framework weights abilities differently and maps aggregate performance to a 0–100% scale relative to an internal "AGI threshold" chosen by its authors. These mappings are normative — not universally agreed — so percentages should be read as framework‑specific progress indicators, not absolute measures of human‑level general intelligence.

Provenance notes: I flag results as (a) published/peer‑reviewed, (b) public benchmark results, or (c) reported/internal tests by labs. Where items are internal or single‑lab reports they are provisional and should be independently verified before being used as firm evidence.

Benchmarks and headline numbers (compact table)

BenchmarkWhat it measuresModel / ScoreHuman baseline / NotesSource type
Ten‑ability AGI framework Aggregate across ~10 cognitive abilities GPT‑4: 27% · GPT‑5: 57% Framework‑specific AGI threshold (authors' mapping) Reported framework scores (authors)
SPACE (visual reasoning subset) Visual reasoning tasks (subset) GPT‑4o: 43.8% · GPT‑5 (Aug 2025): 70.8% Human average: 88.9% Internal/public benchmark reports (reported)
MindCube Spatial / working‑memory tests GPT‑4o: 38.8% · GPT‑5: 59.7% Still below typical human average Benchmark reports (reported)
SimpleQA Hallucination / factual accuracy GPT‑5: hallucinations in >30% of questions (reported) Some other models (e.g., Anthropic Claude variants) report lower hallucination rates Reported / model vendor comparisons
METR endurance test Sustained autonomous task performance GPT‑5.1‑Codex‑Max: ~2 hours 42 minutes · GPT‑4: few minutes Measures autonomous chaining and robustness over time Internal lab test (provisional)
IMO 2025 (DeepMind Gemini, "Deep Think" mode) Formal math problem solving under contest constraints Solved 5 of 6 problems within 4.5 hours (gold‑level performance reported) Shows strong formal reasoning in a constrained task Reported by DeepMind (lab result)

Where models still struggle (the real bottlenecks)

  • Continual learning / long‑term memory: Most models remain effectively "frozen" after training; reliably updating and storing durable knowledge over weeks/months remains unsolved and is widely cited as a high‑uncertainty obstacle.
  • Multimodal perception (vision & world models): Text and math abilities have improved faster than visual induction and physical‑world modeling; visual working memory and physical plausibility judgments still lag humans.
  • Hallucinations and reliable retrieval: High‑confidence errors persist (SimpleQA >30% hallucination reported for GPT‑5 in one test); different model families show substantial variance.
  • Low‑latency tool use & situated action: Language is fast; perception‑action loops and real‑world tool use (robotics) remain harder and slower.

How researchers think we’ll get from here to AGI

Two broad routes dominate discussion:

  1. Scale current methods: Proponents argue more parameters, compute and better data will continue yielding returns. Historical training‑compute growth averaged ~4–5×/year (with earlier bursts up to ~9×/year until mid‑2020).
  2. New architectures / breakthroughs: Others (e.g., prominent ML researchers) argue scaling alone won’t close key gaps and that innovations (robust world models, persistent memory systems, tighter robotics integration) are needed.

Compute projections vary: one analysis (Epoch AI) suggested training budgets up to ~2×10^29 FLOPs could be feasible by 2030 under optimistic assumptions; other reports place upper bounds near ~3×10^31 FLOPs depending on power and chip production assumptions.

Timelines: why predictions disagree

Different metrics, definitions and confidence levels drive wide disagreement. Aggregated expert surveys show medians often in the 2040–2060 range, while some narrow frameworks and industry estimates give earlier dates (one internal framework estimated 50% by end‑2028 and 80% by end‑2030 under its assumptions). A minority of experts and some industry reports have suggested AGI‑like capabilities could appear as early as 2026. When using these numbers, note the underlying definition of AGI, which benchmark(s) are weighted most heavily, and whether the estimate is conditional on continued scaling or a specific breakthrough.

Risks, governance and geopolitics

  • Geopolitics: RAND models (Dec 1, 2025 reporting) show a prisoner’s dilemma: nations face incentives to accelerate unless international verification and shared risk assessments improve.
  • Security risks: Reports warn of misuse (e.g., advances in bio‑expertise outputs), espionage, and supply‑chain chokepoints (chip export controls and debates around GPU access matter for pace of progress).
  • Safety strategies: Proposals range from technical assurance and transparency to verification regimes and deterrence ideas; all face verification and observability challenges.
  • Ethics and law: Active debates continue over openness, liability, and model access control (paywalls vs open releases).

Bottom line for students (and what to watch)

Progress is real and measurable: top models now match or beat humans on many narrow tasks, have larger context windows, and can sustain autonomous code writing for hours in some internal tests. But key human‑like capacities — durable continual learning, reliable multimodal world models, and trustworthy factuality — remain outstanding. Timelines hinge on whether these gaps are closed by continued scaling, a single breakthrough (e.g., workable continual learning), or new architectures. Policy and safety research must accelerate in parallel.

Watch these signals: AGI‑score framework updates, SPACE / IntPhys / MindCube / SimpleQA benchmark results, compute growth analyses (e.g., Epoch AI), major model releases (GPT‑5 and successors), METR endurance reports, and policy studies like RAND’s — and when possible, prioritize independently reproducible benchmark results over single‑lab internal tests.

References and sources (brief)

  • OpenAI GPT‑5 announcement — Aug 7, 2025 (model release/press materials; reported performance claims).
  • Ten‑ability AGI framework — authors’ reported scores for GPT‑4 (27%) and GPT‑5 (57%) (framework paper/report; framework‑specific mapping to AGI threshold).
  • SPACE visual reasoning subset results — reported GPT‑4o 43.8%, GPT‑5 (Aug 2025) 70.8%, human avg 88.9% (benchmark report / lab release; flagged as reported/internal where applicable).
  • MindCube spatial/working‑memory benchmark — reported GPT‑4o 38.8%, GPT‑5 59.7% (benchmark report).
  • SimpleQA factuality/hallucination comparison — GPT‑5 reported >30% hallucination rate; other models (Anthropic Claude variants) report lower rates (vendor/benchmark reports).
  • METR endurance test — reported GPT‑5.1‑Codex‑Max sustained autonomous performance ~2 hours 42 minutes vs GPT‑4 few minutes (internal lab test; provisional).
  • DeepMind Gemini (’Deep Think’ mode) — reported solving 5 of 6 IMO 2025 problems within 4.5 hours (DeepMind report; task‑constrained result).
  • Epoch AI compute projection — suggested ~2×10^29 FLOPs feasible by 2030 under some assumptions; other reports give upper bounds up to ~3×10^31 FLOPs (compute projection studies).
  • RAND modeling of US–China race — reported Dec 1, 2025 (prisoner’s dilemma framing; policy analysis report).
  • Expert surveys and timeline aggregates — multiple surveys report medians often in 2040–2060 with notable variance (survey meta‑analyses / aggregated studies).

Notes: Where a result was described in the original draft as coming from “internal tests” or a single lab, I preserved the claim but flagged it above as provisional and recommended independent verification. For any use beyond classroom discussion, consult the original reports and benchmark datasets to confirm methodology, sample sizes, dates and reproducibility.

Tags: Artificial Intelligence,Technology,

Sunday, December 7, 2025

Model Alert... World Labs launched Marble -- Generated, Editable Virtual Spaces

See All on AI Model Releases

Generated, Editable Virtual Spaces

 

Models that generate 3D spaces typically generate them as users move through them without generating a persistent world to be explored later. A new model produces 3D worlds that can be exported and modified.

 

What’s new: World Labs launched Marble, which generates persistent, editable, reusable 3D spaces from text, images, and other inputs. The company also debuted Chisel, an integrated editor that lets users modify Marble’s output via text prompts and craft spaces environments from scratch.

  • Input/output: Text, images, panoramas, videos, 3D layouts of boxes and planes in; Gaussian splats, meshes, or videos out.
  • Features: Expand spaces, combine spaces, alter visual style, edit spaces via text prompts or visual inputs, download generated spaces
  • Availability: Subscription tiers include Free (4 outputs based on text, images, or panoramas), $20 per month (12 outputs based on multiple images, videos, or 3D layouts), $35 per month (25 outputs with expansion and commercial rights), and $95 per month (75 outputs, all features)

How it works: Marble accepts several media types and exports 3D spaces in a variety of formats.

  • The model can generate a 3D space from a single text prompt or image. For more control, it accepts multiple images with text prompts (like front, back, left, or right) that specify which image should map to what areas. Users can also input short videos, 360-degree panoramas, or 3D models and connect outputs to build complex spaces.
  • The Chisel editor can create and edit 3D spaces directly. Geometric shapes like planes or blocks can be used to build structural elements like walls or furniture and styled via text prompts or images.
  • Generated spaces can be extended by clicking on an area to be extended or connected.
  • Model outputs can be Gaussian splats (high-quality representations composed of semi-transparent particles that can be rendered in web browsers), collider meshes (simplified 3D geometries that define object boundaries for physics simulations), and high-quality meshes (detailed geometries suitable for editing). Video output can include controllable camera paths and effects like smoke or flowing water.

Performance: Early users report generating game-like environments and photorealistic recreations of real-world locations.

  • Marble generates more complete 3D structures than depth maps or point clouds, which represent surfaces but not object geometries, World Labs said.
  • Its mesh outputs integrate with tools commonly used in game development, visual effects, and 3D modeling.

Behind the news: Earlier generative models can produce 3D spaces on the fly, but typically such spaces can’t be saved or revisited interactively. Marble stands out by generating spaces that can be saved and edited. For instance, in October, World Labs introduced RTFM, which generates spaces in real time as users navigate through them. Competing startups like Decart and Odyssey are available as demos, and Google’s Genie 3 remains a research preview.

 

Why it matters: World Labs founder and Stanford professor Fei-Fei Li argues that spatial intelligence — understanding how physical objects occupy and move through space — is a key aspect of intelligence that language models can’t fully address. With Marble, World Labs aspires to catalyze development in spatial AI just as ChatGPT and subsequent large language models ignited progress in text processing.

 

We’re thinking: Virtual spaces produced by Marble are geometrically consistent, which may prove valuable in gaming, robotics, and virtual reality. However, the objects within them are static. Virtual worlds that include motion will bring AI even closer to understanding physics.

 

Tags: AI Model Alert,Artificial Intelligence,Technology,

Model Alert... Open 3D Generation Pipeline -- Meta’s Segment Anything Model (SAM) image-segmentation model

See All on AI Model Releases

Open 3D Generation Pipeline

 

Meta’s Segment Anything Model (SAM) image-segmentation model has evolved into an open-weights suite for generating 3D objects. SAM 3 segments images, SAM 3D turns the segments into 3D objects, and SAM 3D Body produces 3D objects of any people among the segments. You can experiment with all three.

 

SAM 3: SAM 3 now segments images and videos based on input text. It retains the ability to segment objects based on input geometry (bounding boxes or points that are labeled to include or exclude the objects at those locations), like the previous version. 

  • Input/output: Images, video, text, geometry in; segmented images or video out
  • Performance: In Meta’s tests, SAM 3 outperformed almost all competitors on a variety of benchmarks that test image and video segmentation. For instance, on LVIS (segmenting objects from text), SAM 3 (48.5 percent average precision) outperformed DINO-X (38.5 percent average precision). It fell behind APE-D (53.0 percent average precision), which was trained on LVIS’ training set. 
  • Availability: Weights and fine-tuning code freely available for noncommercial and commercial uses in countries that don’t violate U.S., EU, UK, and UN trade restrictions under Meta license 

SAM 3D: This model generates 3D objects from images based on segmentation masks. By individually predicting each object in an image, it can represent the entire scene. It can also take in point clouds to improve its output.

  • Input/output: Image, mask, point cloud in; 3D object (mesh, Gaussian splat) out
  • Performance: Judging both objects and scenes generated from photos, humans preferred SAM 3D’s outputs over those by other models. For instance, when generating objects from the LVIS dataset, people preferred SAM 3D nearly 80 percent of the time, Hunyuan3d 2.0 about 12 percent of the time, and other models 8 percent of the time.
  • Availability: Weights and inference code freely available for noncommercial and commercial uses in countries that don’t violate U.S., EU, UK, and UN trade restrictions under Meta license

SAM 3D Body: Meta released an additional model that produces 3D human figures from images. Input bounding boxes or masks can also determine which figures to produce, and an optional transformer decoder can refine the positions and shapes of human hands.

  • Input/output: Image, bounding boxes, masks in; 3D objects (mesh, Gaussian splat) out
  • Performance: In Meta’s tests, SAM 3D Body achieved the best performance across a number of datasets compared to other models that take images or videos and generate 3D human figures. For example, on the EMDB dataset of people in the wild, SAM 3D Body achieved 62.9 Mean Per Joint Position Error (MPJPE, a measure of how different the predicted joint positions are from the ground truth, lower is better) compared to next best Neural Localizer Fields, which achieved 68.4 MPJPE. On Freihand (a test of hand correctness), SAM 3D Body achieved similar or slightly worse performance than models that specialize in estimating hand poses. (The authors claim the other models were trained on Freihand’s training set.)
  • Availability: Weights, inference code, and training data freely available in countries that don’t violate U.S., EU, UK, and UN trade restrictions under Meta license

Why it matters: This SAM series offers a unified pipeline for making 3D models from images. Each model advances the state of the art, enabling more-accurate image segmentations from text, 3D objects that human judges preferred, and 3D human figures that also appealed to human judges. These models are already driving innovations in Meta’s user experience. For instance, SAM 3 and SAM 3D enable users of Facebook marketplace to see what furniture or other home decor looks like in a particular space.

 

We’re thinking:  At the highest level, all three models learned from a similar data pipeline: Find examples the model currently performs poorly on, use humans to annotate them, and train on the annotations. According to Meta’s publications, this process greatly reduced the time and money required to annotate quality datasets.

 

Tags: Technology,Artificial Intelligence,AI Model Alert,

Model Alert... Ernie -- Baidu’s Multimodal Bids

See All on AI Model Releases

Baidu’s Multimodal Bids

 

Baidu debuted two models: a lightweight, open-weights, vision-language model and a giant, proprietary, multimodal model built to take on U.S. competitors.

 

Ernie-4.5-VL-28B-A3B-Thinking: Baidu’s new open-weights model is based on the earlier Ernie-4.5-21B-A3B Thinking, a text-only MoE reasoning model, plus a 7 billion-parameter vision encoder to process images.It outperforms comparable and larger models on visual reasoning tasks. It can extract on-screen text and analyze videos across time, and it can call tools to zoom in on image details and search for related images.

  • Input/output: Text, image, video in (up to 128,000 tokens); text out
  • Architecture: Mixture-of-experts (MoE) transformer (28 billion parameters total, 3 billion active per token), 21 billion-parameter language decoder/encoder. 
  • Training: The authors used vision-language reasoning examples during mid-training, an emerging phase that typically uses mid-size datasets to sharpen distinct skills or impart specific domains prior to fine-tuning. In addition, they fine-tune via reinforcement learning (RL) with multimodal data. Because MoE architectures can become unstable during RL, the team used a combination of GSPO and IcePop to stabilize the fine-tuning.
  • Features: Tool use, reasoning
  • Performance: Ernie-4.5-VL-28B-A3B-Thinking competes with larger proprietary models on document understanding tasks despite activating only 3 billion parameters, Baidu said. For instance, on ChartQA (chart interpretation), Ernie-4.5-VL-28B-A3B-Thinking reached 87.1 percent accuracy, outperforming Gemini 2.5 Pro (76.3 percent) and GPT-5 set to high reasoning (78.2 percent). On OCRBench (text recognition in images), it achieved 858, ahead of GPT-5 set to high reasoning (810) but trailing Gemini 2.5 Pro (866).
  • Availability: Weights free for noncommercial and commercial uses under Apache 2.0 license via HuggingFace. API $0.14/$0.56 per million input/output tokens via Baidu Qianfan.
  • Undisclosed: Output size limit, training data, reward models

Ernie-5.0: Baidu describes Ernie-5.0’s approach as natively multimodal, meaning it was trained on text, images, audio, and video together rather than fusing different media encoders after training or routing inputs to specialized models. It performs comparably to the similarly multimodal Google Gemini 2.5 or OpenAI GPT-5, according to Baidu.

  • Input/output: Text, image, audio, and video in (up to 128,000 tokens); text, image, audio, video out (up to 64,000 tokens)
  • Architecture: Mixture-of-experts (MoE) transformer (2.4 trillion parameters total, less than 72 billion active per token)
  • Features: Vision-language-audio understanding, reasoning, agentic planning, tool use
  • Performance: In Baidu’s tests of multimodal reasoning, document understanding, and visual question-answering, the company reports that Ernie-5.0 matched or exceeded OpenAI GPT-5 set to high reasoning and Google Gemini 2.5 Pro. For instance, on OCRBench (document comprehension), DocVQA (document comprehension), and ChartQA (structured data reasoning), Baidu Ernie-5.0 achieved top scores. On MM-AU (multimodal audio understanding) and TUT2017 (acoustic scene classification), it demonstrated competitive performance, Baidu said without publishing specific metrics.
  • Availability: Free web interface, API $0.85/$3.40 per million input/output tokens via Baidu Qianfan
  • Undisclosed: Training data, training methods

Yes, but: Shortly after Ernie-5.0's launch, a developer reported that the model repeatedly called tools even after instruction not to. Baidu acknowledged the issue and said it was fixing it.

 

Why it matters: Ernie-4.5-VL-28B-A3B-Thinking offers top visual reasoning at the fraction of the cost of competing models, and more flexibility for fine-tuning and other commercial customizations. However, the long-awaited Ernie 5.0 appears to fall short of expectations. It matches top models on some visual tasks but stops short of the forefront (including Qwen3-Max and Kimi-K2-Thinking) on leaderboards like LM Arena. Pretraining on text, images, video, and audio together is a relatively fresh approach that could simplify current systems that piece together different encoders and decoders for different media types.

 

We’re thinking: Ernie-5.0 may outperform Gemini 2.5 and GPT-5, but Google and OpenAI have already moved on to Gemini 3 and GPT-5.1!