Showing posts with label AI Model Alert. Show all posts

Friday, December 12, 2025

GPT-5.2, Gemini, and the AI Race -- Does Any of This Actually Help Consumers?

The AI world is ending the year with a familiar cocktail of excitement, rumor, and exhaustion. The biggest talk of December: OpenAI is reportedly rushing to ship GPT-5.2 after Google’s Gemini models lit up the leaderboard. Some insiders even describe the mood at OpenAI as a “code red,” signaling just how aggressively they want to reclaim attention, mindshare, and—let’s be honest—investor confidence.

But amid all the hype cycles and benchmark duels, a more important question rises to the surface:

Are consumers or enterprises actually better off after each new model release? Or are we simply watching a very expensive and very flashy arms race?

Welcome to Mixture of Experts.

The Model Release Roller Coaster

A year ago, it seemed like OpenAI could do no wrong—GPT-4 had set new standards, competitors were scrambling, and the narrative looked settled. Fast-forward to today: Google Gemini is suddenly the hot new thing, benchmarks are being rewritten, and OpenAI is seemingly playing catch-up.

The truth? This isn’t new. AI progress moves in cycles, and the industry’s scoreboard changes every quarter. As one expert pointed out: “If this entire saga were a movie, it would be nothing but plot twists.”

And yes—actors might already be fighting for who gets to play Sam Altman and Demis Hassabis in the movie adaptation.

Does GPT-5.2 Actually Matter?

The short answer: Probably not as much as the hype suggests.

While GPT-5.2 may bring incremental improvements—speed, cost reduction, better performance in IDEs like Cursor—don’t expect a productivity revolution the day after launch.

Several experts agreed:

Most consumers won’t notice a big difference.
Most enterprises won’t switch models instantly anyway.
If it were truly revolutionary, they’d call it GPT-6.

The broader sentiment is fatigue. It seems like every week, there’s a new “state-of-the-art” release, a new benchmark victory, a new performance chart making the rounds on social media. The excitement curve has flattened; now the industry is asking:

Are we optimizing models, or just optimizing marketing?

Benchmarks Are Broken—But Still Drive Everything

One irony in today’s AI landscape is that everyone agrees benchmarks are flawed, easily gamed, and often disconnected from real-world usage. Yet companies still treat them as existential battlegrounds.

The result:
An endless loop of model releases aimed at climbing leaderboard rankings that may not reflect what users actually need.

Benchmarks motivate corporate behavior more than consumer benefit. And that’s how we get GPT-5.2 rushed to market—not because consumers demanded it, but because Gemini scored higher.

The Market Is Asking the Wrong Question About Transparency

Another major development this month: Stanford’s latest AI Transparency Index. The most striking insight?

Transparency across the industry has dropped dramatically—from 74% model-provider participation last year to only 30% this year.

But not everyone is retreating. IBM’s Granite team took the top spot with a 95/100 transparency score, driven by major internal investments in dataset lineage, documentation, and policy.

Why the divergence?

Because many companies conflate transparency with open source.
And consumers—enterprises included—aren’t always sure what they’re actually asking for.

The real demand isn’t for “open weights.” It’s for knowability:

What data trained this model?
How safe is it?
How does it behave under stress?
What were the design choices?

Most consumers don’t have vocabulary for that yet. So they ask for open source instead—even when transparency and openness aren’t the same thing.

As one expert noted:
“People want transparency, but they’re asking the wrong questions.”

Amazon Nova: Big Swing or Big Hype?

At AWS re:Invent, Amazon introduced its newest Nova Frontier models, with claims that they’re positioned to compete directly with OpenAI, Google, and Anthropic.

Highlights:

Nova Forge promises checkpoint-based custom model training for enterprises.
Nova Act is Amazon’s answer to agentic browser automation, optimized for enterprise apps instead of consumer websites.
Speech-to-speech frontier models catch up with OpenAI and Google.

Sounds exciting—but there’s a catch.

Most enterprises don’t actually want to train or fine-tune models.

They think they do.
They think they have the data, GPUs, and specialization to justify it.

But the reality is harsh:

Fine-tuning pipelines are expensive and brittle.
Enterprise data is often too noisy or inconsistent.
Tool-use, RAG, and agents outperform fine-tuning for most use cases.

Only the top 1% of organizations will meaningfully benefit from Nova Forge today.
Everyone else should use agents, not custom models.

The Future: Agents That Can Work for Days

Amazon also teased something ambitious: frontier agents that can run for hours or even days to complete complex tasks.

At first glance, that sounds like science fiction—but the core idea already exists:

Multi-step tool use
Long-running workflows
MapReduce-style information gathering
Automated context management
Self-evals and retry loops

The limiting factor isn’t runtime. It’s reliability.

We’re entering a future where you might genuinely say:

“Okay AI, write me a 300-page market analysis on the global semiconductor supply chain,”
and the agent returns the next morning with a comprehensive draft.

But that’s only useful if accuracy scales with runtime—and that’s the new frontier the industry is chasing.

As one expert put it:

“You can run an agent for weeks. That doesn’t mean you’ll like what it produces.”

So… Who’s Actually Winning?

Not OpenAI.
Not Google.
Not Amazon.
Not Anthropic.

The real winner is competition itself.

Competition pushes capabilities forward.
But consumers? They’re not seeing daily life transformation with each release.
Enterprises? They’re cautious, slow to adopt, and unwilling to rebuild entire stacks for minor gains.

The AI world is moving fast—but usefulness is moving slower.

Yet this is how all transformative technologies evolve:
Capabilities first, ethics and transparency next, maturity last.

Just like social media’s path from excitement → ubiquity → regulation,
AI will go through the same arc.

And we’re still early.

Final Thought

We’ll keep seeing rapid-fire releases like GPT-5.2, Gemini Ultra, Nova, and beyond. But model numbers matter less than what we can actually build on top of them.

AI isn’t a model contest anymore.
It’s becoming a systems contest—agents, transparency tooling, deployment pipelines, evaluation frameworks, and safety assurances.

And that’s where the real breakthroughs of 2026 and beyond will come from.

Until then, buckle up. The plot twists aren’t slowing down.

GPT-5.2 is now live in the OpenAI API

GPT-5.2 is now available in the API

Today we released GPT-5.2 in the API and ChatGPT—our most advanced frontier model yet and our best model for real-world agentic work. GPT-5.2 excels at coding, document & data analysis, and customer support use cases.

Build with GPT 5.2

Here’s why you may want to consider switching your workloads to GPT-5.2:

SOTA on long-context understanding: GPT-5.2 beats other models on the OpenAI MRCRv2 long-context eval, and customers like Notion, Box, Databricks, and Hex report strong reasoning performance on complex, ambiguous, data-heavy tasks.
Advanced tool-calling capabilities: GPT-5.2 is SOTA on Tool Decathlon and beats other models on τ²-Bench Telecom, both benchmarks for long-horizon tool use. Triple Whale and Zoom say GPT-5.2 enables more reliable agent execution through improved tool calling.
Our strongest vision model yet: GPT-5.2 is our strongest vision model yet, cutting chart-reasoning and UI-understanding errors by over 50%. Enhanced spatial reasoning makes it more reliable for complex dashboards, app UIs, and diagram analysis.
SOTA on coding: GPT-5.2 leads on SWE-Bench Pro, a benchmark for complex coding tasks. It excels at front-end UI generation and delivers meaningful improvements across debugging, refactoring, and shipping fixes.

GPT-5.2 is now available in the Responses and Chat Completions API. The model adjusts its reasoning based on the complexity of the task and you can control the reasoning effort by setting it to none, low, medium, high, and for the first time, “xhigh” for the most complex tasks.

GPT-5.2 is 40% more expensive than GPT-5 and GPT-5.1. It costs $1.75/1M input tokens and $14/1M output tokens, with a 90% discount on cached inputs. The model is available on Priority Processing and Flex Processing plans, and can be used with the Batch API. More details on the API Pricing Page.

We’ve published prompting guidance and updated our Prompt Optimizer tool to help you get the most out of GPT-5.2.

Build with GPT 5.2

—The OpenAI Team

Sunday, December 7, 2025

Model Alert... World Labs launched Marble -- Generated, Editable Virtual Spaces

See All on AI Model Releases

Generated, Editable Virtual Spaces

Models that generate 3D spaces typically generate them as users move through them without generating a persistent world to be explored later. A new model produces 3D worlds that can be exported and modified.

What’s new: World Labs launched Marble, which generates persistent, editable, reusable 3D spaces from text, images, and other inputs. The company also debuted Chisel, an integrated editor that lets users modify Marble’s output via text prompts and craft spaces environments from scratch.

Input/output: Text, images, panoramas, videos, 3D layouts of boxes and planes in; Gaussian splats, meshes, or videos out.
Features: Expand spaces, combine spaces, alter visual style, edit spaces via text prompts or visual inputs, download generated spaces
Availability: Subscription tiers include Free (4 outputs based on text, images, or panoramas), $20 per month (12 outputs based on multiple images, videos, or 3D layouts), $35 per month (25 outputs with expansion and commercial rights), and $95 per month (75 outputs, all features)

How it works: Marble accepts several media types and exports 3D spaces in a variety of formats.

The model can generate a 3D space from a single text prompt or image. For more control, it accepts multiple images with text prompts (like front, back, left, or right) that specify which image should map to what areas. Users can also input short videos, 360-degree panoramas, or 3D models and connect outputs to build complex spaces.
The Chisel editor can create and edit 3D spaces directly. Geometric shapes like planes or blocks can be used to build structural elements like walls or furniture and styled via text prompts or images.
Generated spaces can be extended by clicking on an area to be extended or connected.
Model outputs can be Gaussian splats (high-quality representations composed of semi-transparent particles that can be rendered in web browsers), collider meshes (simplified 3D geometries that define object boundaries for physics simulations), and high-quality meshes (detailed geometries suitable for editing). Video output can include controllable camera paths and effects like smoke or flowing water.

Performance: Early users report generating game-like environments and photorealistic recreations of real-world locations.

Marble generates more complete 3D structures than depth maps or point clouds, which represent surfaces but not object geometries, World Labs said.
Its mesh outputs integrate with tools commonly used in game development, visual effects, and 3D modeling.

Behind the news: Earlier generative models can produce 3D spaces on the fly, but typically such spaces can’t be saved or revisited interactively. Marble stands out by generating spaces that can be saved and edited. For instance, in October, World Labs introduced RTFM, which generates spaces in real time as users navigate through them. Competing startups like Decart and Odyssey are available as demos, and Google’s Genie 3 remains a research preview.

Why it matters: World Labs founder and Stanford professor Fei-Fei Li argues that spatial intelligence — understanding how physical objects occupy and move through space — is a key aspect of intelligence that language models can’t fully address. With Marble, World Labs aspires to catalyze development in spatial AI just as ChatGPT and subsequent large language models ignited progress in text processing.

We’re thinking: Virtual spaces produced by Marble are geometrically consistent, which may prove valuable in gaming, robotics, and virtual reality. However, the objects within them are static. Virtual worlds that include motion will bring AI even closer to understanding physics.

Tags: AI Model Alert,Artificial Intelligence,Technology,

Model Alert... Open 3D Generation Pipeline -- Meta’s Segment Anything Model (SAM) image-segmentation model

See All on AI Model Releases

Open 3D Generation Pipeline

Meta’s Segment Anything Model (SAM) image-segmentation model has evolved into an open-weights suite for generating 3D objects. SAM 3 segments images, SAM 3D turns the segments into 3D objects, and SAM 3D Body produces 3D objects of any people among the segments. You can experiment with all three.

SAM 3: SAM 3 now segments images and videos based on input text. It retains the ability to segment objects based on input geometry (bounding boxes or points that are labeled to include or exclude the objects at those locations), like the previous version.

Input/output: Images, video, text, geometry in; segmented images or video out
Performance: In Meta’s tests, SAM 3 outperformed almost all competitors on a variety of benchmarks that test image and video segmentation. For instance, on LVIS (segmenting objects from text), SAM 3 (48.5 percent average precision) outperformed DINO-X (38.5 percent average precision). It fell behind APE-D (53.0 percent average precision), which was trained on LVIS’ training set.
Availability: Weights and fine-tuning code freely available for noncommercial and commercial uses in countries that don’t violate U.S., EU, UK, and UN trade restrictions under Meta license

SAM 3D: This model generates 3D objects from images based on segmentation masks. By individually predicting each object in an image, it can represent the entire scene. It can also take in point clouds to improve its output.

Input/output: Image, mask, point cloud in; 3D object (mesh, Gaussian splat) out
Performance: Judging both objects and scenes generated from photos, humans preferred SAM 3D’s outputs over those by other models. For instance, when generating objects from the LVIS dataset, people preferred SAM 3D nearly 80 percent of the time, Hunyuan3d 2.0 about 12 percent of the time, and other models 8 percent of the time.
Availability: Weights and inference code freely available for noncommercial and commercial uses in countries that don’t violate U.S., EU, UK, and UN trade restrictions under Meta license

SAM 3D Body: Meta released an additional model that produces 3D human figures from images. Input bounding boxes or masks can also determine which figures to produce, and an optional transformer decoder can refine the positions and shapes of human hands.

Input/output: Image, bounding boxes, masks in; 3D objects (mesh, Gaussian splat) out
Performance: In Meta’s tests, SAM 3D Body achieved the best performance across a number of datasets compared to other models that take images or videos and generate 3D human figures. For example, on the EMDB dataset of people in the wild, SAM 3D Body achieved 62.9 Mean Per Joint Position Error (MPJPE, a measure of how different the predicted joint positions are from the ground truth, lower is better) compared to next best Neural Localizer Fields, which achieved 68.4 MPJPE. On Freihand (a test of hand correctness), SAM 3D Body achieved similar or slightly worse performance than models that specialize in estimating hand poses. (The authors claim the other models were trained on Freihand’s training set.)
Availability: Weights, inference code, and training data freely available in countries that don’t violate U.S., EU, UK, and UN trade restrictions under Meta license

Why it matters: This SAM series offers a unified pipeline for making 3D models from images. Each model advances the state of the art, enabling more-accurate image segmentations from text, 3D objects that human judges preferred, and 3D human figures that also appealed to human judges. These models are already driving innovations in Meta’s user experience. For instance, SAM 3 and SAM 3D enable users of Facebook marketplace to see what furniture or other home decor looks like in a particular space.

We’re thinking: At the highest level, all three models learned from a similar data pipeline: Find examples the model currently performs poorly on, use humans to annotate them, and train on the annotations. According to Meta’s publications, this process greatly reduced the time and money required to annotate quality datasets.

Tags: Technology,Artificial Intelligence,AI Model Alert,

Model Alert... Ernie -- Baidu’s Multimodal Bids

See All on AI Model Releases

Baidu’s Multimodal Bids

Baidu debuted two models: a lightweight, open-weights, vision-language model and a giant, proprietary, multimodal model built to take on U.S. competitors.

Ernie-4.5-VL-28B-A3B-Thinking: Baidu’s new open-weights model is based on the earlier Ernie-4.5-21B-A3B Thinking, a text-only MoE reasoning model, plus a 7 billion-parameter vision encoder to process images.It outperforms comparable and larger models on visual reasoning tasks. It can extract on-screen text and analyze videos across time, and it can call tools to zoom in on image details and search for related images.

Input/output: Text, image, video in (up to 128,000 tokens); text out
Architecture: Mixture-of-experts (MoE) transformer (28 billion parameters total, 3 billion active per token), 21 billion-parameter language decoder/encoder.
Training: The authors used vision-language reasoning examples during mid-training, an emerging phase that typically uses mid-size datasets to sharpen distinct skills or impart specific domains prior to fine-tuning. In addition, they fine-tune via reinforcement learning (RL) with multimodal data. Because MoE architectures can become unstable during RL, the team used a combination of GSPO and IcePop to stabilize the fine-tuning.
Features: Tool use, reasoning
Performance: Ernie-4.5-VL-28B-A3B-Thinking competes with larger proprietary models on document understanding tasks despite activating only 3 billion parameters, Baidu said. For instance, on ChartQA (chart interpretation), Ernie-4.5-VL-28B-A3B-Thinking reached 87.1 percent accuracy, outperforming Gemini 2.5 Pro (76.3 percent) and GPT-5 set to high reasoning (78.2 percent). On OCRBench (text recognition in images), it achieved 858, ahead of GPT-5 set to high reasoning (810) but trailing Gemini 2.5 Pro (866).
Availability: Weights free for noncommercial and commercial uses under Apache 2.0 license via HuggingFace. API $0.14/$0.56 per million input/output tokens via Baidu Qianfan.
Undisclosed: Output size limit, training data, reward models

Ernie-5.0: Baidu describes Ernie-5.0’s approach as natively multimodal, meaning it was trained on text, images, audio, and video together rather than fusing different media encoders after training or routing inputs to specialized models. It performs comparably to the similarly multimodal Google Gemini 2.5 or OpenAI GPT-5, according to Baidu.

Input/output: Text, image, audio, and video in (up to 128,000 tokens); text, image, audio, video out (up to 64,000 tokens)
Architecture: Mixture-of-experts (MoE) transformer (2.4 trillion parameters total, less than 72 billion active per token)
Features: Vision-language-audio understanding, reasoning, agentic planning, tool use
Performance: In Baidu’s tests of multimodal reasoning, document understanding, and visual question-answering, the company reports that Ernie-5.0 matched or exceeded OpenAI GPT-5 set to high reasoning and Google Gemini 2.5 Pro. For instance, on OCRBench (document comprehension), DocVQA (document comprehension), and ChartQA (structured data reasoning), Baidu Ernie-5.0 achieved top scores. On MM-AU (multimodal audio understanding) and TUT2017 (acoustic scene classification), it demonstrated competitive performance, Baidu said without publishing specific metrics.
Availability: Free web interface, API $0.85/$3.40 per million input/output tokens via Baidu Qianfan
Undisclosed: Training data, training methods

Yes, but: Shortly after Ernie-5.0's launch, a developer reported that the model repeatedly called tools even after instruction not to. Baidu acknowledged the issue and said it was fixing it.

Why it matters: Ernie-4.5-VL-28B-A3B-Thinking offers top visual reasoning at the fraction of the cost of competing models, and more flexibility for fine-tuning and other commercial customizations. However, the long-awaited Ernie 5.0 appears to fall short of expectations. It matches top models on some visual tasks but stops short of the forefront (including Qwen3-Max and Kimi-K2-Thinking) on leaderboards like LM Arena. Pretraining on text, images, video, and audio together is a relatively fresh approach that could simplify current systems that piece together different encoders and decoders for different media types.

We’re thinking: Ernie-5.0 may outperform Gemini 2.5 and GPT-5, but Google and OpenAI have already moved on to Gemini 3 and GPT-5.1!

survival8

Pages

Friday, December 12, 2025

GPT-5.2, Gemini, and the AI Race -- Does Any of This Actually Help Consumers?

The Model Release Roller Coaster

Does GPT-5.2 Actually Matter?

Benchmarks Are Broken—But Still Drive Everything

The Market Is Asking the Wrong Question About Transparency

Amazon Nova: Big Swing or Big Hype?

The Future: Agents That Can Work for Days

So… Who’s Actually Winning?

Final Thought

GPT-5.2 is now live in the OpenAI API

Sunday, December 7, 2025

Model Alert... World Labs launched Marble -- Generated, Editable Virtual Spaces

Generated, Editable Virtual Spaces

Model Alert... Open 3D Generation Pipeline -- Meta’s Segment Anything Model (SAM) image-segmentation model

Open 3D Generation Pipeline

Model Alert... Ernie -- Baidu’s Multimodal Bids

Baidu’s Multimodal Bids