Showing posts with label Artificial Intelligence. Show all posts

Saturday, December 13, 2025

This Week in AI... Why Agentic Systems, GPT-5.2, and Open Models Matter More Than Ever

If it feels like the AI world is moving faster every week, you’re not imagining it.

In just a few days, we’ve seen new open-source foundations launched, major upgrades to large language models, cheaper and faster coding agents, powerful vision-language models, and even sweeping political moves aimed at reshaping how AI is regulated.

Instead of treating these as disconnected announcements, let’s slow down and look at the bigger picture. What’s actually happening here? Why do these updates matter? And what do they tell us about where AI is heading next?

This post breaks it all down — without the hype, and without assuming you already live and breathe AI research papers.

The Quiet Rise of Agentic AI (And Why Governance Matters)

One of the most important stories this week didn’t come with flashy demos or benchmark charts.

The Agentic AI Foundation (AAIF) was created to provide neutral governance for a growing ecosystem of open-source agent technologies. That might sound bureaucratic, but it’s actually a big deal.

At launch, AAIF is stewarding three critical projects:

Model Context Protocol (MCP) from Anthropic
Goose, Block’s agent framework built on MCP
AGENTS.md, OpenAI’s lightweight standard for describing agent behavior in projects

If you’ve been following AI tooling closely, you’ve probably noticed a shift. We’re moving away from single prompt → single response systems, and toward agents that can:

Use tools
Access files and databases
Call APIs
Make decisions across multiple steps
Coordinate with other agents

MCP, in particular, has quietly become a backbone for this movement. With over 10,000 published servers, it’s turning into a kind of “USB-C for AI agents” — a standard way to connect models to tools and data.

What makes AAIF important is not just the tech, but the governance. Instead of one company controlling these standards, the foundation includes contributors from AWS, Google, Microsoft, OpenAI, Anthropic, Cloudflare, Bloomberg, and others.

That signals something important:

Agentic AI isn’t a side experiment anymore — it’s infrastructure.

GPT-5.2: The AI Office Worker Has Arrived

Now let’s talk about the headline grabber: GPT-5.2.

OpenAI positions GPT-5.2 as a model designed specifically for white-collar knowledge work. Think spreadsheets, presentations, reports, codebases, and analysis — the kind of tasks that dominate modern office jobs.

According to OpenAI’s claims, GPT-5.2:

Outperforms human professionals on ~71% of tasks across 44 occupations (GDPval benchmark)
Runs 11× faster than previous models
Costs less than 1% of earlier generations for similar workloads

Those numbers are bold, but the more interesting part is how the model is being framed.

GPT-5.2 isn’t just “smarter.” It’s packaged as a document-first, workflow-aware system:

Building structured spreadsheets
Creating polished presentations
Writing and refactoring production code
Handling long documents with fewer errors

Different variants target different needs:

GPT-5.2 Thinking emphasizes structured reasoning
GPT-5.2 Pro pushes the limits on science and complex problem-solving
GPT-5.2 Instant focuses on speed and responsiveness

The takeaway isn’t that AI is replacing all office workers tomorrow. It’s that AI is becoming a reliable first draft for cognitive labor — not just text, but work artifacts.

Open Models Are Getting Smaller, Cheaper, and Smarter

While big proprietary models grab headlines, some of the most exciting progress is happening in open-source land.

Mistral’s Devstral 2: Serious Coding Power, Openly Licensed

Mistral released Devstral 2, a 123B-parameter coding model, alongside a smaller 24B version called Devstral Small 2.

Here’s why that matters:

Devstral 2 scores 72.2% on SWE-bench Verified
It’s much smaller than competitors like DeepSeek V3.2
Mistral claims it’s up to 7× more cost-efficient than Claude Sonnet
Both models support massive 256K token contexts

Even more importantly, the models are released under open licenses:

Modified MIT for Devstral 2
Apache 2.0 for Devstral Small 2

That means companies can run, fine-tune, and deploy these models without vendor lock-in.

Mistral also launched Mistral Vibe CLI, a tool that lets developers issue natural-language commands across entire codebases — a glimpse into how coding agents will soon feel more like collaborators than autocomplete engines.

Vision + Language + Tools: A New Kind of Reasoning Model

Another major update came from Zhipu AI, which released GLM-4.6V, a vision-language reasoning model with native tool calling.

This is subtle, but powerful.

Instead of treating images as passive inputs, GLM-4.6V can:

Accept images as parameters to tools
Interpret charts, search results, and tool outputs
Reason across text, visuals, and structured data

In practical terms, that enables workflows like:

Turning screenshots into functional code
Analyzing documents that mix text, tables, and images
Running visual web searches and reasoning over results

With both large (106B) and local (9B) versions available, this kind of multimodal agent isn’t just for big cloud players anymore.

Developer Tools Are Becoming Agentic, Too

AI models aren’t the only thing evolving — developer tools are changing alongside them.

Cursor 2.2 introduced a new Debug Mode that feels like an early glimpse of agentic programming environments.

Instead of just pointing out errors, Cursor:

Instruments your code with logs
Generates hypotheses about what’s wrong
Asks you to confirm or reproduce behavior
Iteratively applies fixes

It also added a visual web editor, letting developers:

Click on UI elements
Inspect props and components
Describe changes in plain language
Update code and layout in one integrated view

This blending of code, UI, and agent reasoning hints at a future where “programming” looks much more collaborative — part conversation, part verification.

The Political Dimension: Centralizing AI Regulation

Not all AI news is technical.

This week also saw a major U.S. executive order aimed at creating a single federal AI regulatory framework, overriding state-level laws.

The order:

Preempts certain state AI regulations
Establishes an AI Litigation Task Force
Ties federal funding eligibility to regulatory compliance
Directs agencies to assess whether AI output constraints violate federal law

Regardless of where you stand politically, this move reflects a growing realization:
AI governance is now a national infrastructure issue, not just a tech policy debate.

As AI systems become embedded in healthcare, finance, education, and government, fragmented regulation becomes harder to sustain.

The Bigger Pattern: AI Is Becoming a System, Not a Tool

If there’s one thread connecting all these stories, it’s this:

AI is no longer about individual models — it’s about systems.

We’re seeing:

Standards for agent behavior
Open governance for shared infrastructure
Models optimized for workflows, not prompts
Tools that reason, debug, and collaborate
Governments stepping in to shape long-term direction

The era of “just prompt it” is fading. What’s replacing it is more complex — and more powerful.

Agents need scaffolding. Models need context. Tools need interoperability. And humans are shifting from direct operators to supervisors, reviewers, and designers of AI-driven processes.

So What Should You Take Away From This?

If you’re a student, developer, or knowledge worker, here’s the practical takeaway:

Learn how agentic workflows work — not just prompting
Pay attention to open standards like MCP
Don’t ignore smaller, cheaper models — they’re closing the gap fast
Expect AI tools to increasingly ask for confirmation, not blind trust
Understand that AI’s future will be shaped as much by policy and governance as by benchmarks

The AI race isn’t just about who builds the biggest model anymore.

It’s about who builds the most usable, reliable, and well-governed systems — and who learns to work with them intelligently.

And that race is just getting started.

Monday, December 8, 2025

AI’s Next Phase -- Specialization, Scaling, and the Coming Agent Platform Wars -- Mistral 3 vs DeepSeek 3.2 vs Claude Opus 4.5

See All Articles on AI

As 2025 comes to a close, the AI world is doing the opposite of slowing down. In just a few weeks, we’ve seen three major model launches from different labs:

Mistral 3
DeepSeek 3.2
Claude Opus 4.5

All three are strong. None are obviously “bad.” That alone is a big shift from just a couple of years ago, when only a handful of labs could credibly claim frontier-level models.

But the interesting story isn’t just that everything is good now.

The real story is this:

AI is entering a phase where differentiation comes from specialization and control over platforms, not just raw model quality.

We can see this in three places:

How Mistral, DeepSeek, and Anthropic are carving out different strengths.
How “scaling laws” are quietly becoming “experimentation laws.”
How Amazon’s move against ChatGPT’s shopping agent signals an emerging platform war around agents.

Let’s unpack each.

1. Mistral vs. DeepSeek vs. Claude: When Everyone Is Good, What Makes You Different?

On paper, the new Mistral and DeepSeek releases look like they’re playing the same game: open models, strong benchmarks, competitive quality.

Under the hood, they’re leaning into very different philosophies.

DeepSeek 3.2: Reasoning and Sparse Attention for Agents

DeepSeek has become synonymous with novel attention mechanisms and high-efficiency large models. The 3.2 release extends that trend with:

Sparse attention techniques that help big models run more efficiently.
A strong emphasis on reasoning-first performance, especially around:
- Tool use
- Multi-step “agentic” workflows
- Math and code-heavy tasks

If you squint, DeepSeek is trying to be “the reasoning lab”:

If your workload is complex multi-step thinking with tools, we want to be your default.

Mistral 3: Simple Transformer, Strong Multimodality, Open Weights

Mistral takes almost the opposite architectural route.

No flashy linear attention.
No wild new topology.
Just a dense, relatively “plain” transformer — tuned very well.

The innovation is in how they’ve packaged the lineup:

Multimodal by default across the range, including small models.
You can run something like Mistral 3B locally and still get solid vision + text capabilities.
That makes small, on-device, multimodal workflows actually practical.

The message from Mistral is:

You don’t need a giant proprietary model to do serious multimodal work. You can self-host it, and it’s Apache 2.0 again, not a bespoke “research-only” license.

Claude Opus 4.5: From Assistant to Digital Worker

Anthropic’s Claude Opus 4.5 sits more on the closed, frontier side of the spectrum. Its differentiation isn’t just capabilities, but how it behaves as a collaborator.

A few emerging themes:

Strong focus on software engineering, deep code understanding, and long-context reasoning.
A growing sense of “personality continuity”:
- Users report the model doing natural “callbacks” to earlier parts of the conversation.
- It feels less like a stateless chat and more like an ongoing working relationship.
Framed by Anthropic as more of a “digital worker” than a simple assistant:
- Read the 200-page spec.
- Propose changes.
- Keep state across a long chain of tasks.

If DeepSeek is leaning into reasoning, and Mistral into open multimodal foundations, Claude is leaning into:

“Give us your workflows and we’ll embed a digital engineer into them.”

The Big Shift: Differentiation by Domain, Not Just Quality

A few years ago, the question was: “Which model is the best overall?”

Now the better question is:

“Best for what?”

Best for local multimodal tinkering? Mistral is making a strong case.
Best for tool-heavy reasoning and math/code? DeepSeek is aiming at that.
Best for enterprise-grade digital teammates? Claude wants that slot.

This is how the “no moat” moment is resolving:
When everyone can make a good general model, you specialize by domain and workflow, not just by raw benchmark scores.

2. Are Scaling Laws Still a Thing? Or Are We Just Scaling Experimentation?

A recent blog post from VC Tomas Tunguz reignited debate about scaling laws. His claim, paraphrased: Gemini 3 shows that the old scaling laws are still working—with enough compute, we still get big capability jumps.

There’s probably some truth there, but the nuance matters.

Scaling Laws, the Myth Version

The “myth” version of scaling laws goes something like:

“Make the model bigger. Feed it more data. Profit.”

If that were the full story, only the labs with the most GPUs (or TPUs) would ever meaningfully advance the frontier. Google, with deep TPU integration, is the clearest example: it has “the most computers that ever computed” and the tightest hardware–software stack.

But that’s not quite what seems to be happening.

What’s Really Scaling: Our Ability to Experiment

With Gemini 3, Google didn’t massively increase parameters relative to Gemini 1.5. The improvements likely came from:

Better training methods
Smarter data curation and filtering
Different mixtures of synthetic vs human data
Improved training schedules and hyperparameters

In other words, the action is shifting from:

“Make it bigger” → to → “Train it smarter.”

The catch?
Training smarter still requires a lot of room to experiment. When:

One full-scale training run costs millions of dollars, and
Takes weeks or months,

…you can’t explore the space of training strategies very fully. There’s a huge hyperparameter and design space we’ve barely touched, simply because it’s too expensive to try things.

That leads to a more realistic interpretation:

Scaling laws are quietly turning into experimentation laws.

The more compute you have, the more experiments you can run on:

architecture

training data

curricula

optimization tricks
…and that’s what gives you better models.

From this angle, Google’s big advantage isn’t just size—it’s iteration speed at massive scale. As hardware gets faster, what really scales is how quickly we can search for better training strategies.

3. Agents vs Platforms: Amazon, ChatGPT, and the New Walled Gardens

While models are getting better, a different battle is playing out at the application layer: agents.

OpenAI’s Shopping Research agent is a clear example of the agent vision:

“Tell the agent what you need. It goes out into the world, compares products, and comes back with recommendations.”

If you think “online shopping,” you think Amazon. But Amazon recently took a decisive step:
It began blocking ChatGPT’s shopping agent from accessing product detail pages, review data, and deals.

Why Would Amazon Block It?

You don’t need a conspiracy theory to answer this. A few obvious reasons:

Control over the funnel
Amazon doesn’t want a third-party agent sitting between users and its marketplace.
Protection of ad and search economics
Product discovery is where Amazon makes a lot of money.
They’re building their own AI layers
With things like Alexa+ and Rufus, Amazon wants its own assistants to be the way you shop.

In effect, Amazon is saying:

“If you want to shop here, you’ll use our agent, not someone else’s.”

The Deeper Problem: Agents Need an Open Internet, but the Internet Is Not Open

Large-language-model agents rely on a simple assumption:

“They can go out and interact with whatever site or platform is needed on your behalf.”

But the reality is:

Cloudflare has started blocking AI agents by default.
Amazon is blocking shopping agents.
Many platforms are exploring paywalls or tollbooths for automated access.

So before we hit technical limits on what agents can do, we’re hitting business limits on where they’re allowed to go.

It raises an uncomfortable question:

Can we really have a “universal agent” if every major platform wants to be its own closed ecosystem?

Likely Outcome: Agents Become the New Apps

The original dream:

One personal agent
Talks to every service
Does everything for you across the web

The likely reality:

You’ll have a personal meta-agent, but it will:
- Call Amazon’s agent for shopping
- Call your bank’s agent for finance
- Call your airline’s agent for travel
Behind the scenes, this will look less like a single unified agent and more like:

“A multi-agent OS for your life, glued together by your personal orchestrator.”

In other words, we may not be escaping the “app world” so much as rebuilding it with agents instead of apps.

The Big Picture: What Phase Are We Entering?

If you zoom out, these threads are connected:

Models are converging on “good enough,” so labs specialize by domain and workflow.
Scaling is shifting from “make it bigger” to “let us run more experiments on architectures, data, and training.”
Agents are bumping into platform economics and control, not just technical feasibility.

Put together, it suggests we’re entering a new phase:

From the Open Frontier Phase → to the Specialization and Platform Phase.

Labs will succeed by owning specific domains and developer workflows.
The biggest performance jumps may come from training strategy innovation, not parameter count.
Agent ecosystems will reflect platform power struggles as much as technical imagination.

The excitement isn’t going away. But the rules of the game are changing—from who can train the biggest model to who can:

Specialize intelligently
Experiment fast
Control key platforms
And still give users something that feels like a single, coherent AI experience.

That’s the next frontier.

Tags: Artificial Intelligence,Technology,

Where We Stand on AGI: Latest Developments, Numbers, and Open Questions

See All Articles on AI

Executive summary (one line)

Top models have made rapid, measurable gains (e.g., GPT‑5 reported around 50–70% on several AGI-oriented benchmarks), but persistent, hard-to-solve gaps — especially durable continual learning, robust multimodal world models, and reliable truthfulness — mean credible AGI timelines still range from a few years (for narrow definitions) to several decades (for robust human‑level generality). Numbers below are reported by labs and studies; where results come from internal tests or single groups I flag them as provisional.

Quick snapshot of major recent headlines

OpenAI released GPT‑5 (announced Aug 7, 2025) — presented as a notable step up in reasoning, coding and multimodal support (press release and model paper reported improvements).
Benchmarks and expert studies place current top models roughly “halfway” to some formal AGI definitions: a ten‑ability AGI framework reported GPT‑4 at 27% and GPT‑5 at 57% toward its chosen AGI threshold (framework authors’ reported scores).
Some industry/academic reports and panels (for example, an MIT/Arm deep dive) warn AGI‑like systems might appear as early as 2026; other expert surveys keep median predictions later (many 50%‑probability dates clustered around 2040–2060).
Policy and geopolitics matter: RAND (modeling reported Dec 1, 2025) frames the US–China AGI race as a prisoner’s dilemma — incentives favor speed absent stronger international coordination and verification.

Methods and definitions (short)

What “AGI score” means here: the draft uses several benchmarking frameworks that combine multiple task categories (reasoning, planning, perception, memory, tool use). Each framework weights abilities differently and maps aggregate performance to a 0–100% scale relative to an internal "AGI threshold" chosen by its authors. These mappings are normative — not universally agreed — so percentages should be read as framework‑specific progress indicators, not absolute measures of human‑level general intelligence.

Provenance notes: I flag results as (a) published/peer‑reviewed, (b) public benchmark results, or (c) reported/internal tests by labs. Where items are internal or single‑lab reports they are provisional and should be independently verified before being used as firm evidence.

Benchmarks and headline numbers (compact table)

Benchmark	What it measures	Model / Score	Human baseline / Notes	Source type
Ten‑ability AGI framework	Aggregate across ~10 cognitive abilities	GPT‑4: 27% · GPT‑5: 57%	Framework‑specific AGI threshold (authors' mapping)	Reported framework scores (authors)
SPACE (visual reasoning subset)	Visual reasoning tasks (subset)	GPT‑4o: 43.8% · GPT‑5 (Aug 2025): 70.8%	Human average: 88.9%	Internal/public benchmark reports (reported)
MindCube	Spatial / working‑memory tests	GPT‑4o: 38.8% · GPT‑5: 59.7%	Still below typical human average	Benchmark reports (reported)
SimpleQA	Hallucination / factual accuracy	GPT‑5: hallucinations in >30% of questions (reported)	Some other models (e.g., Anthropic Claude variants) report lower hallucination rates	Reported / model vendor comparisons
METR endurance test	Sustained autonomous task performance	GPT‑5.1‑Codex‑Max: ~2 hours 42 minutes · GPT‑4: few minutes	Measures autonomous chaining and robustness over time	Internal lab test (provisional)
IMO 2025 (DeepMind Gemini, "Deep Think" mode)	Formal math problem solving under contest constraints	Solved 5 of 6 problems within 4.5 hours (gold‑level performance reported)	Shows strong formal reasoning in a constrained task	Reported by DeepMind (lab result)

Where models still struggle (the real bottlenecks)

Continual learning / long‑term memory: Most models remain effectively "frozen" after training; reliably updating and storing durable knowledge over weeks/months remains unsolved and is widely cited as a high‑uncertainty obstacle.
Multimodal perception (vision & world models): Text and math abilities have improved faster than visual induction and physical‑world modeling; visual working memory and physical plausibility judgments still lag humans.
Hallucinations and reliable retrieval: High‑confidence errors persist (SimpleQA >30% hallucination reported for GPT‑5 in one test); different model families show substantial variance.
Low‑latency tool use & situated action: Language is fast; perception‑action loops and real‑world tool use (robotics) remain harder and slower.

How researchers think we’ll get from here to AGI

Two broad routes dominate discussion:

Scale current methods: Proponents argue more parameters, compute and better data will continue yielding returns. Historical training‑compute growth averaged ~4–5×/year (with earlier bursts up to ~9×/year until mid‑2020).
New architectures / breakthroughs: Others (e.g., prominent ML researchers) argue scaling alone won’t close key gaps and that innovations (robust world models, persistent memory systems, tighter robotics integration) are needed.

Compute projections vary: one analysis (Epoch AI) suggested training budgets up to ~2×10^29 FLOPs could be feasible by 2030 under optimistic assumptions; other reports place upper bounds near ~3×10^31 FLOPs depending on power and chip production assumptions.

Timelines: why predictions disagree

Different metrics, definitions and confidence levels drive wide disagreement. Aggregated expert surveys show medians often in the 2040–2060 range, while some narrow frameworks and industry estimates give earlier dates (one internal framework estimated 50% by end‑2028 and 80% by end‑2030 under its assumptions). A minority of experts and some industry reports have suggested AGI‑like capabilities could appear as early as 2026. When using these numbers, note the underlying definition of AGI, which benchmark(s) are weighted most heavily, and whether the estimate is conditional on continued scaling or a specific breakthrough.

Risks, governance and geopolitics

Geopolitics: RAND models (Dec 1, 2025 reporting) show a prisoner’s dilemma: nations face incentives to accelerate unless international verification and shared risk assessments improve.
Security risks: Reports warn of misuse (e.g., advances in bio‑expertise outputs), espionage, and supply‑chain chokepoints (chip export controls and debates around GPU access matter for pace of progress).
Safety strategies: Proposals range from technical assurance and transparency to verification regimes and deterrence ideas; all face verification and observability challenges.
Ethics and law: Active debates continue over openness, liability, and model access control (paywalls vs open releases).

Bottom line for students (and what to watch)

Progress is real and measurable: top models now match or beat humans on many narrow tasks, have larger context windows, and can sustain autonomous code writing for hours in some internal tests. But key human‑like capacities — durable continual learning, reliable multimodal world models, and trustworthy factuality — remain outstanding. Timelines hinge on whether these gaps are closed by continued scaling, a single breakthrough (e.g., workable continual learning), or new architectures. Policy and safety research must accelerate in parallel.

Watch these signals: AGI‑score framework updates, SPACE / IntPhys / MindCube / SimpleQA benchmark results, compute growth analyses (e.g., Epoch AI), major model releases (GPT‑5 and successors), METR endurance reports, and policy studies like RAND’s — and when possible, prioritize independently reproducible benchmark results over single‑lab internal tests.

References and sources (brief)

OpenAI GPT‑5 announcement — Aug 7, 2025 (model release/press materials; reported performance claims).
Ten‑ability AGI framework — authors’ reported scores for GPT‑4 (27%) and GPT‑5 (57%) (framework paper/report; framework‑specific mapping to AGI threshold).
SPACE visual reasoning subset results — reported GPT‑4o 43.8%, GPT‑5 (Aug 2025) 70.8%, human avg 88.9% (benchmark report / lab release; flagged as reported/internal where applicable).
MindCube spatial/working‑memory benchmark — reported GPT‑4o 38.8%, GPT‑5 59.7% (benchmark report).
SimpleQA factuality/hallucination comparison — GPT‑5 reported >30% hallucination rate; other models (Anthropic Claude variants) report lower rates (vendor/benchmark reports).
METR endurance test — reported GPT‑5.1‑Codex‑Max sustained autonomous performance ~2 hours 42 minutes vs GPT‑4 few minutes (internal lab test; provisional).
DeepMind Gemini (’Deep Think’ mode) — reported solving 5 of 6 IMO 2025 problems within 4.5 hours (DeepMind report; task‑constrained result).
Epoch AI compute projection — suggested ~2×10^29 FLOPs feasible by 2030 under some assumptions; other reports give upper bounds up to ~3×10^31 FLOPs (compute projection studies).
RAND modeling of US–China race — reported Dec 1, 2025 (prisoner’s dilemma framing; policy analysis report).
Expert surveys and timeline aggregates — multiple surveys report medians often in 2040–2060 with notable variance (survey meta‑analyses / aggregated studies).

Notes: Where a result was described in the original draft as coming from “internal tests” or a single lab, I preserved the claim but flagged it above as provisional and recommended independent verification. For any use beyond classroom discussion, consult the original reports and benchmark datasets to confirm methodology, sample sizes, dates and reproducibility.

Tags: Artificial Intelligence,Technology,

Sunday, December 7, 2025

Model Alert... World Labs launched Marble -- Generated, Editable Virtual Spaces

See All on AI Model Releases

Generated, Editable Virtual Spaces

Models that generate 3D spaces typically generate them as users move through them without generating a persistent world to be explored later. A new model produces 3D worlds that can be exported and modified.

What’s new: World Labs launched Marble, which generates persistent, editable, reusable 3D spaces from text, images, and other inputs. The company also debuted Chisel, an integrated editor that lets users modify Marble’s output via text prompts and craft spaces environments from scratch.

Input/output: Text, images, panoramas, videos, 3D layouts of boxes and planes in; Gaussian splats, meshes, or videos out.
Features: Expand spaces, combine spaces, alter visual style, edit spaces via text prompts or visual inputs, download generated spaces
Availability: Subscription tiers include Free (4 outputs based on text, images, or panoramas), $20 per month (12 outputs based on multiple images, videos, or 3D layouts), $35 per month (25 outputs with expansion and commercial rights), and $95 per month (75 outputs, all features)

How it works: Marble accepts several media types and exports 3D spaces in a variety of formats.

The model can generate a 3D space from a single text prompt or image. For more control, it accepts multiple images with text prompts (like front, back, left, or right) that specify which image should map to what areas. Users can also input short videos, 360-degree panoramas, or 3D models and connect outputs to build complex spaces.
The Chisel editor can create and edit 3D spaces directly. Geometric shapes like planes or blocks can be used to build structural elements like walls or furniture and styled via text prompts or images.
Generated spaces can be extended by clicking on an area to be extended or connected.
Model outputs can be Gaussian splats (high-quality representations composed of semi-transparent particles that can be rendered in web browsers), collider meshes (simplified 3D geometries that define object boundaries for physics simulations), and high-quality meshes (detailed geometries suitable for editing). Video output can include controllable camera paths and effects like smoke or flowing water.

Performance: Early users report generating game-like environments and photorealistic recreations of real-world locations.

Marble generates more complete 3D structures than depth maps or point clouds, which represent surfaces but not object geometries, World Labs said.
Its mesh outputs integrate with tools commonly used in game development, visual effects, and 3D modeling.

Behind the news: Earlier generative models can produce 3D spaces on the fly, but typically such spaces can’t be saved or revisited interactively. Marble stands out by generating spaces that can be saved and edited. For instance, in October, World Labs introduced RTFM, which generates spaces in real time as users navigate through them. Competing startups like Decart and Odyssey are available as demos, and Google’s Genie 3 remains a research preview.

Why it matters: World Labs founder and Stanford professor Fei-Fei Li argues that spatial intelligence — understanding how physical objects occupy and move through space — is a key aspect of intelligence that language models can’t fully address. With Marble, World Labs aspires to catalyze development in spatial AI just as ChatGPT and subsequent large language models ignited progress in text processing.

We’re thinking: Virtual spaces produced by Marble are geometrically consistent, which may prove valuable in gaming, robotics, and virtual reality. However, the objects within them are static. Virtual worlds that include motion will bring AI even closer to understanding physics.

Tags: AI Model Alert,Artificial Intelligence,Technology,

Model Alert... Open 3D Generation Pipeline -- Meta’s Segment Anything Model (SAM) image-segmentation model

See All on AI Model Releases

Open 3D Generation Pipeline

Meta’s Segment Anything Model (SAM) image-segmentation model has evolved into an open-weights suite for generating 3D objects. SAM 3 segments images, SAM 3D turns the segments into 3D objects, and SAM 3D Body produces 3D objects of any people among the segments. You can experiment with all three.

SAM 3: SAM 3 now segments images and videos based on input text. It retains the ability to segment objects based on input geometry (bounding boxes or points that are labeled to include or exclude the objects at those locations), like the previous version.

Input/output: Images, video, text, geometry in; segmented images or video out
Performance: In Meta’s tests, SAM 3 outperformed almost all competitors on a variety of benchmarks that test image and video segmentation. For instance, on LVIS (segmenting objects from text), SAM 3 (48.5 percent average precision) outperformed DINO-X (38.5 percent average precision). It fell behind APE-D (53.0 percent average precision), which was trained on LVIS’ training set.
Availability: Weights and fine-tuning code freely available for noncommercial and commercial uses in countries that don’t violate U.S., EU, UK, and UN trade restrictions under Meta license

SAM 3D: This model generates 3D objects from images based on segmentation masks. By individually predicting each object in an image, it can represent the entire scene. It can also take in point clouds to improve its output.

Input/output: Image, mask, point cloud in; 3D object (mesh, Gaussian splat) out
Performance: Judging both objects and scenes generated from photos, humans preferred SAM 3D’s outputs over those by other models. For instance, when generating objects from the LVIS dataset, people preferred SAM 3D nearly 80 percent of the time, Hunyuan3d 2.0 about 12 percent of the time, and other models 8 percent of the time.
Availability: Weights and inference code freely available for noncommercial and commercial uses in countries that don’t violate U.S., EU, UK, and UN trade restrictions under Meta license

SAM 3D Body: Meta released an additional model that produces 3D human figures from images. Input bounding boxes or masks can also determine which figures to produce, and an optional transformer decoder can refine the positions and shapes of human hands.

Input/output: Image, bounding boxes, masks in; 3D objects (mesh, Gaussian splat) out
Performance: In Meta’s tests, SAM 3D Body achieved the best performance across a number of datasets compared to other models that take images or videos and generate 3D human figures. For example, on the EMDB dataset of people in the wild, SAM 3D Body achieved 62.9 Mean Per Joint Position Error (MPJPE, a measure of how different the predicted joint positions are from the ground truth, lower is better) compared to next best Neural Localizer Fields, which achieved 68.4 MPJPE. On Freihand (a test of hand correctness), SAM 3D Body achieved similar or slightly worse performance than models that specialize in estimating hand poses. (The authors claim the other models were trained on Freihand’s training set.)
Availability: Weights, inference code, and training data freely available in countries that don’t violate U.S., EU, UK, and UN trade restrictions under Meta license

Why it matters: This SAM series offers a unified pipeline for making 3D models from images. Each model advances the state of the art, enabling more-accurate image segmentations from text, 3D objects that human judges preferred, and 3D human figures that also appealed to human judges. These models are already driving innovations in Meta’s user experience. For instance, SAM 3 and SAM 3D enable users of Facebook marketplace to see what furniture or other home decor looks like in a particular space.

We’re thinking: At the highest level, all three models learned from a similar data pipeline: Find examples the model currently performs poorly on, use humans to annotate them, and train on the annotations. According to Meta’s publications, this process greatly reduced the time and money required to annotate quality datasets.

Tags: Technology,Artificial Intelligence,AI Model Alert,

Pages

Saturday, December 13, 2025

This Week in AI... Why Agentic Systems, GPT-5.2, and Open Models Matter More Than Ever

The Quiet Rise of Agentic AI (And Why Governance Matters)

GPT-5.2: The AI Office Worker Has Arrived

Open Models Are Getting Smaller, Cheaper, and Smarter

Mistral’s Devstral 2: Serious Coding Power, Openly Licensed

Vision + Language + Tools: A New Kind of Reasoning Model

Developer Tools Are Becoming Agentic, Too

The Political Dimension: Centralizing AI Regulation

The Bigger Pattern: AI Is Becoming a System, Not a Tool

So What Should You Take Away From This?

Monday, December 8, 2025

AI’s Next Phase -- Specialization, Scaling, and the Coming Agent Platform Wars -- Mistral 3 vs DeepSeek 3.2 vs Claude Opus 4.5

1. Mistral vs. DeepSeek vs. Claude: When Everyone Is Good, What Makes You Different?

DeepSeek 3.2: Reasoning and Sparse Attention for Agents

Mistral 3: Simple Transformer, Strong Multimodality, Open Weights

Claude Opus 4.5: From Assistant to Digital Worker

The Big Shift: Differentiation by Domain, Not Just Quality

2. Are Scaling Laws Still a Thing? Or Are We Just Scaling Experimentation?

Scaling Laws, the Myth Version

What’s Really Scaling: Our Ability to Experiment

3. Agents vs Platforms: Amazon, ChatGPT, and the New Walled Gardens

Why Would Amazon Block It?

The Deeper Problem: Agents Need an Open Internet, but the Internet Is Not Open

Likely Outcome: Agents Become the New Apps

The Big Picture: What Phase Are We Entering?

Where We Stand on AGI: Latest Developments, Numbers, and Open Questions

Executive summary (one line)

Quick snapshot of major recent headlines

Methods and definitions (short)

Benchmarks and headline numbers (compact table)

Where models still struggle (the real bottlenecks)

How researchers think we’ll get from here to AGI

Timelines: why predictions disagree

Risks, governance and geopolitics

Bottom line for students (and what to watch)

References and sources (brief)

Sunday, December 7, 2025

Model Alert... World Labs launched Marble -- Generated, Editable Virtual Spaces

Generated, Editable Virtual Spaces

Model Alert... Open 3D Generation Pipeline -- Meta’s Segment Anything Model (SAM) image-segmentation model

Open 3D Generation Pipeline