Skip to main content
GLM-5.2 vs Claude Opus

GLM-5.2 vs Claude Opus

· 13 min read
James Daniel Whitford
Software engineer and technical writer

GLM-5.2 just came out, and it's another step forward for what open models can do.

Naturally, the internet freaked out. There's a lot of hype around it right now, and it can be hard to tell what the model actually is, how you can use it, and what it can and can't do.

This guide helps you navigate the hype. We'll show you what people are saying, the pros and the cons, then run our own vibe test pitting Claude Opus against GLM-5.2.

Here's a preview of the two games the models built. Both are browser games written from scratch, with no game engine or 3D rendering library like Three.js. The 3D models are provided by Kenney.

What Opus made

Opus's game, played start to finish

What GLM-5.2 made

GLM-5.2's game, played start to finish

What is GLM-5.2

GLM-5.2 is Z.ai's latest flagship model. It's open weights under an MIT license, so you can download it, run it yourself, or call it through Z.ai's API.

It's built for long-horizon tasks, the kind of long, multi-step coding-agent work that runs for hours. It ships with a 1M-token context window and two thinking effort levels, High and Max, that trade speed for capability.

note

GLM-5.2 is text-only, not multimodal. It can't read images, so workflows built around screenshots or diagrams still need a model like Claude Opus.

Z.ai positions it roughly between Claude Opus 4.7 and 4.8 at similar token usage. Here's their announcement, if you want to read more:

Pricing and access

Because it's open weights, GLM-5.2 is cheap. Through an API it costs a fraction of Opus, and you can run it yourself for free if you have the hardware.

Pricing, per 1M tokens (vendor docs):

InputCache readOutput
Claude Opus 4.8$5$0.50$25
GLM-5.2$1.4$0.26$4.4

On output tokens, GLM-5.2 is less than a fifth the price of Opus.

The weights are on Hugging Face and ModelScope under an MIT license, with no regional restrictions. You can serve it locally with frameworks like vLLM, SGLang, or Transformers.

The benchmarks

Z.ai published these benchmark numbers alongside the release, on its model card.

* = Anthropic self-reported.

BenchmarkGLM-5.2Opus 4.8GPT-5.5Gemini 3.1 Pro
Reasoning
HLE40.549.8*41.4*45
HLE (w/ tools)54.757.9*52.2*51.4*
AIME 202699.295.798.398.2
GPQA-Diamond91.293.693.694.3
IMOAnswerBench91.083.581
Coding
SWE-bench Pro62.169.258.654.2
NL2Repo48.969.750.733.4
DeepSWE46.2587010
ProgramBench63.771.970.839.5
Terminal Bench 2.1 (Terminus-2)81.0858474
Terminal Bench 2.1 (best harness)82.778.983.470.7
SWE-Marathon13.026.012.04.0
Agentic
MCP-Atlas (public)76.877.875.369.2
Tool-Decathlon48.259.955.648.8

An independent run by ArtificialAnalysis broadly agrees:

  • Intelligence Index v4.1: 51 (leading open-weights; MiniMax-M3 44, DeepSeek V4 Pro 44, Kimi K2.6 43).
  • TerminalBench v2.1: 78% (vs 81 / 82.7 on the model card — different harness).
  • Output tokens per task: ~43k (GLM-5.1: 26k).

These benchmarks span three areas: reasoning (hard math and science exams), coding (fixing bugs and building whole projects), and agentic tool use (calling and chaining real tools). For what each one tests, see the benchmark notes at the end.

It can be hard to tell what's real and what isn't online these days. So we compiled a couple of real-world examples to give you the general vibe of what people are saying about GLM-5.2.

"It keeps up with the top closed models"

This tweet compares GLM-5.2 against Claude Opus 4.8 (high), Claude Fable 5, and GPT-5.5 (high). The video shows each model rendering a 3D scene and building a few assets from scratch.

The takeaway people draw is that an open model now lands near the best closed models in the world.

But this is also the kind of thing that shades into astroturfing. The constraints aren't clear, and it's not obvious the task really pits the models against each other.

So treat it as a vibe, not a result. It's a basic demo that impresses on sight, with no technical scrutiny required.

A lot of what you'll see online is exactly this.

"This model is insane at design"

Another common sentiment is that it's strong at user-interface design, on par with the top closed models. This tweet had GLM-5.2 and Opus 4.8 each build a landing page.

The two are hard to tell apart. Design is subjective, so have a look yourself.

It also flags the price: the GLM build cost $0.06 against Opus's $0.49, over six times cheaper and faster. That cheap-and-open angle is a big part of why people are hyped.

"It can't read images"

Not all the talk is positive. This tweet points out that GLM-5.2 can't read an attached image, because it isn't multimodal.

Models like Claude Opus take images natively, which matters for workflows built around screenshots, diagrams, or design mockups.

We ran our own vibe test

To cut through the vibes, we ran our own test. We gave Opus 4.8 and GLM-5.2 the same one-shot prompt: build a 3D platformer game from scratch, in raw WebGL, with no game engine or 3D library.

To finish, each model had to build:

  • A 3D engine and renderer in raw WebGL, no Three.js or any library.
  • A loader for the supplied 3D character and world models.
  • A character that runs and jumps around an arena, with gravity and collision.
  • A follow camera and keyboard controls.
  • The whole thing runnable in the browser with one command.

That stresses a few capabilities at once:

  • Long-horizon work: holding a layered, multi-file project together over many steps.
  • Hard reasoning and code taste: getting the subtle engine internals right, the parts that look fine but quietly break.
  • Correctness over looks: whether the rendering and physics actually work on screen, not just a pretty page.

Both got the same prompt, the same assets, and one attempt with no hints. The 3D models are free CC0 assets from Kenney.

note

We ran Opus 4.8 with extended thinking on high, and GLM-5.2 with thinking set to high, so both models got their full reasoning budget on the task.

How long it took, and what it cost

Opus 4.8 built in Claude Code; GLM-5.2 built in Pi over OpenRouter. Here's how the two runs compared on time, tokens, and cost.

Side-by-side timelapse of Opus and GLM-5.2 building the game

Side-by-side timelapse. Opus finishes at 34:00, GLM-5.2 at 1:11.

MetricOpus (Claude Code)GLM-5.2 (Pi/OpenRouter)
Wall-clock build time33m 30s1h 10m 40s
Output tokens216,809131,000
Peak context window19% of 1M16% of 1M
Tool calls153128
Cost~$21.92 (estimate, list pricing)$5.39 (real billed)

Opus finished in half the time. GLM-5.2 cost a fraction as much.

Playtesting both games

We played both games start to finish. Here's how each one held up.

Opus

Opus's game plays well.

Opus playthrough from start to finish

Opus, start to finish.

From the playthrough:

  • The camera and controller work.
  • One obstacle sits off the player's path, which is a little odd.
  • The spike hazard kills the player, so that logic is correct.
  • It looks good overall, and you can reach the flag and win. There's a real win condition.

The animations look good and run smoothly, with textures applied properly.

Opus animations, textures, and controller working

Opus: animations, textures, controller working.

GLM-5.2

GLM-5.2's game is rougher.

GLM-5.2 playthrough from start to finish

GLM-5.2, start to finish.

From the playthrough:

  • It doesn't look as good overall.
  • The character is missing some of its materials.
  • The spike hazard doesn't kill the character.
  • Reaching the flag does nothing. There's no win condition.

So it's not that great. It did nail one thing, though: the spring.

GLM-5.2 spring launch mechanic

GLM-5.2 spring launch.

You can jump on the spring and launch up to the next platform.

How each model checked its own work

The task told both models to verify their own work before stopping. They differed in one way: Opus is multimodal and can read images, GLM-5.2 is text-only.

Opus could see its output

Opus is multimodal. Its verification test rendered the game and saved a screenshot "for visual confirmation," and the final result shows a clean HUD with the debug readouts cleared.

Opus's self-check screenshot, clean HUD

Opus's screenshot: clean HUD, debug readouts removed.

GLM-5.2 checked the numbers

GLM-5.2 is text-only. It verified through console logs and an on-screen debug readout: FPS, position, grounded state, animation, coins, deaths.

GLM-5.2's final screenshot with the debug overlay still showing

GLM-5.2's final screenshot: the debug overlay is still on. It never saw the frame.

The numbers all checked out, so it stopped. It couldn't tell the debug text was still sitting over the game.

The trade-off

A model that can read images can review its own visual output. It can catch problems that never reach the logs: leftover debug text, bad framing, a model rendered gray instead of textured.

A text-only model checks its work through numbers and console output. That covers non-visual logic, but it misses anything you have to look at, which is how GLM-5.2 left its debug overlay in the final build.

The bugs

Both games had bugs. Here's what broke in each.

GLM-5.2

GLM-5.2's bugs were more frequent and more visible, and several were fundamentals.

The character faces the wrong way

It walks in the right direction, but the model is turned backwards the whole time.

GLM-5.2 character walk and facing bug

Missing textures and a disappearing head

The character renders flat gray instead of textured, and its head vanishes whenever the camera moves.

GLM-5.2 animation controller bugs

The death spike doesn't kill

The character lands right on a spike hazard and nothing happens. No death, no reset.

GLM-5.2 spike collision bug

Opus

Opus's were fewer and subtler, edge cases rather than broken basics.

Standing on thin air

The character can sit beside a platform, in mid-air, without falling. A collision edge case.

Opus coyote-time bug, character stands beside platform without falling

Winning from too far away

The win triggers while the character is still well short of the flag.

Opus early-finish bug, win triggered too far from the flag

The verdict

So, is the hype real? Mostly.

GLM-5.2 is a genuinely strong open model, at a fraction of Opus's price. For a lot of work, that combination is hard to beat.

But it isn't Opus. In our test, Opus was faster, shipped a cleaner and more correct game, and could check its own work by looking at it.

GLM-5.2 was far cheaper, but rougher, and it's text-only.

Use GLM-5.2 when cost and openness matter and the work is mostly text and logic. Use Opus when correctness, polish, and visual judgment matter, and you'll pay for it.

What the benchmarks measure

HLE

Humanity's Last Exam. Thousands of expert-level questions across many subjects, built to be extremely hard.

HLE (w/ tools)

The same exam, but the model can use tools like web search and code.

AIME 2026

A hard American high-school math competition.

GPQA-Diamond

Graduate-level science questions written so they can't be answered with a quick search.

IMOAnswerBench

Math-olympiad-style problems, scored on the final answer.

SWE-bench Pro

Fixing real issues in real codebases, often with changes across several files.

NL2Repo

Building a whole, runnable codebase from a single written spec.

DeepSWE

Agentic software-engineering tasks in a sandboxed container with no internet.

ProgramBench

Rebuilding a full program from only its compiled binary and documentation, with no source or spec given.

Terminal Bench 2.1

Tasks completed through a real terminal. The two rows use a fixed harness (Terminus-2) and each model's best harness.

SWE-Marathon

Twenty ultra-long-horizon engineering tasks, each running for hours.

MCP-Atlas

Tool-use tasks run against real MCP servers, each needing several tool calls.

Tool-Decathlon

Long-horizon tasks across many real apps, each needing a long chain of tool calls.

About the author

James Daniel Whitford
James Daniel WhitfordSoftware engineer and technical writer

James Daniel Whitford is a software engineer and technical writer at Ritza. He writes about developer tooling, AI agents, and full-stack web development, and contributes hands-on tool comparisons to TechStackups.