
Can Claude Opus 4.8 Be Used by Technical Writers to Evaluate Their Own Work?
Agentic AI can do a lot of things, but it’s proven consistently mediocre when it comes to writing and editing. And yet, it remains a tool that could potentially ease the workflows of overworked and overburdened technical writers, some of whom may not have the time to edit their work extensively, others whose first language may not be English, and others who are simply there for the code and not the prose.
The goal of this article, then, is to ascertain whether the ever-lauded Claude Opus 4.8 is capable of accomplishing editing tasks at a level acceptable to a technical editor.
I’ve put two articles through Claude Code: What Is Agent Experience and Why Should You Care? and How to Do an AX Audit, which I’ll refer to as Article 1 and Article 2 respectively. For each, I gave Claude Code a series of identical prompts following the questions a human editor might ask.
Note: All testing in this article was done with Claude Opus 4.8.
Also, for the purpose of keeping this analysis somewhat digestible, I’ve used the abridged version of this prompting process, and I haven’t included my own edits of the two articles. Rather, I’ve graded Claude Code’s edits, giving them a score between 1 and 3 as per the rubric below:
| 1 | 2 | 3 |
|---|---|---|
| Incorrect and/or insubstantial | Partially correct and/or missing nuance | Correct and thorough |
Step 1: The 3-stage pass
The first thing I asked Claude to do was go over the article in 3 passes to flag structural, logical, and grammatical/spelling issues (in that order). All three passes were prompted in the same session for a single article.
The reason I started with this was that these are fairly straightforward evaluations: you can tell right off the bat whether an article contains a structural or logical issue, and a spelling mistake is a spelling mistake.
Here’s what I got.
Structural Issues
Prompt: Ignore logical errors, but identify any structural problems with this article.
Of the corrections presented to me in Articles 1 and 2, I graded them as follows:
| Total Corrections | Grade 1 | Grade 2 | Grade 3 | |
|---|---|---|---|---|
| Article 1 | 6 | 2 | 1 | 3 |
| Article 2 | 5 | 0 | 1 | 4 |
| Percentage | 100% | 18.18% | 18.18% | 63.64% |
So I’d rate Claude’s structural editing prowess about 60/40 on giving you the right advice; there’s also a high enough chance that it’ll miss the nuance just enough to give you a dud suggestion.
I won’t include the exhaustive list here, but here are a few sample corrections representative of my grading rationale:
Article 1
- Organizing model revealed too late
- Problem: The article is built on three stages — discovery → onboarding → usage — but only names them in the final section.
- Fix: Introduce the three-stage frame in the intro so it pays off at the end.
Grade: 3
This suggestion nails the primary structural issue in this article – a lack of signposting for the reader in the introduction.
- Two threads interleaved confusingly
- Problem: The Skyscanner and Google tasks are each introduced in one section and resolved in another, forcing the reader to hold both half-finished.
- Fix: Run each task start-to-finish in its own block.
Grade: 2
This suggestion fails to consider that there is a purpose behind the two threads being woven together. The author had the Skyscanner and Google tasks running concurrently; the results are then presented chronologically rather than discretely.
- Orphan heading (line 53)
- Problem: A single H4 ("Do agents know how much you cost?") with no siblings.
- Fix: Promote to H3 or fold into the surrounding prose.
Grade: 3
...yeah, that’s fair.
Article 2
- Per-stage template applied inconsistently (lines 228, 318, ~221)
- Problem: The four stage sections don't follow the same shape. Stages 3 and 4 open with an early "Kestra scores X/4 for…" line that Stages 1 and 2 lack, and the Onboarding stage is missing the closing
ax-section-scorecard that every other stage has. - Fix: Pick one template for all four — drop the early score lines (228, 318) and add the missing closing card to Onboarding.
- Problem: The four stage sections don't follow the same shape. Stages 3 and 4 open with an early "Kestra scores X/4 for…" line that Stages 1 and 2 lack, and the Onboarding stage is missing the closing
Grade: 3
A flag worth noting. Inconsistency in scoring could throw the reader for a loop.
- Duplicated scorecards have drifted apart (line 60 vs line 373)
- Problem: The top and bottom scorecards should hold identical data but don't — the summary row is labeled "Overall" in one and "Average" in the other, and several card descriptions are reworded between them.
- Fix: Make the two scorecards byte-identical (reconcile the label to "Overall"), or trim the top one to a teaser so there's a single source of truth.
Grade: 2
The final judgement is correct, but Claude fails to note that the final scorecard uses the word “Average” because it falls under the heading of “Overall scorecard: Kestra.” Using “Overall” in the scorecard here would be redundant. That said, both scorecards should, ideally, be consistent in the term they use.
- Missing paragraph break creates a run-on (lines 219–220)
- Problem: The API-key paragraph runs directly into the "Kestra scores 3/4 for onboarding" sentence with no blank line, so they render as one paragraph.
- Fix: Insert a blank line so the wrap-up score is its own paragraph (and add the closing score card here, per Issue 1).
Grade: 3
Correct.
Logical Issues
Prompt: Ignore structural errors, but identify any logical errors in this article.
As with the previous section, I’ve included a few standout examples and my grading rationale.
| Total Corrections | Grade 1 | Grade 2 | Grade 3 | |
|---|---|---|---|---|
| Article 1 | 6 | 2 | 1 | 3 |
| Article 2 | 7 | 3 | 2 | 2 |
| Percentage | 100% | 38.46% | 23.08% | 38.46% |
There’s a 40/60 split here between Claude giving you airtight corrections on logic, and Claude giving you incomplete or useless advice.
Article 1
- Claim contradicts its own examples (line 23)
- Problem: "...not just for coding" — but the examples given (starting projects, adding integrations) are coding.
- Fix: Name the real distinction (choosing which tools, not just writing code).
Grade: 1
Not an issue. The umbrella phrasing of “selling a platform, technical product, library, framework, or anything else” still leaves room for non-coding examples.
- False causal link (lines 17–19)
- Problem: "Agents... don't get bored. So for a human reader, your quickstart needs to be short" — the human conclusion doesn't follow from the fact about agents.
- Fix: Replace "So" with a contrast ("Whereas for a human reader...").
Grade: 3
Correct.
- Conclusion contradicts the narrative (line 175)
- Problem: "the ease with which I built POCs" — but the story describes 45-min CAPTCHA fights, timeouts, and one run that gave up.
- Fix: Acknowledge the friction, then land the favorable conclusion as earned.
Grade: 2
This is a reasonable flag – but it doesn’t account for the possibility that this is a normal amount of friction; in fact, it may even have been less than the author would normally encounter. Adding the word “relative” before “ease” would be sufficient.
Article 2
- Contradiction: is the community-edition API usable or not? (line 219 vs line 272)
- Problem: Onboarding says community users "must use basic authentication instead" (implying the API works without a token), but Integration says "API access is not included in the open source edition" — while describing the agent using basic auth to drive that very API.
Grade: 1
Claude has misidentified a conflict; when reading the article with the larger context in mind, we find that the second statement follows on logically from the first.
- "Works as expected" contradicts the unmet spec (lines 270, 274)
- Problem: The text says the agent ignored the requested SQLite storage and hacked in Postgres, then calls the result one that "works as expected" — but it didn't do what was asked.
- Fix: State the deviation plainly: "runs end to end and produces the chart, but doesn't match the spec — it used Postgres instead of the requested SQLite."
Grade: 2
Claude has correctly identified an issue (a claim seemingly contradicts the result) but provided the wrong fix. Whether the result was one that “works as expected” or not is up to the author’s discretion; what the author needs to do here is clarify what the expectation was.
- Integration test violates the article's own methodology (lines 262, 268)
- Problem: The methodology says keep the test as close to "hello world" as possible, but the actual task (scheduled scrape + persistent cross-run storage + accumulating long-term chart) is a multi-stage pipeline well beyond that.
- Fix: Either soften the rule to allow a "deliberately slightly-beyond-trivial" task and note the test was chosen that way, or swap in a genuine hello-world task.
Grade: 1
A case of misinterpreted hyperbole. The “hello world” example is used to refer to a simple, easily repeatable task; Claude, however, has interpreted that literally to mean creating a task as close as possible to displaying “hello world” text. It’s not that the correction is entirely off-base, but that it fails to account for the reader understanding a loose comparison.
Spelling & Grammar Issues
Prompt: Identify any spelling and grammar errors in the article.
| Total Corrections | Grade 1 | Grade 2 | Grade 3 | |
|---|---|---|---|---|
| Article 1 | 5 | 0 | 0 | 5 |
| Article 2 | 7 | 1 | 2 | 4 |
| Percentage | 100% | 8.33% | 16.67% | 75% |
Claude is strongest when it has to identify spelling and grammar errors. (Use with caution though – double-check everything yourself.)
Article 1
- No misspelled words found. The issues are grammar, punctuation, and agreement.
Grade: 3
Correct, and points for noting that there weren’t any misspelled words.
- Missing period (line 27)
- Problem: The third bullet lacks the closing period the other two have.
- Fix: Add a period.
Grade: 3
Correct.
- Inconsistent singular/plural agreement for "Steel" (lines 73, 115, 165, 175)
- Problem: Switches between "Steel has / its" and "they feed / they got." Line 115 mixes both in one sentence.
- Fix: Treat "Steel" as singular ("it") throughout.
Grade: 3
Correct.
- Minor / stylistic
- "et al" (line 23) conventionally takes a period: "et al."
- "One more warning..., a registration form..., and then I'm..." (line 73) is a comma-spliced list — works as deliberate rapid-fire, tighten only if desired.
Grade: 3
Correct.
Article 2
- Wrong preposition (line 270)
- Problem: "friction with attempting to set up a local SQLite file" is non-idiomatic.
- Fix: "friction in attempting to set up…" (or "friction when it attempted to…").
Grade: 1
This is a non-issue – both “with” and “in” would be correct. However, the real issue in this sentence wasn’t flagged – “some significant friction” should have been corrected to either “some friction” or “significant friction.”
- Broken parallelism (line 303)
- Problem: "failed attempts in working within the Docker Compose setup, and hacking the image output…" — the two list items don't parallel.
- Fix: Repeat the preposition — "in working within… and in hacking…" — or recast both as gerunds.
Grade: 3
Correct.
- Inconsistent capitalization of a product tier (line 410)
- Problem: "Enterprise paywall" and "enterprise one" capitalized differently in the same sentence.
- Fix: Use "Enterprise" consistently.
Grade: 3
Correct.
- Word repetition (lines 270, 303) — stylistic, optional
- Problem: "significant friction" appears twice, and "significant" is used twice within line 303.
- Fix: Vary the wording in one instance.
Grade: 3
Correct. (Once again noting that “some significant” wasn’t flagged.)
Step 2: The qualitative questions
The next few edits were done with the aim of evaluating whether or not the technical article accomplished the following goals:
- Does it teach the reader something?
- Does it make an impact on the reader?
I could just have plugged those exact questions into Claude, but then the result would vary every session. So, first, I set parameters to standardise Claude’s answer (as far as is possible with a glorified magic 8-ball, at least), asking it sub-questions that would, in turn, answer that overarching question.
Does the article teach the reader something?
I figured that, in order to teach the reader something, the article needs to contain a throughline, repeatable steps, and a clear, provable conclusion. If any of those elements was missing, I would argue that there isn’t enough information available in order to teach the reader effectively.
I used the following three prompts (each in a fresh session) to ascertain this for each article:
Does this article have a clear narrative throughline?
Can the reader replicate any experiments or tasks performed in this article?
Does this article contain a final stance (or an explanation for the absence of one)?
Once I have the answers to each of these, I paste those answers (and their corresponding questions) back into Claude and ask it the final question:
Based on these conclusions, does this article teach the reader something?
For both articles, I’ve provided the summary of the answers below:
| Question | Article 1 | Article 2 |
|---|---|---|
| Does this article have a clear narrative throughline? | There is a throughline — two of them, actually, and that's the problem. The conceptual spine (AX = discovery + onboarding + usage; test your own product) and the narrative spine (my journey discovering and adopting Steel) run in parallel but the narrative one takes over by the midpoint. It holds together because Steel is a consistent anchor, but a reader looking for the "what should I do" payoff has to dig it out from under the Steel adventure. | The article has a clear narrative throughline: it moves cleanly from problem to method to a Kestra worked example, then closes with a verdict and per-stage fixes, bookended by a scorecard shown up front and again at the end. Two soft spots slightly weaken it: first, the intro promises measuring AX "objectively," but the actual method is a human-judged qualitative 1–4 rubric — grounded, but not objective as implied, so the framing overpromises; second, the stage rhythm is uneven. |
| Can the reader replicate any experiments or tasks performed in this article? | A reader can absolutely replicate the method — play-acting as a developer, asking an agent discovery/pricing questions, then running the quickstart and POC prompts (which are all quoted). What they cannot replicate is the exact outcomes (which products get recommended, whether CAPTCHAs get solved, timings), because those depend on model state, live websites, and Steel's evolving docs — all of which the article itself acknowledges are fast-moving. | A reader can replicate the process on any product, and can re-run the integration and onboarding tasks against Kestra fairly faithfully. They cannot reproduce the discoverability numbers (full prompt set not published, plus model variance) or run the exact integration flow (YAML truncated). |
| Does this article contain a final stance (or an explanation for the absence of one)? | Yes — the article lands a clear final stance, delivered in two places. The explicit conclusion is the closing "Predictions" section. A secondary, personal stance appears in the Steel case study. | Yes. The article reaches and states a clear final stance. It also backs the stance with a closing analysis, plus concrete recommendations for each stage. |
| Based on these conclusions, does this article teach the reader something? | It teaches a framework (the three stages) and a disposition (evaluate products the way an agent would, invest in genuine quality over GEO gaming) — both of which land. What it does not teach is a repeatable procedure; the "here's the exercise to run on your own product" promise is demonstrated on Steel but never handed to the reader as steps. So: yes, the reader leaves knowing something real and reframing how they think about docs/onboarding — but they leave inspired and reframed more than equipped. | It teaches a method well, teaches a case study well, and slightly over-promises on objectivity while under-delivering on reproducing one of its four stages. A motivated reader finishes it able to run their own audit — which is the main thing it sets out to teach. |
According to Claude, both articles succeed in teaching the reader something. I would rate Claude’s ability to answer this question (as long as it’s given the necessary context) 3/3. It tells you what the article’s methodology of teaching is, if any informational components are lacking, and how to make any necessary fixes.
Does this article make an impact on the reader?
This is a far trickier question than any of the previous ones. How do we define what “impact” is? How do we replicate these results every single time?
We can’t, really, so the next-best thing is to define parameters you would be comfortable measuring the impact of your technical articles by. These are the ones I’ve chosen:
- The reader’s mind needs to be changed (or at the very least influenced) in some way.
- The author’s voice needs to be strong enough to make the overarching point persuasively.
As a human, these questions are far easier to ascertain (if we discount subjective judgement). For the LLM, we’ll need to patchwork more questions together.
I used the following four prompts (each in a fresh session) to ascertain this for each article:
What is this article teaching the reader that they are probably wrong (or right) about?
What does the reader’s journey look like from start to finish, summarised in 100 words?
Does the author maintain continuity in the points they make i.e. can the reader trust the author to know their own mind?
Grade the persuasiveness of this article from 1 (weak) to 5 (very persuasive).
The same happens for this question: once I have the answers to each of these, I paste those answers (and their corresponding questions) back into Claude and ask it the final question:
Based on these conclusions, does this article make an impact on the reader? Evaluate memorability, persuasion-to-action, and behavior change, and identify any fixes needed.
Once again, the summary of the answers below:
| Question | Article 1 | Article 2 |
|---|---|---|
| What is this article teaching the reader that they are probably wrong (or right) about? | The durable, correct lesson is the prediction in the last paragraph: in 2026, companies that invest in genuine quality (thorough docs, frictionless signup, organically-earned mentions) win, and shortcut-takers lose. The thing readers should be most skeptical of is treating one delightful demo as proof of a measurement methodology — the "how to test your AX" section is a vibe-check, not a rigorous framework, and the article is honest enough to call it "get an initial feel." | Probably right: agents now matter as evaluators, lean on training-data priors that favor incumbents, hallucinate confidently about your docs, and won't find your agent tooling unprompted. Probably overselling: the rubric's precision and objectivity (single model, prompt-based, "rounded up from 2.75"), the "less friction is always better" lens it concedes then ignores, and easy fixes whose impact is small versus the unactionable root cause. |
| What does the reader's journey look like from start to finish, summarised in 100 words? | The article defines Agent Experience (AX) and argues why it matters: agents now discover, sign up for, and integrate tools, so products must be findable, frictionless, and correctly usable — or lose out. It then guides readers through testing their own AX via a hands-on case study with headless-browser tools, role-playing as a developer using an agent. The journey moves through three stages — discovery (finding Steel), onboarding (smooth sign-up), and usage (escalating to CAPTCHA-solving) — surfacing quirks like ignored session limits and agents upselling paid plans. It closes with 2026 predictions favoring quality docs over shortcuts. | You open with a hook from a prior piece: if AI agents can't find and use your platform, you're invisible. The natural question — "how do I measure that?" — is posed, and the naive answer (just ask an agent) is debunked. You're then given the four-stage framework (Discoverability, Onboarding, Integration, Agent Tooling) and a 1–4 scoring scale. A topline Kestra scorecard previews the verdict. Each stage then unfolds in turn: what it measures, how to test it, and Kestra's worked result. You close with an overall scorecard, fix recommendations, and a Ritza CTA. |
| Does the author maintain continuity in the points they make i.e. can the reader trust the author to know their own mind? | The reader can trust this author. The framework is announced, used, and recalled; early assertions return as later payoffs; and he keeps his evaluative lens steady even when his own demo underperforms. The only thing a sharp editor might flag is the tonal gap between the high-stakes intro ("game over", "throw the whole project away") and the casual, exploratory romp of the body — but that's a register issue, not a continuity-of-argument one. | The author clearly knows their own mind — the argument doesn't drift, reverse, or contradict itself. The only fixes worth making are cosmetic: pick one friction reason for the integration cards (Docker is what the prose argues, so I'd use that consistently), and standardize on either "Overall" or "Average." |
| Grade the persuasiveness of this article from 1 (weak) to 5 (very persuasive). | The article scores 4/5 for persuasiveness. It convinces through a real, honest end-to-end agent walkthrough — screenshots, transcripts, and admitted failures build credibility. But it relies on a single vendor and one anecdote, asserts rather than proves its broad market claims, and a glib swipe at skeptics slightly undercuts its reach. | (4/5): grounded in concrete, falsifiable test data and honestly reported failures, with reproducible methodology. Weakened by a single validating example, a borrowed core premise, and unjustified rubric thresholds and weighting. |
| Based on these conclusions, does this article make an impact on the reader? Evaluate memorability, persuasion-to-action, and behavior change, and identify any fixes needed. | Memorability: high — the AX/DX frame and "long docs are fine" will stick. Persuasion-to-action: moderate — strong for the already-interested, weak for the skeptic, because the macro market claims rest on confidence rather than data. Behavior change: real but modest — a reader is plausibly nudged to try the test, less plausibly nudged to re-prioritize a roadmap. Conclusion: it makes an impact — it gives the reader a vocabulary, a method, and one assumption-flipping insight they'll carry. What it stops short of is the impact it reaches for in its own opening: making the reader feel they'll lose if they ignore it. It earns "you should be curious about this," not yet "you cannot afford to ignore this." The fix is the same one that would take it from 4 to 5 on persuasion — a second data point, especially a negative one, and a quantified cost of inaction. | Memorability: moderate-to-high — the "AX is the new DX" frame and the four-stage model will stick, anchored by one vivid detail (0/3 generic mentions, the hallucinated API-key screen); the scorecard numbers won't. Persuasion-to-action: mixed — well-supported for the reader inclined to self-audit, but structurally undercut because the easy prescribed fixes (llms.txt, MCP, skills) are the ones the article itself admits matter least, while the dominant factor (training-data presence) is named as unactionable. Behavior change: real but narrow — the durable shift is perceptual (seeing agents as evaluators, sanity-checking what they say about your product), not procedural; instituting a recurring audit or re-prioritizing a roadmap is less likely to stick. Conclusion: it makes a lopsided impact — it changes how the reader sees the problem more than what they do about it, handing over a vocabulary, a reproducible method, and one assumption-flipping insight they'll carry. What it stops short of is the urgency it reaches for in its own opening: the reader leaves convinced AX matters and mildly equipped to measure it, but quietly aware the highest-leverage fix is out of their hands. It earns "you should be curious about this," not yet "you cannot afford to ignore this." The fix is the same one that would take it from 4 to 5 on persuasion — a second data point, ideally a negative one, and a quantified cost of inaction. |
There’s a slight shift in certainty for this question: whereas Claude does rate both articles as having made an impact, it also notes that that impact is diluted by missing or biased information. I’d also go a step further and say that, if not for the very specific parameters given, Claude’s final conclusion didn’t necessarily ascertain whether the article had an impact on the reader inasmuch as it simply explained to us again what the reader learned from the article. (I’m also not convinced that even more finicky prompting would give us a repeatable way of eliciting a more nuanced explanation of how the reader is impacted by the article beyond the data they gain from it.)
Overall, I’d give Claude a 2/3 on the technical editing front.
Claude will usually (but not always) pick out the most obvious errors. It may or may not miss nuances that a human editor would pick up when assessing an article. You can do your best to guide it with context and refined prompting, but realistically speaking, it’s going to be a gamble every time. So, if you’re happy with a certain degree of unpredictability and the knowledge that you’ll need to double-check every single edit, Claude can be a useful, if somewhat arduous, tool to use.
(But make sure yours are the last eyes on anything you submit to an editor.)