We built a speech-to-canvas agent on tldraw

June 24, 2026 · 12 min read

Software engineer and technical writer

tldraw is an infinite-canvas SDK for React. We recently built 2draw on it, a Drawful-style game where players draw on a shared canvas and race to guess each other's drawings. We started wondering what it would take to put an agent in that loop, as an opponent or a rival guesser, and dug into how an agent could read and draw on a tldraw canvas. That research turned into:

Agent draw, a tldraw tool that lets you drag a rectangle on the canvas, speak what you want, and have an AI agent transcribe your voice and draw a complete result inside that rectangle. Drag a few rectangles in a row and they queue up, drawing one at a time.

You can try it right now, or grab the source:

Live demo: tldraw-agent-draw-demo.james-664.workers.dev
Source: github.com/ritza-co/tldraw-agent-draw-demo

What is tldraw

tldraw is an infinite-canvas SDK for React. You have probably used the canvas itself without knowing the name: it is the whiteboard under a lot of "draw it out" features in other products. You drop a <Tldraw /> component into your app and you get shapes, selection, panning and zooming, multiplayer, undo/redo, and a full editor API for reading and writing the canvas in code.

This API opens up some interesting possibilities for AI integration. The same API a user drives with a mouse, an agent can drive in code. Create a shape, move it, label it, draw an arrow between two others.

The Agent starter kit we built on

We did not build the agent from scratch. tldraw publishes an official Agent starter kit (MIT licensed), and it already does the hard part. Out of the box it gives you:

A tldraw canvas plus a chat panel. You type a prompt and an LLM agent takes turns issuing actions that draw and arrange real tldraw shapes (create, move, label, align, review, and more).
A Cloudflare Worker backend with a /stream route and a durable object that runs the agent loop.
Several model providers wired up (OpenAI, Anthropic, Google).
Two context-picking tools in the toolbar, "Pick Area" and "Pick Shape", that let you point the agent at part of the canvas.
A lint/review system that flags overlaps and cut-off text.

So the agent and the drawing were already there. What the starter kit could not do is take an area you draw, listen to you describe what goes in it, and draw a complete result inside that region. That is the feature we added, and it turned out to be a small amount of code on top of a lot of capability.

What we added: Agent draw

The whole feature is a new canvas tool, a speech pipeline, a serialized draw queue, and a prompt section. Here is each piece, with the actual code from the repo.

1. A canvas tool that captures a region

Dragging a rectangle on the tldraw canvas with the Agent draw tool, showing the dashed selection brush

Dragging out a region with the Agent draw tool.

tldraw tools are state machines. You subclass StateNode and define child states; tldraw routes pointer events to whichever state is active. Our AreaCaptureTool has three states (idle → pointing → dragging) and does its real work on pointer-up, when the dragged rectangle is final:

class AreaCaptureDragging extends StateNode {
  static override id = 'dragging'

  private bounds: BoxModel | undefined = undefined

  override onPointerUp() {
    this.editor.updateInstanceState({ brush: null })
    if (!this.bounds) throw new Error('Bounds not set')
    // Hand the captured rectangle (in page coordinates) to the capture session.
    startCaptureSession(this.bounds)
    this.parent.transition('idle')
  }

  updateBounds() {
    if (!this.initialPagePoint) return
    const currentPagePoint = this.editor.inputs.getCurrentPagePoint()
    const x = Math.min(this.initialPagePoint.x, currentPagePoint.x)
    const y = Math.min(this.initialPagePoint.y, currentPagePoint.y)
    const w = Math.abs(currentPagePoint.x - this.initialPagePoint.x)
    const h = Math.abs(currentPagePoint.y - this.initialPagePoint.y)
    // Show tldraw's native selection brush while dragging.
    this.editor.updateInstanceState({ brush: { x, y, w, h } })
    this.bounds = { x, y, w, h }
  }
}

We get the live selection-brush rectangle for free by writing to editor.updateInstanceState({ brush }), the same instance state tldraw's own select tool uses. The bounds are in page coordinates, so they stay correct no matter how the user has panned or zoomed.

2. Recording speech

The moment a capture starts, we open the mic. AreaRecorder is a thin wrapper over the browser's MediaRecorder, deliberately with no knowledge of the agent or transcription, just start() and stop():

export class AreaRecorder {
  async start(): Promise<void> {
    this.stream = await navigator.mediaDevices.getUserMedia({ audio: true })
    const recorder = new MediaRecorder(this.stream, { mimeType: this.mimeType })
    this.chunks = []
    recorder.ondataavailable = (event) => {
      if (event.data.size > 0) this.chunks.push(event.data)
    }
    recorder.start()
    this.recorder = recorder
  }

  async stop(): Promise<Blob> {
    // ...stop the recorder, release the mic, resolve with the recorded clip...
    return new Blob(this.chunks, { type: this.mimeType || 'audio/webm' })
  }
}

3. Transcription, on the worker

The audio blob is posted to a new /transcribe route on the same Cloudflare Worker that already serves the agent. The route just forwards the audio to Mistral's Voxtral transcription model and returns the text, so the API key never touches the browser:

export async function transcribe(request: IRequest, env: Environment) {
  const form = await request.formData()
  const file = form.get('file')

  const outForm = new FormData()
  outForm.append('file', file)
  outForm.append('model', 'voxtral-mini-transcribe-2507')

  const mistralResponse = await fetch('https://api.mistral.ai/v1/audio/transcriptions', {
    method: 'POST',
    headers: { Authorization: `Bearer ${env.MISTRAL_API_KEY}` },
    body: outForm,
  })

  const data = (await mistralResponse.json()) as { text?: string }
  return new Response(JSON.stringify({ text: data.text ?? '' }), {
    headers: { 'Content-Type': 'application/json' },
  })
}

4. A serialized draw queue

Several captured regions on the canvas at once, each showing a status pill: recording, queued, and drawing

Several captures in flight at once, each with its own status pill.

This is the part that makes multiple captures work. You can drag a second rectangle while the first is still drawing, and the captures draw in order rather than fighting over the canvas. The whole thing is a module-level state machine over a tldraw atom: a single active recorder, a FIFO queue, and a single-consumer worker that handles one session at a time.

Starting a new capture auto-stops the one still recording and queues it:

export function startCaptureSession(bounds: BoxModel): string {
  // Drawing a new capture ends the audio of the one still recording.
  if (recordingId) finalizeRecording(recordingId)

  const id = nextId()
  sessions.set([...sessions.get(), { id, bounds, status: 'recording' }])

  const rec = new AreaRecorder()
  recorder = rec
  recordingId = id
  rec.start().catch(/* surface a mic-permission error on the session */)
  return id
}

The consumer drains the queue one session at a time, moving each through transcribing → drawing. Serializing here is the whole point: the agent runs one request at a time, so a newer capture simply waits its turn:

async function processQueue(): Promise<void> {
  if (processing) return            // single consumer
  processing = true
  try {
    while (queue.length > 0) {
      const id = queue.shift() as string
      const session = getSession(id)
      const blob = pendingBlobs.get(id)
      // ...skip if dismissed / no audio / agent not ready...

      patchSession(id, { status: 'transcribing' })
      const text = await transcribe(blob)

      patchSession(id, { status: 'drawing' })
      await requestDrawInArea(agentRef, text, session.bounds)

      removeSession(id)             // clear the overlay for this capture
    }
  } finally {
    processing = false
  }
}

An earlier version tried to stream the mic live and draw as you talked. It was laggy and the value was fuzzy, so we cut it for this discrete capture-then-speak model, and the queue is what replaced the old "two captures cancel each other" overlap bug.

5. Driving the agent to draw inside the box

Each session's transcript and bounds go to requestDrawInArea. The interesting decision here is that we use the starter kit's full agentic loop, agent.prompt, not the single-turn agent.request. prompt keeps taking turns on its own until the model has nothing left to add, so the model finishes the entire drawing in one call instead of drawing one shape and stopping.

export async function requestDrawInArea(
  agent: TldrawAgent,
  text: string,
  bounds: BoxModel
): Promise<number> {
  const area = { type: 'area' as const, bounds, source: 'user' as const }
  ensureMode(agent, 'working')
  const before = agent.editor.getCurrentPageShapeIds().size
  try {
    await agent.prompt({ message: buildAreaMessage(text), contextItems: [area] })
    return agent.editor.getCurrentPageShapeIds().size - before
  } finally {
    ensureMode(agent, 'idling')
  }
}

An earlier version hand-rolled a continue-loop with extra linter passes to force a complete drawing, because a single-turn request only drew one shape. Once we confirmed agent.prompt finishes the whole drawing on its own (with anthropic/claude-haiku-4.5 via OpenRouter), all of that scaffolding got deleted.

6. Cutting the round-trips

The starter kit's working mode ships with a fairly broad action list. For a fixed-region draw two of those actions were costing a full LLM round-trip each without improving the result.

setMyView moves the camera to frame the shapes the agent just drew. For a canvas chat panel that makes sense, but for area capture the viewport is irrelevant, the user wants to see what was drawn, not have the camera jump around. The problem is that setMyView interrupts the current turn and triggers a re-request, adding a round-trip.

review runs the starter kit's lint system: it checks for overlapping shapes, text that overflows its box, and similar issues, then schedules a follow-up turn to fix them. Again, useful in a chat panel, but for a one-shot area draw it just adds another model call at the end.

Removing both from the working mode's action list roughly halved the number of model calls per capture. The change is two lines in AgentModeDefinitions.ts:

// review and setMyView removed — each forces an extra round-trip per draw
// without helping the fixed-region result. Utils stay registered for other modes.
actions: [
  ThinkActionUtil.type,
  CreateActionUtil.type,
  PenActionUtil.type,
  // ... rest of actions ...
]

The utilities stay registered so they remain available to the chat panel. They are just not offered to the agent when it is drawing inside a captured area.

7. Teaching the model to draw, not transcribe

Left: the spoken request transcribed as a paragraph of text. Right: the same request drawn as a clean labelled diagram

The spoken request (left) becomes a labelled diagram, not a wall of transcribed text (right).

The last piece is a prompt. Left alone, a model handed a sentence of speech tends to write that sentence on the canvas as a wall of text. We added a "Drawing inside a captured area" section to the starter kit's system prompt that does two things: it forces the model to assess what visual form actually fits the request, and it forbids dumping the transcript verbatim:

Choose the visual form that best fits the request and immediately emit
create/pen actions — do not ask for clarification, do not stop to think
without drawing:

- A specific named shape ("draw a red circle", "a star"): draw exactly that.
- A single object or illustration: draw it well, not a multi-box diagram.
- A definition or explanation: draw a labelled diagram with keyword labels,
  not the spoken sentence as a block of text.
- A diagram or process: labelled nodes with arrows.
- A chart: for quantitative or comparative content.

If the request is ambiguous, make your best interpretation and draw it anyway.
Never stall — always emit at least one shape.

It also insists the model build the complete result in one turn ("the user does not reply between actions, so anything you defer to a follow-up message will never happen"), which pairs with the agent.prompt decision above.

The tldraw toolbar with the custom Agent draw tool button added alongside the built-in tools

The "Agent draw" tool, added to the toolbar next to tldraw's built-ins.

Finally, the tool is registered through tldraw's standard override API, so "Agent draw" sits in the toolbar next to the built-in tools with its own icon and keyboard shortcut:

const tools = [AreaCaptureTool, TargetShapeTool, TargetAreaTool]

const overrides: TLUiOverrides = {
  tools: (editor, tools) => ({
    ...tools,
    'area-capture': {
      id: 'area-capture',
      label: 'Agent draw',
      kbd: 'a',
      icon: agentDrawIcon,
      onSelect() { editor.setCurrentTool('area-capture') },
    },
    // ...Pick Area / Pick Shape unchanged...
  }),
}

<Tldraw
  licenseKey={import.meta.env.VITE_TLDRAW_LICENSE_KEY}
  tools={tools}
  overrides={overrides}
  components={components}
/>

That is the entire feature: a tool that captures a rectangle, a recorder, a worker route for transcription, a queue, one call into the agent loop, a couple of trimmed actions for speed, and a prompt section. Everything else, the canvas, the agent, the action system, the streaming backend, came from the starter kit.

A note on licensing

Two separate licenses apply here:

The starter-kit code is MIT. Our demo is a derivative of tldraw/agent-template, whose LICENSE.md is MIT, so our source can be MIT too. We keep tldraw's copyright notice alongside ours in our own LICENSE.md.
The tldraw SDK is not. The tldraw npm package is under the proprietary tldraw license. It is free in local development, but any public deployment needs a tldraw license key, passed as the licenseKey prop you saw above.

So the code we publish is MIT and you can run it locally for free, but the live demo needs a tldraw license key, which is why the key lives in a gitignored .env rather than the repo.

What is tldraw​

The Agent starter kit we built on​

What we added: Agent draw​

1. A canvas tool that captures a region​

2. Recording speech​

3. Transcription, on the worker​

4. A serialized draw queue​

5. Driving the agent to draw inside the box​

6. Cutting the round-trips​

7. Teaching the model to draw, not transcribe​

8. Wiring it into the toolbar​

A note on licensing​