Docs

Voice & Camera

Let agents hear the user and see what's in front of them. Two host capabilities — audio_transcribe and camera_photo — that tap into the Rush app's native device access.

Capabilities vs tools#

Voice and camera are capabilities, not tools. The distinction matters:

ToolsCapabilities
Run whereIn rush/cli or the proxyIn the Rush app (microphone, camera, browser)
WhyPure logic or server-side HTTPNeed OS-level device access the CLI can't touch
Declared astools:capabilities:

When an agent needs a capability, rush/cli emits a capability event. The Rush app handles it natively and sends the result back through stdin.

audio_transcribe#

Records user speech through the microphone, streams it to Deepgram, and returns a time-coded transcript. Same pipeline as the voice input button in the Rush app.

Declare it

agent.yaml
capabilities:
  - name: audio_transcribe

Parameters

FieldDefaultNotes
duration30Max recording length in seconds. Capped at 300 (5 min).

Returns

{
  "transcript": [
    { "text": "Hello, could you ", "timestamp": 0.42, "is_final": false },
    { "text": "Hello, could you remind me about tomorrow?", "timestamp": 1.87, "is_final": true }
  ],
  "duration": 2.1
}

Segments stream in as Deepgram produces them. is_final: false means it's still being refined; is_final: true is the stable version. Use only the final ones when you need clean output.

Good uses: agents that ask the user a quick follow-up question, voice-driven note capture, hands-free workflows (meditation guidance, interview prep, tutor loops).

camera_photo#

Captures a photo from the device camera. Uses navigator.mediaDevices.getUserMedia(), so every input the OS knows about is fair game — built-in webcam, external USB cameras, and Apple's Continuity Camera (iPhone as webcam).

Declare it

agent.yaml
capabilities:
  - name: camera_photo

Parameters

FieldDefaultNotes
quality85JPEG quality 0–100.
maxWidth1920Max width in pixels. Aspect ratio preserved.
maxHeight1080Max height in pixels. Aspect ratio preserved.

Returns

{
  "image_data": "<base64-encoded JPEG>",
  "width": 1920,
  "height": 1080,
  "mime_type": "image/jpeg"
}

The agent can pass image_data straight into a vision model call, or save it as an artifact for the user.

Good uses: “scan this receipt,” “log what I had for lunch,” “identify this plant,” “read the text on this whiteboard.” Continuity Camera means the user can point their iPhone at anything physical without leaving their Mac.

Next#

Documentation | Prix | Prix