Voice & Camera

Let agents hear the user and see what's in front of them. Two host capabilities — audio_transcribe and camera_photo — that tap into the Rush app's native device access.

Capabilities vs tools#

Voice and camera are capabilities, not tools. The distinction matters:

	Tools	Capabilities
Run where	In `rush/cli` or the proxy	In the Rush app (microphone, camera, browser)
Why	Pure logic or server-side HTTP	Need OS-level device access the CLI can't touch
Declared as	`tools:`	`capabilities:`

When an agent needs a capability, rush/cli emits a capability event. The Rush app handles it natively and sends the result back through stdin.

audio_transcribe#

Records user speech through the microphone, streams it to Deepgram, and returns a time-coded transcript. Same pipeline as the voice input button in the Rush app.

Declare it

agent.yaml

capabilities:
  - name: audio_transcribe

Parameters

Field	Default	Notes
`duration`	30	Max recording length in seconds. Capped at 300 (5 min).

Returns

{
  "transcript": [
    { "text": "Hello, could you ", "timestamp": 0.42, "is_final": false },
    { "text": "Hello, could you remind me about tomorrow?", "timestamp": 1.87, "is_final": true }
  ],
  "duration": 2.1
}

Segments stream in as Deepgram produces them. is_final: false means it's still being refined; is_final: true is the stable version. Use only the final ones when you need clean output.

Good uses: agents that ask the user a quick follow-up question, voice-driven note capture, hands-free workflows (meditation guidance, interview prep, tutor loops).

Captures a photo from the device camera. Uses navigator.mediaDevices.getUserMedia(), so every input the OS knows about is fair game — built-in webcam, external USB cameras, and Apple's Continuity Camera (iPhone as webcam).

Declare it

agent.yaml

capabilities:
  - name: camera_photo

Parameters

Field	Default	Notes
`quality`	85	JPEG quality 0–100.
`maxWidth`	1920	Max width in pixels. Aspect ratio preserved.
`maxHeight`	1080	Max height in pixels. Aspect ratio preserved.

Returns

{
  "image_data": "<base64-encoded JPEG>",
  "width": 1920,
  "height": 1080,
  "mime_type": "image/jpeg"
}

The agent can pass image_data straight into a vision model call, or save it as an artifact for the user.

Good uses: “scan this receipt,” “log what I had for lunch,” “identify this plant,” “read the text on this whiteboard.” Continuity Camera means the user can point their iPhone at anything physical without leaving their Mac.

Voice & Camera

Capabilities vs tools#

audio_transcribe#

Declare it

Parameters

Returns

camera_photo#

Declare it

Parameters

Returns

Next#

Browser Use

Generative UI

Voice & Camera

Capabilities vs tools#

audio_transcribe#

Declare it

Parameters

Returns

camera_photo#

Declare it

Parameters

Returns

Consent and privacy#

Next#

Browser Use

Generative UI