Voice & Camera
Let agents hear the user and see what's in front of them. Two host capabilities — audio_transcribe and camera_photo — that tap into the Rush app's native device access.
Capabilities vs tools#
Voice and camera are capabilities, not tools. The distinction matters:
| Tools | Capabilities | |
|---|---|---|
| Run where | In rush/cli or the proxy | In the Rush app (microphone, camera, browser) |
| Why | Pure logic or server-side HTTP | Need OS-level device access the CLI can't touch |
| Declared as | tools: | capabilities: |
When an agent needs a capability, rush/cli emits a capability event. The Rush app handles it natively and sends the result back through stdin.
audio_transcribe#
Records user speech through the microphone, streams it to Deepgram, and returns a time-coded transcript. Same pipeline as the voice input button in the Rush app.
Declare it
capabilities:
- name: audio_transcribeParameters
| Field | Default | Notes |
|---|---|---|
duration | 30 | Max recording length in seconds. Capped at 300 (5 min). |
Returns
{
"transcript": [
{ "text": "Hello, could you ", "timestamp": 0.42, "is_final": false },
{ "text": "Hello, could you remind me about tomorrow?", "timestamp": 1.87, "is_final": true }
],
"duration": 2.1
}Segments stream in as Deepgram produces them. is_final: false means it's still being refined; is_final: true is the stable version. Use only the final ones when you need clean output.
Good uses: agents that ask the user a quick follow-up question, voice-driven note capture, hands-free workflows (meditation guidance, interview prep, tutor loops).
camera_photo#
Captures a photo from the device camera. Uses navigator.mediaDevices.getUserMedia(), so every input the OS knows about is fair game — built-in webcam, external USB cameras, and Apple's Continuity Camera (iPhone as webcam).
Declare it
capabilities:
- name: camera_photoParameters
| Field | Default | Notes |
|---|---|---|
quality | 85 | JPEG quality 0–100. |
maxWidth | 1920 | Max width in pixels. Aspect ratio preserved. |
maxHeight | 1080 | Max height in pixels. Aspect ratio preserved. |
Returns
{
"image_data": "<base64-encoded JPEG>",
"width": 1920,
"height": 1080,
"mime_type": "image/jpeg"
}The agent can pass image_data straight into a vision model call, or save it as an artifact for the user.
Good uses: “scan this receipt,” “log what I had for lunch,” “identify this plant,” “read the text on this whiteboard.” Continuity Camera means the user can point their iPhone at anything physical without leaving their Mac.
Consent and privacy#
Both capabilities trigger the OS's native permission prompt the first time they run — the standard macOS microphone and camera sheets. The user has to approve at the system level, not just in Rush. After approval, the permission persists at the OS level.
Capability calls are explicit. The agent can't secretly record or photograph — every call routes through the capability event pipeline, which is visible in the session timeline.