Files
openclaw/extensions/browser/plugin-registration.ts
scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding
* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-05-31 00:00:19 +01:00

214 lines
7.7 KiB
TypeScript

import type {
AnyAgentTool,
OpenClawPluginApi,
OpenClawPluginNodeHostCommand,
OpenClawPluginSecurityAuditCollector,
OpenClawPluginService,
OpenClawPluginToolContext,
OpenClawPluginToolFactory,
} from "openclaw/plugin-sdk/plugin-entry";
import {
BROWSER_REQUEST_GATEWAY_METHOD,
BROWSER_REQUEST_GATEWAY_SCOPE,
} from "./src/browser-gateway-contract.js";
import { BrowserToolSchema } from "./src/browser-tool.schema.js";
const EAGER_BROWSER_CONTROL_SERVICE_ENV = "OPENCLAW_EAGER_BROWSER_CONTROL_SERVER";
let browserRegistrationRuntimeModulePromise: Promise<
typeof import("./register.runtime.js")
> | null = null;
const loadBrowserRegistrationRuntimeModule = async () => {
browserRegistrationRuntimeModulePromise ??= import("./register.runtime.js");
return await browserRegistrationRuntimeModulePromise;
};
function isTruthyEnvValue(value: string | undefined): boolean {
return /^(?:1|true|yes|on)$/iu.test(value?.trim() ?? "");
}
function deriveChatTypeFromSessionKey(
sessionKey: string | undefined,
): "direct" | "group" | "channel" | undefined {
const tokens = new Set(sessionKey?.toLowerCase().split(":").filter(Boolean) ?? []);
if (tokens.has("group")) {
return "group";
}
if (tokens.has("channel")) {
return "channel";
}
if (tokens.has("direct") || tokens.has("dm")) {
return "direct";
}
return undefined;
}
const BROWSER_CLI_DESCRIPTOR = {
name: "browser",
description: "Manage OpenClaw's dedicated browser (Chrome/Chromium)",
hasSubcommands: true,
};
function createLazyBrowserTool(opts?: {
sandboxBridgeUrl?: string;
allowHostControl?: boolean;
agentSessionKey?: string;
agentDir?: string;
workspaceDir?: string;
activeModel?: {
provider?: string;
model?: string;
};
mediaScope?: {
sessionKey?: string;
channel?: string;
chatType?: string;
};
}): AnyAgentTool {
const targetDefault = opts?.sandboxBridgeUrl ? "sandbox" : "host";
const hostHint =
opts?.allowHostControl === false ? "Host target blocked by policy." : "Host target allowed.";
return {
label: "Browser",
name: "browser",
description: [
"Control the browser via OpenClaw's browser control server (status/start/stop/profiles/tabs/open/snapshot/screenshot/actions).",
"Browser choice: omit profile by default for the isolated OpenClaw-managed browser (`openclaw`).",
'For the logged-in user browser, use profile="user". A supported Chromium-based browser (v144+) must be running on the selected host or browser node. Use only when existing logins/cookies matter and the user is present.',
'For profile="user" or other existing-session profiles, omit timeoutMs on act:type, evaluate, hover, scrollIntoView, drag, select, and fill; that driver rejects per-call timeout overrides for those actions.',
'When a node-hosted browser proxy is available, the tool may auto-route to it. Pin a node with node=<id|name> or target="node".',
"When using refs from snapshot (e.g. e12), keep the same tab: prefer passing targetId from the snapshot response into subsequent actions (act/click/type/etc). For tab operations, targetId also accepts tabId handles (t1) and labels from action=tabs.",
"For multi-step browser work, login checks, stale refs, duplicate tabs, or Google Meet flows, use the bundled browser-automation skill when it is available.",
'For stable, self-resolving refs across calls, use snapshot with refs="aria" (Playwright aria-ref ids). Default refs="role" are role+name-based.',
"Use snapshot+act for UI automation. Avoid act:wait by default; use only in exceptional cases when no reliable UI state exists.",
`target selects browser location (sandbox|host|node). Default: ${targetDefault}.`,
hostHint,
].join(" "),
parameters: BrowserToolSchema,
execute: async (toolCallId, args, signal, onUpdate) => {
const { createBrowserTool } = await loadBrowserRegistrationRuntimeModule();
const tool = createBrowserTool(opts);
return await tool.execute(toolCallId, args, signal, onUpdate);
},
};
}
function createBrowserToolOptions(ctx: OpenClawPluginToolContext): {
sandboxBridgeUrl?: string;
allowHostControl?: boolean;
agentSessionKey?: string;
agentDir?: string;
workspaceDir?: string;
activeModel?: {
provider?: string;
model?: string;
};
mediaScope?: {
sessionKey?: string;
channel?: string;
chatType?: string;
};
} {
const mediaChannel = ctx.deliveryContext?.channel ?? ctx.messageChannel;
const mediaChatType = deriveChatTypeFromSessionKey(ctx.sessionKey);
return {
...(ctx.browser?.sandboxBridgeUrl ? { sandboxBridgeUrl: ctx.browser.sandboxBridgeUrl } : {}),
...(ctx.browser?.allowHostControl !== undefined
? { allowHostControl: ctx.browser.allowHostControl }
: {}),
...(ctx.sessionKey ? { agentSessionKey: ctx.sessionKey } : {}),
...(ctx.agentDir ? { agentDir: ctx.agentDir } : {}),
...(ctx.workspaceDir ? { workspaceDir: ctx.workspaceDir } : {}),
...(ctx.activeModel?.provider || ctx.activeModel?.modelId
? {
activeModel: {
provider: ctx.activeModel.provider,
model: ctx.activeModel.modelId,
},
}
: {}),
...(ctx.sessionKey || mediaChannel
? {
mediaScope: {
...(ctx.sessionKey ? { sessionKey: ctx.sessionKey } : {}),
...(mediaChannel ? { channel: mediaChannel } : {}),
...(mediaChatType ? { chatType: mediaChatType } : {}),
},
}
: {}),
};
}
export const browserPluginReload = { restartPrefixes: ["browser"] };
export const browserPluginNodeHostCommands: OpenClawPluginNodeHostCommand[] = [
{
command: "browser.proxy",
cap: "browser",
handle: async (paramsJSON) => {
const { runBrowserProxyCommand } = await loadBrowserRegistrationRuntimeModule();
return await runBrowserProxyCommand(paramsJSON);
},
},
];
export const browserSecurityAuditCollectors: OpenClawPluginSecurityAuditCollector[] = [
async (ctx) => {
const { collectBrowserSecurityAuditFindings } = await loadBrowserRegistrationRuntimeModule();
return collectBrowserSecurityAuditFindings(ctx);
},
];
function createLazyBrowserPluginService(): OpenClawPluginService {
let service: OpenClawPluginService | null = null;
const loadService = async () => {
if (!service) {
const { createBrowserPluginService } = await loadBrowserRegistrationRuntimeModule();
service = createBrowserPluginService();
}
return service;
};
return {
id: "browser-control",
start: async (ctx) => {
if (!isTruthyEnvValue(process.env[EAGER_BROWSER_CONTROL_SERVICE_ENV])) {
return;
}
const loaded = await loadService();
await loaded.start(ctx);
},
stop: async (ctx) => {
if (!service) {
const { stopBrowserControlService } = await import("./src/control-service.js");
await stopBrowserControlService().catch(() => {});
return;
}
await service.stop?.(ctx);
},
};
}
export function registerBrowserPlugin(api: OpenClawPluginApi) {
api.registerTool(((ctx: OpenClawPluginToolContext) =>
createLazyBrowserTool(createBrowserToolOptions(ctx))) as OpenClawPluginToolFactory);
api.registerCli(
async ({ program }) => {
const { registerBrowserCli } = await import("./src/cli/browser-cli.js");
registerBrowserCli(program);
},
{ commands: ["browser"], descriptors: [BROWSER_CLI_DESCRIPTOR] },
);
api.registerGatewayMethod(
BROWSER_REQUEST_GATEWAY_METHOD,
async (opts) => {
const { handleBrowserGatewayRequest } = await loadBrowserRegistrationRuntimeModule();
return await handleBrowserGatewayRequest(opts);
},
{
scope: BROWSER_REQUEST_GATEWAY_SCOPE,
},
);
api.registerService(createLazyBrowserPluginService());
}