Codex 插件 · Crawlbase Documentation

它的功能

Crawlbase Codex 插件将 Crawlbase MCP 封装为 Codex 原生插件。安装完成后，您可以用自然语言让 Codex 抓取页面、提取内容或截取截图，Codex 会选择合适的工具、调用 Crawlbase 并返回结果。

由 Crawlbase 基础设施驱动：JavaScript 渲染、自动代理轮换以及内置反爬虫绕过。与您在生产环境中使用的相同的可靠性，在 Codex 中通过对话式界面呈现。

源代码

该插件已开源：github.com/crawlbase/crawlbase-codex-plugin。欢迎提交 Issue 和 PR。

前置要求

您需要一个 Crawlbase 账号和两个 API token：

CRAWLBASE_TOKEN

required

Normal token：用于静态页面。

CRAWLBASE_JS_TOKEN

required

JavaScript token：用于 JS 渲染的页面以及所有截图操作。

请从您的 dashboard 获取这两个 token。两者的区别请参见 Authentication。

从 Codex Marketplace 安装

打开 Codex 并进入 Plugins → Browse Marketplace。
搜索 Crawlbase Web Scraper。
点击 Install。
在提示时添加您的 CRAWLBASE_TOKEN 和 CRAWLBASE_JS_TOKEN。

Marketplace 列表即将上线

Marketplace 列表仍在审核中。在此期间，请使用下方的手动安装方式。

手动安装

克隆到您的 Codex plugins 目录并设置环境变量：

# Clone the plugin into Codex's plugins directory
git clone https://github.com/crawlbase/crawlbase-codex-plugin \
  ~/.codex/plugins/crawlbase-mcp

# Set your tokens
export CRAWLBASE_TOKEN=YOUR_TOKEN
export CRAWLBASE_JS_TOKEN=YOUR_JS_TOKEN

# Restart Codex - the plugin auto-discovers

使用方法

安装完成后，自然地向 Codex 提问即可。它会选择合适的工具，并在后台调用 Crawlbase。

# Crawling
"Crawl https://example.com and return the HTML"
"Get the markdown content of https://example.com/article"
"Take a screenshot of https://example.com"

# Device emulation
"Fetch the page at https://example.com using a mobile browser"
"Take a full-page screenshot of https://example.com and describe what you see"

暴露的工具

该插件注册了三个抓取工具和六个存储工具。

抓取工具

crawl

tool

获取任意 URL 并返回原始 HTML。接受 store: true 将页面推送到 Cloud Storage，而不是直接返回内联结果。

crawl_markdown

tool

抓取 URL 并返回干净的 Markdown：从 HTML 噪声中提取内容，针对 LLM 消费进行了优化。支持 store: true。

crawl_screenshot

tool

将 URL 渲染为 PNG。截图通过 screenshot_url 临时返回：底层 HTML 可以通过 store: true 持久化，但图像本身不会被存储。

存储工具

storage_get

tool

通过 rid 或 url 获取单个已存储页面。传入 as: "json"、"html" 或 "markdown" 以选择响应格式。

storage_bulk_get

tool

在一次调用中获取最多 100 个 RID。可选的 delete_after 标志适用于一次性发送的流水线场景。

storage_list

tool

通过滚动分页枚举已存储的 RID，每次调用最多 1,000 个。

storage_count

tool

您的存储 silo 中的文档总数。

storage_delete

tool

通过 RID 删除单个已存储页面。

storage_bulk_delete

tool

在一次调用中删除最多 100 个 RID。

存储使用示例

"Crawl https://example.com and store it in Crawlbase Cloud Storage"
"List all stored pages in Crawlbase"
"Fetch rid abc123 from storage as markdown"
"Bulk-retrieve these 50 rids and delete them afterward"
"How many pages do I have in Crawlbase storage?"

按 token 划分的存储 silo

存储按 token 进行分区。使用 CRAWLBASE_TOKEN 抓取的页面与使用 CRAWLBASE_JS_TOKEN 抓取的页面（涵盖 JS 渲染页面和所有屏幕截图）位于不同的 silo 中。

每个抓取响应都包含一个 token_type 字段："normal" 或 "js"，用于告诉您结果落在了哪个 silo 中。调用任何存储工具时，如果该条目位于 JS silo，请传入 use_js_token: true；否则可以省略。

查询错误的 silo 会返回 “Not found”

如果 storage_get 对您确认存在的 RID 返回 not-found 错误，您可能查询了错误的 silo。请使用 use_js_token: true 重试（或者如果已设置，则将其移除）。

Crawlbase MCP Server- 该插件所封装的底层 MCP 服务器
Cloud Storage- 存储后端
提示词模式- 经过实战验证的提示词，您可以将其改编用于 Codex

它的功能

前置要求

从 Codex Marketplace 安装

手动安装

使用方法

暴露的工具

抓取工具

存储工具

存储使用示例

按 token 划分的存储 silo

相关内容