AI 2026年6月18日 3 分钟阅读

Viscribe 实战：用 Schema-Driven 方式让 AI Agent 看懂图片

tinyash 0 条评论

文章信息

发布时间 2026年6月18日
作者 tinyash
阅读时长 3 分钟阅读

让 AI Agent 「看懂」图片并不难，难的是让 Agent 返回结构化的结果。

目前在大多数 AI Agent 应用中，处理图片的基本模式是：把图片丢给多模态模型，然后从自由文本回复中手动解析所需信息。这种做法带来的问题很直接——同样的图片，同样的 Agent，今天的回复格式和明天可能不同，「解析逻辑」成了 Agent 工具链中最脆弱的一环。

Viscribe 就是来填补这个空白的。它是一个开源（MIT）的 Schema-Driven 图片理解库，支持 Python 和 TypeScript，核心思路极其明确：你定义输出 Schema，传递图片，选择模型，拿到结构化数据，而不是自由文本。

Schema 即契约

传统图片处理的自由文本输出就像签了一份没有格式的合同——你知道大概会拿到什么，但永远不能确定具体格式。

Viscribe 的做法直接了当：先定义好输出格式，再让模型按格式返回。

from pydantic import BaseModel, Field
from viscribe.images import extract


class Receipt(BaseModel):
    merchant_name: str | None = Field(description="Store or business name")
    total_amount: float | None = Field(description="Final total on the receipt")
    date: str | None = Field(description="Receipt date if visible")
    line_items: list[str] = Field(description="Visible purchased items")


result = extract(
    image_path="examples/receipt.png",
    output_schema=Receipt,
    instruction="Extract the receipt fields visible in the image.",
    model_config={
        "model": "gpt-5-mini",
        "api_key": "***",
        "temperature": 1,
    },
)

print(result.data.model_dump())

一个 Receipt 模型就定义了你期望的所有字段——商家名、总金额、日期、行项目。Agent 不是「描述一下这张图片」，而是「按这个结构提取」。

TypeScript 版本同样简洁：

import { images } from "viscribe";

const result = await images.extract({
  imagePath: "examples/receipt.png",
  outputSchema: [
    { name: "merchant_name", type: "text", description: "Store or business name" },
    { name: "total_amount", type: "number", description: "Final total on the receipt" },
    { name: "date", type: "text", description: "Receipt date if visible" },
    { name: "line_items", type: "array_text", description: "Visible purchased items" },
  ],
  instruction: "Extract the receipt fields visible in the image.",
  modelConfig: {
    model: "gpt-5-mini",
    apiKey: "sk-...",
    temperature: 1,
  },
});

console.log(result.data);

这种「Schema 即契约」的做法带来的最大好处在于：Agent 的下游处理逻辑不再需要应对格式变化。无论今天用了什么模型，输出结构始终由 Schema 保证。

安装与配置

Viscribe 的安装极其简单，一行命令即可：

pip install viscribe

npm install viscribe

模型配置兼容 OpenAI 格式的视觉模型。你只需要在调用时传入 model_config，指定模型名和 API Key：

from viscribe.images import extract

result = extract(
    image_path="screenshot.png",
    output_schema=[{"name": "summary", "type": "text", "description": "Screen content"}],
    instruction="Describe what you see.",
    model_config={"model": "gpt-5-mini", "api_key": "sk-..."},
)

API Key 建议从环境变量加载，而不是硬编码在代码中。Viscribe 兼容任何 OpenAI-compatible 的视觉模型提供商——Claude、Gemini、本地部署的 vLLM 实例都可以。

核心 API：Extract 的三种调用模式

Viscribe 的 API 设计遵循「一个函数覆盖所有场景」的原则。extract 接受三个核心参数：

参数	说明	示例值
`image_path`	图片路径（本地、base64、远程 URL）	`"receipt.png"`, `"data:image/..."`
`output_schema`	输出结构定义（Pydantic、JSON Schema、简单字段）	`Receipt`, 或简单字段列表
`instruction`	自然语言指令	`"Extract the receipt fields"`

Schema 有三种写法：

Pydantic 模型（Python 专属）：最强大的方式，支持类型注解、可选字段、嵌套模型。

JSON Schema 字典：通用格式，Python 和 TypeScript 都支持。

简单字段列表：快速上手的轻量方式：

result = extract(
    image_path="examples/venice.png",
    output_schema=[
        {"name": "location", "type": "text", "description": "Likely place shown"},
        {"name": "visible_elements", "type": "array_text", "description": "Objects and structures"},
        {"name": "colors", "type": "array_text", "description": "Dominant colors"},
    ],
    instruction="Extract useful scene metadata for a travel catalog.",
)

对于需要高并发处理的场景，Viscribe 提供了异步版本 aextract：

import asyncio
from viscribe.images import aextract


async def main() -> None:
    result = await aextract(
        image_path="examples/receipt.png",
        output_schema=[{"name": "total_amount", "type": "number"}],
        instruction="Extract the visible receipt total.",
    )
    print(result.data)


asyncio.run(main())

可复用客户端模式

在 Agent 应用中，通常需要多次调用图片分析接口。每次调用都传入 model_config 既啰嗦又容易出错。Viscribe 提供了可复用的 Client 模式：

from viscribe import ViscribeAI

client = ViscribeAI(model_config={"model": "gpt-5-mini", "temperature": 1})

result = client.images.extract(
    image_path="examples/receipt.png",
    output_schema=[{"name": "total_amount", "type": "number"}],
)

import { ViscribeAI } from "viscribe";

const client = new ViscribeAI({
  modelConfig: { model: "gpt-5-mini", temperature: 1 },
});

const result = await client.images.extract({
  imagePath: "examples/receipt.png",
  outputSchema: [{ name: "total_amount", type: "number" }],
});

console.log(result.data);

在 Agent 工具的实践中，这种模式尤其有价值——你可以将 Client 初始化为 Agent 上下文中的全局配置，然后在每个工具调用中只关心 image_path 和 output_schema 两个参数。

实战场景：Agent 中的图片分析工具

假设你在构建一个 Agent 工作流，需要让 Agent 能够读取和分析 PDF 截图中的表格。在没有 Viscribe 的方案中，你需要：将图片传给多模态模型 → 解析自由文本回复 → 处理格式不一致 → 手动验证字段完整性。

有了 Viscribe 之后，整个流程简化为两步：定义 Schema → 调用 Extract。

from pydantic import BaseModel, Field
from viscribe.images import extract


class TableData(BaseModel):
    row_count: int = Field(description="Number of data rows in the table")
    headers: list[str] = Field(description="Column header names")
    summary: str = Field(description="Brief summary of what the table shows")


def analyze_table_screenshot(image_path: str) -> TableData:
    result = extract(
        image_path=image_path,
        output_schema=TableData,
        instruction="Extract table metadata from the screenshot.",
        model_config={"model": "gpt-5-mini", "temperature": 0},
    )
    return result.data

这个函数返回的就是一个标准的 TableData 对象，Python 类型提示完整，可以直接被下游逻辑使用——不需要正则匹配、不需要字段名映射。

适用场景一览

Viscribe 的 Schema-Driven 模式在以下场景中特别高效：

场景	Schema 字段示例
票据/发票解析	`merchant_name`, `total_amount`, `date`, `line_items`
商品目录构建	`name`, `brand`, `category`, `attributes`, `tags`
视觉摘要/场景描述	`summary`, `visible_objects`, `scene_type`
分类路由	`category`, `confidence`, `rationale`
视觉检测/QA	`status`, `issues`, `review_notes`
Agent 工具集成	`answer`, `evidence`, `next_action`

注意事项

几个实践中的要点值得留意：

模型选择影响质量：gpt-5-mini 速度快、成本低，适合简单字段提取；复杂嵌套 Schema 建议使用更强模型（如 Claude Sonnet、GPT-5）。

图片格式兼容：支持本地图片路径、base64 编码、远程 URL 三种输入方式。如果 Agent 从网上抓取图片，直接传 URL 最省事。

Schema 设计原则：描述字段 (description) 越具体，模型输出的准确度越高。不要写泛泛的 “the thing”，而是 “Business name as it appears on the storefront”。

重试机制：底层通过 provider 客户端自动处理临时错误，无需手动实现重试逻辑。

不支持的文件：PDF 原生格式不直接支持，需要先转换为图片（截图或 PDF 渲染为 PNG）。

总结

Viscribe 解决的核心问题是：当 AI Agent 需要从图片中提取数据时，如何确保输出可预测、可消费。Schema-Driven 模式让多模态模型的结构化输出不再是玄学，Agent 工具链中的图片理解环节终于有了确定的「类型签名」。

目前在 GitHub 上 23 个 Star 的 Viscribe 还处于早期阶段，但它的 API 设计和文档完整性已经达到了生产可用级别。如果你的 Agent 工作流中有任何「看图填表」类需求，Viscribe 是一个值得一试的轻量级方案。

项目地址：github.com/itsperini/viscribe | 文档：docs.viscribe.ai | PyPI：pip install viscribe | npm：npm install viscribe

AI AI 工具 AI 编程开发者工具开源教程