Skip to content
Framework Docs

Extract

Create Extract Job
POST/api/v2/extract
List Extract Jobs
GET/api/v2/extract
Get Extract Job
GET/api/v2/extract/{job_id}
Delete Extract Job
DELETE/api/v2/extract/{job_id}
Validate Extraction Schema
POST/api/v2/extract/schema/validation
Generate Extraction Schema
POST/api/v2/extract/schema/generate
ModelsExpand Collapse
ExtractConfiguration object { data_schema, cite_sources, confidence_scores, 8 more }

Extract configuration combining parse and extract settings.

data_schema: map[map[unknown] or array of unknown or string or 2 more]

JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.

One of the following:
map[unknown]
array of unknown
string
number
boolean
cite_sources: optional boolean

Include citations in results

confidence_scores: optional boolean

Include confidence scores in results

extraction_target: optional "per_doc" or "per_page" or "per_table_row"

Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row

One of the following:
"per_doc"
"per_page"
"per_table_row"
max_pages: optional number

Maximum number of pages to process. Omit for no limit.

minimum1
parse_config_id: optional string

Saved parse configuration ID to control how the document is parsed before extraction

parse_tier: optional string

Parse tier to use before extraction. Defaults to the extract tier if not specified.

system_prompt: optional string

Custom system prompt to guide extraction behavior

target_pages: optional string

Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.

tier: optional "cost_effective" or "agentic"

Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)

One of the following:
"cost_effective"
"agentic"
version: optional string

Use ‘latest’ for the latest release for the selected tier or a date string (YYYY-MM-DD format) to pin to the nearest release at or before that date.

ExtractJobMetadata object { field_metadata, parse_job_id, parse_tier }

Extraction metadata.

field_metadata: optional ExtractedFieldMetadata { document_metadata, page_metadata, row_metadata }

Metadata for extracted fields including document, page, and row level info.

document_metadata: optional map[map[unknown] or array of unknown or string or 2 more]

Per-field metadata keyed by field name from your schema. Scalar fields (e.g. vendor) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. items) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.

One of the following:
map[unknown]
array of unknown
string
number
boolean
page_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]

Per-page metadata when extraction_target is per_page

One of the following:
map[unknown]
array of unknown
string
number
boolean
row_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]

Per-row metadata when extraction_target is per_table_row

One of the following:
map[unknown]
array of unknown
string
number
boolean
parse_job_id: optional string

Reference to the ParseJob ID used for parsing

parse_tier: optional string

Parse tier used for parsing the document

ExtractJobUsage object { num_pages_extracted }

Extraction usage metrics.

num_pages_extracted: optional number

Number of pages extracted

ExtractV2Job object { id, created_at, file_input, 9 more }

An extraction job.

id: string

Unique job identifier (job_id)

created_at: string

Creation timestamp

formatdate-time
file_input: string

File ID or parse job ID that was extracted

project_id: string

Project this job belongs to

status: string

Current job status.

  • PENDING — queued, not yet started
  • RUNNING — actively processing
  • COMPLETED — finished successfully
  • FAILED — terminated with an error
  • CANCELLED — cancelled by user
updated_at: string

Last update timestamp

formatdate-time
configuration: optional ExtractConfiguration { data_schema, cite_sources, confidence_scores, 8 more }

Extract configuration combining parse and extract settings.

data_schema: map[map[unknown] or array of unknown or string or 2 more]

JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.

One of the following:
map[unknown]
array of unknown
string
number
boolean
cite_sources: optional boolean

Include citations in results

confidence_scores: optional boolean

Include confidence scores in results

extraction_target: optional "per_doc" or "per_page" or "per_table_row"

Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row

One of the following:
"per_doc"
"per_page"
"per_table_row"
max_pages: optional number

Maximum number of pages to process. Omit for no limit.

minimum1
parse_config_id: optional string

Saved parse configuration ID to control how the document is parsed before extraction

parse_tier: optional string

Parse tier to use before extraction. Defaults to the extract tier if not specified.

system_prompt: optional string

Custom system prompt to guide extraction behavior

target_pages: optional string

Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.

tier: optional "cost_effective" or "agentic"

Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)

One of the following:
"cost_effective"
"agentic"
version: optional string

Use ‘latest’ for the latest release for the selected tier or a date string (YYYY-MM-DD format) to pin to the nearest release at or before that date.

configuration_id: optional string

Saved extract configuration ID used for this job, if any

error_message: optional string

Error details when status is FAILED

extract_metadata: optional ExtractJobMetadata { field_metadata, parse_job_id, parse_tier }

Extraction metadata.

field_metadata: optional ExtractedFieldMetadata { document_metadata, page_metadata, row_metadata }

Metadata for extracted fields including document, page, and row level info.

document_metadata: optional map[map[unknown] or array of unknown or string or 2 more]

Per-field metadata keyed by field name from your schema. Scalar fields (e.g. vendor) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. items) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.

One of the following:
map[unknown]
array of unknown
string
number
boolean
page_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]

Per-page metadata when extraction_target is per_page

One of the following:
map[unknown]
array of unknown
string
number
boolean
row_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]

Per-row metadata when extraction_target is per_table_row

One of the following:
map[unknown]
array of unknown
string
number
boolean
parse_job_id: optional string

Reference to the ParseJob ID used for parsing

parse_tier: optional string

Parse tier used for parsing the document

extract_result: optional map[map[unknown] or array of unknown or string or 2 more] or array of map[map[unknown] or array of unknown or string or 2 more]

Extracted data conforming to the data_schema. Returns a single object for per_doc, or an array for per_page / per_table_row.

One of the following:
map[map[unknown] or array of unknown or string or 2 more]
One of the following:
map[unknown]
array of unknown
string
number
boolean
array of map[map[unknown] or array of unknown or string or 2 more]
One of the following:
map[unknown]
array of unknown
string
number
boolean
metadata: optional object { usage }

Job-level metadata.

usage: optional ExtractJobUsage { num_pages_extracted }

Extraction usage metrics.

num_pages_extracted: optional number

Number of pages extracted

ExtractV2JobCreate object { file_input, configuration, configuration_id, webhook_configurations }

Request to create an extraction job. Provide configuration_id or inline configuration.

file_input: string

File ID or parse job ID to extract from

maxLength200
configuration: optional ExtractConfiguration { data_schema, cite_sources, confidence_scores, 8 more }

Extract configuration combining parse and extract settings.

data_schema: map[map[unknown] or array of unknown or string or 2 more]

JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.

One of the following:
map[unknown]
array of unknown
string
number
boolean
cite_sources: optional boolean

Include citations in results

confidence_scores: optional boolean

Include confidence scores in results

extraction_target: optional "per_doc" or "per_page" or "per_table_row"

Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row

One of the following:
"per_doc"
"per_page"
"per_table_row"
max_pages: optional number

Maximum number of pages to process. Omit for no limit.

minimum1
parse_config_id: optional string

Saved parse configuration ID to control how the document is parsed before extraction

parse_tier: optional string

Parse tier to use before extraction. Defaults to the extract tier if not specified.

system_prompt: optional string

Custom system prompt to guide extraction behavior

target_pages: optional string

Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.

tier: optional "cost_effective" or "agentic"

Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)

One of the following:
"cost_effective"
"agentic"
version: optional string

Use ‘latest’ for the latest release for the selected tier or a date string (YYYY-MM-DD format) to pin to the nearest release at or before that date.

configuration_id: optional string

Saved configuration ID

webhook_configurations: optional array of object { webhook_events, webhook_headers, webhook_output_format, webhook_url }

Outbound webhook endpoints to notify on job status changes

webhook_events: optional array of "extract.pending" or "extract.success" or "extract.error" or 20 more

Events to subscribe to (e.g. ‘parse.success’, ‘extract.error’). If null, all events are delivered.

One of the following:
"extract.pending"
"extract.success"
"extract.error"
"extract.partial_success"
"extract.cancelled"
"parse.pending"
"parse.running"
"parse.success"
"parse.error"
"parse.partial_success"
"parse.cancelled"
"classify.pending"
"classify.running"
"classify.success"
"classify.error"
"classify.partial_success"
"classify.cancelled"
"sheets.pending"
"sheets.success"
"sheets.error"
"sheets.partial_success"
"sheets.cancelled"
"unmapped_event"
webhook_headers: optional map[string]

Custom HTTP headers sent with each webhook request (e.g. auth tokens)

webhook_output_format: optional string

Response format sent to the webhook: ‘string’ (default) or ‘json’

webhook_url: optional string

URL to receive webhook POST notifications

ExtractV2JobQueryResponse object { items, next_page_token, total_size }

Paginated list of extraction jobs.

items: array of ExtractV2Job { id, created_at, file_input, 9 more }

The list of items.

id: string

Unique job identifier (job_id)

created_at: string

Creation timestamp

formatdate-time
file_input: string

File ID or parse job ID that was extracted

project_id: string

Project this job belongs to

status: string

Current job status.

  • PENDING — queued, not yet started
  • RUNNING — actively processing
  • COMPLETED — finished successfully
  • FAILED — terminated with an error
  • CANCELLED — cancelled by user
updated_at: string

Last update timestamp

formatdate-time
configuration: optional ExtractConfiguration { data_schema, cite_sources, confidence_scores, 8 more }

Extract configuration combining parse and extract settings.

data_schema: map[map[unknown] or array of unknown or string or 2 more]

JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.

One of the following:
map[unknown]
array of unknown
string
number
boolean
cite_sources: optional boolean

Include citations in results

confidence_scores: optional boolean

Include confidence scores in results

extraction_target: optional "per_doc" or "per_page" or "per_table_row"

Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row

One of the following:
"per_doc"
"per_page"
"per_table_row"
max_pages: optional number

Maximum number of pages to process. Omit for no limit.

minimum1
parse_config_id: optional string

Saved parse configuration ID to control how the document is parsed before extraction

parse_tier: optional string

Parse tier to use before extraction. Defaults to the extract tier if not specified.

system_prompt: optional string

Custom system prompt to guide extraction behavior

target_pages: optional string

Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.

tier: optional "cost_effective" or "agentic"

Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)

One of the following:
"cost_effective"
"agentic"
version: optional string

Use ‘latest’ for the latest release for the selected tier or a date string (YYYY-MM-DD format) to pin to the nearest release at or before that date.

configuration_id: optional string

Saved extract configuration ID used for this job, if any

error_message: optional string

Error details when status is FAILED

extract_metadata: optional ExtractJobMetadata { field_metadata, parse_job_id, parse_tier }

Extraction metadata.

field_metadata: optional ExtractedFieldMetadata { document_metadata, page_metadata, row_metadata }

Metadata for extracted fields including document, page, and row level info.

document_metadata: optional map[map[unknown] or array of unknown or string or 2 more]

Per-field metadata keyed by field name from your schema. Scalar fields (e.g. vendor) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. items) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.

One of the following:
map[unknown]
array of unknown
string
number
boolean
page_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]

Per-page metadata when extraction_target is per_page

One of the following:
map[unknown]
array of unknown
string
number
boolean
row_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]

Per-row metadata when extraction_target is per_table_row

One of the following:
map[unknown]
array of unknown
string
number
boolean
parse_job_id: optional string

Reference to the ParseJob ID used for parsing

parse_tier: optional string

Parse tier used for parsing the document

extract_result: optional map[map[unknown] or array of unknown or string or 2 more] or array of map[map[unknown] or array of unknown or string or 2 more]

Extracted data conforming to the data_schema. Returns a single object for per_doc, or an array for per_page / per_table_row.

One of the following:
map[map[unknown] or array of unknown or string or 2 more]
One of the following:
map[unknown]
array of unknown
string
number
boolean
array of map[map[unknown] or array of unknown or string or 2 more]
One of the following:
map[unknown]
array of unknown
string
number
boolean
metadata: optional object { usage }

Job-level metadata.

usage: optional ExtractJobUsage { num_pages_extracted }

Extraction usage metrics.

num_pages_extracted: optional number

Number of pages extracted

next_page_token: optional string

A token, which can be sent as page_token to retrieve the next page. If this field is omitted, there are no subsequent pages.

total_size: optional number

The total number of items available. This is only populated when specifically requested. The value may be an estimate and can be used for display purposes only.

ExtractV2SchemaGenerateRequest object { data_schema, file_id, name, prompt }

Request schema for generating an extraction schema.

data_schema: optional map[map[unknown] or array of unknown or string or 2 more]

Optional schema to validate, refine, or extend

One of the following:
map[unknown]
array of unknown
string
number
boolean
file_id: optional string

Optional file ID to analyze for schema generation

name: optional string

Name for the generated configuration (auto-generated if omitted)

maxLength255
prompt: optional string

Natural language description of the data structure to extract

ExtractV2SchemaValidateRequest object { data_schema }

Request schema for validating an extraction schema.

data_schema: map[map[unknown] or array of unknown or string or 2 more]

JSON Schema to validate for use with extract jobs

One of the following:
map[unknown]
array of unknown
string
number
boolean
ExtractV2SchemaValidateResponse object { data_schema }

Response schema for schema validation.

data_schema: map[map[unknown] or array of unknown or string or 2 more]

Validated JSON Schema, ready for use in extract jobs

One of the following:
map[unknown]
array of unknown
string
number
boolean
ExtractedFieldMetadata object { document_metadata, page_metadata, row_metadata }

Metadata for extracted fields including document, page, and row level info.

document_metadata: optional map[map[unknown] or array of unknown or string or 2 more]

Per-field metadata keyed by field name from your schema. Scalar fields (e.g. vendor) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. items) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.

One of the following:
map[unknown]
array of unknown
string
number
boolean
page_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]

Per-page metadata when extraction_target is per_page

One of the following:
map[unknown]
array of unknown
string
number
boolean
row_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]

Per-row metadata when extraction_target is per_table_row

One of the following:
map[unknown]
array of unknown
string
number
boolean
ExtractDeleteResponse = unknown