# Extract
## Create Extract Job
`$ llamacloud-prod extract create`
**post** `/api/v2/extract`
Create an extraction job.
Extracts structured data from a document using either a saved
configuration or an inline JSON Schema.
## Input
Provide exactly one of:
- `configuration_id` — reference a saved extraction config
- `configuration` — inline configuration with a `data_schema`
## Document input
Set `file_input` to a file ID (`dfl-...`) or a
completed parse job ID (`pjb-...`).
The job runs asynchronously. Poll `GET /extract/{job_id}` or
register a webhook to monitor completion.
### Parameters
- `--file-input: string`
Body param: File ID or parse job ID to extract from
- `--organization-id: optional string`
Query param
- `--project-id: optional string`
Query param
- `--configuration: optional object { data_schema, cite_sources, confidence_scores, 8 more }`
Body param: Extract configuration combining parse and extract settings.
- `--configuration-id: optional string`
Body param: Saved configuration ID
- `--webhook-configuration: optional array of object { webhook_events, webhook_headers, webhook_output_format, webhook_url }`
Body param: Outbound webhook endpoints to notify on job status changes
### Returns
- `extract_v2_job: object { id, created_at, file_input, 9 more }`
An extraction job.
- `id: string`
Unique job identifier (job_id)
- `created_at: string`
Creation timestamp
- `file_input: string`
File ID or parse job ID that was extracted
- `project_id: string`
Project this job belongs to
- `status: string`
Current job status.
- `PENDING` — queued, not yet started
- `RUNNING` — actively processing
- `COMPLETED` — finished successfully
- `FAILED` — terminated with an error
- `CANCELLED` — cancelled by user
- `updated_at: string`
Last update timestamp
- `configuration: optional object { data_schema, cite_sources, confidence_scores, 8 more }`
Extract configuration combining parse and extract settings.
- `data_schema: map[map[unknown] or array of unknown or string or 2 more]`
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `cite_sources: optional boolean`
Include citations in results
- `confidence_scores: optional boolean`
Include confidence scores in results
- `extract_version: optional string`
Extract algorithm version. Use 'latest' for the default pipeline or a date string (e.g. '2026-01-08') to pin to a specific release.
- `extraction_target: optional "per_doc" or "per_page" or "per_table_row"`
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
- `"per_doc"`
- `"per_page"`
- `"per_table_row"`
- `max_pages: optional number`
Maximum number of pages to process. Omit for no limit.
- `parse_config_id: optional string`
Saved parse configuration ID to control how the document is parsed before extraction
- `parse_tier: optional string`
Parse tier to use before extraction. Defaults to the extract tier if not specified.
- `system_prompt: optional string`
Custom system prompt to guide extraction behavior
- `target_pages: optional string`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `tier: optional "cost_effective" or "agentic"`
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
- `"cost_effective"`
- `"agentic"`
- `configuration_id: optional string`
Saved extract configuration ID used for this job, if any
- `error_message: optional string`
Error details when status is FAILED
- `extract_metadata: optional object { field_metadata, parse_job_id, parse_tier }`
Extraction metadata.
- `field_metadata: optional object { document_metadata, page_metadata, row_metadata }`
Metadata for extracted fields including document, page, and row level info.
- `document_metadata: optional map[map[unknown] or array of unknown or string or 2 more]`
Per-field metadata keyed by field name from your schema. Scalar fields (e.g. `vendor`) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. `items`) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `page_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]`
Per-page metadata when extraction_target is per_page
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `row_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]`
Per-row metadata when extraction_target is per_table_row
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `parse_job_id: optional string`
Reference to the ParseJob ID used for parsing
- `parse_tier: optional string`
Parse tier used for parsing the document
- `extract_result: optional map[map[unknown] or array of unknown or string or 2 more] or array of map[map[unknown] or array of unknown or string or 2 more]`
Extracted data conforming to the data_schema. Returns a single object for per_doc, or an array for per_page / per_table_row.
- `union_member_0: map[map[unknown] or array of unknown or string or 2 more]`
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `union_member_1: array of map[map[unknown] or array of unknown or string or 2 more]`
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `metadata: optional object { usage }`
Job-level metadata.
- `usage: optional object { num_document_tokens, num_output_tokens, num_pages_extracted }`
Extraction usage metrics.
- `num_document_tokens: optional number`
Number of document tokens
- `num_output_tokens: optional number`
Number of output tokens
- `num_pages_extracted: optional number`
Number of pages extracted
### Example
```cli
llamacloud-prod extract create \
--api-key 'My API Key' \
--file-input dfl-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
```
#### Response
```json
{
"id": "ext-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"created_at": "2019-12-27T18:11:19.117Z",
"file_input": "dfl-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"project_id": "prj-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"status": "COMPLETED",
"updated_at": "2019-12-27T18:11:19.117Z",
"configuration": {
"data_schema": {
"foo": {
"foo": "bar"
}
},
"cite_sources": true,
"confidence_scores": true,
"extract_version": "latest",
"extraction_target": "per_doc",
"max_pages": 10,
"parse_config_id": "cfg-11111111-2222-3333-4444-555555555555",
"parse_tier": "fast",
"system_prompt": "Extract all monetary values in USD. If a currency is not specified, assume USD.",
"target_pages": "1,3,5-7",
"tier": "cost_effective"
},
"configuration_id": "cfg-11111111-2222-3333-4444-555555555555",
"error_message": "error_message",
"extract_metadata": {
"field_metadata": {
"document_metadata": {
"items": [
{
"amount": {
"citation": [
{
"matching_text": "$10.00",
"page": 1
}
],
"confidence": 1
},
"description": {
"citation": [
{
"matching_text": "$10/month",
"page": 1
}
],
"confidence": 0.998
}
}
],
"total": {
"citation": "bar",
"confidence": "bar"
},
"vendor": {
"citation": "bar",
"confidence": "bar",
"extraction_confidence": "bar",
"parsing_confidence": "bar"
}
},
"page_metadata": [
{
"foo": {
"foo": "bar"
}
}
],
"row_metadata": [
{
"foo": {
"foo": "bar"
}
}
]
},
"parse_job_id": "parse_job_id",
"parse_tier": "parse_tier"
},
"extract_result": {
"foo": {
"foo": "bar"
}
},
"metadata": {
"usage": {
"num_document_tokens": 0,
"num_output_tokens": 0,
"num_pages_extracted": 0
}
}
}
```
## List Extract Jobs
`$ llamacloud-prod extract list`
**get** `/api/v2/extract`
List extraction jobs with optional filtering and pagination.
Filter by `configuration_id`, `status`, `file_input`,
or creation date range. Results are returned newest-first.
Use `expand=configuration` to include the full configuration used,
and `expand=extract_metadata` for per-field metadata.
### Parameters
- `--configuration-id: optional string`
Filter by configuration ID
- `--created-at-on-or-after: optional string`
Include items created at or after this timestamp (inclusive)
- `--created-at-on-or-before: optional string`
Include items created at or before this timestamp (inclusive)
- `--document-input-type: optional string`
Filter by document input type (file_id or parse_job_id)
- `--document-input-value: optional string`
Deprecated: use file_input instead
- `--expand: optional array of string`
Additional fields to include: configuration, extract_metadata
- `--file-input: optional string`
Filter by file input value
- `--job-id: optional array of string`
Filter by specific job IDs
- `--organization-id: optional string`
- `--page-size: optional number`
Number of items per page
- `--page-token: optional string`
Token for pagination
- `--project-id: optional string`
- `--status: optional "PENDING" or "THROTTLED" or "RUNNING" or 3 more`
Filter by status
### Returns
- `extract_v2_job_query_response: object { items, next_page_token, total_size }`
Paginated list of extraction jobs.
- `items: array of ExtractV2Job`
The list of items.
- `id: string`
Unique job identifier (job_id)
- `created_at: string`
Creation timestamp
- `file_input: string`
File ID or parse job ID that was extracted
- `project_id: string`
Project this job belongs to
- `status: string`
Current job status.
- `PENDING` — queued, not yet started
- `RUNNING` — actively processing
- `COMPLETED` — finished successfully
- `FAILED` — terminated with an error
- `CANCELLED` — cancelled by user
- `updated_at: string`
Last update timestamp
- `configuration: optional object { data_schema, cite_sources, confidence_scores, 8 more }`
Extract configuration combining parse and extract settings.
- `data_schema: map[map[unknown] or array of unknown or string or 2 more]`
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `cite_sources: optional boolean`
Include citations in results
- `confidence_scores: optional boolean`
Include confidence scores in results
- `extract_version: optional string`
Extract algorithm version. Use 'latest' for the default pipeline or a date string (e.g. '2026-01-08') to pin to a specific release.
- `extraction_target: optional "per_doc" or "per_page" or "per_table_row"`
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
- `"per_doc"`
- `"per_page"`
- `"per_table_row"`
- `max_pages: optional number`
Maximum number of pages to process. Omit for no limit.
- `parse_config_id: optional string`
Saved parse configuration ID to control how the document is parsed before extraction
- `parse_tier: optional string`
Parse tier to use before extraction. Defaults to the extract tier if not specified.
- `system_prompt: optional string`
Custom system prompt to guide extraction behavior
- `target_pages: optional string`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `tier: optional "cost_effective" or "agentic"`
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
- `"cost_effective"`
- `"agentic"`
- `configuration_id: optional string`
Saved extract configuration ID used for this job, if any
- `error_message: optional string`
Error details when status is FAILED
- `extract_metadata: optional object { field_metadata, parse_job_id, parse_tier }`
Extraction metadata.
- `field_metadata: optional object { document_metadata, page_metadata, row_metadata }`
Metadata for extracted fields including document, page, and row level info.
- `document_metadata: optional map[map[unknown] or array of unknown or string or 2 more]`
Per-field metadata keyed by field name from your schema. Scalar fields (e.g. `vendor`) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. `items`) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `page_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]`
Per-page metadata when extraction_target is per_page
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `row_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]`
Per-row metadata when extraction_target is per_table_row
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `parse_job_id: optional string`
Reference to the ParseJob ID used for parsing
- `parse_tier: optional string`
Parse tier used for parsing the document
- `extract_result: optional map[map[unknown] or array of unknown or string or 2 more] or array of map[map[unknown] or array of unknown or string or 2 more]`
Extracted data conforming to the data_schema. Returns a single object for per_doc, or an array for per_page / per_table_row.
- `union_member_0: map[map[unknown] or array of unknown or string or 2 more]`
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `union_member_1: array of map[map[unknown] or array of unknown or string or 2 more]`
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `metadata: optional object { usage }`
Job-level metadata.
- `usage: optional object { num_document_tokens, num_output_tokens, num_pages_extracted }`
Extraction usage metrics.
- `num_document_tokens: optional number`
Number of document tokens
- `num_output_tokens: optional number`
Number of output tokens
- `num_pages_extracted: optional number`
Number of pages extracted
- `next_page_token: optional string`
A token, which can be sent as page_token to retrieve the next page. If this field is omitted, there are no subsequent pages.
- `total_size: optional number`
The total number of items available. This is only populated when specifically requested. The value may be an estimate and can be used for display purposes only.
### Example
```cli
llamacloud-prod extract list \
--api-key 'My API Key'
```
#### Response
```json
{
"items": [
{
"id": "ext-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"created_at": "2019-12-27T18:11:19.117Z",
"file_input": "dfl-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"project_id": "prj-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"status": "COMPLETED",
"updated_at": "2019-12-27T18:11:19.117Z",
"configuration": {
"data_schema": {
"foo": {
"foo": "bar"
}
},
"cite_sources": true,
"confidence_scores": true,
"extract_version": "latest",
"extraction_target": "per_doc",
"max_pages": 10,
"parse_config_id": "cfg-11111111-2222-3333-4444-555555555555",
"parse_tier": "fast",
"system_prompt": "Extract all monetary values in USD. If a currency is not specified, assume USD.",
"target_pages": "1,3,5-7",
"tier": "cost_effective"
},
"configuration_id": "cfg-11111111-2222-3333-4444-555555555555",
"error_message": "error_message",
"extract_metadata": {
"field_metadata": {
"document_metadata": {
"items": [
{
"amount": {
"citation": [
{
"matching_text": "$10.00",
"page": 1
}
],
"confidence": 1
},
"description": {
"citation": [
{
"matching_text": "$10/month",
"page": 1
}
],
"confidence": 0.998
}
}
],
"total": {
"citation": "bar",
"confidence": "bar"
},
"vendor": {
"citation": "bar",
"confidence": "bar",
"extraction_confidence": "bar",
"parsing_confidence": "bar"
}
},
"page_metadata": [
{
"foo": {
"foo": "bar"
}
}
],
"row_metadata": [
{
"foo": {
"foo": "bar"
}
}
]
},
"parse_job_id": "parse_job_id",
"parse_tier": "parse_tier"
},
"extract_result": {
"foo": {
"foo": "bar"
}
},
"metadata": {
"usage": {
"num_document_tokens": 0,
"num_output_tokens": 0,
"num_pages_extracted": 0
}
}
}
],
"next_page_token": "next_page_token",
"total_size": 0
}
```
## Get Extract Job
`$ llamacloud-prod extract get`
**get** `/api/v2/extract/{job_id}`
Get a single extraction job by ID.
Returns the job status and results when complete.
Use `expand=configuration` to include the full configuration used,
and `expand=extract_metadata` for per-field metadata.
### Parameters
- `--job-id: string`
- `--expand: optional array of string`
Additional fields to include: configuration, extract_metadata
- `--organization-id: optional string`
- `--project-id: optional string`
### Returns
- `extract_v2_job: object { id, created_at, file_input, 9 more }`
An extraction job.
- `id: string`
Unique job identifier (job_id)
- `created_at: string`
Creation timestamp
- `file_input: string`
File ID or parse job ID that was extracted
- `project_id: string`
Project this job belongs to
- `status: string`
Current job status.
- `PENDING` — queued, not yet started
- `RUNNING` — actively processing
- `COMPLETED` — finished successfully
- `FAILED` — terminated with an error
- `CANCELLED` — cancelled by user
- `updated_at: string`
Last update timestamp
- `configuration: optional object { data_schema, cite_sources, confidence_scores, 8 more }`
Extract configuration combining parse and extract settings.
- `data_schema: map[map[unknown] or array of unknown or string or 2 more]`
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `cite_sources: optional boolean`
Include citations in results
- `confidence_scores: optional boolean`
Include confidence scores in results
- `extract_version: optional string`
Extract algorithm version. Use 'latest' for the default pipeline or a date string (e.g. '2026-01-08') to pin to a specific release.
- `extraction_target: optional "per_doc" or "per_page" or "per_table_row"`
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
- `"per_doc"`
- `"per_page"`
- `"per_table_row"`
- `max_pages: optional number`
Maximum number of pages to process. Omit for no limit.
- `parse_config_id: optional string`
Saved parse configuration ID to control how the document is parsed before extraction
- `parse_tier: optional string`
Parse tier to use before extraction. Defaults to the extract tier if not specified.
- `system_prompt: optional string`
Custom system prompt to guide extraction behavior
- `target_pages: optional string`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `tier: optional "cost_effective" or "agentic"`
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
- `"cost_effective"`
- `"agentic"`
- `configuration_id: optional string`
Saved extract configuration ID used for this job, if any
- `error_message: optional string`
Error details when status is FAILED
- `extract_metadata: optional object { field_metadata, parse_job_id, parse_tier }`
Extraction metadata.
- `field_metadata: optional object { document_metadata, page_metadata, row_metadata }`
Metadata for extracted fields including document, page, and row level info.
- `document_metadata: optional map[map[unknown] or array of unknown or string or 2 more]`
Per-field metadata keyed by field name from your schema. Scalar fields (e.g. `vendor`) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. `items`) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `page_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]`
Per-page metadata when extraction_target is per_page
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `row_metadata: optional array of map[map[unknown] or array of unknown or string or 2 more]`
Per-row metadata when extraction_target is per_table_row
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `parse_job_id: optional string`
Reference to the ParseJob ID used for parsing
- `parse_tier: optional string`
Parse tier used for parsing the document
- `extract_result: optional map[map[unknown] or array of unknown or string or 2 more] or array of map[map[unknown] or array of unknown or string or 2 more]`
Extracted data conforming to the data_schema. Returns a single object for per_doc, or an array for per_page / per_table_row.
- `union_member_0: map[map[unknown] or array of unknown or string or 2 more]`
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `union_member_1: array of map[map[unknown] or array of unknown or string or 2 more]`
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `metadata: optional object { usage }`
Job-level metadata.
- `usage: optional object { num_document_tokens, num_output_tokens, num_pages_extracted }`
Extraction usage metrics.
- `num_document_tokens: optional number`
Number of document tokens
- `num_output_tokens: optional number`
Number of output tokens
- `num_pages_extracted: optional number`
Number of pages extracted
### Example
```cli
llamacloud-prod extract get \
--api-key 'My API Key' \
--job-id job_id
```
#### Response
```json
{
"id": "ext-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"created_at": "2019-12-27T18:11:19.117Z",
"file_input": "dfl-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"project_id": "prj-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"status": "COMPLETED",
"updated_at": "2019-12-27T18:11:19.117Z",
"configuration": {
"data_schema": {
"foo": {
"foo": "bar"
}
},
"cite_sources": true,
"confidence_scores": true,
"extract_version": "latest",
"extraction_target": "per_doc",
"max_pages": 10,
"parse_config_id": "cfg-11111111-2222-3333-4444-555555555555",
"parse_tier": "fast",
"system_prompt": "Extract all monetary values in USD. If a currency is not specified, assume USD.",
"target_pages": "1,3,5-7",
"tier": "cost_effective"
},
"configuration_id": "cfg-11111111-2222-3333-4444-555555555555",
"error_message": "error_message",
"extract_metadata": {
"field_metadata": {
"document_metadata": {
"items": [
{
"amount": {
"citation": [
{
"matching_text": "$10.00",
"page": 1
}
],
"confidence": 1
},
"description": {
"citation": [
{
"matching_text": "$10/month",
"page": 1
}
],
"confidence": 0.998
}
}
],
"total": {
"citation": "bar",
"confidence": "bar"
},
"vendor": {
"citation": "bar",
"confidence": "bar",
"extraction_confidence": "bar",
"parsing_confidence": "bar"
}
},
"page_metadata": [
{
"foo": {
"foo": "bar"
}
}
],
"row_metadata": [
{
"foo": {
"foo": "bar"
}
}
]
},
"parse_job_id": "parse_job_id",
"parse_tier": "parse_tier"
},
"extract_result": {
"foo": {
"foo": "bar"
}
},
"metadata": {
"usage": {
"num_document_tokens": 0,
"num_output_tokens": 0,
"num_pages_extracted": 0
}
}
}
```
## Delete Extract Job
`$ llamacloud-prod extract delete`
**delete** `/api/v2/extract/{job_id}`
Delete an extraction job and its results.
### Parameters
- `--job-id: string`
- `--organization-id: optional string`
- `--project-id: optional string`
### Returns
- `ExtractDeleteResponse: unknown`
### Example
```cli
llamacloud-prod extract delete \
--api-key 'My API Key' \
--job-id job_id
```
#### Response
```json
{}
```
## Validate Extraction Schema
`$ llamacloud-prod extract validate-schema`
**post** `/api/v2/extract/schema/validation`
Validate a JSON schema for extraction.
### Parameters
- `--data-schema: map[map[unknown] or array of unknown or string or 2 more]`
JSON Schema to validate for use with extract jobs
### Returns
- `extract_v2_schema_validate_response: object { data_schema }`
Response schema for schema validation.
- `data_schema: map[map[unknown] or array of unknown or string or 2 more]`
Validated JSON Schema, ready for use in extract jobs
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
### Example
```cli
llamacloud-prod extract validate-schema \
--api-key 'My API Key' \
--data-schema '{foo: {foo: bar}}'
```
#### Response
```json
{
"data_schema": {
"foo": {
"foo": "bar"
}
}
}
```
## Generate Extraction Schema
`$ llamacloud-prod extract generate-schema`
**post** `/api/v2/extract/schema/generate`
Generate a JSON schema and return a product configuration request.
### Parameters
- `--organization-id: optional string`
Query param
- `--project-id: optional string`
Query param
- `--data-schema: optional map[map[unknown] or array of unknown or string or 2 more]`
Body param: Optional schema to validate, refine, or extend
- `--file-id: optional string`
Body param: Optional file ID to analyze for schema generation
- `--name: optional string`
Body param: Name for the generated configuration (auto-generated if omitted)
- `--prompt: optional string`
Body param: Natural language description of the data structure to extract
### Returns
- `configuration_create: object { name, parameters }`
Request body for creating a product configuration.
- `name: string`
Human-readable name for this configuration.
- `parameters: SplitV1Parameters or ExtractV2Parameters or ClassifyV2Parameters or 3 more`
Product-specific configuration parameters.
- `split_v1_parameters: object { categories, product_type, splitting_strategy }`
Typed parameters for a *split v1* product configuration.
- `categories: array of SplitCategory`
Categories to split documents into.
- `name: string`
Name of the category.
- `description: optional string`
Optional description of what content belongs in this category.
- `product_type: "split_v1"`
Product type.
- `"split_v1"`
- `splitting_strategy: optional object { allow_uncategorized }`
Strategy for splitting documents.
- `allow_uncategorized: optional "include" or "forbid" or "omit"`
Controls handling of pages that don't match any category. 'include': pages can be grouped as 'uncategorized' and included in results. 'forbid': all pages must be assigned to a defined category. 'omit': pages can be classified as 'uncategorized' but are excluded from results.
- `"include"`
- `"forbid"`
- `"omit"`
- `extract_v2_parameters: object { data_schema, product_type, cite_sources, 9 more }`
Typed parameters for an *extract v2* product configuration.
- `data_schema: map[map[unknown] or array of unknown or string or 2 more]`
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
- `union_member_0: map[unknown]`
- `union_member_1: array of unknown`
- `union_member_2: string`
- `union_member_3: number`
- `union_member_4: boolean`
- `product_type: "extract_v2"`
Product type.
- `"extract_v2"`
- `cite_sources: optional boolean`
Include citations in results
- `confidence_scores: optional boolean`
Include confidence scores in results
- `extract_version: optional string`
Extract algorithm version. Use 'latest' for the default pipeline or a date string (e.g. '2026-01-08') to pin to a specific release.
- `extraction_target: optional "per_doc" or "per_page" or "per_table_row"`
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
- `"per_doc"`
- `"per_page"`
- `"per_table_row"`
- `max_pages: optional number`
Maximum number of pages to process. Omit for no limit.
- `parse_config_id: optional string`
Saved parse configuration ID to control how the document is parsed before extraction
- `parse_tier: optional string`
Parse tier to use before extraction. Defaults to the extract tier if not specified.
- `system_prompt: optional string`
Custom system prompt to guide extraction behavior
- `target_pages: optional string`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `tier: optional "cost_effective" or "agentic"`
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
- `"cost_effective"`
- `"agentic"`
- `classify_v2_parameters: object { product_type, rules, mode, parsing_configuration }`
Typed parameters for a *classify v2* product configuration.
- `product_type: "classify_v2"`
Product type.
- `"classify_v2"`
- `rules: array of object { description, type }`
Classify rules to evaluate against the document (at least one required)
- `description: string`
Natural language criteria for matching this rule
- `type: string`
Document type to assign when rule matches
- `mode: optional "FAST"`
Classify execution mode
- `"FAST"`
- `parsing_configuration: optional object { lang, max_pages, target_pages }`
Parsing configuration for classify jobs.
- `lang: optional string`
ISO 639-1 language code for the document
- `max_pages: optional number`
Maximum number of pages to process. Omit for no limit.
- `target_pages: optional string`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `parse_v2_parameters: object { product_type, tier, version, 11 more }`
Configuration for LlamaParse v2 document parsing.
Includes tier selection, processing options, output formatting,
page targeting, and webhook delivery. Refer to the LlamaParse
documentation for details on each field.
- `product_type: "parse_v2"`
Product type.
- `"parse_v2"`
- `tier: "fast" or "cost_effective" or "agentic" or "agentic_plus"`
Parsing tier: 'fast' (rule-based, cheapest), 'cost_effective' (balanced), 'agentic' (AI-powered with custom prompts), or 'agentic_plus' (premium AI with highest accuracy)
- `"fast"`
- `"cost_effective"`
- `"agentic"`
- `"agentic_plus"`
- `version: "2025-12-11" or "2025-12-18" or "2025-12-31" or 40 more or string`
Tier version. Use 'latest' for the current stable version, or specify a specific version (e.g., '1.0', '2.0') for reproducible results
- `"2025-12-11"`
- `"2025-12-18"`
- `"2025-12-31"`
- `"2026-01-08"`
- `"2026-01-09"`
- `"2026-01-16"`
- `"2026-01-21"`
- `"2026-01-22"`
- `"2026-01-24"`
- `"2026-01-29"`
- `"2026-01-30"`
- `"2026-02-03"`
- `"2026-02-18"`
- `"2026-02-20"`
- `"2026-02-24"`
- `"2026-02-26"`
- `"2026-03-02"`
- `"2026-03-03"`
- `"2026-03-04"`
- `"2026-03-05"`
- `"2026-03-09"`
- `"2026-03-10"`
- `"2026-03-11"`
- `"2026-03-12"`
- `"2026-03-17"`
- `"2026-03-19"`
- `"2026-03-20"`
- `"2026-03-22"`
- `"2026-03-23"`
- `"2026-03-24"`
- `"2026-03-25"`
- `"2026-03-26"`
- `"2026-03-27"`
- `"2026-03-30"`
- `"2026-03-31"`
- `"2026-04-02"`
- `"2026-04-06"`
- `"2026-04-09"`
- `"2026-04-14"`
- `"2026-04-19"`
- `"2026-04-22"`
- `"2026-04-27"`
- `"latest"`
- `agentic_options: optional object { custom_prompt }`
Options for AI-powered parsing tiers (cost_effective, agentic, agentic_plus).
These options customize how the AI processes and interprets document content.
Only applicable when using non-fast tiers.
- `custom_prompt: optional string`
Custom instructions for the AI parser. Use to guide extraction behavior, specify output formatting, or provide domain-specific context. Example: 'Extract financial tables with currency symbols. Format dates as YYYY-MM-DD.'
- `client_name: optional string`
Identifier for the client/application making the request. Used for analytics and debugging. Example: 'my-app-v2'
- `crop_box: optional object { bottom, left, right, top }`
Crop boundaries to process only a portion of each page. Values are ratios 0-1 from page edges
- `bottom: optional number`
Bottom boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content below this line is excluded
- `left: optional number`
Left boundary as ratio (0-1). 0=left edge, 1=right edge. Content left of this line is excluded
- `right: optional number`
Right boundary as ratio (0-1). 0=left edge, 1=right edge. Content right of this line is excluded
- `top: optional number`
Top boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content above this line is excluded
- `disable_cache: optional boolean`
Bypass result caching and force re-parsing. Use when document content may have changed or you need fresh results
- `fast_options: optional unknown`
Options for fast tier parsing (rule-based, no AI).
Fast tier uses deterministic algorithms for text extraction without AI enhancement.
It's the fastest and most cost-effective option, best suited for simple documents
with standard layouts. Currently has no configurable options but reserved for
future expansion.
- `input_options: optional object { html, pdf, presentation, spreadsheet }`
Format-specific options (HTML, PDF, spreadsheet, presentation). Applied based on detected input file type
- `html: optional object { make_all_elements_visible, remove_fixed_elements, remove_navigation_elements }`
HTML/web page parsing options (applies to .html, .htm files)
- `make_all_elements_visible: optional boolean`
Force all HTML elements to be visible by overriding CSS display/visibility properties. Useful for parsing pages with hidden content or collapsed sections
- `remove_fixed_elements: optional boolean`
Remove fixed-position elements (headers, footers, floating buttons) that appear on every page render
- `remove_navigation_elements: optional boolean`
Remove navigation elements (nav bars, sidebars, menus) to focus on main content
- `pdf: optional unknown`
PDF-specific parsing options (applies to .pdf files)
- `presentation: optional object { out_of_bounds_content, skip_embedded_data }`
Presentation parsing options (applies to .pptx, .ppt, .odp, .key files)
- `out_of_bounds_content: optional boolean`
Extract content positioned outside the visible slide area. Some presentations have hidden notes or content that extends beyond slide boundaries
- `skip_embedded_data: optional boolean`
Skip extraction of embedded chart data tables. When true, only the visual representation of charts is captured, not the underlying data
- `spreadsheet: optional object { detect_sub_tables_in_sheets, force_formula_computation_in_sheets, include_hidden_sheets }`
Spreadsheet parsing options (applies to .xlsx, .xls, .csv, .ods files)
- `detect_sub_tables_in_sheets: optional boolean`
Detect and extract multiple tables within a single sheet. Useful when spreadsheets contain several data regions separated by blank rows/columns
- `force_formula_computation_in_sheets: optional boolean`
Compute formula results instead of extracting formula text. Use when you need calculated values rather than formula definitions
- `include_hidden_sheets: optional boolean`
Parse hidden sheets in addition to visible ones. By default, hidden sheets are skipped
- `output_options: optional object { extract_printed_page_number, images_to_save, markdown, 2 more }`
Output formatting options for markdown, text, and extracted images
- `extract_printed_page_number: optional boolean`
Extract the printed page number as it appears in the document (e.g., 'Page 5 of 10', 'v', 'A-3'). Useful for referencing original page numbers
- `images_to_save: optional array of "screenshot" or "embedded" or "layout"`
Image categories to extract and save. Options: 'screenshot' (full page renders useful for visual QA), 'embedded' (images found within the document), 'layout' (cropped regions from layout detection like figures and diagrams). Empty list saves no images
- `"screenshot"`
- `"embedded"`
- `"layout"`
- `markdown: optional object { annotate_links, inline_images, tables }`
Markdown formatting options including table styles and link annotations
- `annotate_links: optional boolean`
Add link annotations to markdown output in the format [text](url). When false, only the link text is included
- `inline_images: optional boolean`
Embed images directly in markdown as base64 data URIs instead of extracting them as separate files. Useful for self-contained markdown output
- `tables: optional object { compact_markdown_tables, markdown_table_multiline_separator, merge_continued_tables, output_tables_as_markdown }`
Table formatting options including markdown vs HTML format and merging behavior
- `compact_markdown_tables: optional boolean`
Remove extra whitespace padding in markdown table cells for more compact output
- `markdown_table_multiline_separator: optional string`
Separator string for multiline cell content in markdown tables. Example: '
' to preserve line breaks, ' ' to join with spaces
- `merge_continued_tables: optional boolean`
Automatically merge tables that span multiple pages into a single table. The merged table appears on the first page with merged_from_pages metadata
- `output_tables_as_markdown: optional boolean`
Output tables as markdown pipe tables instead of HTML