# Extract
## Create Extract Job
`extract.create(ExtractCreateParams**kwargs) -> ExtractV2Job`
**post** `/api/v2/extract`
Create an extraction job.
Extracts structured data from a document using either a saved
configuration or an inline JSON Schema.
## Input
Provide exactly one of:
- `configuration_id` — reference a saved extraction config
- `configuration` — inline configuration with a `data_schema`
## Document input
Set `file_input` to a file ID (`dfl-...`) or a
completed parse job ID (`pjb-...`).
The job runs asynchronously. Poll `GET /extract/{job_id}` or
register a webhook to monitor completion.
### Parameters
- `file_input: str`
File ID or parse job ID to extract from
- `organization_id: Optional[str]`
- `project_id: Optional[str]`
- `configuration: Optional[ExtractConfigurationParam]`
Extract configuration combining parse and extract settings.
- `data_schema: Dict[str, Union[Dict[str, object], List[object], str, 3 more]]`
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `cite_sources: Optional[bool]`
Include citations in results
- `confidence_scores: Optional[bool]`
Include confidence scores in results
- `extract_version: Optional[str]`
Extract algorithm version. Use 'latest' for the default pipeline or a date string (e.g. '2026-01-08') to pin to a specific release.
- `extraction_target: Optional[Literal["per_doc", "per_page", "per_table_row"]]`
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
- `"per_doc"`
- `"per_page"`
- `"per_table_row"`
- `max_pages: Optional[int]`
Maximum number of pages to process. Omit for no limit.
- `parse_config_id: Optional[str]`
Saved parse configuration ID to control how the document is parsed before extraction
- `parse_tier: Optional[str]`
Parse tier to use before extraction. Defaults to the extract tier if not specified.
- `system_prompt: Optional[str]`
Custom system prompt to guide extraction behavior
- `target_pages: Optional[str]`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `tier: Optional[Literal["cost_effective", "agentic"]]`
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
- `"cost_effective"`
- `"agentic"`
- `configuration_id: Optional[str]`
Saved configuration ID
- `webhook_configurations: Optional[Iterable[WebhookConfiguration]]`
Outbound webhook endpoints to notify on job status changes
- `webhook_events: Optional[List[Literal["extract.pending", "extract.success", "extract.error", 14 more]]]`
Events to subscribe to (e.g. 'parse.success', 'extract.error'). If null, all events are delivered.
- `"extract.pending"`
- `"extract.success"`
- `"extract.error"`
- `"extract.partial_success"`
- `"extract.cancelled"`
- `"parse.pending"`
- `"parse.running"`
- `"parse.success"`
- `"parse.error"`
- `"parse.partial_success"`
- `"parse.cancelled"`
- `"classify.pending"`
- `"classify.success"`
- `"classify.error"`
- `"classify.partial_success"`
- `"classify.cancelled"`
- `"unmapped_event"`
- `webhook_headers: Optional[Dict[str, str]]`
Custom HTTP headers sent with each webhook request (e.g. auth tokens)
- `webhook_output_format: Optional[str]`
Response format sent to the webhook: 'string' (default) or 'json'
- `webhook_url: Optional[str]`
URL to receive webhook POST notifications
### Returns
- `class ExtractV2Job: …`
An extraction job.
- `id: str`
Unique job identifier (job_id)
- `created_at: datetime`
Creation timestamp
- `file_input: str`
File ID or parse job ID that was extracted
- `project_id: str`
Project this job belongs to
- `status: str`
Current job status.
- `PENDING` — queued, not yet started
- `RUNNING` — actively processing
- `COMPLETED` — finished successfully
- `FAILED` — terminated with an error
- `CANCELLED` — cancelled by user
- `updated_at: datetime`
Last update timestamp
- `configuration: Optional[ExtractConfiguration]`
Extract configuration combining parse and extract settings.
- `data_schema: Dict[str, Union[Dict[str, object], List[object], str, 3 more]]`
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `cite_sources: Optional[bool]`
Include citations in results
- `confidence_scores: Optional[bool]`
Include confidence scores in results
- `extract_version: Optional[str]`
Extract algorithm version. Use 'latest' for the default pipeline or a date string (e.g. '2026-01-08') to pin to a specific release.
- `extraction_target: Optional[Literal["per_doc", "per_page", "per_table_row"]]`
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
- `"per_doc"`
- `"per_page"`
- `"per_table_row"`
- `max_pages: Optional[int]`
Maximum number of pages to process. Omit for no limit.
- `parse_config_id: Optional[str]`
Saved parse configuration ID to control how the document is parsed before extraction
- `parse_tier: Optional[str]`
Parse tier to use before extraction. Defaults to the extract tier if not specified.
- `system_prompt: Optional[str]`
Custom system prompt to guide extraction behavior
- `target_pages: Optional[str]`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `tier: Optional[Literal["cost_effective", "agentic"]]`
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
- `"cost_effective"`
- `"agentic"`
- `configuration_id: Optional[str]`
Saved extract configuration ID used for this job, if any
- `error_message: Optional[str]`
Error details when status is FAILED
- `extract_metadata: Optional[ExtractJobMetadata]`
Extraction metadata.
- `field_metadata: Optional[ExtractedFieldMetadata]`
Metadata for extracted fields including document, page, and row level info.
- `document_metadata: Optional[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]`
Per-field metadata keyed by field name from your schema. Scalar fields (e.g. `vendor`) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. `items`) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `page_metadata: Optional[List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]]`
Per-page metadata when extraction_target is per_page
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `row_metadata: Optional[List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]]`
Per-row metadata when extraction_target is per_table_row
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `parse_job_id: Optional[str]`
Reference to the ParseJob ID used for parsing
- `parse_tier: Optional[str]`
Parse tier used for parsing the document
- `extract_result: Optional[Union[Dict[str, Union[Dict[str, object], List[object], str, 3 more]], List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]], null]]`
Extracted data conforming to the data_schema. Returns a single object for per_doc, or an array for per_page / per_table_row.
- `Dict[str, Union[Dict[str, object], List[object], str, 3 more]]`
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]`
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `metadata: Optional[Metadata]`
Job-level metadata.
- `usage: Optional[ExtractJobUsage]`
Extraction usage metrics.
- `num_document_tokens: Optional[int]`
Number of document tokens
- `num_output_tokens: Optional[int]`
Number of output tokens
- `num_pages_extracted: Optional[int]`
Number of pages extracted
### Example
```python
import os
from llama_cloud import LlamaCloud
client = LlamaCloud(
api_key=os.environ.get("LLAMA_CLOUD_API_KEY"), # This is the default and can be omitted
)
extract_v2_job = client.extract.create(
file_input="dfl-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
)
print(extract_v2_job.id)
```
#### Response
```json
{
"id": "ext-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"created_at": "2019-12-27T18:11:19.117Z",
"file_input": "dfl-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"project_id": "prj-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"status": "COMPLETED",
"updated_at": "2019-12-27T18:11:19.117Z",
"configuration": {
"data_schema": {
"foo": {
"foo": "bar"
}
},
"cite_sources": true,
"confidence_scores": true,
"extract_version": "latest",
"extraction_target": "per_doc",
"max_pages": 10,
"parse_config_id": "cfg-11111111-2222-3333-4444-555555555555",
"parse_tier": "fast",
"system_prompt": "Extract all monetary values in USD. If a currency is not specified, assume USD.",
"target_pages": "1,3,5-7",
"tier": "cost_effective"
},
"configuration_id": "cfg-11111111-2222-3333-4444-555555555555",
"error_message": "error_message",
"extract_metadata": {
"field_metadata": {
"document_metadata": {
"items": [
{
"amount": {
"citation": [
{
"matching_text": "$10.00",
"page": 1
}
],
"confidence": 1
},
"description": {
"citation": [
{
"matching_text": "$10/month",
"page": 1
}
],
"confidence": 0.998
}
}
],
"total": {
"citation": "bar",
"confidence": "bar"
},
"vendor": {
"citation": "bar",
"confidence": "bar",
"extraction_confidence": "bar",
"parsing_confidence": "bar"
}
},
"page_metadata": [
{
"foo": {
"foo": "bar"
}
}
],
"row_metadata": [
{
"foo": {
"foo": "bar"
}
}
]
},
"parse_job_id": "parse_job_id",
"parse_tier": "parse_tier"
},
"extract_result": {
"foo": {
"foo": "bar"
}
},
"metadata": {
"usage": {
"num_document_tokens": 0,
"num_output_tokens": 0,
"num_pages_extracted": 0
}
}
}
```
## List Extract Jobs
`extract.list(ExtractListParams**kwargs) -> SyncPaginatedCursor[ExtractV2Job]`
**get** `/api/v2/extract`
List extraction jobs with optional filtering and pagination.
Filter by `configuration_id`, `status`, `file_input`,
or creation date range. Results are returned newest-first.
Use `expand=configuration` to include the full configuration used,
and `expand=extract_metadata` for per-field metadata.
### Parameters
- `configuration_id: Optional[str]`
Filter by configuration ID
- `created_at_on_or_after: Optional[Union[str, datetime, null]]`
Include items created at or after this timestamp (inclusive)
- `created_at_on_or_before: Optional[Union[str, datetime, null]]`
Include items created at or before this timestamp (inclusive)
- `document_input_type: Optional[str]`
Filter by document input type (file_id or parse_job_id)
- `document_input_value: Optional[str]`
Deprecated: use file_input instead
- `expand: Optional[SequenceNotStr[str]]`
Additional fields to include: configuration, extract_metadata
- `file_input: Optional[str]`
Filter by file input value
- `job_ids: Optional[SequenceNotStr[str]]`
Filter by specific job IDs
- `organization_id: Optional[str]`
- `page_size: Optional[int]`
Number of items per page
- `page_token: Optional[str]`
Token for pagination
- `project_id: Optional[str]`
- `status: Optional[Literal["PENDING", "THROTTLED", "RUNNING", 3 more]]`
Filter by status
- `"PENDING"`
- `"THROTTLED"`
- `"RUNNING"`
- `"COMPLETED"`
- `"FAILED"`
- `"CANCELLED"`
### Returns
- `class ExtractV2Job: …`
An extraction job.
- `id: str`
Unique job identifier (job_id)
- `created_at: datetime`
Creation timestamp
- `file_input: str`
File ID or parse job ID that was extracted
- `project_id: str`
Project this job belongs to
- `status: str`
Current job status.
- `PENDING` — queued, not yet started
- `RUNNING` — actively processing
- `COMPLETED` — finished successfully
- `FAILED` — terminated with an error
- `CANCELLED` — cancelled by user
- `updated_at: datetime`
Last update timestamp
- `configuration: Optional[ExtractConfiguration]`
Extract configuration combining parse and extract settings.
- `data_schema: Dict[str, Union[Dict[str, object], List[object], str, 3 more]]`
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `cite_sources: Optional[bool]`
Include citations in results
- `confidence_scores: Optional[bool]`
Include confidence scores in results
- `extract_version: Optional[str]`
Extract algorithm version. Use 'latest' for the default pipeline or a date string (e.g. '2026-01-08') to pin to a specific release.
- `extraction_target: Optional[Literal["per_doc", "per_page", "per_table_row"]]`
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
- `"per_doc"`
- `"per_page"`
- `"per_table_row"`
- `max_pages: Optional[int]`
Maximum number of pages to process. Omit for no limit.
- `parse_config_id: Optional[str]`
Saved parse configuration ID to control how the document is parsed before extraction
- `parse_tier: Optional[str]`
Parse tier to use before extraction. Defaults to the extract tier if not specified.
- `system_prompt: Optional[str]`
Custom system prompt to guide extraction behavior
- `target_pages: Optional[str]`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `tier: Optional[Literal["cost_effective", "agentic"]]`
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
- `"cost_effective"`
- `"agentic"`
- `configuration_id: Optional[str]`
Saved extract configuration ID used for this job, if any
- `error_message: Optional[str]`
Error details when status is FAILED
- `extract_metadata: Optional[ExtractJobMetadata]`
Extraction metadata.
- `field_metadata: Optional[ExtractedFieldMetadata]`
Metadata for extracted fields including document, page, and row level info.
- `document_metadata: Optional[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]`
Per-field metadata keyed by field name from your schema. Scalar fields (e.g. `vendor`) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. `items`) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `page_metadata: Optional[List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]]`
Per-page metadata when extraction_target is per_page
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `row_metadata: Optional[List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]]`
Per-row metadata when extraction_target is per_table_row
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `parse_job_id: Optional[str]`
Reference to the ParseJob ID used for parsing
- `parse_tier: Optional[str]`
Parse tier used for parsing the document
- `extract_result: Optional[Union[Dict[str, Union[Dict[str, object], List[object], str, 3 more]], List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]], null]]`
Extracted data conforming to the data_schema. Returns a single object for per_doc, or an array for per_page / per_table_row.
- `Dict[str, Union[Dict[str, object], List[object], str, 3 more]]`
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]`
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `metadata: Optional[Metadata]`
Job-level metadata.
- `usage: Optional[ExtractJobUsage]`
Extraction usage metrics.
- `num_document_tokens: Optional[int]`
Number of document tokens
- `num_output_tokens: Optional[int]`
Number of output tokens
- `num_pages_extracted: Optional[int]`
Number of pages extracted
### Example
```python
import os
from llama_cloud import LlamaCloud
client = LlamaCloud(
api_key=os.environ.get("LLAMA_CLOUD_API_KEY"), # This is the default and can be omitted
)
page = client.extract.list()
page = page.items[0]
print(page.id)
```
#### Response
```json
{
"items": [
{
"id": "ext-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"created_at": "2019-12-27T18:11:19.117Z",
"file_input": "dfl-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"project_id": "prj-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"status": "COMPLETED",
"updated_at": "2019-12-27T18:11:19.117Z",
"configuration": {
"data_schema": {
"foo": {
"foo": "bar"
}
},
"cite_sources": true,
"confidence_scores": true,
"extract_version": "latest",
"extraction_target": "per_doc",
"max_pages": 10,
"parse_config_id": "cfg-11111111-2222-3333-4444-555555555555",
"parse_tier": "fast",
"system_prompt": "Extract all monetary values in USD. If a currency is not specified, assume USD.",
"target_pages": "1,3,5-7",
"tier": "cost_effective"
},
"configuration_id": "cfg-11111111-2222-3333-4444-555555555555",
"error_message": "error_message",
"extract_metadata": {
"field_metadata": {
"document_metadata": {
"items": [
{
"amount": {
"citation": [
{
"matching_text": "$10.00",
"page": 1
}
],
"confidence": 1
},
"description": {
"citation": [
{
"matching_text": "$10/month",
"page": 1
}
],
"confidence": 0.998
}
}
],
"total": {
"citation": "bar",
"confidence": "bar"
},
"vendor": {
"citation": "bar",
"confidence": "bar",
"extraction_confidence": "bar",
"parsing_confidence": "bar"
}
},
"page_metadata": [
{
"foo": {
"foo": "bar"
}
}
],
"row_metadata": [
{
"foo": {
"foo": "bar"
}
}
]
},
"parse_job_id": "parse_job_id",
"parse_tier": "parse_tier"
},
"extract_result": {
"foo": {
"foo": "bar"
}
},
"metadata": {
"usage": {
"num_document_tokens": 0,
"num_output_tokens": 0,
"num_pages_extracted": 0
}
}
}
],
"next_page_token": "next_page_token",
"total_size": 0
}
```
## Get Extract Job
`extract.get(strjob_id, ExtractGetParams**kwargs) -> ExtractV2Job`
**get** `/api/v2/extract/{job_id}`
Get a single extraction job by ID.
Returns the job status and results when complete.
Use `expand=configuration` to include the full configuration used,
and `expand=extract_metadata` for per-field metadata.
### Parameters
- `job_id: str`
- `expand: Optional[SequenceNotStr[str]]`
Additional fields to include: configuration, extract_metadata
- `organization_id: Optional[str]`
- `project_id: Optional[str]`
### Returns
- `class ExtractV2Job: …`
An extraction job.
- `id: str`
Unique job identifier (job_id)
- `created_at: datetime`
Creation timestamp
- `file_input: str`
File ID or parse job ID that was extracted
- `project_id: str`
Project this job belongs to
- `status: str`
Current job status.
- `PENDING` — queued, not yet started
- `RUNNING` — actively processing
- `COMPLETED` — finished successfully
- `FAILED` — terminated with an error
- `CANCELLED` — cancelled by user
- `updated_at: datetime`
Last update timestamp
- `configuration: Optional[ExtractConfiguration]`
Extract configuration combining parse and extract settings.
- `data_schema: Dict[str, Union[Dict[str, object], List[object], str, 3 more]]`
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `cite_sources: Optional[bool]`
Include citations in results
- `confidence_scores: Optional[bool]`
Include confidence scores in results
- `extract_version: Optional[str]`
Extract algorithm version. Use 'latest' for the default pipeline or a date string (e.g. '2026-01-08') to pin to a specific release.
- `extraction_target: Optional[Literal["per_doc", "per_page", "per_table_row"]]`
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
- `"per_doc"`
- `"per_page"`
- `"per_table_row"`
- `max_pages: Optional[int]`
Maximum number of pages to process. Omit for no limit.
- `parse_config_id: Optional[str]`
Saved parse configuration ID to control how the document is parsed before extraction
- `parse_tier: Optional[str]`
Parse tier to use before extraction. Defaults to the extract tier if not specified.
- `system_prompt: Optional[str]`
Custom system prompt to guide extraction behavior
- `target_pages: Optional[str]`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `tier: Optional[Literal["cost_effective", "agentic"]]`
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
- `"cost_effective"`
- `"agentic"`
- `configuration_id: Optional[str]`
Saved extract configuration ID used for this job, if any
- `error_message: Optional[str]`
Error details when status is FAILED
- `extract_metadata: Optional[ExtractJobMetadata]`
Extraction metadata.
- `field_metadata: Optional[ExtractedFieldMetadata]`
Metadata for extracted fields including document, page, and row level info.
- `document_metadata: Optional[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]`
Per-field metadata keyed by field name from your schema. Scalar fields (e.g. `vendor`) map to a FieldMetadataEntry with citation and confidence. Array fields (e.g. `items`) map to a list where each element contains per-sub-field FieldMetadataEntry objects, indexed by array position. Nested objects contain sub-field entries recursively.
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `page_metadata: Optional[List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]]`
Per-page metadata when extraction_target is per_page
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `row_metadata: Optional[List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]]`
Per-row metadata when extraction_target is per_table_row
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `parse_job_id: Optional[str]`
Reference to the ParseJob ID used for parsing
- `parse_tier: Optional[str]`
Parse tier used for parsing the document
- `extract_result: Optional[Union[Dict[str, Union[Dict[str, object], List[object], str, 3 more]], List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]], null]]`
Extracted data conforming to the data_schema. Returns a single object for per_doc, or an array for per_page / per_table_row.
- `Dict[str, Union[Dict[str, object], List[object], str, 3 more]]`
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `List[Dict[str, Union[Dict[str, object], List[object], str, 3 more]]]`
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `metadata: Optional[Metadata]`
Job-level metadata.
- `usage: Optional[ExtractJobUsage]`
Extraction usage metrics.
- `num_document_tokens: Optional[int]`
Number of document tokens
- `num_output_tokens: Optional[int]`
Number of output tokens
- `num_pages_extracted: Optional[int]`
Number of pages extracted
### Example
```python
import os
from llama_cloud import LlamaCloud
client = LlamaCloud(
api_key=os.environ.get("LLAMA_CLOUD_API_KEY"), # This is the default and can be omitted
)
extract_v2_job = client.extract.get(
job_id="job_id",
)
print(extract_v2_job.id)
```
#### Response
```json
{
"id": "ext-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"created_at": "2019-12-27T18:11:19.117Z",
"file_input": "dfl-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"project_id": "prj-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"status": "COMPLETED",
"updated_at": "2019-12-27T18:11:19.117Z",
"configuration": {
"data_schema": {
"foo": {
"foo": "bar"
}
},
"cite_sources": true,
"confidence_scores": true,
"extract_version": "latest",
"extraction_target": "per_doc",
"max_pages": 10,
"parse_config_id": "cfg-11111111-2222-3333-4444-555555555555",
"parse_tier": "fast",
"system_prompt": "Extract all monetary values in USD. If a currency is not specified, assume USD.",
"target_pages": "1,3,5-7",
"tier": "cost_effective"
},
"configuration_id": "cfg-11111111-2222-3333-4444-555555555555",
"error_message": "error_message",
"extract_metadata": {
"field_metadata": {
"document_metadata": {
"items": [
{
"amount": {
"citation": [
{
"matching_text": "$10.00",
"page": 1
}
],
"confidence": 1
},
"description": {
"citation": [
{
"matching_text": "$10/month",
"page": 1
}
],
"confidence": 0.998
}
}
],
"total": {
"citation": "bar",
"confidence": "bar"
},
"vendor": {
"citation": "bar",
"confidence": "bar",
"extraction_confidence": "bar",
"parsing_confidence": "bar"
}
},
"page_metadata": [
{
"foo": {
"foo": "bar"
}
}
],
"row_metadata": [
{
"foo": {
"foo": "bar"
}
}
]
},
"parse_job_id": "parse_job_id",
"parse_tier": "parse_tier"
},
"extract_result": {
"foo": {
"foo": "bar"
}
},
"metadata": {
"usage": {
"num_document_tokens": 0,
"num_output_tokens": 0,
"num_pages_extracted": 0
}
}
}
```
## Delete Extract Job
`extract.delete(strjob_id, ExtractDeleteParams**kwargs) -> object`
**delete** `/api/v2/extract/{job_id}`
Delete an extraction job and its results.
### Parameters
- `job_id: str`
- `organization_id: Optional[str]`
- `project_id: Optional[str]`
### Returns
- `object`
### Example
```python
import os
from llama_cloud import LlamaCloud
client = LlamaCloud(
api_key=os.environ.get("LLAMA_CLOUD_API_KEY"), # This is the default and can be omitted
)
extract = client.extract.delete(
job_id="job_id",
)
print(extract)
```
#### Response
```json
{}
```
## Validate Extraction Schema
`extract.validate_schema(ExtractValidateSchemaParams**kwargs) -> ExtractV2SchemaValidateResponse`
**post** `/api/v2/extract/schema/validation`
Validate a JSON schema for extraction.
### Parameters
- `data_schema: Dict[str, Union[Dict[str, object], Iterable[object], str, 3 more]]`
JSON Schema to validate for use with extract jobs
- `Dict[str, object]`
- `Iterable[object]`
- `str`
- `float`
- `bool`
### Returns
- `class ExtractV2SchemaValidateResponse: …`
Response schema for schema validation.
- `data_schema: Dict[str, Union[Dict[str, object], List[object], str, 3 more]]`
Validated JSON Schema, ready for use in extract jobs
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
### Example
```python
import os
from llama_cloud import LlamaCloud
client = LlamaCloud(
api_key=os.environ.get("LLAMA_CLOUD_API_KEY"), # This is the default and can be omitted
)
extract_v2_schema_validate_response = client.extract.validate_schema(
data_schema={
"foo": {
"foo": "bar"
}
},
)
print(extract_v2_schema_validate_response.data_schema)
```
#### Response
```json
{
"data_schema": {
"foo": {
"foo": "bar"
}
}
}
```
## Generate Extraction Schema
`extract.generate_schema(ExtractGenerateSchemaParams**kwargs) -> ConfigurationCreate`
**post** `/api/v2/extract/schema/generate`
Generate a JSON schema and return a product configuration request.
### Parameters
- `organization_id: Optional[str]`
- `project_id: Optional[str]`
- `data_schema: Optional[Dict[str, Union[Dict[str, object], Iterable[object], str, 3 more]]]`
Optional schema to validate, refine, or extend
- `Dict[str, object]`
- `Iterable[object]`
- `str`
- `float`
- `bool`
- `file_id: Optional[str]`
Optional file ID to analyze for schema generation
- `name: Optional[str]`
Name for the generated configuration (auto-generated if omitted)
- `prompt: Optional[str]`
Natural language description of the data structure to extract
### Returns
- `class ConfigurationCreate: …`
Request body for creating a product configuration.
- `name: str`
Human-readable name for this configuration.
- `parameters: Parameters`
Product-specific configuration parameters.
- `class SplitV1Parameters: …`
Typed parameters for a *split v1* product configuration.
- `categories: List[SplitCategory]`
Categories to split documents into.
- `name: str`
Name of the category.
- `description: Optional[str]`
Optional description of what content belongs in this category.
- `product_type: Literal["split_v1"]`
Product type.
- `"split_v1"`
- `splitting_strategy: Optional[SplittingStrategy]`
Strategy for splitting documents.
- `allow_uncategorized: Optional[Literal["include", "forbid", "omit"]]`
Controls handling of pages that don't match any category. 'include': pages can be grouped as 'uncategorized' and included in results. 'forbid': all pages must be assigned to a defined category. 'omit': pages can be classified as 'uncategorized' but are excluded from results.
- `"include"`
- `"forbid"`
- `"omit"`
- `class ExtractV2Parameters: …`
Typed parameters for an *extract v2* product configuration.
- `data_schema: Dict[str, Union[Dict[str, object], List[object], str, 3 more]]`
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
- `Dict[str, object]`
- `List[object]`
- `str`
- `float`
- `bool`
- `product_type: Literal["extract_v2"]`
Product type.
- `"extract_v2"`
- `cite_sources: Optional[bool]`
Include citations in results
- `confidence_scores: Optional[bool]`
Include confidence scores in results
- `extract_version: Optional[str]`
Extract algorithm version. Use 'latest' for the default pipeline or a date string (e.g. '2026-01-08') to pin to a specific release.
- `extraction_target: Optional[Literal["per_doc", "per_page", "per_table_row"]]`
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
- `"per_doc"`
- `"per_page"`
- `"per_table_row"`
- `max_pages: Optional[int]`
Maximum number of pages to process. Omit for no limit.
- `parse_config_id: Optional[str]`
Saved parse configuration ID to control how the document is parsed before extraction
- `parse_tier: Optional[str]`
Parse tier to use before extraction. Defaults to the extract tier if not specified.
- `system_prompt: Optional[str]`
Custom system prompt to guide extraction behavior
- `target_pages: Optional[str]`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `tier: Optional[Literal["cost_effective", "agentic"]]`
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
- `"cost_effective"`
- `"agentic"`
- `class ClassifyV2Parameters: …`
Typed parameters for a *classify v2* product configuration.
- `product_type: Literal["classify_v2"]`
Product type.
- `"classify_v2"`
- `rules: List[Rule]`
Classify rules to evaluate against the document (at least one required)
- `description: str`
Natural language criteria for matching this rule
- `type: str`
Document type to assign when rule matches
- `mode: Optional[Literal["FAST"]]`
Classify execution mode
- `"FAST"`
- `parsing_configuration: Optional[ParsingConfiguration]`
Parsing configuration for classify jobs.
- `lang: Optional[str]`
ISO 639-1 language code for the document
- `max_pages: Optional[int]`
Maximum number of pages to process. Omit for no limit.
- `target_pages: Optional[str]`
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
- `class ParseV2Parameters: …`
Configuration for LlamaParse v2 document parsing.
Includes tier selection, processing options, output formatting,
page targeting, and webhook delivery. Refer to the LlamaParse
documentation for details on each field.
- `product_type: Literal["parse_v2"]`
Product type.
- `"parse_v2"`
- `tier: Literal["fast", "cost_effective", "agentic", "agentic_plus"]`
Parsing tier: 'fast' (rule-based, cheapest), 'cost_effective' (balanced), 'agentic' (AI-powered with custom prompts), or 'agentic_plus' (premium AI with highest accuracy)
- `"fast"`
- `"cost_effective"`
- `"agentic"`
- `"agentic_plus"`
- `version: Union[Literal["2025-12-11", "2025-12-18", "2025-12-31", 40 more], str]`
Tier version. Use 'latest' for the current stable version, or specify a specific version (e.g., '1.0', '2.0') for reproducible results
- `Literal["2025-12-11", "2025-12-18", "2025-12-31", 40 more]`
Tier version. Use 'latest' for the current stable version, or specify a specific version (e.g., '1.0', '2.0') for reproducible results
- `"2025-12-11"`
- `"2025-12-18"`
- `"2025-12-31"`
- `"2026-01-08"`
- `"2026-01-09"`
- `"2026-01-16"`
- `"2026-01-21"`
- `"2026-01-22"`
- `"2026-01-24"`
- `"2026-01-29"`
- `"2026-01-30"`
- `"2026-02-03"`
- `"2026-02-18"`
- `"2026-02-20"`
- `"2026-02-24"`
- `"2026-02-26"`
- `"2026-03-02"`
- `"2026-03-03"`
- `"2026-03-04"`
- `"2026-03-05"`
- `"2026-03-09"`
- `"2026-03-10"`
- `"2026-03-11"`
- `"2026-03-12"`
- `"2026-03-17"`
- `"2026-03-19"`
- `"2026-03-20"`
- `"2026-03-22"`
- `"2026-03-23"`
- `"2026-03-24"`
- `"2026-03-25"`
- `"2026-03-26"`
- `"2026-03-27"`
- `"2026-03-30"`
- `"2026-03-31"`
- `"2026-04-02"`
- `"2026-04-06"`
- `"2026-04-09"`
- `"2026-04-14"`
- `"2026-04-19"`
- `"2026-04-22"`
- `"2026-04-27"`
- `"latest"`
- `str`
- `agentic_options: Optional[AgenticOptions]`
Options for AI-powered parsing tiers (cost_effective, agentic, agentic_plus).
These options customize how the AI processes and interprets document content.
Only applicable when using non-fast tiers.
- `custom_prompt: Optional[str]`
Custom instructions for the AI parser. Use to guide extraction behavior, specify output formatting, or provide domain-specific context. Example: 'Extract financial tables with currency symbols. Format dates as YYYY-MM-DD.'
- `client_name: Optional[str]`
Identifier for the client/application making the request. Used for analytics and debugging. Example: 'my-app-v2'
- `crop_box: Optional[CropBox]`
Crop boundaries to process only a portion of each page. Values are ratios 0-1 from page edges
- `bottom: Optional[float]`
Bottom boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content below this line is excluded
- `left: Optional[float]`
Left boundary as ratio (0-1). 0=left edge, 1=right edge. Content left of this line is excluded
- `right: Optional[float]`
Right boundary as ratio (0-1). 0=left edge, 1=right edge. Content right of this line is excluded
- `top: Optional[float]`
Top boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content above this line is excluded
- `disable_cache: Optional[bool]`
Bypass result caching and force re-parsing. Use when document content may have changed or you need fresh results
- `fast_options: Optional[object]`
Options for fast tier parsing (rule-based, no AI).
Fast tier uses deterministic algorithms for text extraction without AI enhancement.
It's the fastest and most cost-effective option, best suited for simple documents
with standard layouts. Currently has no configurable options but reserved for
future expansion.
- `input_options: Optional[InputOptions]`
Format-specific options (HTML, PDF, spreadsheet, presentation). Applied based on detected input file type
- `html: Optional[InputOptionsHTML]`
HTML/web page parsing options (applies to .html, .htm files)
- `make_all_elements_visible: Optional[bool]`
Force all HTML elements to be visible by overriding CSS display/visibility properties. Useful for parsing pages with hidden content or collapsed sections
- `remove_fixed_elements: Optional[bool]`
Remove fixed-position elements (headers, footers, floating buttons) that appear on every page render
- `remove_navigation_elements: Optional[bool]`
Remove navigation elements (nav bars, sidebars, menus) to focus on main content
- `pdf: Optional[object]`
PDF-specific parsing options (applies to .pdf files)
- `presentation: Optional[InputOptionsPresentation]`
Presentation parsing options (applies to .pptx, .ppt, .odp, .key files)
- `out_of_bounds_content: Optional[bool]`
Extract content positioned outside the visible slide area. Some presentations have hidden notes or content that extends beyond slide boundaries
- `skip_embedded_data: Optional[bool]`
Skip extraction of embedded chart data tables. When true, only the visual representation of charts is captured, not the underlying data
- `spreadsheet: Optional[InputOptionsSpreadsheet]`
Spreadsheet parsing options (applies to .xlsx, .xls, .csv, .ods files)
- `detect_sub_tables_in_sheets: Optional[bool]`
Detect and extract multiple tables within a single sheet. Useful when spreadsheets contain several data regions separated by blank rows/columns
- `force_formula_computation_in_sheets: Optional[bool]`
Compute formula results instead of extracting formula text. Use when you need calculated values rather than formula definitions
- `include_hidden_sheets: Optional[bool]`
Parse hidden sheets in addition to visible ones. By default, hidden sheets are skipped
- `output_options: Optional[OutputOptions]`
Output formatting options for markdown, text, and extracted images
- `extract_printed_page_number: Optional[bool]`
Extract the printed page number as it appears in the document (e.g., 'Page 5 of 10', 'v', 'A-3'). Useful for referencing original page numbers
- `images_to_save: Optional[List[Literal["screenshot", "embedded", "layout"]]]`
Image categories to extract and save. Options: 'screenshot' (full page renders useful for visual QA), 'embedded' (images found within the document), 'layout' (cropped regions from layout detection like figures and diagrams). Empty list saves no images
- `"screenshot"`
- `"embedded"`
- `"layout"`
- `markdown: Optional[OutputOptionsMarkdown]`
Markdown formatting options including table styles and link annotations
- `annotate_links: Optional[bool]`
Add link annotations to markdown output in the format [text](url). When false, only the link text is included
- `inline_images: Optional[bool]`
Embed images directly in markdown as base64 data URIs instead of extracting them as separate files. Useful for self-contained markdown output
- `tables: Optional[OutputOptionsMarkdownTables]`
Table formatting options including markdown vs HTML format and merging behavior
- `compact_markdown_tables: Optional[bool]`
Remove extra whitespace padding in markdown table cells for more compact output
- `markdown_table_multiline_separator: Optional[str]`
Separator string for multiline cell content in markdown tables. Example: '
' to preserve line breaks, ' ' to join with spaces
- `merge_continued_tables: Optional[bool]`
Automatically merge tables that span multiple pages into a single table. The merged table appears on the first page with merged_from_pages metadata
- `output_tables_as_markdown: Optional[bool]`
Output tables as markdown pipe tables instead of HTML