Generate Extraction Schema

ConfigurationCreate extract().generateSchema(, )

POST/api/v2/extract/schema/generate

Generate a JSON schema and return a product configuration request.

ParametersExpand Collapse

ExtractGenerateSchemaParams params

Optional<String> organizationId

Optional<String> projectId

ExtractV2SchemaGenerateRequest extractV2SchemaGenerateRequest

Request schema for generating an extraction schema.

ReturnsExpand Collapse

class ConfigurationCreate:

Request body for creating a product configuration.

String name

Human-readable name for this configuration.

maxLength255

minLength1

Parameters parameters

Product-specific configuration parameters.

One of the following:

class ClassifyV2Parameters:

Typed parameters for a classify v2 product configuration.

JsonValue; productType

Product type.

List<Rule> rules

Classify rules to evaluate against the document (at least one required)

String description

Natural language criteria for matching this rule

maxLength2000

minLength10

String type

Document type to assign when rule matches

maxLength50

minLength1

Optional<Mode> mode

Classify execution mode

Optional<ParsingConfiguration> parsingConfiguration

Parsing configuration for classify jobs.

Optional<String> lang

ISO 639-1 language code for the document

Optional<Long> maxPages

Maximum number of pages to process. Omit for no limit.

minimum1

Optional<String> targetPages

Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.

class ExtractV2Parameters:

Typed parameters for an extract v2 product configuration.

DataSchema dataSchema

JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.

One of the following:

class UnionMember0:

List<JsonValue>

String

double

boolean

JsonValue; productType

Product type.

Optional<Boolean> citeSources

Include citations in results

Optional<Boolean> confidenceScores

Include confidence scores in results

Optional<ExtractionTarget> extractionTarget

Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row

One of the following:

PER_DOC("per_doc")

PER_PAGE("per_page")

PER_TABLE_ROW("per_table_row")

Optional<Long> maxPages

Maximum number of pages to process. Omit for no limit.

minimum1

Optional<String> parseConfigId

Saved parse configuration ID to control how the document is parsed before extraction

Optional<String> parseTier

Parse tier to use before extraction. Defaults to the extract tier if not specified.

Optional<String> systemPrompt

Custom system prompt to guide extraction behavior

Optional<String> targetPages

Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.

Optional<Tier> tier

Extract tier: cost_effective (5 credits/page), agentic (15 credits/page), or agentic_plus (50 credits/page)

One of the following:

AGENTIC("agentic")

AGENTIC_PLUS("agentic_plus")

COST_EFFECTIVE("cost_effective")

Optional<String> version

Use ‘latest’ for the latest release for the selected tier or a date string (YYYY-MM-DD format) to pin to the nearest release at or before that date. Job responses always report the concrete resolved version the job runs, fixed at job creation; saved configurations keep the value as provided.

class ParseV2Parameters:

Configuration for LlamaParse v2 document parsing.

Includes tier selection, processing options, output formatting, page targeting, and webhook delivery. Refer to the LlamaParse documentation for details on each field.

JsonValue; productType

Product type.

Tier tier

Parsing tier: ‘fast’ (rule-based, cheapest), ‘cost_effective’ (balanced), ‘agentic’ (AI-powered with custom prompts), or ‘agentic_plus’ (premium AI with highest accuracy)

One of the following:

AGENTIC("agentic")

AGENTIC_PLUS("agentic_plus")

COST_EFFECTIVE("cost_effective")

FAST("fast")

Version version

Version for the selected tier. Use latest, or pin one of that tier’s dated versions.

Current latest by tier:

fast: 2026-06-15
cost_effective: 2026-06-26
agentic: 2026-07-15
agentic_plus: 2026-07-08

Full list: GET /api/v2/parse/versions.

One of the following:

LATEST("latest")

_2026_07_15("2026-07-15")

_2026_07_08("2026-07-08")

_2026_06_26("2026-06-26")

_2026_06_15("2026-06-15")

Optional<AgenticOptions> agenticOptions

Options for AI-powered parsing tiers (cost_effective, agentic, agentic_plus).

These options customize how the AI processes and interprets document content. Only applicable when using non-fast tiers.

Optional<String> customPrompt

Custom instructions for the AI parser. Use to guide extraction behavior, specify output formatting, or provide domain-specific context. Example: ‘Extract financial tables with currency symbols. Format dates as YYYY-MM-DD.’

Optional<String> clientName

Identifier for the client/application making the request. Used for analytics and debugging. Example: ‘my-app-v2’

Optional<CropBox> cropBox

Crop boundaries to process only a portion of each page. Values are ratios 0-1 from page edges

Optional<Double> bottom

Bottom boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content below this line is excluded

maximum1

minimum0

Optional<Double> left

Left boundary as ratio (0-1). 0=left edge, 1=right edge. Content left of this line is excluded

maximum1

minimum0

Optional<Double> right

Right boundary as ratio (0-1). 0=left edge, 1=right edge. Content right of this line is excluded

maximum1

minimum0

Optional<Double> top

Top boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content above this line is excluded

maximum1

minimum0

Optional<Boolean> disableCache

Bypass result caching and force re-parsing. Use when document content may have changed or you need fresh results

Optional<JsonValue> fastOptions

Options for fast tier parsing (rule-based, no AI).

Fast tier uses deterministic algorithms for text extraction without AI enhancement. It’s the fastest and most cost-effective option, best suited for simple documents with standard layouts. Currently has no configurable options but reserved for future expansion.

Optional<InputOptions> inputOptions

Format-specific options (HTML, PDF, spreadsheet, presentation). Applied based on detected input file type

Optional<Html> html

HTML/web page parsing options (applies to .html, .htm files)

Optional<Boolean> makeAllElementsVisible

Force all HTML elements to be visible by overriding CSS display/visibility properties. Useful for parsing pages with hidden content or collapsed sections

Optional<Boolean> removeFixedElements

Remove fixed-position elements (headers, footers, floating buttons) that appear on every page render

Optional<Boolean> removeNavigationElements

Remove navigation elements (nav bars, sidebars, menus) to focus on main content

Optional<Image> image

Image parsing options (applies to .jpg, .jpeg, .png, .webp files)

Optional<Boolean> cameraPhotoCorrection

Detect documents photographed with a camera (e.g. phone scans of receipts or forms), then crop, perspective-correct, and flatten uneven lighting and shadows before parsing. Supports JPEG, PNG, WebP, and HEIC/HEIF inputs. Improves results when the document is tilted or surrounded by background. Images that already look like clean scans are left untouched

Optional<JsonValue> pdf

PDF-specific parsing options (applies to .pdf files)

Optional<Presentation> presentation

Presentation parsing options (applies to .pptx, .ppt, .odp, .key files)

Optional<Boolean> outOfBoundsContent

Extract content positioned outside the visible slide area. Some presentations have hidden notes or content that extends beyond slide boundaries

Optional<Boolean> skipEmbeddedData

Skip extraction of embedded chart data tables. When true, only the visual representation of charts is captured, not the underlying data

Optional<Spreadsheet> spreadsheet

Spreadsheet parsing options (applies to .xlsx, .xls, .csv, .ods files)

Optional<Boolean> detectSubTablesInSheets

Detect and extract multiple tables within a single sheet. Useful when spreadsheets contain several data regions separated by blank rows/columns

Optional<Boolean> forceFormulaComputationInSheets

Compute formula results instead of extracting formula text. Use when you need calculated values rather than formula definitions

Optional<Boolean> includeHiddenSheets

Parse hidden sheets in addition to visible ones. By default, hidden sheets are skipped

Optional<OutputOptions> outputOptions

Output formatting options for markdown, text, and extracted images

Optional<List<String>> additionalOutputs

Optional additional output artifacts to save alongside the primary parse output. Each value opts in to generating and persisting one extra file; the empty list (default) saves none. The three accepted values are: ‘stripped_md’ — per-page markdown stripped of formatting (links, bold/italic, images, HTML), saved as JSON for full-text-search indexing; fetch via expand=stripped_markdown_content_metadata. ‘concatenated_stripped_txt’ — all stripped pages concatenated into a single plain-text file with \n\n---\n\n between pages, useful for feeding the document into search or embedding pipelines as one blob; fetch via expand=concatenated_stripped_markdown_content_metadata. ‘word_bbox’ — raw word-level bounding boxes (one JSON object per word, with page number and x/y/w/h coordinates) saved as JSONL, useful for highlighting or grounding extracted answers back to the source document; fetch via expand=raw_words_content_metadata.

Optional<Boolean> extractPrintedPageNumber

Extract the printed page number as it appears in the document (e.g., ‘Page 5 of 10’, ‘v’, ‘A-3’). Useful for referencing original page numbers

Optional<List<GranularBbox>> granularBboxes

Bounding-box granularity levels to compute for the parse. ‘word’ computes one bounding box per detected word; ‘line’ computes one per text line; ‘cell’ computes one per table cell. Multiple levels can be requested. Empty list (default) disables granular bboxes — only item-level layout boxes are returned on the result. When set, the computed boxes are not inlined on the result items; they are written to a separate grounded_items sidecar (JSONL, one row per page) and exposed as result_content_metadata.grounded_items (a presigned download URL) on the parse result. Each row matches the GroundedJsonItem shape.

One of the following:

CELL("cell")

LINE("line")

WORD("word")

Optional<List<ImagesToSave>> imagesToSave

Image categories to extract and save. Options: ‘screenshot’ (full page renders useful for visual QA), ‘embedded’ (images found within the document), ‘layout’ (cropped regions from layout detection like figures and diagrams). Empty list saves no images

One of the following:

EMBEDDED("embedded")

LAYOUT("layout")

SCREENSHOT("screenshot")

Optional<Markdown> markdown

Markdown formatting options including table styles and link annotations

Optional<Boolean> annotateLinks

Add link annotations to markdown output in the format text. When false, only the link text is included

Optional<Boolean> inlineImages

Embed images directly in markdown as base64 data URIs instead of extracting them as separate files. Useful for self-contained markdown output

Optional<Tables> tables

Table formatting options including markdown vs HTML format and merging behavior

Optional<Boolean> compactMarkdownTables

Remove extra whitespace padding in markdown table cells for more compact output

Optional<String> markdownTableMultilineSeparator

Separator string for multiline cell content in markdown tables. Example: ‘<br>’ to preserve line breaks, ’ ’ to join with spaces

Optional<Boolean> mergeContinuedTables

Automatically merge tables that span multiple pages into a single table. The merged table appears on the first page with merged_from_pages metadata

Optional<Boolean> outputTablesAsMarkdown

Output tables as markdown pipe tables instead of HTML <table> tags. Markdown tables are simpler but cannot represent complex structures like merged cells

Optional<SpatialText> spatialText

Spatial text output options for preserving document layout structure

Optional<Boolean> doNotUnrollColumns

Keep multi-column layouts intact instead of linearizing columns into sequential text. Automatically enabled for non-fast tiers

Optional<Boolean> preserveLayoutAlignmentAcrossPages

Maintain consistent text column alignment across page boundaries. Automatically enabled for document-level parsing modes

Optional<Boolean> preserveVerySmallText

Include text below the normal size threshold. Useful for footnotes, watermarks, or fine print that might otherwise be filtered out

Optional<TablesAsSpreadsheet> tablesAsSpreadsheet

Options for exporting tables as XLSX spreadsheets

Optional<Boolean> enable

Whether this option is enabled

Optional<Boolean> guessSheetName

Automatically generate descriptive sheet names from table context (headers, surrounding text) instead of using generic names like ‘Table_1’

Optional<PageRanges> pageRanges

Page selection: limit total pages or specify exact pages to process

Optional<Long> maxPages

Maximum number of pages to process. Pages are processed in order starting from page 1. If both max_pages and target_pages are set, target_pages takes precedence

minimum1

Optional<String> targetPages

Comma-separated list of specific pages to process using 1-based indexing. Supports individual pages and ranges. Examples: ‘1,3,5’ (pages 1, 3, 5), ‘1-5’ (pages 1 through 5 inclusive), ‘1,3,5-8,10’ (pages 1, 3, 5-8, and 10). Pages are sorted and deduplicated automatically. Duplicate pages cause an error

Optional<ProcessingControl> processingControl

Job execution controls including timeouts and failure thresholds

Optional<JobFailureConditions> jobFailureConditions

Quality thresholds that determine when a job should fail vs complete with partial results

Optional<Double> allowedPageFailureRatio

Maximum ratio of pages allowed to fail before the job fails (0-1). Example: 0.1 means job fails if more than 10% of pages fail. Default is 0.05 (5%)

exclusiveMinimum0

maximum1

Optional<Boolean> failOnBuggyFont

Fail the job if a problematic font is detected that may cause incorrect text extraction. Buggy fonts can produce garbled or missing characters

Optional<Boolean> failOnImageExtractionError

Fail the entire job if any embedded image cannot be extracted. By default, image extraction errors are logged but don’t fail the job

Optional<Boolean> failOnImageOcrError

Fail the entire job if OCR fails on any image. By default, OCR errors result in empty text for that image

Optional<Boolean> failOnMarkdownReconstructionError

Fail the entire job if markdown cannot be reconstructed for any page. By default, failed pages use fallback text extraction

Optional<Timeouts> timeouts

Timeout settings for job execution. Increase for large or complex documents

Optional<Long> baseInSeconds

Base timeout for the job in seconds (max 7200 = 2 hours). This is the minimum time allowed regardless of document size

exclusiveMinimum0

maximum7200

Optional<Long> extraTimePerPageInSeconds

Additional timeout per page in seconds (max 300 = 5 minutes). Total timeout = base + (this value × page count)

exclusiveMinimum0

maximum300

Optional<ProcessingOptions> processingOptions

Document processing options including OCR, table extraction, and chart parsing

Optional<Boolean> aggressiveTableExtraction

Use aggressive heuristics to detect table boundaries, even without visible borders. Useful for documents with borderless or complex tables

Optional<List<AutoModeConfiguration>> autoModeConfiguration

Conditional processing rules that apply different parsing options based on page content, document structure, or filename patterns. Each entry defines trigger conditions and the parsing configuration to apply when triggered

ParsingConf parsingConf

Parsing configuration to apply when trigger conditions are met

Optional<Boolean> adaptiveLongTable

Whether to use adaptive long table handling

Optional<Boolean> aggressiveTableExtraction

Whether to use aggressive table extraction

Optional<CropBox> cropBox

Crop box options for auto mode parsing configuration.

Optional<Double> bottom

Bottom boundary of crop box as ratio (0-1)

maximum1

minimum0

Optional<Double> left

Left boundary of crop box as ratio (0-1)

maximum1

minimum0

Optional<Double> right

Right boundary of crop box as ratio (0-1)

maximum1

minimum0

Optional<Double> top

Top boundary of crop box as ratio (0-1)

maximum1

minimum0

Optional<String> customPrompt

Custom AI instructions for matched pages. Overrides the base custom_prompt

Optional<Boolean> extractLayout

Whether to extract layout information

Optional<Boolean> highResOcr

Whether to use high resolution OCR

Optional<Ignore> ignore

Ignore options for auto mode parsing configuration.

Optional<Boolean> ignoreDiagonalText

Whether to ignore diagonal text in the document

Optional<Boolean> ignoreHiddenText

Whether to ignore hidden text in the document

Optional<String> language

Primary language of the document

Optional<Boolean> outlinedTableExtraction

Whether to use outlined table extraction

Optional<Presentation> presentation

Presentation-specific options for auto mode parsing configuration.

Optional<Boolean> outOfBoundsContent

Extract out of bounds content in presentation slides

Optional<Boolean> skipEmbeddedData

Skip extraction of embedded data for charts in presentation slides

Optional<SpatialText> spatialText

Spatial text options for auto mode parsing configuration.

Optional<Boolean> doNotUnrollColumns

Keep column structure intact without unrolling

Optional<Boolean> preserveLayoutAlignmentAcrossPages

Preserve text alignment across page boundaries

Optional<Boolean> preserveVerySmallText

Include very small text in spatial output

Optional<SpecializedChartParsing> specializedChartParsing

Enable specialized chart parsing with the specified mode

One of the following:

AGENTIC("agentic")

AGENTIC_PLUS("agentic_plus")

EFFICIENT("efficient")

Optional<Tier> tier

Override the parsing tier for matched pages. Must be paired with version

One of the following:

AGENTIC("agentic")

AGENTIC_PLUS("agentic_plus")

COST_EFFECTIVE("cost_effective")

FAST("fast")

Optional<Version> version

Version for the override tier. Required when tier is set. Use latest, or pin one of that tier’s dated versions.

Current latest by tier:

fast: 2026-06-15
cost_effective: 2026-06-26
agentic: 2026-07-15
agentic_plus: 2026-07-08

Full list: GET /api/v2/parse/versions.

One of the following:

LATEST("latest")

_2026_07_15("2026-07-15")

_2026_07_08("2026-07-08")

_2026_06_26("2026-06-26")

_2026_06_15("2026-06-15")

Optional<String> filenameMatchGlob

Single glob pattern to match against filename

Optional<List<String>> filenameMatchGlobList

List of glob patterns to match against filename

Optional<String> filenameRegexp

Regex pattern to match against filename

Optional<String> filenameRegexpMode

Regex mode flags (e.g., ‘i’ for case-insensitive)

Optional<Boolean> fullPageImageInPage

Trigger if page contains a full-page image (scanned page detection)

Optional<FullPageImageInPageThreshold> fullPageImageInPageThreshold

Threshold for full page image detection (0.0-1.0, default 0.8)

One of the following:

double

String

Optional<Boolean> imageInPage

Trigger if page contains non-screenshot images

Optional<String> layoutElementInPage

Trigger if page contains this layout element type

Optional<LayoutElementInPageConfidenceThreshold> layoutElementInPageConfidenceThreshold

Confidence threshold for layout element detection

One of the following:

double

String

Optional<PageContainsAtLeastNCharts> pageContainsAtLeastNCharts

Trigger if page has more than N charts

One of the following:

long

String

Optional<PageContainsAtLeastNImages> pageContainsAtLeastNImages

Trigger if page has more than N images

One of the following:

long

String

Optional<PageContainsAtLeastNLayoutElements> pageContainsAtLeastNLayoutElements

Trigger if page has more than N layout elements

One of the following:

long

String

Optional<PageContainsAtLeastNLines> pageContainsAtLeastNLines

Trigger if page has more than N lines

One of the following:

long

String

Optional<PageContainsAtLeastNLinks> pageContainsAtLeastNLinks

Trigger if page has more than N links

One of the following:

long

String

Optional<PageContainsAtLeastNNumbers> pageContainsAtLeastNNumbers

Trigger if page has more than N numeric words

One of the following:

long

String

Optional<PageContainsAtLeastNPercentNumbers> pageContainsAtLeastNPercentNumbers

Trigger if page has more than N% numeric words

One of the following:

long

String

Optional<PageContainsAtLeastNTables> pageContainsAtLeastNTables

Trigger if page has more than N tables

One of the following:

long

String

Optional<PageContainsAtLeastNWords> pageContainsAtLeastNWords

Trigger if page has more than N words

One of the following:

long

String

Optional<PageContainsAtMostNCharts> pageContainsAtMostNCharts

Trigger if page has fewer than N charts

One of the following:

long

String

Optional<PageContainsAtMostNImages> pageContainsAtMostNImages

Trigger if page has fewer than N images

One of the following:

long

String

Optional<PageContainsAtMostNLayoutElements> pageContainsAtMostNLayoutElements

Trigger if page has fewer than N layout elements

One of the following:

long

String

Optional<PageContainsAtMostNLines> pageContainsAtMostNLines

Trigger if page has fewer than N lines

One of the following:

long

String

Optional<PageContainsAtMostNLinks> pageContainsAtMostNLinks

Trigger if page has fewer than N links

One of the following:

long

String

Optional<PageContainsAtMostNNumbers> pageContainsAtMostNNumbers

Trigger if page has fewer than N numeric words

One of the following:

long

String

Optional<PageContainsAtMostNPercentNumbers> pageContainsAtMostNPercentNumbers

Trigger if page has fewer than N% numeric words

One of the following:

long

String

Optional<PageContainsAtMostNTables> pageContainsAtMostNTables

Trigger if page has fewer than N tables

One of the following:

long

String

Optional<PageContainsAtMostNWords> pageContainsAtMostNWords

Trigger if page has fewer than N words

One of the following:

long

String

Optional<PageLongerThanNChars> pageLongerThanNChars

Trigger if page has more than N characters

One of the following:

long

String

Optional<Boolean> pageMdError

Trigger on pages with markdown extraction errors

Optional<PageShorterThanNChars> pageShorterThanNChars

Trigger if page has fewer than N characters

One of the following:

long

String

Optional<String> regexpInPage

Regex pattern to match in page content

Optional<String> regexpInPageMode

Regex mode flags for regexp_in_page

Optional<Boolean> tableInPage

Trigger if page contains a table

Optional<String> textInPage

Trigger if page text/markdown contains this string

Optional<String> triggerMode

How to combine multiple trigger conditions: ‘and’ (all conditions must match, this is the default) or ‘or’ (any single condition can trigger)

Optional<ConfidenceScoreEffort> confidenceScoreEffort

Confidence scoring effort. Omit for standard scoring. ‘high’: more accurate assessment of the parsing quality of every page, plus a document-level score in the result metadata; costs an additional 5 credits per page

Optional<CostOptimizer> costOptimizer

Cost optimizer configuration for reducing parsing costs on simpler pages.

When enabled, the parser analyzes each page and routes simpler pages to faster, cheaper processing while preserving quality for complex pages. Only works with ‘agentic’ or ‘agentic_plus’ tiers.

Optional<Boolean> enable

Enable cost-optimized parsing. Routes simpler pages to faster processing while complex pages use full AI analysis. May reduce speed on some documents. IMPORTANT: Only available with ‘agentic’ or ‘agentic_plus’ tiers

Optional<Boolean> disableHeuristics

Disable automatic heuristics including outlined table extraction and adaptive long table handling. Use when heuristics produce incorrect results

Optional<Forms> forms

Beta: set to ‘enrich’ to run an additional AI form-analysis pass on pages detected as forms, producing a structured tree of the form’s sections, fields, and fillable grids. Retrieve the result with expand=forms. ‘default’ (the default) applies standard parsing with no extra pass. Not available on the fast tier

One of the following:

DEFAULT("default")

ENRICH("enrich")

Optional<Ignore> ignore

Options for ignoring specific text types (diagonal, hidden, text in images)

Optional<Boolean> ignoreDiagonalText

Skip text rotated at an angle (not horizontal/vertical). Useful for ignoring watermarks or decorative angled text

Optional<Boolean> ignoreHiddenText

Skip text marked as hidden in the document structure. Some PDFs contain invisible text layers used for accessibility or search indexing

Optional<Boolean> ignoreTextInImage

Skip OCR text extraction from embedded images. Use when images contain irrelevant text (watermarks, logos) that shouldn’t be in the output

Optional<OcrParameters> ocrParameters

OCR configuration including language detection settings

Optional<List<ParsingLanguages>> languages

Languages to use for OCR text recognition. Specify multiple languages if document contains mixed-language content. Order matters - put primary language first. Example: [‘en’, ‘es’] for English with Spanish

One of the following:

ABQ("abq")

ADY("ady")

AF("af")

ANG("ang")

AR("ar")

AS("as")

AVA("ava")

AZ("az")

BE("be")

BG("bg")

BGC("bgc")

BH("bh")

BHO("bho")

BN("bn")

BS("bs")

CH_SIM("ch_sim")

CH_TRA("ch_tra")

CHE("che")

CS("cs")

CY("cy")

DA("da")

DAR("dar")

DE("de")

EN("en")

ES("es")

ET("et")

FA("fa")

FR("fr")

GA("ga")

GOM("gom")

HI("hi")

HR("hr")

HU("hu")

ID("id")

INH("inh")

IS("is")

IT("it")

JA("ja")

KBD("kbd")

KN("kn")

KO("ko")

KU("ku")

LA("la")

LBE("lbe")

LEZ("lez")

LT("lt")

LV("lv")

MAH("mah")

MAI("mai")

MI("mi")

MN("mn")

MNI("mni")

MR("mr")

MS("ms")

MT("mt")

NE("ne")

NEW("new")

NL("nl")

NO("no")

OC("oc")

PI("pi")

PL("pl")

PT("pt")

RO("ro")

RS_CYRILLIC("rs_cyrillic")

RS_LATIN("rs_latin")

RU("ru")

SA("sa")

SCK("sck")

SK("sk")

SL("sl")

SQ("sq")

SV("sv")

SW("sw")

TA("ta")

TAB("tab")

TE("te")

TH("th")

TJK("tjk")

TL("tl")

TR("tr")

UG("ug")

UK("uk")

UR("ur")

UZ("uz")

VI("vi")

Optional<SpecializedChartParsing> specializedChartParsing

Enable AI-powered chart analysis. Modes: ‘efficient’ (fast, lower cost), ‘agentic’ (balanced), ‘agentic_plus’ (highest accuracy). Automatically enables extract_layout and precise_bounding_box when set

One of the following:

AGENTIC("agentic")

AGENTIC_PLUS("agentic_plus")

EFFICIENT("efficient")

Optional<List<String>> webhookConfigurationIds

IDs of saved webhook configurations to notify for this job.

Optional<List<WebhookConfiguration>> webhookConfigurations

Webhook endpoints for job status notifications. Multiple webhooks can be configured for different events or services

Optional<List<String>> webhookEvents

Events that trigger this webhook. Options: ‘parse.success’ (job completed), ‘parse.error’ (job failed), ‘parse.partial_success’ (some pages failed), ‘parse.pending’, ‘parse.running’, ‘parse.cancelled’. If not specified, webhook fires for all events

Optional<WebhookHeaders> webhookHeaders

Custom HTTP headers to include in webhook requests. Use for authentication tokens or custom routing. Example: {‘Authorization’: ‘Bearer xyz’}

Optional<WebhookOutputFormat> webhookOutputFormat

Format of the webhook payload body. ‘string’ (default) sends the payload as a JSON-encoded string; ‘json’ sends it as a JSON object.

One of the following:

JSON("json")

STRING("string")

Optional<String> webhookSigningSecret

Shared signing secret used to sign webhook deliveries. When set, each request includes an HMAC-SHA256 signature of the request body in the ‘LC-Signature’ header (value ‘sha256=’). Recompute the HMAC over the raw request body with this secret to verify the delivery is authentic.

Optional<String> webhookUrl

HTTPS URL to receive webhook POST requests. Must be publicly accessible

class SplitV1Parameters:

Typed parameters for a split v1 product configuration.

List<SplitCategory> categories

Categories to split documents into.

String name

Name of the category.

maxLength200

minLength1

Optional<String> description

Optional description of what content belongs in this category.

maxLength2000

minLength1

JsonValue; productType

Product type.

Optional<SplittingStrategy> splittingStrategy

Strategy for splitting documents.

Optional<AllowUncategorized> allowUncategorized

Controls handling of pages that don’t match any category. ‘include’: pages can be grouped as ‘uncategorized’ and included in results. ‘forbid’: all pages must be assigned to a defined category. ‘omit’: pages can be classified as ‘uncategorized’ but are excluded from results.

One of the following:

FORBID("forbid")

INCLUDE("include")

OMIT("omit")

class SpreadsheetV1:

Typed parameters for a spreadsheet v1 product configuration.

JsonValue; productType

Product type.

Optional<String> extractionRange

A1 notation of the range to extract a single region from. If None, the entire sheet is used.

Optional<Boolean> flattenHierarchicalTables

Return a flattened dataframe when a detected table is recognized as hierarchical.

Optional<Boolean> generateAdditionalMetadata

Deprecated: controlled by tier. Whether to generate additional metadata (title, description) for each extracted region. Honored only on agentic.

Optional<Boolean> includeHiddenCells

Whether to include hidden cells when extracting regions from the spreadsheet.

Optional<List<String>> sheetNames

The names of the sheets to extract regions from. If empty, all sheets will be processed.

Optional<String> specialization

Deprecated: controlled by tier. Optional specialization mode for domain-specific extraction. Supported values: ‘financial-standard’, ‘financial-enhanced’, ‘financial-precise’. Default None uses the general-purpose pipeline. Honored only on agentic.

Optional<TableMergeSensitivity> tableMergeSensitivity

Deprecated: controlled by tier. Influences how likely similar-looking regions are merged into a single table. Honored only on agentic.

One of the following:

STRONG("strong")

WEAK("weak")

Optional<Tier> tier

Spreadsheet extraction tier. cost_effective uses the rule-based/ML-only pipeline; agentic uses the full pipeline.

One of the following:

AGENTIC("agentic")

COST_EFFECTIVE("cost_effective")

Optional<Boolean> useExperimentalProcessing

Deprecated: controlled by tier. Enables experimental processing. Honored only on agentic.

class UntypedParameters:

Catch-all for configurations without a dedicated typed schema.

Accepts arbitrary JSON fields alongside product_type.

JsonValue; productType

Product type.

Generate Extraction Schema

package ai.llamaindex.llamacloud.example;

import ai.llamaindex.llamacloud.client.LlamaCloudClient;
import ai.llamaindex.llamacloud.client.okhttp.LlamaCloudOkHttpClient;
import ai.llamaindex.llamacloud.models.configurations.ConfigurationCreate;
import ai.llamaindex.llamacloud.models.extract.ExtractGenerateSchemaParams;
import ai.llamaindex.llamacloud.models.extract.ExtractV2SchemaGenerateRequest;

public final class Main {
    private Main() {}

    public static void main(String[] args) {
        LlamaCloudClient client = LlamaCloudOkHttpClient.fromEnv();

        ExtractV2SchemaGenerateRequest params = ExtractV2SchemaGenerateRequest.builder().build();
        ConfigurationCreate configurationCreate = client.extract().generateSchema(params);
    }
}

{
  "name": "x",
  "parameters": {
    "product_type": "classify_v2",
    "rules": [
      {
        "description": "contains invoice number, line items, and total amount",
        "type": "invoice"
      }
    ],
    "mode": "FAST",
    "parsing_configuration": {
      "lang": "en",
      "max_pages": 10,
      "target_pages": "1,3,5-7"
    }
  }
}

Returns Examples

{
  "name": "x",
  "parameters": {
    "product_type": "classify_v2",
    "rules": [
      {
        "description": "contains invoice number, line items, and total amount",
        "type": "invoice"
      }
    ],
    "mode": "FAST",
    "parsing_configuration": {
      "lang": "en",
      "max_pages": 10,
      "target_pages": "1,3,5-7"
    }
  }
}