Use this file to discover all available pages before exploring further.
Extract specific fields from documents by providing a JSON schema. Marker parses the document and fills in your schema with extracted values.Before you begin, make sure you have:
If you already converted a document with save_checkpoint=True using the Convert API, pass the checkpoint_id to ExtractOptions to skip re-parsing. This saves time and cost when running extraction on a previously converted document.
The extract endpoint accepts the following parameters: file, page_schema or schema_id (one is required), schema_version, mode, max_pages, page_range, save_checkpoint, checkpoint_id, and webhook_url.
Instead of passing page_schema inline, you can save schemas to Datalab and reference them by ID. This avoids repeating the schema in every request and enables versioning.
Extraction scoring is in beta.We’d love your feedback — reach out at support@datalab.to.Scoring is free.
Scoring runs automatically after every extraction. When you poll request_check_url, the response initially contains just the extracted fields and citations. If you continue polling the same URL, the response will eventually include _score fields and an extraction_score_average once scoring completes. No extra parameters or endpoints are needed.Each _score field is a {"score": int, "reasoning": str} object explaining what evidence was found or missing.
Once scoring finishes, each field also gets a _score object, and the top-level response includes an extraction_score_average:
{ "invoice_number": "INV-2024-001", "invoice_number_citations": ["block_123"], "invoice_number_score": { "score": 5, "reasoning": "Value found verbatim in the document header with a matching citation." }, "total_amount": 1500.00, "total_amount_citations": ["block_456"], "total_amount_score": { "score": 4, "reasoning": "Amount found in the totals row; minor ambiguity due to a subtotal nearby." }}
The top-level response also includes extraction_score_average (4.5 in this case), averaging all field scores.Score rubric:
Score
Meaning
5
High confidence — clear match with strong citation support
4
Good confidence — match found with minor ambiguity
3
Moderate confidence — partial match or uncertain citation
2
Low confidence — match is inferred or weakly supported
Don’t want to write schemas by hand? Use the schema generation endpoint to automatically suggest schemas for your document. This requires a checkpoint from a previous conversion: