Structured Extraction - Datalab Documentation

Extract specific fields from documents by providing a JSON schema. Marker parses the document and fills in your schema with extracted values. Before you begin, make sure you have:

A Datalab account with an API key (new accounts include $5 in free credits)
Python 3.10+ installed
The Datalab SDK: pip install datalab-python-sdk
Your DATALAB_API_KEY environment variable set

Building for production? Use Pipelines to chain processors, version your configuration, and deploy with a single API call.

Quick Start

import json
from datalab_sdk import DatalabClient, ExtractOptions

client = DatalabClient()

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string", "description": "Invoice ID or number"},
        "total_amount": {"type": "number", "description": "Total amount due"},
        "vendor_name": {"type": "string", "description": "Company or vendor name"}
    },
    "required": ["invoice_number", "total_amount"]
}

options = ExtractOptions(
    page_schema=json.dumps(schema),
    mode="balanced"
)

result = client.extract("invoice.pdf", options=options)
extracted = json.loads(result.extraction_schema_json)
print(f"Invoice: {extracted['invoice_number']}")
print(f"Total: ${extracted['total_amount']}")

Schema Format

Use JSON Schema format to define what you want to extract:

{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "description": "Describe what this field contains"
    },
    "numeric_field": {
      "type": "number",
      "description": "A numeric value"
    },
    "list_field": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "nested_field": {"type": "string"}
        }
      }
    }
  },
  "required": ["field_name"]
}

Tips for Better Extraction

Use descriptive field names - invoice_number is clearer than id
Add descriptions - The description field helps the model understand context
Specify types correctly - Use number for numeric values, string for text
Use arrays for repeating data - Line items, table rows, etc.

Common schema pitfalls:

Using vague field names like data or info — be specific (e.g., invoice_number, total_amount)
Forgetting description fields — these help the model understand what to extract
Setting type: "string" for numeric values — use type: "number" for amounts, quantities, etc.
Deeply nested schemas — keep schemas as flat as possible for better extraction accuracy

Response

The extracted data is returned in extraction_schema_json:

{
  "status": "complete",
  "success": true,
  "json": {...},
  "extraction_schema_json": "{\"invoice_number\": \"INV-2024-001\", \"total_amount\": 1500.00, ...}",
  "page_count": 2
}

Citation Tracking

Each extracted field includes citations to the source blocks:

{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123", "block_124"],
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"]
}

Use these block IDs with the json output to trace extracted values back to the source document.

Schema Examples

Financial Document

schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string", "description": "Company name"},
        "fiscal_year": {"type": "string", "description": "Fiscal year"},
        "total_revenue": {"type": "number", "description": "Total revenue in dollars"},
        "net_income": {"type": "number", "description": "Net income in dollars"},
        "eps": {"type": "number", "description": "Earnings per share"}
    }
}

Scientific Paper

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string", "description": "Paper title"},
        "authors": {
            "type": "array",
            "items": {"type": "string"},
            "description": "List of author names"
        },
        "abstract": {"type": "string", "description": "Paper abstract"},
        "keywords": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Keywords or tags"
        }
    }
}

Contract

schema = {
    "type": "object",
    "properties": {
        "parties": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "role": {"type": "string"}
                }
            }
        },
        "effective_date": {"type": "string", "description": "Contract start date"},
        "termination_date": {"type": "string", "description": "Contract end date"},
        "total_value": {"type": "number", "description": "Total contract value"}
    }
}

Using Checkpoints

If you already converted a document with save_checkpoint=True using the Convert API, pass the checkpoint_id to ExtractOptions to skip re-parsing. This saves time and cost when running extraction on a previously converted document.

from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
convert_result = client.convert("invoice.pdf", options=ConvertOptions(save_checkpoint=True))
checkpoint_id = convert_result.checkpoint_id

# Step 2: Extract using checkpoint (no re-parsing needed)
schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string", "description": "Invoice ID"},
        "total_amount": {"type": "number", "description": "Total due"}
    }
}

options = ExtractOptions(
    page_schema=json.dumps(schema),
    checkpoint_id=checkpoint_id
)
result = client.extract("invoice.pdf", options=options)
extracted = json.loads(result.extraction_schema_json)

The extract endpoint accepts the following parameters: file, page_schema or schema_id (one is required), schema_version, mode, max_pages, page_range, save_checkpoint, checkpoint_id, and webhook_url.

Using Saved Schemas

Instead of passing page_schema inline, you can save schemas to Datalab and reference them by ID. This avoids repeating the schema in every request and enables versioning.

curl -X POST https://www.datalab.to/api/v1/extract \
  -H "X-API-Key: $DATALAB_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "schema_id=sch_k8Hx9mP2nQ4v"

Pass schema_version to pin to a specific schema version; omit it to always use the latest. See Saved Schemas for full CRUD API reference.

Confidence Scoring

Extraction scoring is in beta.We’d love your feedback — reach out at support@datalab.to.Scoring is free.

Scoring runs automatically after every extraction. When you poll request_check_url, the response initially contains just the extracted fields and citations. If you continue polling the same URL, the response will eventually include _score fields and an extraction_score_average once scoring completes. No extra parameters or endpoints are needed. Each _score field is a {"score": int, "reasoning": str} object explaining what evidence was found or missing.

Score response format

Without scoring complete, extraction_schema_json contains fields and citations:

{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123"],
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"]
}

Once scoring finishes, each field also gets a _score object, and the top-level response includes an extraction_score_average:

{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123"],
  "invoice_number_score": {
    "score": 5,
    "reasoning": "Value found verbatim in the document header with a matching citation."
  },
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"],
  "total_amount_score": {
    "score": 4,
    "reasoning": "Amount found in the totals row; minor ambiguity due to a subtotal nearby."
  }
}

The top-level response also includes extraction_score_average (4.5 in this case), averaging all field scores. Score rubric:

Score	Meaning
5	High confidence — clear match with strong citation support
4	Good confidence — match found with minor ambiguity
3	Moderate confidence — partial match or uncertain citation
2	Low confidence — match is inferred or weakly supported
1	Very low confidence — no clear evidence found

See Confidence Scoring for a full walkthrough with code examples.

Auto-Generate Schemas

Don’t want to write schemas by hand? Use the schema generation endpoint to automatically suggest schemas for your document. This requires a checkpoint from a previous conversion:

import os, requests, json, time

headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

# Step 1: Convert with checkpoint
with open("invoice.pdf", "rb") as f:
    resp = requests.post(
        "https://www.datalab.to/api/v1/convert",
        files={"file": ("invoice.pdf", f, "application/pdf")},
        data={"save_checkpoint": "true", "output_format": "markdown"},
        headers=headers
    )
check_url = resp.json()["request_check_url"]

# Poll until complete
while True:
    result = requests.get(check_url, headers=headers).json()
    if result["status"] == "complete":
        checkpoint_id = result["checkpoint_id"]
        break
    time.sleep(2)

# Step 2: Generate schemas
resp = requests.post(
    "https://www.datalab.to/api/v1/marker/extraction/gen_schemas",
    json={"checkpoint_id": checkpoint_id},
    headers=headers
)
gen_check_url = resp.json()["request_check_url"]

while True:
    result = requests.get(gen_check_url, headers=headers).json()
    if result["status"] == "complete":
        suggestions = result["suggestions"]
        print("Simple schema:", suggestions["simple_schema"])
        print("Moderate schema:", suggestions["moderate_schema"])
        print("Complex schema:", suggestions["complex_schema"])
        break
    time.sleep(2)

The endpoint returns three schema options at different complexity levels — use the one that best matches your needs, then customize it.

Using Forge Playground

Create and test schemas visually in Forge Playground:

Upload a sample document
Define fields in the visual editor
Switch to JSON Editor to copy the schema
Test extraction before deploying

Next Steps

Saved Schemas

Create reusable schemas and reference them by ID — no need to repeat the schema in each request

Confidence Scoring

Score extraction results with per-field confidence ratings

Handling Long Documents

Strategies for extracting from 100+ page documents

Document Segmentation

Split documents by section before extraction

Documentation Index

​Quick Start

​Schema Format

​Tips for Better Extraction

​Response

​Citation Tracking

​Schema Examples

​Financial Document

​Scientific Paper

​Contract

​Using Checkpoints

​Using Saved Schemas

​Confidence Scoring

​Score response format

​Auto-Generate Schemas

​Using Forge Playground

​Next Steps

Saved Schemas

Confidence Scoring

Handling Long Documents

Document Segmentation

Quick Start

Schema Format

Tips for Better Extraction

Response

Citation Tracking

Schema Examples

Financial Document

Scientific Paper

Contract

Using Checkpoints

Using Saved Schemas

Confidence Scoring

Score response format

Auto-Generate Schemas

Using Forge Playground

Next Steps