API Overview - Datalab Documentation

Datalab provides REST APIs for document conversion, structured extraction, form filling, and file management. All APIs use the same authentication and follow similar patterns.

For the simplest integration, use the Python SDK. The SDK handles authentication, polling, and provides typed responses.

Authentication

All requests require an API key in the X-API-Key header:

curl -X POST https://www.datalab.to/api/v1/convert \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@document.pdf"

Get your API key from the API Keys dashboard.

Request Pattern

All processing endpoints follow this pattern:

Submit a document for processing (returns immediately with a request_id)
Poll the status endpoint until processing completes
Retrieve results from the completed response

Submit Request

POST /api/v1/{endpoint}

Response:

{
  "success": true,
  "request_id": "abc123",
  "request_check_url": "https://www.datalab.to/api/v1/{endpoint}/abc123"
}

Poll for Results

GET /api/v1/{endpoint}/{request_id}

Response while processing:

{
  "status": "processing"
}

Response when complete:

{
  "status": "complete",
  "success": true,
  ...results...
}

Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.

Document Conversion

Convert documents to Markdown, HTML, JSON, or chunks. Endpoint: POST /api/v1/convert

Request

import requests

url = "https://www.datalab.to/api/v1/convert"
headers = {"X-API-Key": "YOUR_API_KEY"}

with open("document.pdf", "rb") as f:
    response = requests.post(
        url,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={
            "output_format": "markdown",
            "mode": "balanced",
        },
        headers=headers
    )

data = response.json()
check_url = data["request_check_url"]

Parameters

Parameter	Type	Default	Description
`file`	file	-	Document file (multipart upload)
`file_url`	string	-	URL to document (alternative to file upload)
`output_format`	string	`markdown`	Output format: `markdown`, `html`, `json`, `chunks`
`mode`	string	`fast`	Processing mode: `fast`, `balanced`, `accurate`
`max_pages`	int	-	Maximum pages to process
`page_range`	string	-	Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index.
`paginate`	bool	`false`	Add page delimiters to output
`skip_cache`	bool	`false`	Skip cached results
`disable_image_extraction`	bool	`false`	Don’t extract images
`disable_image_captions`	bool	`false`	Don’t generate image captions
`save_checkpoint`	bool	`false`	Save checkpoint for reuse
`extras`	string	-	Comma-separated: `track_changes`, `chart_understanding`, `extract_links`, `table_row_bboxes`, `infographic`, `new_block_types`
`add_block_ids`	bool	`false`	Add block IDs to HTML for citations
`include_markdown_in_chunks`	bool	`false`	Include markdown content in chunks output
`token_efficient_markdown`	bool	`false`	Optimize markdown for LLM token efficiency
`fence_synthetic_captions`	bool	`false`	Wrap synthetic image captions in HTML comments
`additional_config`	string	-	JSON with extra config options
`webhook_url`	string	-	Override webhook URL for this request

Processing Modes

Mode	Description
`fast`	Lowest latency, good for simple documents (default)
`balanced`	Balance of speed and accuracy
`accurate`	Highest accuracy, best for complex layouts

Response

Poll request_check_url until status is complete:

import time

while True:
    response = requests.get(check_url, headers=headers)
    result = response.json()

    if result["status"] == "complete":
        break
    time.sleep(2)

print(result["markdown"])

Response fields:

Field	Type	Description
`status`	string	`processing`, `complete`, or `failed`
`success`	bool	Whether conversion succeeded
`markdown`	string	Markdown output (if format is markdown)
`html`	string	HTML output (if format is html)
`json`	object	JSON output (if format is json)
`chunks`	object	Chunked output (if format is chunks)
`images`	object	Extracted images as `{filename: base64}`
`metadata`	object	Document metadata
`page_count`	int	Number of pages processed
`parse_quality_score`	float	Quality score (0-5)
`cost_breakdown`	object	Cost in cents
`error`	string	Error message if failed

For structured data extraction, see the Extract endpoint. For document segmentation, see the Segment endpoint.

Structured Extraction

Extract structured data from documents using a JSON schema. Endpoint: POST /api/v1/extract

Request

import requests
import json

headers = {"X-API-Key": "YOUR_API_KEY"}

schema = {
    "invoice_number": {"type": "string", "description": "Invoice ID"},
    "total": {"type": "number", "description": "Total amount"},
    "line_items": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "description": {"type": "string"},
                "amount": {"type": "number"}
            }
        }
    }
}

response = requests.post(
    "https://www.datalab.to/api/v1/extract",
    files={"file": ("invoice.pdf", open("invoice.pdf", "rb"), "application/pdf")},
    data={
        "page_schema": json.dumps(schema),
        "mode": "balanced"
    },
    headers=headers
)

data = response.json()
check_url = data["request_check_url"]

Parameters

Parameter	Type	Default	Description
`file`	file	-	Document file (multipart upload)
`file_url`	string	-	URL to document (alternative to file upload)
`page_schema`	string	-	JSON schema defining the data to extract. Required unless `schema_id` is provided.
`schema_id`	string	-	ID of a saved extraction schema (e.g. `sch_k8Hx9mP2nQ4v`). Mutually exclusive with `page_schema`.
`schema_version`	int	-	Version of the saved schema to use. Only valid with `schema_id`; defaults to the latest version.
`checkpoint_id`	string	-	Checkpoint ID from a previous `/convert` call (with `save_checkpoint=true`). Skips re-parsing.
`mode`	string	`fast`	Processing mode: `fast`, `balanced`, `accurate`
`output_format`	string	`markdown`	Output format: `markdown`, `html`, `json`, `chunks`
`max_pages`	int	-	Maximum pages to process
`page_range`	string	-	Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index.
`save_checkpoint`	bool	`false`	Save a checkpoint after processing for reuse with subsequent calls
`webhook_url`	string	-	Override webhook URL for this request

The extracted data is returned in extraction_schema_json in the poll response. See Structured Extraction for detailed examples.

Document Segmentation

Segment documents into structured sections using a JSON schema. Endpoint: POST /api/v1/segment

Parameters

Parameter	Type	Default	Description
`file`	file	-	Document file (multipart upload)
`file_url`	string	-	URL to document (alternative to file upload)
`segmentation_schema`	string	required	JSON schema defining the segments to extract
`checkpoint_id`	string	-	Checkpoint ID from a previous `/convert` call (with `save_checkpoint=true`). Skips re-parsing.
`mode`	string	`fast`	Processing mode: `fast`, `balanced`, `accurate`

See Document Segmentation for detailed examples.

Track Changes

Extract tracked changes (insertions and deletions) from DOCX files. Endpoint: POST /api/v1/track-changes

response = requests.post(
    "https://www.datalab.to/api/v1/track-changes",
    files={"file": ("document.docx", open("document.docx", "rb"), "application/vnd.openxmlformats-officedocument.wordprocessingml.document")},
    headers=headers
)

See Track Changes for detailed examples.

Custom Processor

This feature is currently in beta. The API may change.

Execute custom AI-powered processors on documents. Endpoint: POST /api/v1/custom-processor

POST /api/v1/custom-pipeline is deprecated (sunset: September 30, 2026). Migrate to POST /api/v1/custom-processor.

Parameters

Parameter	Type	Default	Description
`file`	file	-	Document file (multipart upload)
`file_url`	string	-	URL to document
`pipeline_id`	string	required	Custom processor ID (`cp_XXXXX`)
`version`	int	-	Processor version to run (default: active version)
`run_eval`	bool	`false`	Run evaluation rules defined for the processor
`mode`	string	`fast`	Processing mode: `fast`, `balanced`, `accurate`
`output_format`	string	`markdown`	Output format: `markdown`, `html`, `json`, `chunks`
`webhook_url`	string	-	URL to POST when complete

Form Filling

Fill forms in PDFs and images. Endpoint: POST /api/v1/fill

Request

import json

field_data = {
    "full_name": {"value": "John Doe", "description": "Full legal name"},
    "date": {"value": "2024-01-15", "description": "Today's date"},
    "signature": {"value": "John Doe", "description": "Signature field"}
}

response = requests.post(
    "https://www.datalab.to/api/v1/fill",
    files={"file": ("form.pdf", open("form.pdf", "rb"), "application/pdf")},
    data={
        "field_data": json.dumps(field_data),
        "confidence_threshold": "0.5"
    },
    headers=headers
)

Parameters

Parameter	Type	Default	Description
`file`	file	-	Form file (PDF or image)
`file_url`	string	-	URL to form
`field_data`	string	-	JSON mapping field names to values
`context`	string	-	Additional context for field matching
`confidence_threshold`	float	`0.5`	Minimum confidence for matching (0-1)
`page_range`	string	-	Specific pages to process
`skip_cache`	bool	`false`	Skip cached results

Field Data Format

{
  "field_key": {
    "value": "The value to fill",
    "description": "Description to help match the field"
  }
}

Response

Field	Type	Description
`status`	string	Processing status
`success`	bool	Whether filling succeeded
`output_format`	string	`pdf` or `png`
`output_base64`	string	Base64-encoded filled form
`fields_filled`	array	Successfully filled field names
`fields_not_found`	array	Unmatched field names
`page_count`	int	Pages processed
`cost_breakdown`	object	Cost details

See Form Filling for more examples.

File Management

Upload and manage files for use in pipelines.

Upload File

Step 1: Request an upload URL

POST /api/v1/files/upload
Content-Type: application/json

{
  "filename": "document.pdf",
  "content_type": "application/pdf"
}

Response:

{
  "file_id": 123,
  "upload_url": "https://...",
  "reference": "datalab://file-abc123"
}

Step 2: Upload directly to the presigned URL

PUT {upload_url}
Content-Type: application/pdf

<file contents>

Step 3: Confirm upload

GET /api/v1/files/{file_id}/confirm

List Files

GET /api/v1/files?limit=50&offset=0

Get File Metadata

GET /api/v1/files/{file_id}

Get Download URL

GET /api/v1/files/{file_id}/download?expires_in=3600

Delete File

DELETE /api/v1/files/{file_id}

See File Management for detailed examples.

Thumbnails

Generate page thumbnails from a previously processed document:

GET /api/v1/thumbnails/{lookup_key}?thumb_width=300&page_range=0-2

Parameter	Type	Default	Description
`lookup_key`	string	Required	The request ID from a previous conversion
`thumb_width`	int	300	Thumbnail width in pixels
`page_range`	string	All pages	Pages to generate (e.g., `"0,2-4"`)

Response:

{
  "success": true,
  "thumbnails": ["base64_encoded_jpg_1", "base64_encoded_jpg_2"]
}

Thumbnails are returned as base64-encoded JPG images.

Create Document

Generate DOCX files from markdown with track changes support:

POST /api/v1/create-document
Content-Type: application/json

{
  "markdown": "# Title\n\nThis is <ins data-revision-author=\"Editor\">newly added</ins> text.",
  "output_format": "docx"
}

See Create Document for detailed examples.

Webhooks

Configure webhooks to receive notifications when processing completes instead of polling. Set a default webhook URL in your account settings, or override per-request with the webhook_url parameter. See Webhooks for configuration details.

Rate Limits

Default rate limits apply per API key. If you exceed limits, you’ll receive a 429 response. See Rate Limits for details and how to request higher limits.

Next Steps

SDK Reference

Use the Python SDK for a simpler integration with typed responses.

Webhooks

Receive notifications when processing completes instead of polling.

API Limits

Understand file size limits, page limits, and rate limiting.

Document Conversion

Detailed guide to converting documents to Markdown, HTML, or JSON.

Documentation Index

​Authentication

​Request Pattern

​Submit Request

​Poll for Results

​Document Conversion

​Request

​Parameters

​Processing Modes

​Response

​Structured Extraction

​Request

​Parameters

​Document Segmentation

​Parameters

​Track Changes

​Custom Processor

​Parameters

​Form Filling

​Request

​Parameters

​Field Data Format

​Response

​File Management

​Upload File

​List Files

​Get File Metadata

​Get Download URL

​Delete File

​Thumbnails

​Create Document

​Webhooks

​Rate Limits

​Next Steps

SDK Reference

Webhooks

API Limits

Document Conversion

Authentication

Request Pattern

Submit Request

Poll for Results

Document Conversion

Request

Parameters

Processing Modes

Response

Structured Extraction

Request

Parameters

Document Segmentation

Parameters

Track Changes

Custom Processor

Parameters

Form Filling

Request

Parameters

Field Data Format

Response

File Management

Upload File

List Files

Get File Metadata

Get Download URL

Delete File

Thumbnails

Create Document

Webhooks

Rate Limits

Next Steps