Datalab provides REST APIs for document conversion, structured extraction, form filling, and file management. All APIs use the same authentication and follow similar patterns.Documentation Index
Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt
Use this file to discover all available pages before exploring further.
For the simplest integration, use the Python SDK. The SDK handles authentication, polling, and provides typed responses.
Authentication
All requests require an API key in theX-API-Key header:
Request Pattern
All processing endpoints follow this pattern:- Submit a document for processing (returns immediately with a
request_id) - Poll the status endpoint until processing completes
- Retrieve results from the completed response
Submit Request
Poll for Results
Document Conversion
Convert documents to Markdown, HTML, JSON, or chunks. Endpoint:POST /api/v1/convert
Request
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file | file | - | Document file (multipart upload) |
file_url | string | - | URL to document (alternative to file upload) |
output_format | string | markdown | Output format: markdown, html, json, chunks |
mode | string | fast | Processing mode: fast, balanced, accurate |
max_pages | int | - | Maximum pages to process |
page_range | string | - | Specific pages (e.g., "0-5,10", 0-indexed). For spreadsheets, filters by sheet index. |
paginate | bool | false | Add page delimiters to output |
skip_cache | bool | false | Skip cached results |
disable_image_extraction | bool | false | Don’t extract images |
disable_image_captions | bool | false | Don’t generate image captions |
save_checkpoint | bool | false | Save checkpoint for reuse |
extras | string | - | Comma-separated: track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types |
add_block_ids | bool | false | Add block IDs to HTML for citations |
include_markdown_in_chunks | bool | false | Include markdown content in chunks output |
token_efficient_markdown | bool | false | Optimize markdown for LLM token efficiency |
fence_synthetic_captions | bool | false | Wrap synthetic image captions in HTML comments |
additional_config | string | - | JSON with extra config options |
webhook_url | string | - | Override webhook URL for this request |
Processing Modes
| Mode | Description |
|---|---|
fast | Lowest latency, good for simple documents (default) |
balanced | Balance of speed and accuracy |
accurate | Highest accuracy, best for complex layouts |
Response
Pollrequest_check_url until status is complete:
| Field | Type | Description |
|---|---|---|
status | string | processing, complete, or failed |
success | bool | Whether conversion succeeded |
markdown | string | Markdown output (if format is markdown) |
html | string | HTML output (if format is html) |
json | object | JSON output (if format is json) |
chunks | object | Chunked output (if format is chunks) |
images | object | Extracted images as {filename: base64} |
metadata | object | Document metadata |
page_count | int | Number of pages processed |
parse_quality_score | float | Quality score (0-5) |
cost_breakdown | object | Cost in cents |
error | string | Error message if failed |
For structured data extraction, see the Extract endpoint. For document segmentation, see the Segment endpoint.
Structured Extraction
Extract structured data from documents using a JSON schema. Endpoint:POST /api/v1/extract
Request
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file | file | - | Document file (multipart upload) |
file_url | string | - | URL to document (alternative to file upload) |
page_schema | string | - | JSON schema defining the data to extract. Required unless schema_id is provided. |
schema_id | string | - | ID of a saved extraction schema (e.g. sch_k8Hx9mP2nQ4v). Mutually exclusive with page_schema. |
schema_version | int | - | Version of the saved schema to use. Only valid with schema_id; defaults to the latest version. |
checkpoint_id | string | - | Checkpoint ID from a previous /convert call (with save_checkpoint=true). Skips re-parsing. |
mode | string | fast | Processing mode: fast, balanced, accurate |
output_format | string | markdown | Output format: markdown, html, json, chunks |
max_pages | int | - | Maximum pages to process |
page_range | string | - | Specific pages (e.g., "0-5,10", 0-indexed). For spreadsheets, filters by sheet index. |
save_checkpoint | bool | false | Save a checkpoint after processing for reuse with subsequent calls |
webhook_url | string | - | Override webhook URL for this request |
extraction_schema_json in the poll response.
See Structured Extraction for detailed examples.
Document Segmentation
Segment documents into structured sections using a JSON schema. Endpoint:POST /api/v1/segment
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file | file | - | Document file (multipart upload) |
file_url | string | - | URL to document (alternative to file upload) |
segmentation_schema | string | required | JSON schema defining the segments to extract |
checkpoint_id | string | - | Checkpoint ID from a previous /convert call (with save_checkpoint=true). Skips re-parsing. |
mode | string | fast | Processing mode: fast, balanced, accurate |
Track Changes
Extract tracked changes (insertions and deletions) from DOCX files. Endpoint:POST /api/v1/track-changes
Custom Processor
This feature is currently in beta. The API may change.
POST /api/v1/custom-processor
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file | file | - | Document file (multipart upload) |
file_url | string | - | URL to document |
pipeline_id | string | required | Custom processor ID (cp_XXXXX) |
version | int | - | Processor version to run (default: active version) |
run_eval | bool | false | Run evaluation rules defined for the processor |
mode | string | fast | Processing mode: fast, balanced, accurate |
output_format | string | markdown | Output format: markdown, html, json, chunks |
webhook_url | string | - | URL to POST when complete |
Form Filling
Fill forms in PDFs and images. Endpoint:POST /api/v1/fill
Request
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file | file | - | Form file (PDF or image) |
file_url | string | - | URL to form |
field_data | string | - | JSON mapping field names to values |
context | string | - | Additional context for field matching |
confidence_threshold | float | 0.5 | Minimum confidence for matching (0-1) |
page_range | string | - | Specific pages to process |
skip_cache | bool | false | Skip cached results |
Field Data Format
Response
| Field | Type | Description |
|---|---|---|
status | string | Processing status |
success | bool | Whether filling succeeded |
output_format | string | pdf or png |
output_base64 | string | Base64-encoded filled form |
fields_filled | array | Successfully filled field names |
fields_not_found | array | Unmatched field names |
page_count | int | Pages processed |
cost_breakdown | object | Cost details |
File Management
Upload and manage files for use in pipelines.Upload File
Step 1: Request an upload URLList Files
Get File Metadata
Get Download URL
Delete File
Thumbnails
Generate page thumbnails from a previously processed document:| Parameter | Type | Default | Description |
|---|---|---|---|
lookup_key | string | Required | The request ID from a previous conversion |
thumb_width | int | 300 | Thumbnail width in pixels |
page_range | string | All pages | Pages to generate (e.g., "0,2-4") |
Create Document
Generate DOCX files from markdown with track changes support:Webhooks
Configure webhooks to receive notifications when processing completes instead of polling. Set a default webhook URL in your account settings, or override per-request with thewebhook_url parameter.
See Webhooks for configuration details.
Rate Limits
Default rate limits apply per API key. If you exceed limits, you’ll receive a429 response.
See Rate Limits for details and how to request higher limits.
Next Steps
SDK Reference
Use the Python SDK for a simpler integration with typed responses.
Webhooks
Receive notifications when processing completes instead of polling.
API Limits
Understand file size limits, page limits, and rate limiting.
Document Conversion
Detailed guide to converting documents to Markdown, HTML, or JSON.