File Management - Datalab Documentation

Overview

Datalab provides file storage for documents you want to process with pipelines or reuse across multiple API calls. Uploaded files get a reference URL (datalab://file-xxx) that you can use in pipelines.

Upload Files

Upload one or more files to Datalab storage:

from datalab_sdk import DatalabClient

client = DatalabClient()

# Upload a single file
file = client.upload_files("document.pdf")
print(f"Uploaded: {file.original_filename}")
print(f"Reference: {file.reference}")  # datalab://file-abc123

# Upload multiple files
files = client.upload_files(["doc1.pdf", "doc2.pdf", "doc3.pdf"])
for f in files:
    print(f"{f.original_filename}: {f.reference}")

Upload Result

The UploadedFileMetadata object contains:

Field	Type	Description
`file_id`	int	Unique file ID
`original_filename`	str	Original filename
`content_type`	str	MIME type
`reference`	str	Datalab reference URL (`datalab://file-xxx`)
`upload_status`	str	Status: `"pending"`, `"completed"`, `"failed"`
`file_size`	int	File size in bytes
`created`	str	Upload timestamp

List Files

List all uploaded files with pagination:

# List first 50 files
result = client.list_files(limit=50, offset=0)

print(f"Total files: {result['total']}")
for file in result['files']:
    print(f"  {file.original_filename} ({file.file_size} bytes)")
    print(f"    Reference: {file.reference}")
    print(f"    Status: {file.upload_status}")

Pagination

# Page through all files
offset = 0
limit = 50

while True:
    result = client.list_files(limit=limit, offset=offset)

    for file in result['files']:
        print(file.original_filename)

    if offset + limit >= result['total']:
        break

    offset += limit

Get File Metadata

Get details for a specific file:

# By file ID (integer)
file = client.get_file_metadata(123)

# By hashid (string from reference URL)
file = client.get_file_metadata("abc123")

print(f"Filename: {file.original_filename}")
print(f"Size: {file.file_size} bytes")
print(f"Type: {file.content_type}")
print(f"Created: {file.created}")

Get Download URL

Generate a presigned URL to download a file:

result = client.get_file_download_url(
    file_id=123,
    expires_in=3600  # URL valid for 1 hour (default)
)

print(f"Download URL: {result['download_url']}")
print(f"Expires in: {result['expires_in']} seconds")

# Download the file
import requests
response = requests.get(result['download_url'])
with open("downloaded.pdf", "wb") as f:
    f.write(response.content)

Expiration Options

The expires_in parameter accepts values from 60 to 86400 seconds (1 minute to 24 hours):

# Short-lived URL (1 minute)
result = client.get_file_download_url(file_id, expires_in=60)

# Long-lived URL (24 hours)
result = client.get_file_download_url(file_id, expires_in=86400)

Delete File

Delete an uploaded file:

result = client.delete_file(123)

if result['success']:
    print(f"Deleted: {result['message']}")

Using Files in Pipelines

File references can be used as input to pipelines:

from datalab_sdk import DatalabClient

client = DatalabClient()

# Upload files
files = client.upload_files(["invoice1.pdf", "invoice2.pdf"])

# Run pipeline on each uploaded file
for f in files:
    execution = client.run_pipeline(
        "pl_abc123",
        file_url=f.reference  # e.g., 'datalab://file-abc123'
    )
    print(f"{f.original_filename}: {execution.execution_id}")

See Pipelines for more details.

Async Usage

import asyncio
from datalab_sdk import AsyncDatalabClient

async def manage_files():
    async with AsyncDatalabClient() as client:
        # Upload
        files = await client.upload_files(["doc.pdf"])

        # List
        result = await client.list_files(limit=10)

        # Get metadata
        file = await client.get_file_metadata(files[0].file_id)

        # Download URL
        url = await client.get_file_download_url(files[0].file_id)

        # Delete
        await client.delete_file(files[0].file_id)

asyncio.run(manage_files())

Example: Batch Upload and Process

from datalab_sdk import DatalabClient
from pathlib import Path

client = DatalabClient()

# Find all PDFs in a directory
pdf_files = list(Path("./documents").glob("*.pdf"))

# Upload all files
uploaded = client.upload_files([str(p) for p in pdf_files])

print(f"Uploaded {len(uploaded)} files:")
for file in uploaded:
    print(f"  {file.original_filename}: {file.reference}")

# Store references for later use
references = {f.original_filename: f.reference for f in uploaded}

Supported File Types

See Supported File Types for a complete list of supported formats.

Next Steps

File Upload Recipe

Step-by-step guide for uploading and managing files via the API.

Pipelines

Chain processors into versioned, reusable pipelines.

Conversion SDK

Convert documents to Markdown, HTML, JSON, or chunks.

API Limits

Understand rate limits and file size constraints.

Documentation Index

​Overview

​Upload Files

​Upload Result

​List Files

​Pagination

​Get File Metadata

​Get Download URL

​Expiration Options

​Delete File

​Using Files in Pipelines

​Async Usage

​Example: Batch Upload and Process

​Supported File Types

​Next Steps