Getting Started

Lexa is Cerevox’s enterprise-grade document parsing API that delivers 10x better performance and accuracy compared to traditional solutions. Unlike other APIs that struggle with complex layouts and structured data, Lexa uses state-of-the-art AI models to extract content with 99.9% accuracy while maintaining native async support and vector database optimization.Key differentiators:
  • SOTA accuracy with advanced ML models
  • 10x faster processing with sub-second response times
  • Vector DB ready chunks optimized for RAG applications
  • 12+ file formats with consistent results
  • Enterprise-grade reliability with 99.9% SLA
You can start parsing documents in under 5 minutes:
pip install cerevox
from cerevox import Lexa

client = Lexa(api_key="your-api-key")
documents = client.parse(["document.pdf"])

# Get vector DB ready chunks
chunks = documents.get_all_text_chunks(target_size=500)
Our Python SDK handles authentication, retries, and error handling automatically. We also provide comprehensive examples for Django, Flask, FastAPI, and async applications.
Lexa supports 12+ file formats with consistent, high-accuracy parsing:Documents: PDF, DOCX, PPTX, TXT, HTML, RTF
Spreadsheets: XLSX, CSV, TSV
Google Workspace: Google Docs, Sheets, Slides
Data: JSON, Parquet
All formats support advanced table extraction, image detection, and metadata preservation. File size limits range from 100MB for complex documents to 1GB+ for simple text files.

Technical Implementation

Lexa provides native async support with the AsyncLexa client:
import asyncio
from cerevox import AsyncLexa

async def main():
    async with AsyncLexa(api_key="your-api-key") as client:
        # Process multiple documents concurrently
        documents = await client.parse([
            "report1.pdf", "report2.docx", "data.xlsx"
        ])
        
        # Batch process with progress tracking
        async for status in client.parse_with_progress(files):
            print(f"Progress: {status.progress}")

asyncio.run(main())
This enables concurrent processing of multiple documents, significantly improving throughput for batch operations.
Lexa is designed specifically for RAG workflows with built-in vector database optimization:
# Get optimally sized chunks for embeddings
chunks = documents.get_all_text_chunks(
    target_size=500,  # Optimal for most embedding models
    overlap=50,       # Maintain context between chunks
    include_metadata=True  # Rich metadata for filtering
)

# Direct integration with vector databases
for chunk in chunks:
    # Each chunk includes source document, page numbers,
    # element types, and confidence scores
    vector_db.upsert(
        id=chunk.id,
        vector=embed(chunk.content),
        metadata=chunk.metadata
    )
We provide pre-built integration examples for Pinecone, Weaviate, ChromaDB, and Qdrant.
Lexa integrates with 7+ major cloud storage platforms:
  • Amazon S3: Direct parsing from S3 buckets and folders
  • Microsoft SharePoint: Sites, drives, and document libraries
  • Google Drive: Files and folders with permission management
  • Box: Enterprise file storage with advanced metadata
  • Dropbox: Personal and business accounts
  • Salesforce: Document attachments and files
  • Coming Soon: Azure Blob, OneDrive, Notion
# Parse entire S3 folder
documents = client.parse_s3_folder(
    bucket="my-bucket",
    folder="documents/",
    recursive=True
)

Performance and Scaling

Lexa delivers industry-leading performance across all metrics:Speed:
  • Simple PDFs: < 1 second
  • Complex documents (100+ pages): 15-45 seconds
  • Batch processing: 10-50 documents/minute
  • Concurrent async: 100+ documents/minute
Accuracy:
  • Text extraction: 99.9%
  • Table structure: 92.5%
  • Metadata extraction: 99.2%
  • Multi-format consistency: 99.7%
Reliability:
  • API uptime: 99.9% SLA
  • Auto-retry on failures: 3 attempts with exponential backoff
  • Rate limiting: 1000 requests/minute (enterprise plans)
Lexa is built for enterprise scale with several optimization strategies:Horizontal Scaling: Our API automatically scales to handle spikes in demand
Batch Processing: Process up to 100 documents per API call
Async Processing: Non-blocking operations with progress callbacks
Caching: Intelligent caching reduces processing time for similar documents
Load Balancing: Global infrastructure ensures low latency worldwide
# Batch processing example
large_batch = ["doc1.pdf", "doc2.docx", ...] # 100+ files
documents = await client.parse(
    large_batch,
    progress_callback=lambda status: print(f"Progress: {status.completed}/{status.total}")
)
See pricingFree Plan (Free):
  • 1000 Documents Parsed
  • Community support
Dev Plan ($5/month):
  • Start with 100 pages
  • $0.05 per additional page
  • 100 requests/minute
  • Email support
  • Vector DB integrations
Pro Plan ($99/month):
  • Start with 10,000 pages
  • $0.01 per additional page
  • 3x cost for advanced processing
  • 100 requests/minute
  • Email support
  • Vector DB integrations
Enterprise Plan (Custom):
  • Unlimited pages
  • 1000+ requests/minute
  • Dedicated support
  • On-premise deployment
  • Custom integrations
All plans include the same high accuracy and all file format support.

Advanced Features

Lexa uses advanced computer vision and ML models to extract tables with high fidelity:
documents = client.parse("financial_report.pdf")

# Access extracted tables
for doc in documents:
    for table in doc.tables:
        print(f"Table on page {table.page_number}")
        print(f"Dimensions: {table.rows}x{table.columns}")
        
        # Export to pandas DataFrame
        df = table.to_pandas()
        
        # Or get raw structured data
        data = table.to_dict()
Features include:
  • Structure preservation: Maintains cell relationships and formatting
  • Multi-page tables: Automatically combines split tables
  • Header detection: Identifies and preserves table headers
  • Data type inference: Automatically detects numbers, dates, etc.
Lexa offers several processing modes and customization options:
from cerevox import ProcessingMode

# Standard processing (fastest)
docs = client.parse("document.pdf", mode=ProcessingMode.DEFAULT)

# Advanced processing (highest accuracy)
docs = client.parse("document.pdf", mode=ProcessingMode.ADVANCED)

# Custom chunking parameters
chunks = docs.get_text_chunks(
    target_size=1000,      # Larger chunks for long-form content
    tolerance=0.2,         # 20% size variance allowed
    respect_boundaries=True, # Don't break sentences/paragraphs
    include_tables=True    # Include table content in chunks
)
Contact our team for specialized processing modes for specific document types or industries.
Security is built into every aspect of Lexa:Data Security:
  • TLS 1.3 encryption for all API communications
  • Documents processed in isolated environments
  • No document storage - processed and deleted immediately
  • SOC 2 Type II certified infrastructure
Access Control:
  • API key authentication with rotation support
  • Role-based access control (enterprise plans)
  • IP whitelisting and VPC connectivity options
  • Audit logging for all API operations
Compliance:
  • GDPR compliant data processing
  • HIPAA compliance available (enterprise)
  • Regional data processing options (US, EU, Asia)
  • On-premise deployment for maximum security

Support and Community

We provide comprehensive support across multiple channels:Community Support:Direct Support:
  • Email support for Pro and Enterprise customers
  • Video calls for Enterprise customers
  • Dedicated Slack channels for large deployments
  • 24/7 support for mission-critical applications
Stay connected with the Cerevox developer community:We ship new features and improvements every 2-3 weeks based on community feedback.