File Operations
Upload and manage documents that power your RAG Q&A system.
Upload Files
Upload from Local File
from cerevox import Hippo
hippo = Hippo( api_key = "your-api-key" )
# Upload a local file
file = hippo.upload_file(
folder_id = "folder_123" ,
file_path = "documents/user-guide.pdf"
)
print ( f "Uploaded: { file .name } " )
print ( f "File ID: { file .id } " )
print ( f "Status: { file .status } " )
Upload from URL
# Upload directly from a URL
file = hippo.upload_file_from_url(
folder_id = "folder_123" ,
file_url = "https://example.com/whitepaper.pdf" ,
file_name = "whitepaper.pdf" # Optional custom name
)
print ( f "Uploaded from URL: { file .name } " )
Batch Upload
Sync - Sequential
Async - Concurrent
files = []
for file_path in [ "doc1.pdf" , "doc2.docx" , "doc3.pptx" ]:
file = hippo.upload_file(folder_id, file_path)
files.append( file )
print ( f "Uploaded: { file .name } " )
Documents
PDF (.pdf)
Word (.docx, .doc)
PowerPoint (.pptx, .ppt)
Text (.txt)
RTF (.rtf)
Spreadsheets
Excel (.xlsx, .xls)
CSV (.csv)
TSV (.tsv)
Web & Other
HTML (.html)
MHTML (.mhtml)
Markdown (.md)
File size limits:
Max file size: 100MB per file
Contact support for larger files or custom formats
List Files
# Get all files in a folder
files = hippo.get_files( folder_id = "folder_123" )
for file in files:
print ( f " { file .name } - { file .status } - { file .size_bytes } bytes" )
File status values:
uploading: File is being uploaded
processing: File is being indexed
completed: File is ready for Q&A
failed: Processing failed
Get File Details
# Get specific file information
file = hippo.get_file( file_id = "file_456" )
print ( f "Name: { file .name } " )
print ( f "Status: { file .status } " )
print ( f "Size: { file .size_bytes } bytes" )
print ( f "Pages: { file .page_count } " )
print ( f "Uploaded: { file .created_at } " )
Delete Files
# Delete a file
hippo.delete_file( file_id = "file_456" )
print ( "File deleted successfully" )
Deleted files cannot be recovered. The file will be removed from all chats and answers that referenced it.
File Processing
Processing Time
Files are automatically processed after upload:
Upload
File is uploaded to Cerevox (a few seconds)
Parsing
Document is parsed for text and structure (10s - 2min)
Chunking
Content is split into semantic chunks (a few seconds)
Indexing
Chunks are indexed for search (10s - 1min)
Ready
File is ready for Q&A!
Total processing time:
Small files (< 10 pages): 10-30 seconds
Medium files (10-100 pages): 30-120 seconds
Large files (> 100 pages): 2-5 minutes
Monitor Processing Status
import time
# Upload file
file = hippo.upload_file(folder_id, "large-document.pdf" )
# Poll until processing completes
while file .status != "completed" :
time.sleep( 5 )
file = hippo.get_file( file .id)
print ( f "Status: { file .status } " )
print ( "File ready for Q&A!" )
Best Practices
Before uploading:
Ensure PDFs are text-based (not scanned images)
Check that documents aren’t password-protected
Verify file isn’t corrupted
Tip : OCR (scanned) PDFs work but may have lower accuracy. Use text-based PDFs when possible.
Reduce processing time:
Remove unnecessary pages (covers, blanks, ads)
Compress images in PDFs
Split very large documents (500+ pages)
Smaller, focused documents = faster processing + better search results
Upload Related Documents Together
Use Descriptive Filenames
Good : product-api-authentication-guide.pdf
Bad : doc1.pdf, untitled.pdfDescriptive names help with source citations and debugging.
Use Async for Batch Uploads
# Async = 10x faster for multiple files
async with AsyncHippo() as hippo:
tasks = [hippo.upload_file(folder_id, f) for f in files]
results = await asyncio.gather( * tasks)
Async concurrent uploads are significantly faster than sequential.
Complete Example: Batch Upload
import asyncio
from pathlib import Path
from cerevox import AsyncHippo
async def batch_upload_directory ( folder_id , directory_path ):
"""Upload all PDFs from a directory"""
async with AsyncHippo( api_key = "your-api-key" ) as hippo:
# Get all PDF files
pdf_files = list (Path(directory_path).glob( "*.pdf" ))
print ( f "Found { len (pdf_files) } PDF files" )
# Upload concurrently
tasks = [
hippo.upload_file(folder_id, str (pdf))
for pdf in pdf_files
]
files = await asyncio.gather( * tasks)
# Report results
print ( f " \n ✅ Uploaded { len (files) } files:" )
for file in files:
print ( f " - { file .name } ( { file .status } )" )
return files
# Usage
folder = await hippo.create_folder( "Uploaded Docs" )
files = await batch_upload_directory(folder.id, "./documents" )
Error Handling
from cerevox import Hippo, HippoError
hippo = Hippo()
try :
file = hippo.upload_file(folder_id, "document.pdf" )
print ( f "Uploaded: { file .name } " )
except FileNotFoundError :
print ( "Error: File not found" )
except HippoError as e:
if "unsupported format" in str (e).lower():
print ( "Error: File format not supported" )
elif "too large" in str (e).lower():
print ( "Error: File exceeds size limit" )
else :
print ( f "Error: { e } " )
Next Steps