optimizationintermediate

Batch Processing with LLM APIs (2026)

Quick Answer

The Batch API processes LLM requests in bulk with a 24-hour SLA at 50% of real-time pricing. For any offline workload — nightly data enrichment, bulk classification, document processing, eval runs — batch processing cuts costs in half. The tradeoff: no streaming, no real-time response, and partial failures require retry logic. For workloads that don't need immediate results, batch is almost always the right choice.

When to Use

  • Nightly ETL jobs that enrich data with LLM-generated labels, summaries, or classifications
  • One-time bulk processing of a large document corpus (thousands to millions of documents)
  • Running evaluation sets against multiple models for offline comparison
  • Generating embeddings, summaries, or structured extractions for all records in a database
  • Any LLM task where results can wait up to 24 hours

How It Works

  1. 1Create a batch request file in JSONL format: one JSON object per line, each with a custom_id, the model, and the messages payload. Submit the file to the Batch API endpoint.
  2. 2The API returns a batch_id immediately. Poll the status endpoint until the batch completes (typically 30 minutes to 6 hours for most batches). Download the results file when status is 'completed'.
  3. 3Results are returned in JSONL format with the same custom_id, allowing you to match results to inputs. Failed requests include error details; successful ones include the full response.
  4. 4Implement retry logic for failed items: extract failed custom_ids from the results, create a new batch with just the failed items, and re-submit. Most failures are transient (rate limit, timeout).
  5. 5Cost calculation: batch pricing is exactly 50% of real-time pricing. A $3/M token prompt costs $1.50/M in batch. For a 10M token daily job, that's $15/day vs. $30/day — meaningful at scale.

Examples

Anthropic Batch API complete workflow
import anthropic
import jsonlines
import time

client = anthropic.Anthropic()

# Step 1: Prepare batch requests
def create_batch_file(documents: list[dict], output_path: str):
    with jsonlines.open(output_path, 'w') as writer:
        for doc in documents:
            writer.write({
                'custom_id': doc['id'],
                'params': {
                    'model': 'claude-3-5-haiku-20241022',
                    'max_tokens': 200,
                    'messages': [{
                        'role': 'user',
                        'content': f'Classify this review as positive/negative/neutral. Return JSON.\n\n{doc["text"]}'
                    }]
                }
            })

# Step 2: Submit batch
with open('batch_requests.jsonl', 'rb') as f:
    batch = client.beta.messages.batches.create(requests=f)

print(f'Batch ID: {batch.id}, Status: {batch.processing_status}')

# Step 3: Poll until complete
while True:
    batch = client.beta.messages.batches.retrieve(batch.id)
    if batch.processing_status == 'ended':
        break
    print(f'Status: {batch.processing_status}, Completed: {batch.request_counts.succeeded}/{batch.request_counts.processing}')
    time.sleep(60)

# Step 4: Download and parse results
results = {}
for result in client.beta.messages.batches.results(batch.id):
    if result.result.type == 'succeeded':
        results[result.custom_id] = result.result.message.content[0].text
    else:
        print(f'Failed: {result.custom_id}: {result.result.error}')
Output:Complete batch workflow: prepare JSONL → submit → poll → parse results. For 10,000 documents, this takes 30-90 minutes and costs 50% of real-time. custom_id maps results back to source documents.

Common Mistakes

  • No retry logic for failed items — batch jobs regularly have 1-5% failure rates from transient errors. Without retry logic, you silently lose that percentage of your data. Always check result.result.type and reprocess failed items.
  • Mixing time-sensitive and batch-appropriate queries — using batch for interactive features breaks user experience. Only use batch for truly offline workflows. If a feature can be moved to async/offline without impacting UX, it's a batch candidate.
  • Not chunking very large batches — the Batch API has limits on request file size. For very large jobs (millions of requests), split into chunks of 50,000-100,000 requests each and submit as separate batch jobs.
  • Polling too frequently — polling every 5 seconds for a 2-hour batch job wastes API calls and risks rate limiting. Poll every 30-60 seconds; most batch jobs complete in 30 minutes to 4 hours.

FAQ

What's the maximum batch size?+

Anthropic: 100,000 requests per batch, 256MB file size limit. OpenAI: 50,000 requests per batch, 100MB file size limit. For larger jobs, split into multiple sequential or parallel batches. There's no limit on the number of batches you can submit.

How do I estimate batch completion time?+

Typical completion times: under 10K requests → 15-30 minutes; 10K-100K requests → 1-4 hours; 100K+ requests → 4-12 hours. Times vary with API load. Don't design systems that require exact batch completion times — build for asynchronous delivery with a reasonable SLA (4 hours for most production use cases).

Can I use batch processing for embeddings?+

Yes — OpenAI's Batch API supports text-embedding-3 models. Anthropic's Batch API is for messages only (no embeddings). For large-scale embedding jobs with Anthropic embeddings (if/when available), or OpenAI, batch processing is the recommended approach for bulk indexing.

What happens if a batch fails partially?+

Each request in a batch is independent. A partial failure means some requests succeeded and some failed. The results file includes status for each request. Extract failed custom_ids, determine the failure reason, fix if necessary (e.g., malformed input), and resubmit just the failed items in a new batch.

Is there a streaming option in batch mode?+

No — batch processing is entirely asynchronous. There's no streaming or progressive result delivery. The entire batch completes, then you download all results at once. If you need progressive results (process as they complete), run parallel real-time requests with asyncio instead of batch.

Related