Inference

Pipeline Parallelism

Quick Answer

Distributing different layers across multiple GPUs to enable processing multiple batches simultaneously.

Pipeline parallelism assigns different layers to different GPUs. GPUs process different stages of the pipeline simultaneously. Early GPUs work on batch while later GPUs finish previous batch. Pipeline parallelism hides latency. Bubble time (idle GPUs) can be significant. Pipeline parallelism is better for throughput than latency. It's complementary to tensor parallelism. Optimal pipeline configuration depends on hardware.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →