Training
Synthetic Data
Quick Answer
Training data generated by models or algorithms rather than manually created.
Synthetic data is generated computationally rather than manually created. Models can generate training data for fine-tuning, addressing data scarcity. However, synthetic data risks amplifying model errors and biases. Best practices: use high-quality models for generation, filter outputs, use synthetic data to augment (not replace) real data. Synthetic data enables scaling data creation. Careful evaluation is necessary to ensure quality. Recent work shows synthetic data can effectively augment training.
Last verified: 2026-04-08