I’ve been thinking about the role of synthetic datasets in data projects, especially now that LLMs and generative models make data generation much easier. On one hand, synthetic data can help with privacy, class imbalance, rare cases, benchmarking, and testing pipelines when real data is limited or sensitive. On the other hand, I’m not sure how people evaluate whether a synthetic dataset is actually useful rather than just plausible-looking. Distribution shift, hidden bias, leakage from source data, and weak evaluation seem like real risks. For people who have used synthetic datasets in practice: when did they work well, and when did they fail? Also, what checks or metrics do you use before trusting a synthetic dataset for training, evaluation, or analysis? Thanks in advance for any thoughts.…