Systems note

How does table statistics affect query performance?

Recently I noticed something surprisingly impactful: simply running TABLE ANALYZE right after a CTAS (CREATE TABLE AS SELECT) can make subsequent queries on that table 40–50% faster in both Spark and Trino.

CTAS is a common pattern when materializing intermediate results or shipping curated tables to downstream users. The problem is that immediately after CTAS, the table’s statistics are often missing or badly outdated. The query engine has no idea about basic properties like row counts, value distributions, or data skew, so the optimizer falls back to conservative defaults and often picks suboptimal join orders, join strategies, or partition pruning plans.

TABLE ANALYZE (or the engine-specific equivalent) fills in those statistics. Once the engine has real row counts and column stats, it can choose better join orders, more appropriate shuffle strategies, and smarter filters. In workloads with skew or large joins, this shows up as noticeably faster queries and more predictable runtimes.

[Add your latency or throughput graph comparing CTAS-only vs CTAS+TABLE ANALYZE here]

Figure: Queries run 40–50% faster after collecting table statistics.

When to use this pattern

In practice, the pattern is simple: after the CTAS finishes, immediately trigger TABLE ANALYZE on the new table (possibly in an asynchronous job) so that every downstream query benefits. It’s a tiny addition to the pipeline, but in my experiments it consistently delivered 40–50% faster queries on Spark and Trino.