SQL Query and performance tuning - DB stats and identifying slow queries

 


๐Ÿ“Š Understanding Table Statistics in PostgreSQL (With Simple Examples)

Efficient queries depend on knowing your data. PostgreSQL uses table statistics to build better query plans, and you can use them too, to diagnose slow queries, analyze distributions, and optimize performance.

Let’s explore how to gather and use these statistics using tools like:

  • ANALYZE

  • pg_stat_statements

  • auto_explain

  • pg_stats


๐Ÿ“ฆ Why Do Table Statistics Matter?

PostgreSQL keeps metadata about:

  • ๐Ÿ”ข Number of rows in each table

  • ๐Ÿ’พ Storage usage

  • ๐Ÿ”„ Value frequency (e.g., most common values)

  • ๐Ÿ“‰ Data distribution (e.g., histograms, skew)

  • ⚖️ Cardinality (how many distinct values a column has)

This info helps the query planner decide how to execute your queries efficiently.


⚙️ 1. Collecting Stats with ANALYZE

Run ANALYZE to gather statistics on tables:

ANALYZE your_table;

You can also combine it with VACUUM to clean and analyze at the same time:

VACUUM ANALYZE your_table;

๐Ÿ“Œ Note: The raw output of ANALYZE isn’t human-readable — it’s used internally by the planner. But you can view the results in pg_stats.


๐Ÿงผ What’s the Role of VACUUM?

The VACUUM command cleans up dead rows left behind after updates/deletes.

✅ Regular VACUUM

  • Reclaims space

  • Does not lock the table

  • Can be run automatically via the autovacuum daemon

VACUUM your_table;

๐Ÿ’ฃ VACUUM FULL

  • Locks the table during operation

  • Rewrites the table entirely to reclaim more space

  • Requires extra disk space (old + new copy during operation)

VACUUM FULL your_table;

๐Ÿงช Combine with ANALYZE

Want to clean up and update stats together?

VACUUM FULL ANALYZE your_table;

๐Ÿ”ง Rebuilding Indexes: REINDEX

Normally, you don’t need to manually rebuild indexes. But if you suspect corruption or you’re fixing bloat, you can use:

REINDEX INDEX index_name; REINDEX TABLE table_name; REINDEX SCHEMA schema_name;

This rebuilds B-tree indexes, freeing unused pages and correcting any inconsistencies.


✅ Best Practices Summary

TaskCommandWhen to Use
Gather stats  ANALYZE       After bulk inserts or deletes
Clean dead rows  VACUUM       Regular maintenance (or autovacuum)
Reclaim more space  VACUUM FULL       Only when lots of space can be reclaimed
Clean + analyze together  VACUUM FULL ANALYZE       After large deletes + need updated stats
Rebuild indexes  REINDEX       If indexes are corrupted or bloated


๐Ÿš€ About Analyze and Vaccum

Statistics are the silent engine behind PostgreSQL performance. Without them, the planner guesses — and guesses can be wrong.

By regularly using ANALYZE, VACUUM, and REINDEX, you’re giving PostgreSQL the visibility it needs to run queries faster and smarter.


๐Ÿ“Š 2. View Stats with pg_stats

pg_stats is a readable view into PostgreSQL’s internal statistics.

SELECT * FROM pg_stats WHERE tablename = 'users';

๐Ÿ” Key Columns:

ColumnDescription
attname                 Column name
null_frac                 Fraction of NULLs
n_distinct                 Number of distinct values
most_common_vals                 Most frequent values
histogram_bounds               
                Distribution spread (like a bell curve)


๐Ÿงช Example:
SELECT attname, null_frac, n_distinct, most_common_vals FROM pg_stats WHERE tablename = 'users' AND attname = 'country';

This helps you answer questions like:

  • Are there too many NULLs?

  • Are most users from one country?

  • Should I create an index on this column?


๐Ÿ“ˆ 3. Track SQL Performance with pg_stat_statements

Want to know which queries are the slowest?

Enable pg_stat_statements in postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'

Then restart PostgreSQL and run:

CREATE EXTENSION pg_stat_statements;

Query the view:

SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;

๐Ÿง  You’ll See:

  • The exact SQL statement

  • Number of times it ran (calls)

  • Total time spent

  • Rows returned

๐Ÿ“Œ Use this to identify frequent or slow queries to tune.


๐Ÿข 4. Find Slow Queries with auto_explain

You can automatically log query plans for slow queries — without manually running EXPLAIN.

Enable auto_explain:

shared_preload_libraries = 'auto_explain'

Set minimum execution time (e.g., 200ms):

auto_explain.log_min_duration = '200ms'

๐Ÿ“„ This logs the plan for any query slower than 200ms in your PostgreSQL log.

Use this in production to catch:

  • Missing indexes

  • Table scans

  • Bad joins


๐Ÿ“‰ Bonus: Understand Data Skew

Skew = when values aren’t evenly spread. Here are 3 common types:

Skew TypeWhat It Means
Uniform        Values spread evenly
Negative skew        More values on the higher end
Positive skew        More values on the lower end


๐Ÿ’ก Why it matters: The planner might guess wrongly if it assumes uniform data but your real data is skewed — leading to bad query plans.

You can spot skew via histograms in pg_stats.


✅ Summary: Tools for PostgreSQL Statistics

ToolUse Case
ANALYZECollect or refresh table statistics
pg_statsView per-column data distribution
pg_stat_statementsTrack query performance
auto_explainLog slow query plans automatically


๐Ÿš€ Next Steps

Want to improve query speed and understand your data better?

  1. Start by running ANALYZE

  2. Explore pg_stats to learn about your data

  3. Enable pg_stat_statements to find slow queries

  4. Use auto_explain in production to diagnose bad plans





Comments

Popular posts from this blog

SQL Query and performance tuning - Indexing in depth

Apache Kafka - The basics

Spring: How to deal with circular dependencies