What is Document Indexing & How to automate it?

From Declarative SQL to Procedural Execution Plans

Understanding the Journey Behind Your Queries

🧠 Why SQL Is So Widely Used

SQL (Structured Query Language) has come a long way from its roots as a tool for querying relational databases. Today, it powers data analytics in platforms like Apache Spark, and even in streaming tools like Apache Kafka.

So, what makes SQL so universal?

✅ It’s Ideal for Tabular Data

Most data in the real world naturally fits into tables:

🛍 Retailers track customer purchases in structured tables.
🏥 Healthcare systems manage patient records in relational databases.
💳 Financial institutions store transaction histories in tables.

It's tough to find an industry that doesn't benefit from tabular data, and by extension, SQL.

📣 SQL Is Declarative — But What Does That Mean?

SQL is a declarative language, meaning you state what you want, not how to get it.

Example:

SELECT name, email FROM customers WHERE city = 'Tallinn';

You don’t specify how the database should fetch this data, you just declare the result you want.

This makes SQL different from procedural programming languages like:

Java
Python
C++

In those, you have to define the exact steps to perform operations, such as iterating through arrays or handling logic flow.

🧰 Under the Hood: From SQL to Execution Plan

Although SQL is declarative, the database still needs to figure out how to execute your request. That’s where execution plans come in.

An execution plan is the procedural version of your SQL query. It’s a sequence of low-level operations like:

🔍 Scans
⚙️ Filters
🔗 Joins

Think of it like this:

You say:

“I want a list of customers in Austin.”

The database replies:

“Got it. I’ll scan the customers table, apply a filter for city = 'Austin', then return the relevant columns.”

🚀 Query Optimization: Why Efficiency Matters

Just like you'd write efficient code in a procedural language, execution plans need optimization. Different plans can yield the same result, but some are much faster.

🧪 Example: Filtering Customer Purchases

Let’s say you want to retrieve customer purchases where the ID starts with 3014:

Inefficient Plan:
Scan all 10 million rows to find the matching records.

Optimized Plan:
Use an index to jump straight to the relevant rows, much faster!

🔍 Beyond Tables: Advanced SQL Features

Although SQL’s foundation is tabular, it also supports more advanced features like:

ROLLUPs
CUBEs
Window functions

These let you create rich summaries and multidimensional reports. While not strictly tabular in appearance, they still operate on table-like structures.

🔍 Scanning Tables and Indexes

How Databases Read Your Data — Efficiently or Not

📦 What Is a Table Scan?

One of the most fundamental operations in a database execution plan is a scan. It's exactly what it sounds like: going row by row through a table to find relevant data.

🧠 Simple Definition:

Table Scan = Read each row → Apply filter → Move to next row.

This is known as a linear operation, and it happens whether the data is being read from persistent storage (like disk) or in-memory cache.

🧪 Example: Table Scan in Action

Let’s say you want to retrieve all orders placed in 2025:

SELECT * FROM orders WHERE order_year = 2025;

If there’s no index on order_year, the database will scan every single row in the orders table, checking the order_year value one at a time.

That’s a full table scan—and it can be expensive for large tables.

📘 Full Table Scans: When Are They Okay?

✅ Good for:

Small tables
Queries that touch most rows (e.g., aggregate reports)

❌ Inefficient when:

You only need a few rows
The query is run frequently on a large dataset

If you spot a full table scan in your query plan, it’s worth asking:

“Is there a better way to retrieve this data?”

🌿 Enter Indexes: The Fast Lane of Data Access

Indexes are special data structures that make it faster to look up rows without scanning the entire table.

⚙️ How Indexes Work:

An index entry stores:
- A key value (e.g., customer ID, order date)
- A pointer to the row location in the table
Indexes are typically stored in trees, which allow for fast searching.

🧱 Types of Indexes (and When to Use Them)

Index Type	Best For	Description
B-Tree	Equality & Range Queries	Most common type. Used when searching values like `age > 30` or `name = 'John'`.
Hash Index	Equality Queries Only	Fast lookup for exact matches, like `email = 'abc@xyz.com'`.
Bitmap Index	Columns with Few Unique Values	Great for gender, status, or flags (`is_active = true`). Used for set operations.
Specialized	Custom Data (e.g., Geospatial)	PostgreSQL and others support GIN, GiST, etc., for non-standard data types.

🌳 Index Structures: Why Trees?

Trees (especially balanced trees like B-Trees) are efficient for:

Keeping data sorted
Allowing quick traversal
Minimizing disk reads

Think of a B-Tree like an efficient phone book—you don’t flip through every page to find a name. You jump to the right section and narrow down quickly.

🧠 Columnar Storage: A Special Case

In analytical databases (like Amazon Redshift or Google BigQuery), data is stored in columns rather than rows. This makes it:

Highly efficient for queries filtering on just a few columns
Better suited for OLAP workloads (e.g., dashboards, BI)

⚠️ However, the content discussed here assumes row-oriented storage, which is the norm in transactional systems like MySQL, PostgreSQL, and Oracle.

📌 Best Practices

✅ Use Indexes When:

You're filtering on a column that’s used often in WHERE clauses
The table is large and scanned frequently
The column has sufficient cardinality (many unique values)

❌ Avoid Index Overuse:

Too many indexes slow down writes (INSERT/UPDATE/DELETE)
Use query execution plans to confirm index usage

🧭 Summary: Scan or Index?

Operation	Speed	Best For
Full Table Scan	Slow on large tables	Small tables or broad queries
Index Scan	Fast (when applicable)	Narrow filters, large datasets

Always check your execution plan. If you're scanning more than necessary, an index might save the day.

Search This Blog

Simply Everything I want to share

SQL Query and performance tuning - Indexes