Post

Interview_Index, query execution plan, hint

βœ… What is the difference between random I/O and sequential I/O?

  • random I/O: access data at arbitary locations on storage device
  • no order
  • WHERE, find with conditions
  • need to skip several rows
  • πŸ‘ŽπŸ» higher seek time
  • πŸ‘ŽπŸ» need to move disk head more times
  • πŸ‘ŽπŸ» read/write head has to move more

  • sequential I/O: access data in a continuous, ordered way
  • read all data
  • find sequential data
  • find with index
  • πŸ‘πŸ» more efficient
  • πŸ‘πŸ» minimize head movement

βœ… What is an index in a database?

data structure to help speed up data retrieval operation

  • like a look up table
  • like index of dictionary

  • key + value structure
  • store data from column βž• pointer to corresponding rows

    • value from colum: actual data(empID)
    • pointer: address to the actual row in the table(row ID, physical address)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
- Employee table
| EmpID | Name    | Department |
| ----- | ------- | ---------- |
| 101   | Alice   | HR         |
| 102   | Bob     | Sales      |
| 103   | Charlie | HR         |

- create Index on EmpID
Key   | Value (RowID / Pointer)
------+------------------------
101   | β†’ Row 1
102   | β†’ Row 2
103   | β†’ Row 3

- SQL
SELECT * FROM Employees WHERE EmpID = 102;
  • index is always SORTED
  • πŸ‘πŸ» SELECT, scan entire table ❌, quickly locate data ⭕️
  • πŸ‘ŽπŸ» INSERT, UPDATE, DELETE

βœ… How does an index work?

  • index allow DB to skip full table scan

  • βœ”οΈ B-Tree index: maintain balanced binary tree structure
  • tree with sorted keys
  • DB navigates through the tree to find target key, get value, then access the associated data
  • B+Tree is used in MySQL, PostgreSQL

  • βœ”οΈ Hash index: use hash function to map keys to specific buckets
  • key-value hash table

  • βœ”οΈ Bitmap index: bitmap for each value
  • πŸ‘πŸ» efficient for low-cardinality columns

βœ… What factors should you consider when creating an index?

  • 1️⃣ Query patterns: add index to WHERE, JOIN, ORDER BY, GROUP BY
  • 2️⃣ Selectivity: high selectivity = many unique values
  • 3️⃣ Read vs. Write workload: index speed up reads, slow down writes
  • 4️⃣ Table size: index is better for large tables
  • 5️⃣ Sort, Group: index help sort, grouping columns
  • 6️⃣ uniqueness: use UNIQUE index to enforce data integrity
1
2
3
4
5
When is index most efficient?
- πŸ‘πŸ» GROUP BY, ORDER BY
- πŸ‘πŸ» high cardinality, lots of unique values
- πŸ‘πŸ» more read, not write
- πŸ‘πŸ» big table

βœ… What factors should we be aware when creating index?

  • 1️⃣ aware of Range conditions like BETWEEN, LIKE, <, >
  • index is only used up to the column where range starts
  • columns after that in multi-column index are ignored
-- composite index in month, city

WHERE month BEWTEEN 202301 AND 202312
  AND city = Seoul

-- month use index
-- city might NOT use index
  • 2️⃣ Use Equality Conditions
  • use = and IN
-- replace range with equality

-- not efficient
AND month BETWEEN 200801 AND 200812

-- more efficient
AND month IN ('200801','200802','200803','200804','200805','200806','200807','200808','200809','200810','200811','200812')
  • 3️⃣ aware of OR
  • AND: narrow down rows
  • OR: increase number of rows to scan
  • 🚫 avoid using too many OR

  • 4️⃣ Do not CAST
  • do not modify or cast indexed columns
-- column is String
WHERE CAST(col AS INT) = 123
-- πŸ‘ŽπŸ» disable index due to data mismatch
  • πŸ’Š Match data type

βœ… What should you be careful about when using indexes?

  • ⚠️ overuse of index slow down write operations
  • ⚠️ index maintenance overhead: changing indexed columns often lead to constant re-balancing of index structure
  • ⚠️ using index in low cardinality columns: if use index in gender column with only a few distinct values, index will not be effective!
  • ⚠️ correct order in composite(multi-level) index: composite index query should start with first column. Put high-cardinality columns first

βœ… Is it good to create many indexes on a table?

  • πŸ‘πŸ» faster read, faster query performance
  • πŸ‘πŸ» optimized sort, grouping

  • index also has trade-offs
  • πŸ‘ŽπŸ» slower write
  • πŸ‘ŽπŸ» increased storage usage
  • πŸ‘ŽπŸ» complexity in index management
  • πŸ‘ŽπŸ» risk of suboptimal execution plans due to too many choices
  • πŸ’‘ suggested less than 3 indexes per table

πŸ’‘ What is a covering index?

covering index: index that contains all the columns needed to satisfy a query
DB can answer the query just using index, WO accessing table

  • index holds required data
  • DB DOES NOT NEED to access actual table rows
  • no table lookup
  • πŸ‘πŸ» reduce I/O, just get data from index, not real table
  • πŸ‘πŸ» faster read, do not have to access table
  • πŸ‘πŸ» efficient for read
1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(255),
  age INT
);

-- create covering index
CREATE INDEX idx_email_age ON users (email, age);

SELECT age FROM users WHERE email = 'john@example.com';
-- WHERE clause knows wmail
-- SELECT needs only age
-- index idx_email_age already has both email and age

πŸ’‘ What is a multi-column index (or composite index)?

multi-column index: single index that includes multiple columns from table
useful when queries filter or sort using more than one column

  • when queries involve more than one column
  • ⭐️ order of column matters
  • index is most effective when query filters based on leftmost columns first
  • better to set from broad to narrow categories
  • λŒ€λΆ„λ₯˜ ➑️ 쀑뢄λ₯˜ ➑️ μ†ŒλΆ„λ₯˜
  • food ➑️ vegetable ➑️ carrot
1
an index on (first_name, last_name) can efficiently support queries filtering on first_name or both, but not just last_name
  • example of composite index
1
2
3
4
5
6
7
8
9
10
CREATE INDEX idx_email_age ON users (email, age);

-- ⭕️ efficient, use leftmost column
SELECT * FROM users WHERE email = 'john@example.com';

-- ⭕️ efficient, use both index columns
SELECT * FROM users WHERE email = 'john@example.com' AND age = 30;

-- ❌ not efficient, skips first column
SELECT * FROM users WHERE age = 30;

πŸ’‘ What are nodes in B-Tree and B+Tree indexes?

  • root node: topmost node, one node per tree, search starts here
  • internal node: node that routes search paths, between root and leaf node
  • leaf node: most bottom node, node that stores actual data(same leaf node, same depth)
  • linked leaf nodes: B+Tree only, connected from left ➑️ right for range scanning
1
2
3
4
5
        [50]       ← Root
        /    \
   [20, 40] [60, 80]   ← Internal nodes
     |
[10] [20] [30] [40]   ← Leaf nodes

πŸ’‘ Can you explain the difference between B-Tree and B+Tree indexes?

  • 곡톡점: balanced tree structures used in indexing
  • 차이점: how/where they store data

  • βœ”οΈ B-Tree: store key and actual data in both internal and leaf nodes
  • tree with sorted keys
  • DB navigates through the tree to find target key, get value, then access the associated data

  • πŸ‘πŸ» fast point lookup
  • πŸ‘ŽπŸ» increase node size and I/O
  • πŸ‘ŽπŸ» need more memory, data is saved both in root, branch, leaf node
1
2
3
4
5
6
internal node: store key βž• data
leaf node: store key βž• data
leaf connection: not linked
point lookup: fast
range query: slow(internal nodes are not linked, has to go through multiple levels)
use: old DBs
  • example
1
2
3
4
5
6
7
if DB is [5, 10, 15, 20, 25, 30, 35, 40]

           [15] /*internal node*/
          /    \
      [5,10]  [20,25,30] /*leaf nodes*/
                   \
                [35,40]
  • βœ”οΈ B+Tree: store data in only in leaf node
  • store only key and pointer in root, internal node
  • actual data resides in leaf nodes
  • leaf nodes are linked by LinkedList together
  • πŸ‘πŸ» fast range query, support sequential scans BETWEEN, <, >, LIKE %abc%
  • πŸ‘πŸ» memory efficiency, only has to save data in leaf node
  • used as default in modern databases
  • used in MySQL, PostgreSQL
1
2
3
4
5
6
internal node: only keys
leaf node: store key βž• data
leaf connection: linked
point lookup: fast
range query:fast(leaces are linked)
use: modern RDBMS
  • example
1
2
3
4
5
6
7
8
9
10
11
if DB is [5, 10, 15, 20, 25, 30, 35, 40]

           [15] /*internal node: store only keys*/
          /    \
        [5,10]  [20,25]  /*leaf node: store key and data*/
                   \
                  [30,35]
                     \
                    [40]

Leaf node chain: [5,10] β†’ [20,25] β†’ [30,35] β†’ [40] /*linked*/
  • example of real-world query
1
2
3
4
5
6
CREATE INDEX idx_age ON users (age);

SELECT id, email
FROM users
WHERE age BETWEEN 20 AND 30
ORDER BY id;

πŸ’‘ What is a Hash index, and when would you use it?

use hash function to map search keys into fixed-sized bucket number

  • hash function maps the key to bucket number
  • id = 15 ➑️ hash function ➑️ Hash table bucket number 3

  • πŸ‘πŸ» equality search WHERE id = 13
  • retrieve search data in constant time O(1)
  • πŸ‘ŽπŸ» not suitable for range query ORDER BY
  • Why? hash function destroy natural ordering of data
  • πŸ‘ŽπŸ» not suitable for prefix match LIKE 'abc%'

  • used in Memory tables or special DB
  • Hash index is NOT used in InnoDB(MySQL)

πŸ’‘ What is a clustering index?

determine physical order of data in the table

  • rows are stored on disk in the same order as index
  • only ONE clustering index per table is possible
  • Why? only one row can be clustering index
  • (id순으둜 μ •λ ¬ν•΄λ‘˜μ§€, λ‚˜μ΄μˆœμœΌλ‘œ μ •λ ¬ν•΄λ‘˜μ§€, λ‚ μ§œ 순으둜 μ •λ ¬ν•΄ λ‘˜μ§€ ν•˜λ‚˜λ§Œ κ³ λ₯Ό 수 μžˆμœΌλ‹ˆκΉŒ)
  • in InnoDB, primary key is clustering index by default
1
2
3
4
5
6
7
8
9
10
11
12
13
clustering index: created_at

+-----------------------------+
| posts Table (Physical Rows)|
+-----------------------------+
| id=9  | created_at=2023-01-01 |
| id=4  | created_at=2023-01-02 |
| id=12 | created_at=2023-01-03 |
| id=5  | created_at=2023-01-04 |
| id=7  | created_at=2023-01-05 |
+-----------------------------+
        ↑
   (physically sorted)
  • πŸ‘πŸ» RANGE queries like ORDER BY
  • πŸ‘πŸ» ordered scans are fast
  • πŸ‘ŽπŸ» if many INSERT, DELETE, UPDATE, lot of re-ordering has to take place
1
2
SELECT * FROM posts
WHERE created_at BETWEEN '2023-01-02' AND '2023-01-04';
  • leaf node of the clustering index store the actual data rows and pointers

βœ… What is a non-clustering index?

  • index that DOES NOT affect the physical order of data in the table
  • instead, use index columns
  • and pointer: point to where to find data in actual table row
  • thus, only store index and pointer
1
2
3
4
5
email               β†’ row location
------------------------------
alice@example.com   β†’ row ID 1
bob@example.com     β†’ row ID 2
carol@example.com   β†’ row ID 3
  • so when I query
1
2
3
4
5
SELECT * FROM users WHERE email = 'bob@example.com';

-- find email
-- then go to row ID 2 with pointer
-- then get user data

βœ… What is a full-text index?

  • specialized index used for searching large text data
  • such as paragraph, documents hash- support advanced text search
  • search words inside text
  • πŸ‘πŸ» MATCH AGAINST
1
2
SELECT * FROM posts
WHERE MATCH(content) AGAINST('database indexing');
  • this query will return posts that mention database or indexing
  • then, sort by relevance

Screenshot-2025-06-30-at-01-22-20.png

βœ… Can you explain different types of index scans in a database?

  • Index scan: retrieving data using index, instead of scanning the entire table

  • βœ”οΈ Index Full Scan: scan all entries in index
  • query access all rows in index
  • can be faster than table full scan if index has fewer columns
1
SELECT name FROM users;
  • βœ”οΈ Index Range Scan: scan specific range based on conditions
  • BETWEEN, <, >, LIKE 'A%
1
SELECT * FROM employees WHERE age BETWEEN 30 AND 40;
  • βœ”οΈ Index Unique Scan: retrieve single row with unique index
1
SELECT * FROM users WHERE id = 123;
  • βœ”οΈ Index Skip Scan: when you are using composite index(multi-column index) but your query skips the first column
1
2
3
4
5
6
7
8
9
10
-- have composite index on department, job_title
CREATE INDEX idx_dept_job ON employees(department, job_title);

-- however, run query on job_title
-- the query DOES NOT filter the first column
SELECT * FROM employees WHERE job_title = 'Manager';

-- in composite index, need to use first or both index in query
-- thus, normally, index will not be used
-- 🟒 with index skip scan, DB will use index smartly
  • DB will check department per job_title
  • smth like Teacher + Manager, Developer + Manager…
  • DB will do a subtree scan within the index

  • example of better query
  • use both index columns, no skipping is needed
1
2
SELECT * FROM employees
WHERE department = 'Sales' AND job_title = 'Manager';

βœ… Have you ever checked a query execution plan? Can you explain what it is and how you analyze it?

  • checked a query execution plan using EXPLAIN in MySQL
  • query execution plan: strategy chosen by DB optimizer to execute SQL query

  • no index Screenshot-2025-06-30-at-00-46-36.png

  • use index Screenshot-2025-06-30-at-00-52-17.png

  • type: join type or scan method, important for performance
    • ALL: full table scan
    • index: full index scan
    • range, ref, eq-ref: efficient indexed access
  • key: which index is used
  • rows: number of rows DB expects to scan
  • filtered: percentage of rows expected to pass the filtering condition

  • also used show profile CPU to check CPU usage

Screenshot-2025-06-30-at-00-47-01.png

βœ… What is a hint in SQL, and when would you use it?

  • hint: directive you include in SQL query to influence the behavior of the query optimizer
  • override default optimization, enforce your strategy

  • tell optimizer to use idx_department index explicitly
1
2
3
SELECT /*+ INDEX(employees idx_department) */ *
FROM employees
WHERE department = 'Engineering';
  • ⚠️ overusing hints can reduce maintainability

βœ… How do you check if an index is being used by a query?

  • check by examining query execution plan
  • use EXPLAIN in MySQL

  • look for key column in the EXPLAIN output
  • if key shows index, that index is being used
  • if key is null, query is not using index

  • can also run SHOW INDEX;
1
SHOW INDEX FROM user;

Screenshot-2025-06-30-at-00-58-37.png

βœ… How does GROUP BY affect index usage in queries?

  • index can be used in GROUP BY queries only when the grouping column match the index structure

    1. leading column of index must appear in GROUP BY
    1. column order in GROUP BY must match index order
    1. extra columns in GROUP BY that are NOT part of index can prevent index use
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Index: (country, gender, name)

-- ⭕️ index is used
SELECT country, COUNT(*)
FROM users
GROUP BY country;

SELECT country, gender, COUNT(*)
FROM users
GROUP BY country, gender;

-- ❌ index is NOT used
-- order is wrong
SELECT gender, country
FROM users
GROUP BY gender, country;
1
2
3
4
5
6
7
8
9
10
11
12
13
예) 인덱슀 (a, b, c)κ°€ μžˆμ„ λ•Œ

1. μΈλ±μŠ€κ°€ μ μš©λ˜λŠ” 경우:

- GROUP BY a : 인덱슀 첫 번째 컬럼인 a만 μ‚¬μš©ν•˜λ―€λ‘œ, 인덱슀 적용
- GROUP BY a, b : a와 bλŠ” 인덱슀의 첫 λ²ˆμ§Έμ™€ 두 번째 μ»¬λŸΌμ— ν•΄λ‹Ήν•˜λ―€λ‘œ, 인덱슀 적용
- GROUP BY a, b, c : λͺ¨λ“  컬럼이 μΈλ±μŠ€μ— ν¬ν•¨λ˜μ–΄ μžˆμ–΄ 인덱슀 적용

2. μΈλ±μŠ€κ°€ μ μš©λ˜μ§€ μ•ŠλŠ” 경우:

- GROUP BY b : bλŠ” 두 번째 μ»¬λŸΌμ΄μ§€λ§Œ 첫 번째 컬럼인 aκ°€ ν¬ν•¨λ˜μ§€ μ•Šμ•„μ„œ, μΈλ±μŠ€κ°€ μ μš©λ˜μ§€ μ•ŠμŒ
- GROUP BY b, a : bκ°€ 첫 번째둜 λ‚˜μ˜€κ³ , aλŠ” 두 번째둜 λ‚˜μ˜€κΈ° λ•Œλ¬Έμ—, μˆœμ„œκ°€ λ§žμ§€ μ•Šμ•„ μΈλ±μŠ€κ°€ μ μš©λ˜μ§€ μ•ŠμŒ
- GROUP BY a, c, b, d : μΈλ±μŠ€μ— ν¬ν•¨λ˜μ§€ μ•Šμ€ d 컬럼이 있기 λ•Œλ¬Έμ—, μΈλ±μŠ€κ°€ μ μš©λ˜μ§€ μ•ŠμŒ

βœ… How would you create an index on a table with name, country, and gender?

    1. Analyze cardinality: how unique is this column?
  • name would have higher cardinality
  • country or gender has low cardinality

    1. Create composite(multi-column) index
1
CREATE INDEX idx_name_country_gender ON users(name, country, gender);
    1. ⚠️ Be careful, in composite index including leading column is important!
  • always include leading column
1
2
3
4
5
6
7
8
-- ⭕️ use index
SELECT * FROM users WHERE name = 'John';
SELECT * FROM users WHERE name = 'John' AND country = 'Korea';

-- ❌ index will not be used
-- unless Index Skip Scan is enabled
SELECT * FROM users WHERE country = 'Korea';

    1. ⚠️ In GROUP-BY, leading column and order is very important
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- ⭕️ use index
SELECT name, COUNT(*) FROM users
GROUP BY name;

SELECT name, country, COUNT(*) FROM users
GROUP BY name, country;

SELECT name, country, gender, COUNT(*) FROM users
GROUP BY name, country, gender;

-- ❌ index will not be used
-- leading column is not included
SELECT country, COUNT(*) FROM users
GROUP BY country;

-- ❌ index will not be used
-- order is wrong
SELECT country, name FROM users
GROUP BY country, name;
This post is licensed under CC BY 4.0 by the author.