Lecture Notes for cs581

Overview

Overview
Lecture <2025-08-25 Mon>
Lecture <2025-08-27 Wed>
- Random vs. Sequential Access (HDD)
- Pipelining
  - columnar storage
  - row-based storage
Lecture <2025-09-03 Wed>
Lecture <2025-09-08 Mon>
Lecture <2025-09-15 Mon>
Lecture <2025-09-24 Wed>
Lecture <2025-09-29 Mon>
Lecture <2025-10-01 Wed>
Lecture <2025-10-08 Wed>
Lecture <2025-10-13 Mon>
Lecture <2025-10-15 Wed>
Lecture <2025-10-22 Wed>
Lecture <2025-10-27 Mon>
Lecture <2025-10-29 Wed>
Lecture <2025-11-03 Mon>
- Merging subqueries (subquery pull-up of SPJ queries)
- Lack of cross optimization between UDFs written in a procedural language or procedural extension of SQL and the remaining of a query
Lecture <2025-11-05 Wed>
- Rewrite argmax using window functions
Lecture <2025-11-24 Mon>
Lecture <2025-12-03 Wed>

Lecture <2025-08-25 Mon>

course webpage: https://www.cs.uic.edu/~bglavic/cs581/2025-fall/
syllabus: https://www.cs.uic.edu/~bglavic/cs581/2025-fall/syllabus/
slides: https://www.cs.uic.edu/~bglavic/cs581/2025-fall/schedule/schedule/
additional resources: https://www.cs.uic.edu/~bglavic/cs581/2025-fall/resources/

1
2
3


SELECT s.name
FROM students s, registration r
WHERE s.uin = r.uin AND r.coursenr = 'cs581';

Lecture <2025-08-27 Wed>

Random vs. Sequential Access (HDD)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# DISK STATS
rpm = 7200
avg_seektime_ms = 15
transfer_rate_sec_mb = 100
block_size_byte = (1024 * 4)  # 4KB blocks

# DERIVED STATS
avg_rotational_delay_ms = 0.5 * (float(1) / (rpm / (60 * 1000))) # translate rpm into
block_size_mb = float(block_size_byte) / float(1024 * 1024)
transfer_rate_ms_mb = transfer_rate_sec_mb / 1000
transfer_one_block_ms = block_size_mb / transfer_rate_ms_mb

# RANDOM AND SEQUENTIAL ACCESS TIMES
random_access_block = round(avg_seektime_ms + avg_rotational_delay_ms + transfer_one_block_ms,3)
sequential_access_block = round(transfer_one_block_ms,3)
speed_up = round(random_access_block / sequential_access_block,3)

print(f"random access (ms per block):        {random_access_block:9.3f}\nsequential access (ms per block):    {sequential_access_block:9.3f}\nspeed-up sequential:                 {speed_up:9.3f}")

random access (ms per block):           19.206
sequential access (ms per block):        0.039
speed-up sequential:                   492.462

Pipelining

columnar storage

1
2
3
4
5
6


int a[1024 * 1024];
int sum = 0;
for(int i = 0; i++; i < 1024 * 1024)
{
    sum += a[i];
}

row-based storage

1
2
3
4
5
6


int* a[1024 * 1024];
int columnpos = 3;
for(int i = 0; i++; i < 1024 * 1024)
{
    sum += a[i][columnpos];
}

Lecture <2025-09-03 Wed>

branchy code in loops

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


for(int i = 0; i < 10000; i++)
{
    if (colchunk[i] < 3)
    {
        colchunk[i] += 3;
    }
    else
    {
        colchunk[i] -= 2;
    }
}

1
2
3
4


for(int i = 0; i < 10000; i++)
{
    colchunk[i] += (colchunk[i] < 3) * 3 + (colchunk[i] >= 3) * -2;
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


struct A {
  int a;
  char b;
};

struct B {
  int c;
  char d;
  int e;
};

Lecture <2025-09-08 Mon>

Lecture <2025-09-15 Mon>

can use B-tree on lastname (lexicographical)

1
2
3


SELECT *
FROM student
WHERE lastname LIKE 'A%';

can't use B-tree

1
2
3


SELECT *
FROM student
WHERE lastname LIKE '%b%';

Lecture <2025-09-24 Wed>

Lecture <2025-09-29 Mon>

Lecture <2025-10-01 Wed>

Lecture <2025-10-08 Wed>

1
2


s = 3 + 4 + 5 + 6 + 7 + 8
s2 = (3 + 4 + 5) + (6 + 7 + 8)

associativity and commutativity

Lecture <2025-10-13 Mon>

1
2
3
4
5


CREATE VIEW cntstd AS (
   SELECT count(*) AS numstd, major
   FROM students
   GROUP BY major
);

1

SELECT * FROM cntstd;

1
2
3
4
5


SELECT * FROM (
   SELECT count(*) AS numstd, major
   FROM students
   GROUP BY major
) cntstd;

1
2
3
4


SELECT name
FROM (SELECT send_email(email, 'Hey Student, ...'), name
      FROM student)
WHERE name LIKE 'A%';

1

SELECT sum(common_gpa()) AS avg_gpa FROM students HAVING acg_gpa > 4.5;

1

SELECT NULL = 3 OR 3 = 3

AND	FALSE	TRUE	NULL
FALSE	FALSE	FALSE	FALSE
TRUE		TRUE	NULL
NULL			NULL

OR	FALSE	TRUE	NULL
FALSE	FALSE	TRUE	NULL
TRUE		TRUE	TRUE
NULL			NULL

1
2
3


SELECT A + NULL

SELECT sum(A)

Lecture <2025-10-15 Wed>

window functions

1
2
3
4


WITH RECURSIVE setstuff (a,b) AS (
  SELECT * FROM edge
  UNION
  SELECT ...);

Lecture <2025-10-22 Wed>

1

SELECT 1 AS tester, * FROM R

\[ \sigma_{A=3 \vee A=4}(R) = \sigma_{A=3}(R) \cup \sigma_{A=4}(R) \]

Lecture <2025-10-27 Mon>

Lecture <2025-10-29 Wed>

S = { R_1, R_4, R_6 }

O = { R_4 } S / O = { R_1, R_6 }

1
2
3
4
5
6


from math import pow

numtables = 40
print("Number of plans")
print(f"Dynamic Programming: {pow(3,numtables)}")
print(f"Greedy Join enumeration: {pow(numtables,3)}")

Number of plans
Dynamic Programming: 1.2157665459056929e+19
Greedy Join enumeration: 64000.0

Lecture <2025-11-03 Mon>

Merging subqueries (subquery pull-up of SPJ queries)

1
2
3


SELECT c * c AS D
FROM (SELECT a + b AS c, e
      FROM R(a,b,e)) sub

pullup subqueries

1
2


SELECT (a+b) * (a+b) AS D
FROM R

what about side effects (we have to be careful as we may change the number of times the function is executed)
1 2 3

SELECT c * c AS D FROM (SELECT send_email(a,b) AS c, e FROM R(a,b,e)) sub
in worst case, we get an exponential blowup

1
2
3
4


SELECT an-1 * an-1 AS an
FROM (SELECT an-2 * an-2 AS an-1
     ...
        SELECT a1 * a1 AS a2 FROM R

1
2
3
4
5
6


CREATE FUNCTION f(x,y) {
   RETURN x + y
};

SELECT f(a,c)
FROM R;

1
2


SELECT fresult
FROM R, LATERAL (SELECT a + c AS fresult) AS f;

1
2


SELECT a + c AS fresult
FROM R;

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


CREATE FUNCTION divide(x,y) AS (
  IF y = 0
  THEN
     RETURN NULL
  ELSE
     RETURN x / y;
)

-- ->
SELECT CASE WHEN y = 0 THEN NULL ELSE x / y;

1
2
3


CREATE FUNction
   FOR x in (SELECT * FROM ..)
      loop_body

Lack of cross optimization between UDFs written in a procedural language or procedural extension of SQL and the remaining of a query

in general this is hard as in general optimizing beyond the "function boundary" would require lifting the function to the declarative / batch level of relational operations (such as joins) which is akin to automatic parallelization of programs (an undecidable computational problem)
- the net result is that functions are "optimization boundaries", e.g., if your UDF runs a query with a join and you call your UDF from another query with joins, there will be no join reordering between these two
there are some research prototypes that translate procedural code into declarative SQL:
- Froid: Optimization of Imperative Programs in a Relational Database. Karthik Ramachandra, Kwanghyun Park, K. Venkatesh Emani, Alan Halverson, César A. Galindo-Legaria, Conor Cunningham. PVLDB, vol. 11 (4), 432–444, 2017.
- Blackmagic: Automatic Inlining of Scalar Udfs into SQL Queries with Froid. Karthik Ramachandra, Kwanghyun Park. PVLDB, vol. 12 (12), 1810–1813, 2019.
- One with RECURSIVE Is Worth Many GOTOs. Denis Hirn, Torsten Grust. SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, pp. 723–735 (2021).
- Dear User-Defined Functions, Inlining Isn'T Working out so Great for Us. Let'S Try Batching to Make Our Relationship Work. Sincerely, SQL. Kai Franz, Samuel Arch, Denis Hirn, Torsten Grust, Todd C. Mowry, Andrew Pavlo. 14th Conference on Innovative Data Systems Research, CIDR 2024, Chaminade, HI, USA, January 14-17, 2024 (2024).

Lecture <2025-11-05 Wed>

Rewrite argmax using window functions

1
2
3
4
5
6
7


SELECT name
FROM orders o,
    (SELECT max(amount) AS ma, i.cust
    FROM orders i
    GROUP BY i.cust) sub
WHERE sub.cust = o.cust
      AND o.amount = sub.ma

1
2
3
4


SELECT name
FROM (Select max(amount) OVER (PARTITION BY cust) AS ma, amount, name
      FROM orders)
WHERE ma = amount

Enhanced Subquery Optimizations in Oracle. Srikanth Bellamkonda, Rafi Ahmed, Andrew Witkowski, Angela Amor, Mohamed Zait, Chun-Chieh Lin. Proceedings of the VLDB Endowment, vol. 2 (2), 1366–1377, 2009.