CS480 - Database Systems - 2024 spring

Course webpage for CS480 - 2024 spring taught by Boris Glavic

Lecture Notes for CS480

Lecture <2024-01-09 Tue>

https://www.cs.uic.edu/~bglavic/cs480/2024-spring/homework/

Lecture <2024-01-11 Thu>

Crossproduct

\(S_1 = \{e_1, e_2 \}\) \(S_2 = \{a_1, a_2\}\)

\(S_1 \times S_2\)

S1 S2
e1 a1
e1 a2
e2 a1
e2 a2

Relations

\(\mathbb{D}_1 = \{ 0,1 \}\) \(\mathbb{D}_2 = \{ red, green, blue \}\)

\(\mathbf{Shirts}(color: \mathbb{D}_2, is-muscle: \mathbb{D}_1)\)

\(\mathbb{D}_2 \times \mathbb{D}_1 = \{ (red, 0), (green, 0), (blue, 1), \dlots \}\)

color is-muscle
green 1
red 0
blue 1
blue 0

Candidate Keys

  • Lastname + Firstname is unique
  • Lastname or Firstname by itself is not
Lastname Firstname Age
  • ID is unique
  • Lastname + Firstname is unique
  • Lastname or Firstname by itself is not
ID Lastname Firstname Age

Lecture <2024-01-20 Sat>

Relational algebra

Selection example

$\sigma_{city = 'Chicago'}(Add)$

  • table Add
state city zip
IL Chicago 66014
IL Schaumberg 66077
CA San Francisco 11555
  • result
state city zip
IL Chicago 66014

Lecture <2024-01-23 Tue>

Lecture <2024-01-25 Thu>

Project Use Case

Ideas

  1. library (books) - 9 (14)
  2. sports statistics - 9 (9)
  3. public transportation - 7
  4. inventory management - 2
  5. game ratings / store - 7
  6. online retailer - 7
  7. crime statistics - 2
  8. hospital management - 8
  9. stock market - 5

relational algebra

1
SELECT * FROM student;
id name dept_name tot_cred
00128 Zhang Comp. Sci. 102
12345 Shankar Comp. Sci. 32
19991 Brandt History 80
23121 Chavez Finance 110
44553 Peltier Physics 56
98988 Tanaka Biology 120

\(\gamma_{dept\_name; count(id)}(student)\)

\(\gamma_{count(id)}(\sigma_{dept\_name=CS}(student))\) \(\sigma_{tot\_cred > 100}(student)\)

\(\sigma_{dept\_name = History \lor dept\_name = Finance}(student)\)

\(\sigma_{dept\_name = History \land tot\_cred < 20}(student)\)

\(\pi_{tot\_cred / 10, name}(student)\)

\(\pi_{name}(\sigma_{dept\_name = History}(student))\)

\[ \pi_{name, course\_id}(student \bowtie_{id = id'} \rho_{id' \gets id}(takes)) \]

1
SELECT * FROM takes LIMIT 3;
id course_id sec_id semester year grade
00128 CS-101 1 Fall 2009 A
00128 CS-347 1 Fall 2009 A-
12345 CS-101 1 Fall 2009 C
1
SELECT * FROM course LIMIT 3;
course_id title dept_name credits
BIO-101 Intro. to Biology Biology 4
BIO-301 Genetics Biology 4
BIO-399 Computational Biology Biology 3
1
SELECT * FROM instructor LIMIT 3;
id name dept_name salary
10101 Srinivasan Comp. Sci. 65000.00
12121 Wu Finance 90000.00
15151 Mozart Music 40000.00

\(\pi_{name}(student) \cup \pi_{name}(instructor)\)

\(\pi_{name}(instructor) - \pi_{name}(student)\)

\(\pi_{name}(instructor) \cap \pi_{name}(student)\)

Lecture <2024-01-30 Tue>

1
SELECT * FROM time_slot;
time_slot_id day start_hr start_min end_hr end_min
A M 8 0 8 50
A W 8 0 8 50
A F 8 0 8 50
B M 9 0 9 50
B W 9 0 9 50
B F 9 0 9 50
C M 11 0 11 50
C W 11 0 11 50
C F 11 0 11 50
D M 13 0 13 50
D W 13 0 13 50
D F 13 0 13 50
E T 10 30 11 45
E R 10 30 11 45
F T 14 30 15 45
F R 14 30 15 45
G M 16 0 16 50
G W 16 0 16 50
G F 16 0 16 50
H W 10 0 12 30
1
2
3
4
SELECT count(*),
       (((end_hr * 60) + end_min) - (start_hr * 60 + start_min)) AS length
  FROM time_slot
 GROUP BY (((end_hr * 60) + end_min) - (start_hr * 60 + start_min));
count length
4 75
15 50
1 150
  • return all student
1
2
SELECT *
FROM student s
id name dept_name tot_cred
00128 Zhang Comp. Sci. 102
12345 Shankar Comp. Sci. 32
19991 Brandt History 80
23121 Chavez Finance 110
44553 Peltier Physics 56
45678 Levy Physics 46
54321 Williams Comp. Sci. 54
55739 Sanchez Music 38
70557 Snow Physics 0
76543 Brown Comp. Sci. 58
76653 Aoi Elec. Eng. 60
98765 Bourikas Elec. Eng. 98
98988 Tanaka Biology 120
  • give all info for student with id 12345
1
2
3
SELECT *
FROM student
WHERE id = '12345';
id name dept_name tot_cred
12345 Shankar Comp. Sci. 32
  • names of students with tot_cred smaller than 10
1
2
3
SELECT name
FROM student
WHERE tot_cred < 10;
name
Snow
  • number of students that have taken more than 50 credits
1
2
3
SELECT count(*)
FROM student
WHERE tot_cred > 50;
count
9
  • for each credit amount, the number of students with this many credits (per 10 credits)
1
2
3
SELECT (tot_cred / 10)::int * 10, count(*)
FROM student
GROUP BY (tot_cred / 10)::int;
?column? count
110 1
30 1
50 2
40 1
0 1
100 2
60 3
120 1
80 1
  • the smallest, average, and maximum number of credits over all student
1
2
SELECT min(tot_cred), avg(tot_cred), max(tot_cred)
FROM student;
min avg max
0 65.6923076923076923 120
  • the smallest, average, and maximum number of credits over all student per dept
1
2
3
SELECT dept_name, min(tot_cred), avg(tot_cred), max(tot_cred)
FROM student
GROUP BY dept_name;
dept_name min avg max
Finance 110 110.0000000000000000 110
History 80 80.0000000000000000 80
Physics 0 34.0000000000000000 56
Music 38 38.0000000000000000 38
Comp. Sci. 32 61.5000000000000000 102
Biology 120 120.0000000000000000 120
Elec. Eng. 60 79.0000000000000000 98
1
2
3
4
5
SELECT dept_name,
       max(tot_cred) - min(tot_cred) AS diff
FROM student
GROUP BY dept_name
ORDER BY diff ASC;
dept_name diff
Music 0
History 0
Finance 0
Biology 0
Elec. Eng. 38
Physics 56
Comp. Sci. 70
1
2
SELECT cmax, *
FROM student;
cmax id name dept_name tot_cred
0 00128 Zhang Comp. Sci. 102
0 12345 Shankar Comp. Sci. 32
0 19991 Brandt History 80
0 23121 Chavez Finance 110
0 44553 Peltier Physics 56
0 45678 Levy Physics 46
0 54321 Williams Comp. Sci. 54
0 55739 Sanchez Music 38
0 70557 Snow Physics 0
0 76543 Brown Comp. Sci. 58
0 76653 Aoi Elec. Eng. 60
0 98765 Bourikas Elec. Eng. 98
0 98988 Tanaka Biology 120
1
2
3
SELECT tot_cred AS a
FROM student
ORDER BY a - a;
a
102
32
80
110
56
46
54
38
0
58
60
98
120
1
2
3
SELECT name
FROM student
WHERE tot_cred= (SELECT min(tot_cred) FROM student);
name
Snow

Lecture <2024-02-01 Thu>

1
2
3
4
SELECT DISTINCT s.*
FROM student s,
     takes t
WHERE s.id = t.id;
id name dept_name tot_cred
98988 Tanaka Biology 120
98765 Bourikas Elec. Eng. 98
55739 Sanchez Music 38
45678 Levy Physics 46
19991 Brandt History 80
76653 Aoi Elec. Eng. 60
00128 Zhang Comp. Sci. 102
76543 Brown Comp. Sci. 58
54321 Williams Comp. Sci. 54
12345 Shankar Comp. Sci. 32
23121 Chavez Finance 110
44553 Peltier Physics 56
1
2
3
SELECT *
FROM student s
WHERE NOT EXISTS (SELECT * FROM takes t WHERE s.id = t.id);
id name dept_name tot_cred
70557 Snow Physics 0
1
2
3
(SELECT * FROM student)
EXCEPT
(SELECT s.* FROM student s, takes t WHERE s.id = t.id)
id name dept_name tot_cred
70557 Snow Physics 0
1
2
3
4
5
SELECT i.*, ns.numst
FROM instructor i,
     (SELECT count(*) numst, dept_name FROM department
     GROUP BY dept_name) AS ns
WHERE i.dept_name = ns.dept_name;
id name dept_name salary numst
10101 Srinivasan Comp. Sci. 65000.00 1
12121 Wu Finance 90000.00 1
15151 Mozart Music 40000.00 1
22222 Einstein Physics 95000.00 1
32343 El Said History 60000.00 1
33456 Gold Physics 87000.00 1
45565 Katz Comp. Sci. 75000.00 1
58583 Califieri History 62000.00 1
76543 Singh Finance 80000.00 1
76766 Crick Biology 72000.00 1
83821 Brandt Comp. Sci. 92000.00 1
98345 Kim Elec. Eng. 80000.00 1
1
2
3
(SELECT * FROM instructor)
EXCEPT
(SELECT i.* FROM instructor i, student s WHERE i.dept_name = s.dept_name);
id name dept_name salary

Lecture <2024-02-06 Tue>

Relational Algebra

orders

custid item quantity price shippingfee
1 Toothpaste 100 5.99 30
1 Shampoo 1 12.99 0
2 Shampoo 3 12.99 3
  • per customer compute to the total money they owe us

    • for each order the quantity times price plus the shipping fee (\(\pi\))
    • then summed up across all orders (\(\gamma\)) - group by custid
  • show only customers where the total is smaller 500 (\(\sigma\))

\(orderprices \leftarrow \rho_{custid, orderprice}(\pi_{custid, quantity * price + shippingfee}(orders))\) \(orderprices \leftarrow \rho_{orderprice \leftarrow quantity * price + shippingfee}(\pi_{custid, quantity * price + shippingfee}(orders))\)

custid orderprice
1 629.00
1 12.99
2 38.97

\(totalcustorders \leftarrow \rho_{totalorders \leftarrow sum(orderprice)}(\gamma_{custid; sum(orderprice)}(orderprices))\)

custid totalorders
1 641.99
2 38.97

\(\sigma_{totalorders < 500}(totalcustorders)\)

custid totalorders
2 38.97
  • get largest and average totalorders (maximum totalorders across all operators) (\(\gamma\))

\(maxttl \leftarrow \rho_{maxtotal,avgtotal}(\gamma_{max(totalorders),avg(totalorders)}(totalcustorders))\)

maxtotal avgtotal
641.99 340.48

\(\pi_{maxtotal,avgtotal,custid}(\sigma_{maxtotal = totalorders}(maxttl \times totalcustorders))\)

\(\pi_{maxtotal,avgtotal,custid}(maxttl \bowtie_{maxtotal = totalorders} totalcustorders)\)

maxtotal avgtotal custid
641.99 340.48 1
  • get customer with maximum totalorders
  • equivalences

\(R \bowtie (S \bowtie T) \equiv (R \bowtie S) \bowtie T\)

Lecture <2024-02-08 Thu>

1
2
3
CREATE TABLE edge (f TEXT, t TEXT);

INSERT INTO edge VALUES ('a','b'), ('b','c'), ('b','d'), ('d','e');
CREATE TABLE
INSERT 0 4
1
SELECT * FROM edge;
f t
a b
b c
b d
d e
  • paths of length 2
1
2
3
SELECT e1.f, e2.t
FROM edge e1, edge e2
WHERE e1.f != e2.f AND e1.t != e2.t AND e1.t = e2.f;
f t
a d
a c
b e
  • paths of length up to 2
1
2
3
4
5
6
WITH twopaths AS (SELECT e1.f, e2.t
                FROM edge e1, edge e2
                WHERE e1.f != e2.f AND e1.t != e2.t AND e1.t = e2.f)
SELECT * FROM edge
UNION
(SELECT * FROM twopaths);
f t
a b
b c
a d
a c
d e
b d
b e
  • paths of length up to 3
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
WITH twopaths AS (SELECT e1.f, e2.t
                FROM edge e1, edge e2
                WHERE e1.f != e2.f AND e1.t != e2.t AND e1.t = e2.f),
  threepaths AS (
    (SELECT e1.f, e2.t
                FROM twopaths e1, edge e2
                WHERE e1.f != e2.f AND e1.t != e2.t AND e1.t = e2.f)
    )
SELECT * FROM edge
UNION
(SELECT * FROM twopaths)
UNION
(SELECT * FROM threepaths);
f t
b c
b e
d e
a e
a c
a b
a d
b d
1
2
3
4
5
6
7
WITH RECURSIVE inpre AS (
  (SELECT *, 1 AS cnt FROM prereq) -- direct prereqs
  UNION
  (SELECT i.course_id, p.prereq_id, cnt + 1 AS cnt -- recursive step
     FROM inpre i, -- reference to previous iteration result
          prereq p WHERE i.prereq_id = p.course_id))
SELECT * FROM inpre;
course_id prereq_id cnt
BIO-301 BIO-101 1
BIO-399 BIO-101 1
CS-190 CS-101 1
CS-315 CS-101 1
CS-319 CS-101 1
CS-347 CS-101 1
EE-181 PHY-101 1
1
2
3
4
5
6
7
8
WITH RECURSIVE tc AS (
  SELECT *, 1 cnt FROM edge
  UNION
  SELECT e1.f, e2.t, cnt + 1 AS cnt
    FROM tc e1, edge e2
   WHERE e1.f != e2.f AND e1.t != e2.t AND e1.t = e2.f
         AND cnt <= 1)
SELECT * FROM tc;
f t cnt
a b 1
b c 1
b d 1
d e 1
a c 2
a d 2
b e 2
1
INSERT INTO edge VALUES ('a', 'a');
INSERT 0 1
1
2
3
4
5
6
7
WITH RECURSIVE tc AS (
  SELECT *, 1 cnt FROM edge
  UNION
  SELECT e1.f, e2.t, cnt + 1 AS cnt
    FROM tc e1, edge e2
   WHERE e1.t = e2.f AND cnt < 5)
SELECT * FROM tc;
f t cnt
a b 1
b c 1
b d 1
d e 1
a a 1
a b 2
a a 2
a c 2
a d 2
b e 2
a b 3
a a 3
a c 3
a d 3
a e 3
a b 4
a a 4
a c 4
a d 4
a e 4
a b 5
a a 5
a c 5
a d 5
a e 5

Lecture <2024-02-13 Tue>

1
SELECT schemaname, count(*) FROM pg_tables GROUP BY schemaname;
schemaname count
public 21
pg_catalog 64
information_schema 4

Lecture <2024-02-15 Thu>

  • all student of the biology department
1
2
3
SELECT *
FROM student
WHERE dept_name = 'Biology';
id name dept_name tot_cred
98988 Tanaka Biology 120
  • all student of the biology department with at least 20 credits
1
2
3
SELECT *
FROM student
WHERE dept_name = 'Biology' AND tot_cred >= 20;
id name dept_name tot_cred
98988 Tanaka Biology 120
  • count student of the biology department with at least 20 credits
1
2
3
SELECT count(*)
FROM student
WHERE dept_name = 'Biology' AND tot_cred >= 20;
count
1
  • find student with the least credits (return credits)
1
2
SELECT max(tot_cred) AS maxttl
FROM student
maxttl
120
  • find all student information with the least credits (return credits)
1
2
3
4
5
WITH maxc AS (SELECT max(tot_cred) AS maxttl
FROM student)
SELECT *
FROM student s, maxc m
WHERE s.tot_cred = m.maxttl;
id name dept_name tot_cred maxttl
98988 Tanaka Biology 120 120
1
2
3
4
5
WITH maxc AS (SELECT max(tot_cred) AS maxttl
FROM student)
SELECT *
FROM student s
WHERE s.tot_cred = (SELECT * FROM maxc);
id name dept_name tot_cred
98988 Tanaka Biology 120
1
2
3
4
SELECT *
FROM student s
WHERE s.tot_cred = (SELECT max(tot_cred) AS maxttl
FROM student);
id name dept_name tot_cred
98988 Tanaka Biology 120
  • every student that has the max credits in their department
1
2
3
4
5
6
7
WITH perdmax AS (SELECT max(tot_cred) AS mc,
                     dept_name
                FROM student
               GROUP BY dept_name)
SELECT s.*
  FROM student s, perdmax m
WHERE s.tot_cred = m.mc AND s.dept_name = m.dept_name;
id name dept_name tot_cred
00128 Zhang Comp. Sci. 102
19991 Brandt History 80
23121 Chavez Finance 110
44553 Peltier Physics 56
55739 Sanchez Music 38
98765 Bourikas Elec. Eng. 98
98988 Tanaka Biology 120
1
2
3
SELECT *
  FROM student s1
WHERE tot_cred = (SELECT max(tot_cred) FROM student s2 WHERE s1.dept_name = s2.dept_name);
id name dept_name tot_cred
00128 Zhang Comp. Sci. 102
19991 Brandt History 80
23121 Chavez Finance 110
44553 Peltier Physics 56
55739 Sanchez Music 38
98765 Bourikas Elec. Eng. 98
98988 Tanaka Biology 120
1
2
3
4
EXPLAIN ANALYZE
SELECT *
  FROM student s1
WHERE tot_cred = (SELECT max(tot_cred) FROM student s2 WHERE s1.dept_name = s2.dept_name);
QUERY PLAN
Seq Scan on student s1 (cost=0.00..7053.62 rows=2 width=152) (actual time=0.062..0.142 rows=7 loops=1)
Filter: (tot_cred = (SubPlan 1))
Rows Removed by Filter: 6
SubPlan 1
-> Aggregate (cost=15.63..15.64 rows=1 width=32) (actual time=0.007..0.007 rows=1 loops=13)
-> Seq Scan on student s2 (cost=0.00..15.62 rows=2 width=12) (actual time=0.003..0.005 rows=3 loops=13)
Filter: ((s1.dept_name)::text = (dept_name)::text)
Rows Removed by Filter: 10
Planning Time: 1.681 ms
Execution Time: 0.255 ms
1
2
3
4
EXPLAIN ANALYZE
SELECT *
  FROM student s1
WHERE (tot_cred,dept_name) IN (SELECT max(tot_cred), dept_name FROM student s2 GROUP BY dept_name);
QUERY PLAN
Hash Join (cost=21.75..38.61 rows=2 width=152) (actual time=0.069..0.075 rows=7 loops=1)
Hash Cond: ((s1.tot_cred = (max(s2.tot_cred))) AND ((s1.dept_name)::text = (s2.dept_name)::text))
-> Seq Scan on student s1 (cost=0.00..14.50 rows=450 width=152) (actual time=0.031..0.032 rows=13 loops=1)
-> Hash (cost=18.75..18.75 rows=200 width=90) (actual time=0.023..0.024 rows=7 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> HashAggregate (cost=16.75..18.75 rows=200 width=90) (actual time=0.017..0.019 rows=7 loops=1)
Group Key: s2.dept_name
Batches: 1 Memory Usage: 40kB
-> Seq Scan on student s2 (cost=0.00..14.50 rows=450 width=70) (actual time=0.001..0.002 rows=13 loops=1)
Planning Time: 1.001 ms
Execution Time: 0.140 ms
  • get for each students the buildings and room number they are taking classes sorted by student
1
2
3
4
5
6
7
8
SELECT DISTINCT s.id, building, room_number
FROM student s, section se, takes t
WHERE s.ID = t.ID
  AND t.course_id = se.course_id
  AND t.sec_id = se.sec_id
  AND t.semester = se.semester
  AND t.year = se.year
ORDER BY s.id;
id building room_number
00128 Packard 101
00128 Taylor 3128
12345 Packard 101
12345 Taylor 3128
12345 Watson 120
19991 Painter 514
23121 Packard 101
44553 Watson 100
45678 Packard 101
45678 Watson 100
54321 Packard 101
54321 Taylor 3128
55739 Packard 101
76543 Packard 101
76543 Taylor 3128
76653 Taylor 3128
98765 Packard 101
98765 Watson 120
98988 Painter 514
  • courses (get all information) without prerequisites
1
2
3
4
5
6
SELECT *
FROM course c
EXCEPT
SELECT c.*
FROM course c, prereq p
WHERE c.course_id = p.course_id;
course_id title dept_name credits
MU-199 Music Video Production Music 3
BIO-101 Intro. to Biology Biology 4
CS-101 Intro. to Computer Science Comp. Sci. 4
FIN-201 Investment Banking Finance 3
PHY-101 Physical Principles Physics 4
HIS-351 World History History 3
1
2
3
4
5
6
SELECT *
  FROM course c
       LEFT OUTER JOIN
       (SELECT 1 AS indicator, * FROM prereq) p
           ON (c.course_id = p.course_id)
WHERE p.indicator IS NULL;
course_id title dept_name credits indicator course_id prereq_id
MU-199 Music Video Production Music 3
HIS-351 World History History 3
FIN-201 Investment Banking Finance 3
PHY-101 Physical Principles Physics 4
BIO-101 Intro. to Biology Biology 4
CS-101 Intro. to Computer Science Comp. Sci. 4
1
2
3
SELECT *
  FROM course c
WHERE NOT EXISTS (SELECT * FROM prereq p WHERE p.course_id = c.course_id);
course_id title dept_name credits
MU-199 Music Video Production Music 3
HIS-351 World History History 3
FIN-201 Investment Banking Finance 3
PHY-101 Physical Principles Physics 4
BIO-101 Intro. to Biology Biology 4
CS-101 Intro. to Computer Science Comp. Sci. 4
1
2
3
SELECT *
  FROM course c
WHERE NOT c.course_id IN (SELECT p.course_id FROM prereq p);
course_id title dept_name credits
BIO-101 Intro. to Biology Biology 4
CS-101 Intro. to Computer Science Comp. Sci. 4
FIN-201 Investment Banking Finance 3
HIS-351 World History History 3
MU-199 Music Video Production Music 3
PHY-101 Physical Principles Physics 4
1
2
3
SELECT *
  FROM course c
WHERE (SELECT count(*) FROM prereq p WHERE p.course_id = c.course_id) = 0;
course_id title dept_name credits
BIO-101 Intro. to Biology Biology 4
CS-101 Intro. to Computer Science Comp. Sci. 4
FIN-201 Investment Banking Finance 3
HIS-351 World History History 3
MU-199 Music Video Production Music 3
PHY-101 Physical Principles Physics 4

Lecture <2024-02-20 Tue>

1
2
3
4
5
CREATE FUNCTION deptst(deptn TEXT) RETURNS SETOF student AS
$$
  SELECT * FROM student WHERE dept_name = deptn;
$$
LANGUAGE SQL;
CREATE FUNCTION
1
SELECT * FROM deptst('Biology');
id name dept_name tot_cred
98988 Tanaka Biology 120
1
2
3
4
5
CREATE FUNCTION qtable(tablename TEXT) RETURNS SETOF student AS
$$
  SELECT * FROM tablename;
$$
LANGUAGE SQL;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import psycopg2

# define connection parameters
connection = {
    'dbname': 'cs480_slides',
    'user': 'postgres',
    'host': '127.0.0.1',
    'password': 'test',
    'port': 5450
}

# open connection
conn = psycopg2.connect(**connection)
print(conn)
<connection object at 0x1099a8c10; dsn: 'user=postgres password=xxx dbname=cs480_slides host=127.0.0.1 port=5450', closed: 0>

None

Cursor

  • Cursors allow execution of SQL code
1
2
cur = conn.cursor()
print(cur)
<cursor object at 0x10ada3040; closed: 0>
1
2
3
4
5
6
7
8
cur = conn.cursor()

# run query
cur.execute("SELECT name, deptname FROM student")

# fetch results
rows = cur.fetchall()
print(rows[0])
('Zhang', 'Comp. Sci.')
1
2
3
4
5
6
7
# run query
cur.execute("SELECT count(*), deptname FROM student GROUP BY deptname")

# fetch results
rows = cur.fetchall()
for r in rows:
    print(f"department {r[1]} has {r[0]} students")
department Physics has 3 students
department Biology has 1 students
department Elec. Eng. has 2 students
department Finance has 1 students
department Comp. Sci. has 6 students
department History has 1 students
department Music has 1 students
1
2
3
4
5
6
# run query
cur.execute("SELECT name, deptname FROM student")

# fetch results
rows = cur.fetchall()
print(rows[0])
('Zhang', 'Comp. Sci.')
1
2
cur.close()
conn.close()

Lecture <2024-02-22 Thu>

Music collection - music player

  • albums: title, artist (one more), release date, genre (one or more),
  • tracks: title, tracks belong to albums and have a track number within this album, length, need artists, writing credits
  • artists: name, birthdate, wikipedia entry url
  • album collections contain one or more albums
  • playlists: have name, and contain songs (with order)
  • constraints:

    • each album should have at least one tracks
    • each track should have a name
    • tracks should positive consecutive numbers

Lecture <2024-02-27 Tue>

Lecture <2024-02-29 Thu>

Lecture <2024-03-05 Tue>

Relational algebra to SQL

R(A,B,C) S(B,D,E)

\[ \sigma_{A=5}(\pi_{A,B}(R)) \bowtie \pi_{B,D}(S) \]

  • UNION and UNION ALL
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT DISTINCT *
FROM (SELECT *
     FROM (
       SELECT A, B
       FROM R
       ) i1
     WHERE A = 5) l
     NATURAL JOIN
     (SELECT B, D
      FROM S) r

Testing FDs

  • test $A \to C$
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
DROP TABLE R;
CREATE TABLE R (
  id TEXT,
  A INT,
  B INT,
  C INT
  );

INSERT INTO R
VALUES
   ('x1',1,1,1),
   ('x2',1,2,3),
   ('x3',1,3,4),
   ('x4',2,3,5);
DROP TABLE
CREATE TABLE
INSERT 0 4
1
SELECT * FROM r;
id a b c
x1 1 1 1
x2 1 2 3
x3 1 3 4
x4 2 3 5
  • $\sigma_1: A \to C$
1
2
3
4
SELECT NOT EXISTS(
SELECT *
FROM R r1, R r2
WHERE r1.A = r2.A AND r1.C < r2.C);
?column?
f
  • $\sigma_2: A,B \to C,D$
1
2
3
SELECT *
FROM R r1, R r2
WHERE r1.A = r2.A AND r1.B=r2.B AND (r1.C != r2.C OR r1.D != r2.D)
  • $\sigma_3: A \to B,C$
1
2
3
SELECT NOT EXISTS (SELECT *
FROM R r1, R r2
WHERE r1.A = r2.A AND (r1.B != r2.B OR r1.C != r2.C));
?column?
f

Lecture <2024-03-14 Thu>

Lecture <2024-03-26 Tue>

Lecture <2024-03-28 Thu>

Lecture <2024-04-04 Thu>

Postgres page layout

1
2
3
DROP TABLE IF EXISTS testpage;
CREATE TABLE testpage (a int, b int);
INSERT INTO testpage (SELECT a, round(random() * 100) AS b FROM generate_series(1,1000) g(a));
DROP TABLE
CREATE TABLE
INSERT 0 1000
1
SELECT count(*) FROM testpage;
count
1000
  • install the extension that allows for inspection
1
CREATE EXTENSION pageinspect;
  • get page header data for page 0
1
SELECT * FROM page_header(get_raw_page('testpage', 0));
lsn checksum flags lower upper special pagesize version prune_xid
0/1668F7A8 0 0 928 960 8192 8192 4 0
  • get tuples on the page 0
1
SELECT * FROM heap_page_items(get_raw_page('testpage', 0)) LIMIT 10;
lp lp_off lp_flags lp_len t_xmin t_xmax t_field3 t_ctid t_infomask2 t_infomask t_hoff t_bits t_oid t_data
1 8160 1 32 4461 4463 0 (4,97) 2 256 24 \x010000002f000000
2 8128 1 32 4461 4463 0 (4,98) 2 256 24 \x0200000006000000
3 8096 1 32 4461 0 0 (0,3) 2 2304 24 \x0300000007000000
4 8064 1 32 4461 0 0 (0,4) 2 2304 24 \x0400000024000000
5 8032 1 32 4461 0 0 (0,5) 2 2304 24 \x0500000063000000
6 8000 1 32 4461 0 0 (0,6) 2 2304 24 \x060000002b000000
7 7968 1 32 4461 0 0 (0,7) 2 2304 24 \x070000005e000000
8 7936 1 32 4461 0 0 (0,8) 2 2304 24 \x0800000041000000
9 7904 1 32 4461 0 0 (0,9) 2 2304 24 \x090000003c000000
10 7872 1 32 4461 0 0 (0,10) 2 2304 24 \x0a00000034000000
1
UPDATE testpage SET b = 10000 WHERE A < 3;
UPDATE 2
1
UPDATE testpage SET b = 10000 WHERE A BETWEEN 3 AND 5;
UPDATE 3
1
SELECT * FROM heap_page_items(get_raw_page('testpage', 0)) LIMIT 10;
lp lp_off lp_flags lp_len t_xmin t_xmax t_field3 t_ctid t_infomask2 t_infomask t_hoff t_bits t_oid t_data
1 0 3 0
2 0 3 0
3 8160 1 32 4461 4464 0 (0,227) 16386 256 24 \x0300000007000000
4 8128 1 32 4461 4464 0 (0,228) 16386 256 24 \x0400000024000000
5 8096 1 32 4461 4464 0 (4,99) 2 256 24 \x0500000063000000
6 8064 1 32 4461 0 0 (0,6) 2 2304 24 \x060000002b000000
7 8032 1 32 4461 0 0 (0,7) 2 2304 24 \x070000005e000000
8 8000 1 32 4461 0 0 (0,8) 2 2304 24 \x0800000041000000
9 7968 1 32 4461 0 0 (0,9) 2 2304 24 \x090000003c000000
10 7936 1 32 4461 0 0 (0,10) 2 2304 24 \x0a00000034000000

Lecture <2024-04-11 Thu>

1
2
3
4
5
6
CREATE TABLE unc (a int, b int, c int, d int);

INSERT INTO unc (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(1,1000000) g(a))
;
CREATE TABLE
INSERT 0 1000000
1
2
CREATE INDEX unc_idx ON unc (b);
CREATE INDEX clust_idx ON unc (a);
CREATE INDEX
CREATE INDEX
1
ANALYZE unc;
ANALYZE
1
EXPLAIN ANALYZE SELECT sum(d) FROM unc WHERE c < 1000;
QUERY PLAN
Finalize Aggregate (cost=11625.11..11625.12 rows=1 width=8) (actual time=15.214..16.839 rows=1 loops=1)
-> Gather (cost=11624.89..11625.10 rows=2 width=8) (actual time=15.139..16.835 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=10624.89..10624.90 rows=1 width=8) (actual time=11.086..11.086 rows=1 loops=3)
-> Parallel Seq Scan on unc (cost=0.00..10614.33 rows=4224 width=4) (actual time=0.007..10.962 rows=3339 loops=3)
Filter: (c < 1000)
Rows Removed by Filter: 329995
Planning Time: 0.503 ms
Execution Time: 16.879 ms
1
EXPLAIN ANALYZE SELECT sum(d) FROM unc WHERE a < 200000;
QUERY PLAN
Finalize Aggregate (cost=6990.86..6990.87 rows=1 width=8) (actual time=30.605..33.521 rows=1 loops=1)
-> Gather (cost=6990.64..6990.85 rows=2 width=8) (actual time=30.475..33.513 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=5990.64..5990.65 rows=1 width=8) (actual time=23.652..23.653 rows=1 loops=3)
-> Parallel Index Scan using clust_idx on unc (cost=0.42..5776.36 rows=85712 width=4) (actual time=0.053..18.336 rows=66666 loops=3)
Index Cond: (a < 200000)
Planning Time: 1.492 ms
Execution Time: 33.609 ms
1
SELECT count(*) FROM unc WHERE b < 1000;
count
99931
1
EXPLAIN ANALYZE SELECT sum(d) FROM unc WHERE b < 100;
QUERY PLAN
Aggregate (cost=2570.54..2570.55 rows=1 width=8) (actual time=7.501..7.503 rows=1 loops=1)
-> Bitmap Heap Scan on unc (cost=16.21..2568.03 rows=1004 width=4) (actual time=0.347..7.333 rows=919 loops=1)
Recheck Cond: (b < 100)
Heap Blocks: exact=846
-> Bitmap Index Scan on unc_idx (cost=0.00..15.96 rows=1004 width=0) (actual time=0.182..0.183 rows=919 loops=1)
Index Cond: (b < 100)
Planning Time: 2.479 ms
Execution Time: 7.604 ms
1
EXPLAIN ANALYZE SELECT sum(d) FROM unc WHERE b < 50000;
QUERY PLAN
Finalize Aggregate (cost=12131.02..12131.03 rows=1 width=8) (actual time=49.272..52.012 rows=1 loops=1)
-> Gather (cost=12130.80..12131.01 rows=2 width=8) (actual time=49.154..52.007 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=11130.80..11130.81 rows=1 width=8) (actual time=42.669..42.669 rows=1 loops=3)
-> Parallel Seq Scan on unc (cost=0.00..10614.33 rows=206588 width=4) (actual time=0.012..31.747 rows=166694 loops=3)
Filter: (b < 50000)
Rows Removed by Filter: 166639
Planning Time: 2.867 ms
Execution Time: 52.157 ms
1
EXPLAIN ANALYZE SELECT sum(d) FROM unc WHERE b < 10 AND a < 100000;
QUERY PLAN
Aggregate (cost=365.90..365.91 rows=1 width=8) (actual time=0.944..0.945 rows=1 loops=1)
-> Bitmap Heap Scan on unc (cost=5.18..365.88 rows=10 width=4) (actual time=0.041..0.938 rows=12 loops=1)
Recheck Cond: (b < 10)
Filter: (a < 100000)
Rows Removed by Filter: 82
Heap Blocks: exact=92
-> Bitmap Index Scan on unc_idx (cost=0.00..5.17 rows=100 width=0) (actual time=0.017..0.017 rows=94 loops=1)
Index Cond: (b < 10)
Planning Time: 1.716 ms
Execution Time: 1.019 ms
1
UPDATE unc SET a = a + 1, b = b + 1 WHERE MOD(a,5) = round(random() * 5);
UPDATE 174266
1
ANALYZE;
ANALYZE
1
EXPLAIN ANALYZE SELECT sum(d) FROM unc WHERE a < 200000;
QUERY PLAN
Finalize Aggregate (cost=13166.77..13166.78 rows=1 width=8) (actual time=31.870..34.994 rows=1 loops=1)
-> Gather (cost=13166.56..13166.77 rows=2 width=8) (actual time=31.764..34.989 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=12166.56..12166.57 rows=1 width=8) (actual time=26.906..26.906 rows=1 loops=3)
-> Parallel Seq Scan on unc (cost=0.00..11961.33 rows=82090 width=4) (actual time=0.012..22.929 rows=66666 loops=3)
Filter: (a < 200000)
Rows Removed by Filter: 266667
Planning Time: 1.070 ms
Execution Time: 35.060 ms

Lecture <2024-04-18 Thu>

Join algorithms

Hash join and nested loop (small table r)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
DROP TABLE IF EXISTS r;
CREATE TABLE r (a int, b int, c int, d int);

INSERT INTO r (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(1,1000) g(a))
;

DROP TABLE IF EXISTS s;
CREATE TABLE s (a int, b int, c int, d int);

INSERT INTO s (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(1,1000000) g(a))
;

ANALYZE;
DROP TABLE
CREATE TABLE
INSERT 0 1000
DROP TABLE
CREATE TABLE
INSERT 0 1000000
ANALYZE
1
2
3
4
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT r.*, s.*
FROM r, s
WHERE r.a = s.a;
QUERY PLAN
Gather (cost=1028.50..12267.84 rows=1000 width=32) (actual time=0.582..50.892 rows=1000 loops=1)
Output: r.a, r.b, r.c, r.d, s.a, s.b, s.c, s.d
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=5616
-> Hash Join (cost=28.50..11167.84 rows=417 width=32) (actual time=27.469..43.645 rows=333 loops=3)
Output: r.a, r.b, r.c, r.d, s.a, s.b, s.c, s.d
Hash Cond: (s.a = r.a)
Buffers: shared hit=5616
Worker 0: actual time=41.046..41.047 rows=0 loops=1
Buffers: shared hit=1987
Worker 1: actual time=41.084..41.085 rows=0 loops=1
Buffers: shared hit=1926
-> Parallel Seq Scan on public.s (cost=0.00..9572.67 rows=416667 width=16) (actual time=0.013..19.361 rows=333333 loops=3)
Output: s.a, s.b, s.c, s.d
Buffers: shared hit=5406
Worker 0: actual time=0.014..18.685 rows=348615 loops=1
Buffers: shared hit=1885
Worker 1: actual time=0.015..18.182 rows=337440 loops=1
Buffers: shared hit=1824
-> Hash (cost=16.00..16.00 rows=1000 width=16) (actual time=0.328..0.328 rows=1000 loops=3)
Output: r.a, r.b, r.c, r.d
Buckets: 1024 Batches: 1 Memory Usage: 55kB
Buffers: shared hit=18
Worker 0: actual time=0.352..0.352 rows=1000 loops=1
Buffers: shared hit=6
Worker 1: actual time=0.381..0.381 rows=1000 loops=1
Buffers: shared hit=6
-> Seq Scan on public.r (cost=0.00..16.00 rows=1000 width=16) (actual time=0.118..0.201 rows=1000 loops=3)
Output: r.a, r.b, r.c, r.d
Buffers: shared hit=18
Worker 0: actual time=0.157..0.229 rows=1000 loops=1
Buffers: shared hit=6
Worker 1: actual time=0.187..0.262 rows=1000 loops=1
Buffers: shared hit=6
Planning:
Buffers: shared hit=141
Planning Time: 2.009 ms
Execution Time: 50.986 ms
1
2
3
4
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT r.*, s.*
FROM r, s
WHERE r.a BETWEEN s.a - 10 AND s.a + 10;
QUERY PLAN
Nested Loop (cost=0.00..22515424.50 rows=111111111 width=32) (actual time=76.346..50025.108 rows=20945 loops=1)
Output: r.a, r.b, r.c, r.d, s.a, s.b, s.c, s.d
Join Filter: ((r.a >= (s.a - 10)) AND (r.a <= (s.a + 10)))
Rows Removed by Join Filter: 999979055
Buffers: shared hit=5412
-> Seq Scan on public.s (cost=0.00..15406.00 rows=1000000 width=16) (actual time=0.024..68.647 rows=1000000 loops=1)
Output: s.a, s.b, s.c, s.d
Buffers: shared hit=5406
-> Materialize (cost=0.00..21.00 rows=1000 width=16) (actual time=0.000..0.023 rows=1000 loops=1000000)
Output: r.a, r.b, r.c, r.d
Buffers: shared hit=6
-> Seq Scan on public.r (cost=0.00..16.00 rows=1000 width=16) (actual time=0.006..0.063 rows=1000 loops=1)
Output: r.a, r.b, r.c, r.d
Buffers: shared hit=6
Planning:
Buffers: shared hit=113
Planning Time: 0.750 ms
JIT:
Functions: 6
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.307 ms, Inlining 47.066 ms, Optimization 16.492 ms, Emission 12.743 ms, Total 76.609 ms
Execution Time: 50043.921 ms
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
DROP TABLE IF EXISTS r;
CREATE TABLE r (a int, b int, c int, d int);

INSERT INTO r (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(1,1000000) g(a))
;

DROP TABLE IF EXISTS s;
CREATE TABLE s (a int, b int, c int, d int);

INSERT INTO s (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(1,1000000) g(a))
;

ANALYZE;
DROP TABLE
CREATE TABLE
INSERT 0 1000000
DROP TABLE
CREATE TABLE
INSERT 0 1000000
ANALYZE
less memory
1
2
3
4
EXPLAIN (ANALYZE)
SELECT r1.*, s.*
FROM r r1, s
WHERE r1.b = s.b
QUERY PLAN
Merge Join (cost=230130.67..399330.10 rows=10947027 width=32) (actual time=290.547..1346.343 rows=9999389 loops=1)
Merge Cond: (r1.b = s.b)
-> Sort (cost=115063.84..117563.84 rows=1000000 width=16) (actual time=146.832..238.683 rows=1000000 loops=1)
Sort Key: r1.b
Sort Method: quicksort Memory: 63639kB
-> Seq Scan on r r1 (cost=0.00..15406.00 rows=1000000 width=16) (actual time=0.136..46.769 rows=1000000 loops=1)
-> Sort (cost=115063.84..117563.84 rows=1000000 width=16) (actual time=140.882..390.812 rows=9999395 loops=1)
Sort Key: s.b
Sort Method: quicksort Memory: 63639kB
-> Seq Scan on s (cost=0.00..15406.00 rows=1000000 width=16) (actual time=0.129..42.435 rows=1000000 loops=1)
Planning Time: 8.273 ms
JIT:
Functions: 7
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.176 ms, Inlining 0.000 ms, Optimization 0.194 ms, Emission 2.616 ms, Total 2.986 ms
Execution Time: 1543.345 ms
1
2
SHOW work_mem;
SHOW effective_cache_size;
work_mem
2MB
effective_cache_size
2MB

Merge join (small table r)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
DROP TABLE IF EXISTS r;
CREATE TABLE r (a int, b int, c int, d int);

INSERT INTO r (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(1,1000000) g(a))
;

DROP TABLE IF EXISTS s;
CREATE TABLE s (a int, b int, c int, d int);

INSERT INTO s (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(1,1000000) g(a))
;

ANALYZE;
DROP TABLE
CREATE TABLE
INSERT 0 1000000
DROP TABLE
CREATE TABLE
INSERT 0 1000000
ANALYZE
1
2
3
4
EXPLAIN (ANALYZE)
SELECT r.*, s.*
FROM r, s
WHERE r.a = s.a;
QUERY PLAN
Hash Join (cost=27906.00..57062.00 rows=1000000 width=32) (actual time=151.138..503.430 rows=1000000 loops=1)
Hash Cond: (r.a = s.a)
-> Seq Scan on r (cost=0.00..15406.00 rows=1000000 width=16) (actual time=0.014..35.329 rows=1000000 loops=1)
-> Hash (cost=15406.00..15406.00 rows=1000000 width=16) (actual time=149.873..149.874 rows=1000000 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 55067kB
-> Seq Scan on s (cost=0.00..15406.00 rows=1000000 width=16) (actual time=0.015..38.646 rows=1000000 loops=1)
Planning Time: 7.518 ms
Execution Time: 521.277 ms
1
2
3
4
EXPLAIN (ANALYZE)
SELECT r.*, s.*
FROM r, s
WHERE r.b = s.b;
QUERY PLAN
Hash Join (cost=27906.00..168891.05 rows=10932905 width=32) (actual time=174.887..2121.716 rows=9997000 loops=1)
Hash Cond: (r.b = s.b)
-> Seq Scan on r (cost=0.00..15406.00 rows=1000000 width=16) (actual time=0.024..41.629 rows=1000000 loops=1)
-> Hash (cost=15406.00..15406.00 rows=1000000 width=16) (actual time=174.398..174.399 rows=1000000 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 55067kB
-> Seq Scan on s (cost=0.00..15406.00 rows=1000000 width=16) (actual time=0.004..34.842 rows=1000000 loops=1)
Planning Time: 2.192 ms
JIT:
Functions: 10
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.214 ms, Inlining 0.000 ms, Optimization 0.663 ms, Emission 11.283 ms, Total 12.161 ms
Execution Time: 2420.994 ms
less memory
  • 5MB for sorting and hashing and 10MB for buffer pool size
1
2
3
4
EXPLAIN (ANALYZE, BUFFERS)
SELECT r.*, s.*
FROM r, s
WHERE r.a = s.a;
QUERY PLAN
Gather (cost=1028.50..12267.84 rows=1000 width=32) (actual time=0.869..50.570 rows=1000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=5616
-> Hash Join (cost=28.50..11167.84 rows=417 width=32) (actual time=27.042..42.886 rows=333 loops=3)
Hash Cond: (s.a = r.a)
Buffers: shared hit=5616
-> Parallel Seq Scan on s (cost=0.00..9572.67 rows=416667 width=16) (actual time=0.012..19.056 rows=333333 loops=3)
Buffers: shared hit=5406
-> Hash (cost=16.00..16.00 rows=1000 width=16) (actual time=0.359..0.359 rows=1000 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 55kB
Buffers: shared hit=18
-> Seq Scan on r (cost=0.00..16.00 rows=1000 width=16) (actual time=0.107..0.213 rows=1000 loops=3)
Buffers: shared hit=18
Planning:
Buffers: shared hit=141
Planning Time: 1.244 ms
Execution Time: 50.697 ms

index join

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
DROP TABLE IF EXISTS r;
CREATE TABLE r (a int PRIMARY KEY, b int, c int, d int);

INSERT INTO r (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(900000,1900000) g(a))
;

DROP TABLE IF EXISTS s;
CREATE TABLE s (a int PRIMARY KEY, b int, c int, d int);

INSERT INTO s (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(1,1000000) g(a))
;

ANALYZE;
DROP TABLE
CREATE TABLE
INSERT 0 1000001
DROP TABLE
CREATE TABLE
INSERT 0 1000000
ANALYZE
1
2
3
4
EXPLAIN
SELECT r1.*, s.*
FROM r r1, s
WHERE r1.a = s.a
QUERY PLAN
Merge Join (cost=30539.45..47292.07 rows=1000000 width=32)
Merge Cond: (r1.a = s.a)
-> Index Scan using r_pkey on r r1 (cost=0.42..31389.44 rows=1000001 width=16)
-> Index Scan using s_pkey on s (cost=0.42..31389.42 rows=1000000 width=16)
1
2
3
4
EXPLAIN
SELECT r1.a, s.a
FROM r r1, s
WHERE r1.a = s.a
QUERY PLAN
Merge Join (cost=25665.20..41340.04 rows=1000000 width=8)
Merge Cond: (r1.a = s.a)
-> Index Only Scan using r_pkey on r r1 (cost=0.42..25980.44 rows=1000001 width=4)
-> Index Only Scan using s_pkey on s (cost=0.42..25980.42 rows=1000000 width=4)
1
2
SELECT correlation FROM pg_stats
WHERE tablename = 'r' AND attname = 'a';
correlation
1
only one table clustered
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
DROP TABLE IF EXISTS r;
CREATE TABLE r (a int PRIMARY KEY, b int, c int, d int);

INSERT INTO r (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(900000,1900000) g(a)
ORDER BY random()
)
;

DROP TABLE IF EXISTS s;
CREATE TABLE s (a int PRIMARY KEY, b int, c int, d int);

INSERT INTO s (
SELECT a, round(random() * 100000) AS b, round(random() * 100000) AS c, round(random() * 100) AS d
FROM generate_series(1,1000000) g(a))
;

ANALYZE;
DROP TABLE
CREATE TABLE
INSERT 0 1000001
DROP TABLE
CREATE TABLE
INSERT 0 1000000
ANALYZE
1
2
SELECT correlation FROM pg_stats
WHERE tablename = 'r' AND attname = 'a';
correlation
0.012891567
1
2
3
4
EXPLAIN
SELECT r1.*, s.*
FROM r r1, s
WHERE r1.a = s.a
QUERY PLAN
Hash Join (cost=32789.00..65469.02 rows=1000000 width=32)
Hash Cond: (r1.a = s.a)
-> Seq Scan on r r1 (cost=0.00..15406.01 rows=1000001 width=16)
-> Hash (cost=15406.00..15406.00 rows=1000000 width=16)
-> Seq Scan on s (cost=0.00..15406.00 rows=1000000 width=16)
1
2
3
4
EXPLAIN
SELECT r1.*, s.*
FROM r r1, s
WHERE r1.a = s.a AND r1.B < s.b
QUERY PLAN
Hash Join (cost=32789.00..65469.01 rows=333333 width=32)
Hash Cond: (r1.a = s.a)
Join Filter: (r1.b < s.b)
-> Seq Scan on r r1 (cost=0.00..15406.01 rows=1000001 width=16)
-> Hash (cost=15406.00..15406.00 rows=1000000 width=16)
-> Seq Scan on s (cost=0.00..15406.00 rows=1000000 width=16)

Lecture <2024-04-23 Tue>

  1. distributed databases (5)
  2. NoSQL (3)
  3. query compilation + main memory databases (vector-databases) 3)
  4. statistics - cost estimate (2)
  5. query optimization - join reordering (1)
  6. disk-resident hash index (1)
  7. MVCC
  8. Aries (1)
  9. light-weight indexes (zone maps) + partitioning

Lecture <2024-04-25 Thu>

relational algebra -> SQL transaction

  • $\Pi_{a_1, \ldots, a_n}(Q)$
1
2
SELECT a1, ..., an
FROM R
  • $\sigma_C(R)$
1
2
3
SELECT *
FROM R
WHERE C
  • $R \cup S$
1
2
3
SELECT * FROM R
UNION
SELECT * FROM S
  • $R \bowtie S$
1
2
SELECT *
FROM R NATURAL JOIN S
  • $\gamma_{f(a); G}(R)$
1
2
3
SELECT f(A)
FROM R
[ GROUP BY G ]
  • R(A,B,C), S(D,E,F), T(G,H)
  • $(\gamma_{sum(A), E}(R \bowtie_{C=D} S)) \bowtie_{E = G} T$
1
2
3
4
5
6
7
8
SELECT *
FROM (
 SELECT sum(A)
 FROM (SELECT *
       FROM R JOIN S ON (C = D)
) sub2
 GROUP BY E
) l JOIN  T ON (E = G);

ER to SQL

1
2
3
4
5
CREATE TABLE course (
courseid INT PRIMARY KEY,
title TEXT,
credits INT
);

ER-diagram (create one from requirements)

Last updated on 24 Aug 2020
Published on 24 Aug 2020