05-advanced-sql-queries/advanced-sql-queries.ipynb


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bac3442e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import agh_db_lectures\n",
    "agh_db_lectures.prepare_notebook_for_sql()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6430579f",
   "metadata": {},
   "source": [
    "# Advanced SQL queries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc6d5613",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- aggregating selected data (sum, count…)\n",
    "- order of query results\n",
    "- combining data from multipe tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f79c3828",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Open the diagram in an image viewer for more convenience.\n",
    "agh_db_lectures.nw_diagram.download_open()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59ffd66d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Connect to the (locally-hosted) database.\n",
    "%sql postgresql://demo_user:demo_pwd@localhost:25432/agh_it_northwind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44553e37",
   "metadata": {},
   "outputs": [],
   "source": [
    "agh_db_lectures.download_restore_nw_postgres_dump()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc15108a",
   "metadata": {},
   "source": [
    "## The \"ORDER BY\" and \"LIMIT\" clauses"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8865659",
   "metadata": {},
   "source": [
    "Task: find X first order shipments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95ab5146",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT order_id, shipped_date, ship_city, ship_country\n",
    "FROM orders"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08a740f6",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Theodore, what would you like \"X\" to be?\n",
    "- Normally, the order of result rows is completely arbitrary.\n",
    "- Here we see that it does not match our requirement.\n",
    "- Add `ORDER BY shipped_date` — we can impose ordering this way.\n",
    "- The order inside groups with the same date is still arbitrary.\n",
    "  - We can break these ties as well.\n",
    "  - Make it `ORDER BY shipped_data, ship_country` (or analogous) — a comma separated list of sort keys can be used.\n",
    "- But there are still ties, e.g. on 1996-09-10.\n",
    "- We can add `, ship_city` to break it.\n",
    "- Change to `ship_country || ':' || ship_city`.\n",
    "  - Sort keys can be expressions rather than just column names.\n",
    "  - No need for the sort key expressions to actually be selected.\n",
    "- Add `, order_id` to break tie on 1996-08-21.\n",
    "- Now we have a deterministic result sequence, which is no longer (just) a set.\n",
    "- Add `LIMIT BY 100` — we can have the query only return the first 100 rows.\n",
    "- This is useful for paging when an application presents a sublist of entries and the user can click to see the next page.\n",
    "  - How to get data of the next page?\n",
    "  - Add `OFFSET 100` before (or after) `LIMIT`.\n",
    "  - Change to `OFFSET 2*100`.\n",
    "  - Theodore, what application that you know uses this kind of paging?\n",
    "  - When a query is performed, the effective order of computation is:\n",
    "    - `FROM`,\n",
    "    - `WHERE`,\n",
    "    - `ORDER BY`,\n",
    "    - `OFFSET`, and\n",
    "    - `LIMIT`.\n",
    "  - `OFFSET` works, but what if entries at page borders get modified before the user clicks the \"next page\" button?\n",
    "    - The user could see the same row again at the beginning of the subsequent page.\n",
    "    - Or worse, the user could miss a row.\n",
    "  - Remove `OFFSET` and add `WHERE (shipped_date, ship_country || ':' || ship_city, order_id) > ('1996-11-11', 'Germany:Cunewalde', 10345)`.\n",
    "- We are using 2 row expressions to perform comparison on their resulting tuples :)\n",
    "- Application, when requesting a page, could provide a (date,country+city,id) tuple instead of a page number.\n",
    "- This approach to paging\n",
    "  - is more robust,\n",
    "  - but somewhat harder to get right\n",
    "  - and requires more queries to additionally\n",
    "    - compute the current position (aka page number) of the user, and\n",
    "    - allow easily navigating to, e.g., the middle page.\n",
    "- Add `ASC` to every sort key — this is the default.\n",
    "- What if we instead wanted to sort from newest dates?\n",
    "  - Use `shipped_date DESC,`.\n",
    "  - No `WHERE` initially.\n",
    "  - Use `COALESCE(shipped_date, DATE('now') + 1) DESC`.\n",
    "  - Add `WHERE ('1998-04-06' - COALESCE(shipped_date, DATE('now') + 1), ship_country || ':' || ship_city, order_id) > (0, 'UK:London', 10987)`.\n",
    "- What if we wanted to implement paging when, e.g., sorting ascending by country and decending by city?\n",
    "  - Give up tuple comparisons and express it as a logical expression with more `AND`s and `OR`s.\n",
    "  - Verbose but doable.\n",
    "- Now, if we had instead interpreted the initial tast to possibly return a 101st row if it has a date tie with 100th, then we'd need to use a subquery which we'll do later today."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f456b374",
   "metadata": {},
   "source": [
    "## Cartesian product"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da32a14b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%sql --file dbmses-languages.sql"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2a13c8d",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We shall start with simpler tables and move to Northwind later.\n",
    "- The `dbmses-languages.sql` script creates and populates 2 tables with\n",
    "  - programming languages, and\n",
    "  - Database Management Systems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95dc98e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT * FROM languages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b7feee1",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We have just 5 languages in the table to keep the example simple.\n",
    "- Erlang is developed and used by Ericsson, a company that students of teleinformatics here probably know well ;)\n",
    "- The other langs are probably known by everyone.  Yup, Java and JavaScript are different things."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3725f897",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT *\n",
    "FROM DBMSes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "060b8d83",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Here, we just have entries for a few selected DBs for simplicity.\n",
    "- It is good to have a multi-column table for the examples, so we include information on\n",
    "  - whether the DBMB is specific to a single programming language (if so, which one, otherwise, we store NULL) and\n",
    "  - whether the DBMS is available under a free/libre software license.\n",
    "- Mnesia is a distributed database engine written in Erlang.\n",
    "  - It is part of a typical Erlang installation.\n",
    "  - It is a relational DBMS, except for value atomicity requirement, but SQL DBs commonly violate that one as well.\n",
    "  - Unusually for a relational DBMS, Mnesia doesn't use SQL.\n",
    "  - Erlang data structures are used to to interface with Mnesia.\n",
    "- IndexedDB is available to scripts executing on websites to store data in.\n",
    "  - It is actually a database API in JavaScript rather than a single DBMS, I admit.\n",
    "  - IndexedDB data is kept on disk by the web browser on behalf of websites, similarly to cookies.\n",
    "  - It is another example of a relational database that is not queried with SQL.\n",
    "- Keep in mind these are toy tables — we would design the schema differently for a serious database.\n",
    "- Let's write a query that will produce possible DB-language combinations, for example for a software project.\n",
    "- Make it `FROM DBMSes, languages`.\n",
    "- There can be multiple comma-separated tables (well, table expressions) listed after the `FROM` keyword.\n",
    "- Our query result is a table that is a cartesian product of all the command-separated table names.\n",
    "  - Each row with each row.\n",
    "  - We had $4$ and $5$ rows, now we have $4\\cdot5 = 20$ rows.\n",
    "- Our cartesian product includes combinations of Mnesia and languages other than Erlang — these would not be practical for a software project.\n",
    "  - Same issue with IndexedDB.\n",
    "  - We can add `WHERE DBMSes.language_name IS NULL OR DBMSes.language_name = languages.language_name` to filter the 20 result rows that we have.\n",
    "  - As you see, we can use the syntax `*table_name*.*column_name*` to refer to column values of a pecific table.\n",
    "  - The result now makes sense WRT Mnesia and IndexedDB.\n",
    "- The `DBMSes.language_name` column is useful for filtering, but it is not necessary to have it in the result table.\n",
    "- Use `SELECT DBMSes.DBMS_name, DBMSes.DBMS_available_as_FLOSS, languages.language_name`.\n",
    "- The `*table_name*.*column_name*` syntax works as well in `SELECT` clause.\n",
    "- We can give aliases to input table names to make the query less verbose.\n",
    "  - Use `FROM DBMSes D, languages l`.\n",
    "  - Change `DBMSes.` to `D.` everywhere.\n",
    "  - Analogously with `languages.`.\n",
    "- We could also use the `AS` keyword in `FROM`.\n",
    "  - Make it `DBMSes AS D`.\n",
    "  - Works the same.\n",
    "  - Notably, Oracle would not allow `AS` in `FROM`.\n",
    "  - A foolproof approach would be to\n",
    "    - use `AS` for column aliases in `SELECT`, like `D.DBMS_available_as_FLOSS AS FLOSS_stack`, and\n",
    "    - omit `AS` for table aliases in `FROM`.\n",
    "- Additionally, we can omit `*table_name*.` whenever the column name is unambigious.\n",
    "  - Change `D.DBMS_name` → `DBMS_name`, and\n",
    "  - `D.DBMS_available_as_FLOSS` → `DBMS_available_as_FLOSS`.\n",
    "- This is a basic example of how we can combine the data from multiple (here, 2) tables.\n",
    "- Let's apply this knowledge to querying the Northwind DB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a82cf92",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT -- fill in\n",
    "FROM , -- fill in\n",
    "--WHERE -- fill in"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38ae646f",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- As you see in the Northwind ER diagram, certain ids appear in more than one table.\n",
    "- They are there to allow correlating rows from one table to rows from another table.\n",
    "- Theodore, as our example, would you like us to display\n",
    "  - products with their category names and supplier names, or\n",
    "  - orders with their customer names and responsible employee names?\n",
    "- `SELECT p.product_id, p.product_name, c.category_name FROM products p, categories c` — get a cartesian product.\n",
    "- `WHERE p.category_id = c.category_id` — limit it to interesting rows.\n",
    "- Add\n",
    "  - `, s.company_name AS supplier_name` to `SELECT` clause,\n",
    "  - `, suppliers s` to `FROM` clause, and\n",
    "  - ` AND p.supplier_id = s.supplier_id` to `WHERE` clause.\n",
    "- We get a pretty human-readable listing of products."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b1b7d6e",
   "metadata": {},
   "source": [
    "## Basic \"JOIN\"s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eaafb4f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT D.DBMS_name,\n",
    "       D.DBMS_available_as_FLOSS AS FLOSS_stack,\n",
    "       l.language_name\n",
    "FROM DBMSes AS D, languages AS l\n",
    "WHERE D.language_name IS NULL OR\n",
    "      D.language_name = l.language_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52b65be7",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Let's get back to the programming languages example.\n",
    "- Modern SQL standard revisions allow using `CROSS JOIN` instead of a comma (\",\") to produce a cartesian product of tables.\n",
    "- SQL also provides an alternative syntax for working with a **subset of** a cartesian product.\n",
    "- Replace `FROM` and `WHERE` clauses with `FROM DBMSes AS D INNER JOIN languages AS l ON D.language_name IS NULL OR D.language_name = l.language_name`.\n",
    "- The result is the same.\n",
    "- The logic used to be \"Get the cartesian product, then filter it according to \\*condition\\*.\"\n",
    "- Now the logic is \"Get combined rows according to \\*condition\\*.\"\n",
    "- It does **not** mean that the latter is more efficient!  A DBMS is free to optimize both the old and the new query.\n",
    "- The semantic difference is that, when combining data from more than 2 tables, `JOIN` takes precedence over `,` in the `FROM` clause.\n",
    "- Other than that, the major difference is that `INNER JOIN` might be more readable than a cartesian product (unless a cartesian product is what we want conceptually).\n",
    "- We can still use `WHERE` to additionally filter the combined rows.\n",
    "  - Let's say I don't like relying on proprietary systems.\n",
    "  - Add `WHERE D.DBMS_available_as_FLOSS`.\n",
    "  - Technically, it would also be possible to include this in the `INNER JOIN` condition.\n",
    "  - Add parentheses around `INNER JOIN` condition and change `WHERE` to `AND`.\n",
    "  - Note the use of parentheses.\n",
    "  - The result is the same, but it is arguably less readable, go back to using `WHERE` after `INNER JOIN`.\n",
    "- As a **digression**, note that result column name (`FLOSS_stack`) cannot be used in `WHERE`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "360eb35d",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "-- paste in the code from the Northwind cartesian product example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "506a4681",
   "metadata": {},
   "source": [
    "_notes_\n",
    "- Use `products p INNER JOIN categories c ON p.category_id = c.category_id` and remove part of the `WHERE` condition.\n",
    "- Note that we have 2 comma-separated table expressions in the `FROM` clause now.\n",
    "- Change\n",
    "  - `, suppliers s` → `INNER JOIN suppliers s`, and\n",
    "  - `WHERE` to `ON`.\n",
    "- One might guess that `table_A.some_id = table_B.some_id` is a common type of an `INNER JOIN` condition.\n",
    "- If the `some_id` column is named the same in both tables, we can use an alternative syntax.\n",
    "- Comment out `ON p.category_id = c.category_id` and put `USING(category_id)` there.\n",
    "- In case of our particular query, the behavior of this `INNER JOIN` is the same.\n",
    "- The differences are as follows.\n",
    "  - The result of the `INNER JOIN` does not contain the duplicate column.\n",
    "    - Add `, category_id` to the `SELECT` clause.\n",
    "    - It works.\n",
    "    - Temporarily revert to `ON` instead of `USING` in the `FROM` clause.\n",
    "    - It now complains about column name `category_id` being ambigious.\n",
    "  - The `USING` form of `INNER JOIN` can be more readable.\n",
    "- We could also require the equivalence of multiple columns between 2 tables.\n",
    "  - It works by specifying multiple comma-separated column names in `USING`.\n",
    "  - Hard to demonstrate on Northwind DB.\n",
    "- Note that Mocrosoft's SQL Server notably does not support `USING` with `JOIN`s.\n",
    "- Analogously, change to `USING` in the second `INNER JOIN`.\n",
    "- Looking at the ER diagram, note that `category_id` is the only common column name between `products` and `categories`\n",
    "  - In such cases, under Postgres we can use the `NATURAL INNER JOIN` syntactic construct.\n",
    "  - Change `INNER JOIN categories c USING(category_id)` → `NATURAL INNER JOIN categories c`.\n",
    "  - Works as `USING(*list_of_commonly_named_table_columns*)`.\n",
    "  - Analogously with `supplier_id`, but note that this is actually dangerous.\n",
    "  - The `NATURAL INNER JOIN` construct relies on the columns names of input tables.\n",
    "  - What if\n",
    "    - I change the second `INNER JOIN` in this query to `NATURAL INNER JOIN`, and\n",
    "    - the `suppliers` table later gets modified to include a new `description` column?\n",
    "  - I advise against using `NATURAL INNER JOIN`.\n",
    "  - Similarly to `USING`, the `NATURAL INNER JOIN` is supported by Postgres, Oracle, MySQL and SQLite but not the SQL Server.\n",
    "- We shall get back to `JOIN`s later on."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbb59dc9",
   "metadata": {},
   "source": [
    "## The \"COUNT\" function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eec00a18",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT *\n",
    "FROM customers;"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cf93e85",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We see data of different companies, where some columns have unique values (e.g., `customer_id`) and some have non-unique values (e.g., `country`).\n",
    "-Theodore, which attributes from the customers table would you like to work with now?\n",
    "- Jupyter tells us the number of selected rows, but that number won't be readily available to a program that executes the query through an API.\n",
    "- A program could fetch the result and count rows by itself, but… what a waste of resources :c\n",
    "- Replace with `COUNT(*chosen_attribute*)`.\n",
    "- Note that use of aggregation changed the number of result rows.\n",
    "- We get the same result with `COUNT(city)` and `COUNT(country)`.\n",
    "- Replace with `COUNT(fax)`, result is different.\n",
    "  - One could conclude that `COUNT` is not merely used to count rows.\n",
    "  - It does not count unique values, either (countries are not unique in this table).\n",
    "  - It counts non-null values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a108bae1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT 'any string'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7325c3a8",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Wrap in `COUNT()`.\n",
    "  - It counted the single row that tableless `SELECT` could return.\n",
    "- Try\n",
    "  - `COUNT(1)`,\n",
    "  - `COUNT(0)`,\n",
    "  - `COUNT(TRUE)`, and\n",
    "  - `COUNT(FALSE)` — it is easy to forget that this one gives 1.\n",
    "- `COUNT(NULL)` — the only one that gives 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8873abd4",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT 100 - COUNT(customer_id)\n",
    "FROM customers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4cbb903",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We can perform further computation on the result of aggregation functions.\n",
    "- Actually, there is one more elegant way to count rows — `COUNT(*)`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "062bda6a",
   "metadata": {},
   "source": [
    "## The \"GROUP BY\" clause"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "865298df",
   "metadata": {},
   "source": [
    "Task: using orders table, get the date of last order of company 🥕."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7eecff9",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT company_name, customer_id\n",
    "FROM customers\n",
    "\n",
    "--SELECT -- fill in\n",
    "--FROM orders\n",
    "--WHERE customer_id = 🥕"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7045ed55",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Theodore, which company would you like the carrot to stand for?\n",
    "- Comment the current query command and uncomment the second one.\n",
    "- Fill in the `SELECT` clause with `order_date` to show order dates.\n",
    "- Change it to `MAX(order_date)`.\n",
    "- `MAX` happens to be another aggregate function, line `COUNT`.\n",
    "- `MIN` works analogously (change to it, but go back to `MAX`).\n",
    "- The query is now good… for getting the last order date of a single customer.\n",
    "- What if we wanted to get last order dates of all customers?\n",
    "- A loop would be useful, but\n",
    "  - we have not learned the procedural extensions to SQL yet, and\n",
    "  - these extensions are nonstandard, anyway."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "244f9dbb",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT customer_id, MAX(order_date) AS last_order_date\n",
    "FROM orders\n",
    "GROUP BY customer_id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "514e9f48",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- `GROUP BY` - makes aggragation functions like `COUNT` and `MAX` operate on smaller **groups** rather than all tuples.\n",
    "- Add\n",
    "  - `JOIN customers USING(customer_id)` to the `FROM` clause,\n",
    "  - `, company_name` to the `GROUP BY` clause, and\n",
    "  - `company_name, ` as the first expression in the `SELECT` clause.\n",
    "- So, `GROUP BY` allowed us to solve the earlier task for all possible values of carrot at once.\n",
    "- Note that you **can**\n",
    "    - perform operations on values before an aggregate function is applied, e.g., `MAX(EXTRACT(YEAR FROM order_date))` in Postgres and Oracle\n",
    "      - (`YEAR(order_date)` in MySQL/MariaDB, SQL Server and some others),\n",
    "    - perform operations on an aggregate function's result, e.g., `EXTRACT(YEAR FROM MAX(order_date))`,\n",
    "    - group by a column without directly using it in the `SELECT` clause,\n",
    "    - use expressions in the `GROUP BY` clause, e.g., `EXTRACT(YEAR FROM order_date)` to group by year and get the last order date in each year by each customer, and\n",
    "    - use aggregate functions in the `ORDER BY` clause, e.g., `ORDER BY MAX(order_date) DESC`.\n",
    "- Note that you **can't**\n",
    "  - use aggregate functions in `WHERE`, `FROM`, and — obviously — the `GROUP BY` clause, and\n",
    "  - omit in `GROUP BY` a column name that is referenced outside of aggregate functions in query's `SELECT` clause.\n",
    "- Note that the effective order of operations is\n",
    "  - `FROM`\n",
    "  - `WHERE`\n",
    "  - `GROUP BY`, and\n",
    "  - `ORDER BY`.\n",
    "- So, if we want to get each company's last order in 1997, we can add `WHERE EXTRACT(YEAR FROM order_date) = 1997`.\n",
    "- What if we instead wanted to get companies that made their last orders in 1997?\n",
    "  - I.e., we want to filter rows **after** grouping is performed.\n",
    "  - Add `HAVING EXTRACT(YEAR FROM MAX(order_date)) = 1997` after the `GROUP BY` clause.\n",
    "  - `HAVING` is similar to `WHERE` but affects the result of grouping.\n",
    "    - Note that, similarly to `WHERE`, the `HAVING` clause is not aware of result table's column names.\n",
    "    - E.g., `HAVING EXTRACT(YEAR FROM last_order_date) = 1997` gives an error.\n",
    "  - `HAVING` can be used together with `WHERE` in a single query.\n",
    "  - As in `SELECT`, is you reference a column name in `HAVING`, it has to be either\n",
    "    - inside an aggregate function call, or\n",
    "    - present in the `GROUP BY` clause.\n",
    "  - E.g.,\n",
    "    - `HAVING country = 'Poland'` only works after you add `country` to the `GROUP BY` clause, and\n",
    "    - `HAVING EXTRACT(YEAR FROM order_date) = 1997` des not work."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26c69093",
   "metadata": {},
   "source": [
    "## Other aggregate functions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9c902ec",
   "metadata": {},
   "source": [
    "Try out `AVG()`, `SUM()`, and `EVERY()` (on boolean values)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67f5d3b3",
   "metadata": {},
   "source": [
    "## Subqueries and temporary tables"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "880c7bd7",
   "metadata": {},
   "source": [
    "Task: find 104 first order shipments.  If there are more shipments made on the same day as the 104th one, include all of them in the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "438fae47",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT order_id, shipped_date, ship_city, ship_country\n",
    "FROM orders o"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "428d9efc",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- One approach would be to compute the number of orders that shipped before a specific tested order.\n",
    "- If the number of earlier orders is less than 104, then the tested order is among the first 104 ones (even if ex-equo on the 104th place).\n",
    "- Theodore, is our current knowledge enough to compute the number of orders shipped before date X for certain X?\n",
    "  - Could you choose an X from the dates you see?\n",
    "- Comment out the current query.\n",
    "- Add `SELECT COUNT(*) FROM orders WHERE shipped_date < '1997-01-15' -- some X` below.\n",
    "- We get the number of orders shipped before.\n",
    "- We can change `COUNT(*)` → `COUNT(*) < 104` to instead get a boolean value telling us if there were less than 104 orders before.\n",
    "- Notice that the query result is a 1-row, 1-column table.\n",
    "- This allows us to blatantly do this.\n",
    "  - Uncomment the initial query above.\n",
    "  - Add `WHERE` right after it.\n",
    "  - Wrap the second query in parentheses (\"()\").\n",
    "  - Change the chosen \"X\" date to `COALESCE(o.shipped_date, DATE('now'))`.\n",
    "- We've just had Postgres execute a `SELECT` command that involved executing another `SELECT` command for every row of the `orders` table.\n",
    "- Let's make it slightly more readable.\n",
    "  - Change alias `outer_orders` to `o`.\n",
    "  - In the inner `SELECT`, use alias `inner_orders` analogously.\n",
    "  - Have the inner `SELECT` return just number with `COUNT(*)` and compare the entire subquery to 104.\n",
    "- Add `ORDER BY shipped_date`.\n",
    "- We can also add a similar subquery in the `SELECT` clause.\n",
    "- The subquery depends on a value from the outer query.\n",
    "  - I.e., the `outer_orders.shipped_date`.\n",
    "  - Such subquery is called **correlated** and gives us a lot of flexibility.\n",
    "  - This also means the DMBS **might** need to compute the subquery separately for each row.\n",
    "  - That can be a performance killer.\n",
    "- Let's\n",
    "  - drop the subquery on `SELECT`, and\n",
    "  - rewrite the subquery in `WHERE` as a non-correlated one,\n",
    "    - e.g., use `WHERE shipped_date IN (SELECT shipped_date FROM orders ORDER BY shipped_date LIMIT 104)`.\n",
    "- If subquery result is a 1-column table, it can be used as the right side of the `IN` operator.\n",
    "- Another _operator_ commonly used with subqueries is `EXISTS` (`NOT EXISTS`),\n",
    "  - e.g. `NOT EXISTS (SELECT shipped_date FROM orders WHERE shipped_date < COALESCE(outer_orders.shipped_date, DATE('now') + 1) ORDER BY shipped_date OFFSET 104)`,\n",
    "  - which tests if subquery result is empty, and\n",
    "  - once again gives us a correlated query :c — get back to the non-correlated one.\n",
    "- A non-correlated subquery can also be used as a table expression in the `FROM` clause,\n",
    "  - e.g. `FROM orders INNER JOIN (SELECT shipped_date FROM orders ORDER BY shipped_date LIMIT 104) USING(shipped_date)`,\n",
    "  - with no `WHERE` in the outer query, and\n",
    "  - with `DISTINCT` added to the outer `SELECT`.\n",
    "- Non-correlated subqueries can be replaced with temporary tables that are perhaps more readable (and more obviously non-correlated!).\n",
    "- Use `WITH initial_dates AS (*our_subquery*) SELECT *all_like_before* FROM orders INNER JOIN initial_dates USING(shipped_date) ORDER BY shipped_date`.\n",
    "- More comma-separated `*table_name* AS (*subquery*)` temporary table terms can come after `WITH`.\n",
    "- Note that in this construct the `AS` keyword is required.\n",
    "- Later temporary table declarations can reference temporary tables specified earlier in the same `WITH` clause.\n",
    "  - Add `, initial_dates_distinct AS (SELECT DISTINCT shipped_date FROM initial_dates)`.\n",
    "  - Use it in the final query instead of `initial_dates`.\n",
    "  - Remove `DISTINCT` from the final query."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02bb8636",
   "metadata": {},
   "source": [
    "## Outer \"JOIN\"s"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e8b1e69",
   "metadata": {},
   "source": [
    "Count the number of orders of each customer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4c37abc",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT company_name, COUNT(order_id) AS order_count\n",
    "FROM customers INNER JOIN orders USING(customer_id)\n",
    "GROUP BY company_name, customer_id\n",
    "ORDER BY company_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "064c37da",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Theodore, is it clear what this query does?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "730e3207",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT company_name, 0\n",
    "FROM customers outer_c\n",
    "WHERE NOT EXISTS (SELECT order_id\n",
    "                  FROM orders\n",
    "                  WHERE customer_id = outer_c.customer_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a9c3978",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Theodore, is it clear what this one does?\n",
    "- These 2 companies do not appear in the result of the former query.\n",
    "- That is because their rows can never meet the `INNER JOIN`'s condition.\n",
    "- We can solve the problem using `UNION`.\n",
    "  - Add `, 0` to `SELECT`.\n",
    "  - Copy the previous query to this cell.\n",
    "  - Add `UNION` between the 2 queries.\n",
    "  - Remove the `ORDER BY` clause.\n",
    "- We are done, now let's see how the same thing is typically done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ead9b998",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT company_name, order_id, order_date\n",
    "FROM customers INNER JOIN orders USING(customer_id)\n",
    "WHERE company_name LIKE 'P%'\n",
    "ORDER BY company_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4b149f0",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Let's look closer at how our `JOIN`'s result looks like before grouping.\n",
    "- We shall only take into account the companies whose names start with \"P\".  The result table is going to be smaller and more comprehensible.\n",
    "- \"Paris spécialités\", which had no orders, obviously does not appead here.\n",
    "- So far we've been only using `INNER JOIN` (and `CROSS JOIN` once).\n",
    "- There is also another type of `JOIN`, the `OUTER JOIN`.\n",
    "- `OUTER JOIN` is similar but causes all rows of the input table to **appear at least once** in the result — even if `OUTER JOIN`'s condition was never satisfied for them.\n",
    "- If needed, it uses `NULL`s in place of nonexistent values that would come from the other input table.\n",
    "- Change `INNER JOIN` to `LEFT OUTER JOIN`.\n",
    "- See?  The special row with `NULL`s is not added for other customers, only for \"Paris spécialités\", which would otherwise not appear in the result.\n",
    "- Recall now that `COUNT` counted non-`NULL` values.\n",
    "- It is perfect for the task of counting the actual number of orders of each company here.\n",
    "  - Add `, COUNT(order_id) AS order_count` and `GROUP BY company_name, customer_id`.\n",
    "  - Remove `, order_date`.\n",
    "- Note that\n",
    "  - `COUNT(order_id)` works properly, because `order_id` is `NULL` in case of the special row, but\n",
    "  - `COUNT(*)` would be bad because it counts the rows in the group without regard to `NULL`s.\n",
    "- What is the purpose of `LEFT` in `LEFT OUTER JOIN`?\n",
    "  - Comment out the `GROUP BY` and `COUNT` parts.\n",
    "  - `LEFT OUTER JOIN`, as name suggests, ensures that all rows of the left table appear in its result.\n",
    "  - Swap table names around `LEFT OUTER JOIN`.\n",
    "  - \"Paris spécialités\" is gone again.\n",
    "  - We could now instead use `RIGHT OUTER JOIN` which functions as you'd expect.\n",
    "  - \"Paris spécialités\" is back there."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43d48810",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT employee_id, first_name, last_name, reports_to\n",
    "FROM employees"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aafb595c",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Let's now look at the `employees` table.\n",
    "- An employee can have another employee as a boss.\n",
    "  - This is recorded in the `reports_to` attribute.\n",
    "  - The value of `reports_to`, if present, is another employee's id."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73e9440f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT se.first_name || ' ' || se.last_name AS subordinate,\n",
    "       me.first_name || ' ' || me.last_name AS manager\n",
    "FROM                 employees se\n",
    "     LEFT OUTER JOIN employees me\n",
    "         ON se.reports_to = me.employee_id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d08cea04",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We'd like to list all emploees together with their managers.\n",
    "- Theodore, is this query comprehensible?\n",
    "- We have employees and their bosses in the 2 columns of the result.\n",
    "  - Andrew Fuller is the only employee without real boss.\n",
    "  - Notice how concatenating a string with `NULL` yields a `NULL` value.\n",
    "- `LEFT OUTER JOIN` allowed us to have every employee appear in the left column, with possible `NULL`s in the right one.\n",
    "- We can instead use `RIGHT OUTER JOIN` to have every employee appear in the right column, with possible `NULL`s in the left one.\n",
    "- SQL also allows us to get both of those behaviors at once.\n",
    "  - I.e., do an `OUTER JOIN` that guarantees that each row from both the left and the right table appears in the result.\n",
    "  - It is, in a sense, an `OUTER JOIN` that is `RIGHT` and `LEFT` at the same time.\n",
    "  - Taking inspiration from boss' surname, change the code to use `FULL OUTER JOIN`.\n",
    "  - This one is definitely less frequently used than one-sided `OUTER JOIN`s.\n",
    "  - However, it can be argued that a table as produced by `FULL OUTER JOIN` allows more things to be looked up easily by a human reader.\n",
    "- Now, a useful hint.  The `OUTER` and `INNER` keywords can be dropped when writing queries :)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "/gnu/store/q35bxk1i5blx0rcgxphhrq9xqmgha9bz-python-3.11.11/bin/python3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}