AGH-db-lectures - Lecture materials for databases course at AGH

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bac3442e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import agh_db_lectures\n",
    "agh_db_lectures.prepare_notebook_for_sql()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6430579f",
   "metadata": {},
   "source": [
    "# Advanced SQL queries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc6d5613",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- aggregating selected data (sum, count…)\n",
    "- order of query results\n",
    "- combining data from multipe tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f79c3828",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Open the diagram in an image viewer for more convenience.\n",
    "agh_db_lectures.nw_diagram.download_open()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59ffd66d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Connect to the (locally-hosted) database.\n",
    "%sql postgresql://demo_user:demo_pwd@localhost:25432/agh_it_northwind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44553e37",
   "metadata": {},
   "outputs": [],
   "source": [
    "agh_db_lectures.download_restore_nw_postgres_dump()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc15108a",
   "metadata": {},
   "source": [
    "## The \"ORDER BY\" and \"LIMIT\" clauses"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8865659",
   "metadata": {},
   "source": [
    "Task: find X first order shipments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95ab5146",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT order_id, shipped_date, ship_city, ship_country\n",
    "FROM orders"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08a740f6",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Theodore, what would you like \"X\" to be?\n",
    "- Normally, the order of result rows is completely arbitrary.\n",
    "- Here we see that it does not match our requirement.\n",
    "- Add `ORDER BY shipped_date` — we can impose ordering this way.\n",
    "- The order inside groups with the same date is still arbitrary.\n",
    "  - We can break these ties as well.\n",
    "  - Make it `ORDER BY shipped_data, ship_country` (or analogous) — a comma separated list of sort keys can be used.\n",
    "- But there are still ties, e.g. on 1996-09-10.\n",
    "- We can add `, ship_city` to break it.\n",
    "- Change to `ship_country || ':' || ship_city`.\n",
    "  - Sort keys can be expressions rather than just column names.\n",
    "  - No need for the sort key expressions to actually be selected.\n",
    "- Add `, order_id` to break tie on 1996-08-21.\n",
    "- Now we have a deterministic result sequence, which is no longer (just) a set.\n",
    "- Add `LIMIT BY 100` — we can have the query only return the first 100 rows.\n",
    "- This is useful for paging when an application presents a sublist of entries and the user can click to see the next page.\n",
    "  - How to get data of the next page?\n",
    "  - Add `OFFSET 100` before (or after) `LIMIT`.\n",
    "  - Change to `OFFSET 2*100`.\n",
    "  - Theodore, what application that you know uses this kind of paging?\n",
    "  - When a query is performed, the effective order of computation is:\n",
    "    - `FROM`,\n",
    "    - `WHERE`,\n",
    "    - `ORDER BY`,\n",
    "    - `OFFSET`, and\n",
    "    - `LIMIT`.\n",
    "  - `OFFSET` works, but what if entries at page borders get modified before the user clicks the \"next page\" button?\n",
    "    - The user could see the same row again at the beginning of the subsequent page.\n",
    "    - Or worse, the user could miss a row.\n",
    "  - Remove `OFFSET` and add `WHERE (shipped_date, ship_country || ':' || ship_city, order_id) > ('1996-11-11', 'Germany:Cunewalde', 10345)`.\n",
    "- We are using 2 row expressions to perform comparison on their resulting tuples :)\n",
    "- Application, when requesting a page, could provide a (date,country+city,id) tuple instead of a page number.\n",
    "- This approach to paging\n",
    "  - is more robust,\n",
    "  - but somewhat harder to get right\n",
    "  - and requires more queries to additionally\n",
    "    - compute the current position (aka page number) of the user, and\n",
    "    - allow easily navigating to, e.g., the middle page.\n",
    "- Add `ASC` to every sort key — this is the default.\n",
    "- What if we instead wanted to sort from newest dates?\n",
    "  - Use `shipped_date DESC,`.\n",
    "  - No `WHERE` initially.\n",
    "  - Use `COALESCE(shipped_date, DATE('now') + 1) DESC`.\n",
    "  - Add `WHERE ('1998-04-06' - COALESCE(shipped_date, DATE('now') + 1), ship_country || ':' || ship_city, order_id) > (0, 'UK:London', 10987)`.\n",
    "- What if we wanted to implement paging when, e.g., sorting ascending by country and decending by city?\n",
    "  - Give up tuple comparisons and express it as a logical expression with more `AND`s and `OR`s.\n",
    "  - Verbose but doable.\n",
    "- Now, if we had instead interpreted the initial tast to possibly return a 101st row if it has a date tie with 100th, then we'd need to use a subquery which we'll do later today."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f456b374",
   "metadata": {},
   "source": [
    "## Cartesian product"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da32a14b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%sql --file dbmses-languages.sql"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2a13c8d",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We shall start with simpler tables and move to Northwind later.\n",
    "- The `dbmses-languages.sql` script creates and populates 2 tables with\n",
    "  - programming languages, and\n",
    "  - Database Management Systems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95dc98e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT * FROM languages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b7feee1",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We have just 5 languages in the table to keep the example simple.\n",
    "- Erlang is developed and used by Ericsson, a company that students of teleinformatics here probably know well ;)\n",
    "- The other langs are probably known by everyone.  Yup, Java and JavaScript are different things."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3725f897",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT *\n",
    "FROM DBMSes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "060b8d83",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Here, we just have entries for a few selected DBs for simplicity.\n",
    "- It is good to have a multi-column table for the examples, so we include information on\n",
    "  - whether the DBMB is specific to a single programming language (if so, which one, otherwise, we store NULL) and\n",
    "  - whether the DBMS is available under a free/libre software license.\n",
    "- Mnesia is a distributed database engine written in Erlang.\n",
    "  - It is part of a typical Erlang installation.\n",
    "  - It is a relational DBMS, except for value atomicity requirement, but SQL DBs commonly violate that one as well.\n",
    "  - Unusually for a relational DBMS, Mnesia doesn't use SQL.\n",
    "  - Erlang data structures are used to to interface with Mnesia.\n",
    "- IndexedDB is available to scripts executing on websites to store data in.\n",
    "  - It is actually a database API in JavaScript rather than a single DBMS, I admit.\n",
    "  - IndexedDB data is kept on disk by the web browser on behalf of websites, similarly to cookies.\n",
    "  - It is another example of a relational database that is not queried with SQL.\n",
    "- Keep in mind these are toy tables — we would design the schema differently for a serious database.\n",
    "- Let's write a query that will produce possible DB-language combinations, for example for a software project.\n",
    "- Make it `FROM DBMSes, languages`.\n",
    "- There can be multiple comma-separated tables (well, table expressions) listed after the `FROM` keyword.\n",
    "- Our query result is a table that is a cartesian product of all the command-separated table names.\n",
    "  - Each row with each row.\n",
    "  - We had $4$ and $5$ rows, now we have $4\\cdot5 = 20$ rows.\n",
    "- Our cartesian product includes combinations of Mnesia and languages other than Erlang — these would not be practical for a software project.\n",
    "  - Same issue with IndexedDB.\n",
    "  - We can add `WHERE DBMSes.language_name IS NULL OR DBMSes.language_name = languages.language_name` to filter the 20 result rows that we have.\n",
    "  - As you see, we can use the syntax `*table_name*.*column_name*` to refer to column values of a pecific table.\n",
    "  - The result now makes sense WRT Mnesia and IndexedDB.\n",
    "- The `DBMSes.language_name` column is useful for filtering, but it is not necessary to have it in the result table.\n",
    "- Use `SELECT DBMSes.DBMS_name, DBMSes.DBMS_available_as_FLOSS, languages.language_name`.\n",
    "- The `*table_name*.*column_name*` syntax works as well in `SELECT` clause.\n",
    "- We can give aliases to input table names to make the query less verbose.\n",
    "  - Use `FROM DBMSes D, languages l`.\n",
    "  - Change `DBMSes.` to `D.` everywhere.\n",
    "  - Analogously with `languages.`.\n",
    "- We could also use the `AS` keyword in `FROM`.\n",
    "  - Make it `DBMSes AS D`.\n",
    "  - Works the same.\n",
    "  - Notably, Oracle would not allow `AS` in `FROM`.\n",
    "  - A foolproof approach would be to\n",
    "    - use `AS` for column aliases in `SELECT`, like `D.DBMS_available_as_FLOSS AS FLOSS_stack`, and\n",
    "    - omit `AS` for table aliases in `FROM`.\n",
    "- Additionally, we can omit `*table_name*.` whenever the column name is unambigious.\n",
    "  - Change `D.DBMS_name` → `DBMS_name`, and\n",
    "  - `D.DBMS_available_as_FLOSS` → `DBMS_available_as_FLOSS`.\n",
    "- This is a basic example of how we can combine the data from multiple (here, 2) tables.\n",
    "- Let's apply this knowledge to querying the Northwind DB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a82cf92",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT -- fill in\n",
    "FROM , -- fill in\n",
    "--WHERE -- fill in"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38ae646f",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- As you see in the Northwind ER diagram, certain ids appear in more than one table.\n",
    "- They are there to allow correlating rows from one table to rows from another table.\n",
    "- Theodore, as our example, would you like us to display\n",
    "  - products with their category names and supplier names, or\n",
    "  - orders with their customer names and responsible employee names?\n",
    "- `SELECT p.product_id, p.product_name, c.category_name FROM products p, categories c` — get a cartesian product.\n",
    "- `WHERE p.category_id = c.category_id` — limit it to interesting rows.\n",
    "- Add\n",
    "  - `, s.company_name AS supplier_name` to `SELECT` clause,\n",
    "  - `, suppliers s` to `FROM` clause, and\n",
    "  - ` AND p.supplier_id = s.supplier_id` to `WHERE` clause.\n",
    "- We get a pretty human-readable listing of products."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b1b7d6e",
   "metadata": {},
   "source": [
    "## Basic \"JOIN\"s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eaafb4f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT D.DBMS_name,\n",
    "       D.DBMS_available_as_FLOSS AS FLOSS_stack,\n",
    "       l.language_name\n",
    "FROM DBMSes AS D, languages AS l\n",
    "WHERE D.language_name IS NULL OR\n",
    "      D.language_name = l.language_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52b65be7",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Let's get back to the programming languages example.\n",
    "- Modern SQL standard revisions allow using `CROSS JOIN` instead of a comma (\",\") to produce a cartesian product of tables.\n",
    "- SQL also provides an alternative syntax for working with a **subset of** a cartesian product.\n",
    "- Replace `FROM` and `WHERE` clauses with `FROM DBMSes AS D INNER JOIN languages AS l ON D.language_name IS NULL OR D.language_name = l.language_name`.\n",
    "- The result is the same.\n",
    "- The logic used to be \"Get the cartesian product, then filter it according to \\*condition\\*.\"\n",
    "- Now the logic is \"Get combined rows according to \\*condition\\*.\"\n",
    "- It does **not** mean that the latter is more efficient!  A DBMS is free to optimize both the old and the new query.\n",
    "- The semantic difference is that, when combining data from more than 2 tables, `JOIN` takes precedence over `,` in the `FROM` clause.\n",
    "- Other than that, the major difference is that `INNER JOIN` might be more readable than a cartesian product (unless a cartesian product is what we want conceptually).\n",
    "- We can still use `WHERE` to additionally filter the combined rows.\n",
    "  - Let's say I don't like relying on proprietary systems.\n",
    "  - Add `WHERE D.DBMS_available_as_FLOSS`.\n",
    "  - Technically, it would also be possible to include this in the `INNER JOIN` condition.\n",
    "  - Add parentheses around `INNER JOIN` condition and change `WHERE` to `AND`.\n",
    "  - Note the use of parentheses.\n",
    "  - The result is the same, but it is arguably less readable, go back to using `WHERE` after `INNER JOIN`.\n",
    "- As a **digression**, note that result column name (`FLOSS_stack`) cannot be used in `WHERE`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "360eb35d",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "-- paste in the code from the Northwind cartesian product example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "506a4681",
   "metadata": {},
   "source": [
    "_notes_\n",
    "- Use `products p INNER JOIN categories c ON p.category_id = c.category_id` and remove part of the `WHERE` condition.\n",
    "- Note that we have 2 comma-separated table expressions in the `FROM` clause now.\n",
    "- Change\n",
    "  - `, suppliers s` → `INNER JOIN suppliers s`, and\n",
    "  - `WHERE` to `ON`.\n",
    "- One might guess that `table_A.some_id = table_B.some_id` is a common type of an `INNER JOIN` condition.\n",
    "- If the `some_id` column is named the same in both tables, we can use an alternative syntax.\n",
    "- Comment out `ON p.category_id = c.category_id` and put `USING(category_id)` there.\n",
    "- In case of our particular query, the behavior of this `INNER JOIN` is the same.\n",
    "- The differences are as follows.\n",
    "  - The result of the `INNER JOIN` does not contain the duplicate column.\n",
    "    - Add `, category_id` to the `SELECT` clause.\n",
    "    - It works.\n",
    "    - Temporarily revert to `ON` instead of `USING` in the `FROM` clause.\n",
    "    - It now complains about column name `category_id` being ambigious.\n",
    "  - The `USING` form of `INNER JOIN` can be more readable.\n",
    "- We could also require the equivalence of multiple columns between 2 tables.\n",
    "  - It works by specifying multiple comma-separated column names in `USING`.\n",
    "  - Hard to demonstrate on Northwind DB.\n",
    "- Note that Mocrosoft's SQL Server notably does not support `USING` with `JOIN`s.\n",
    "- Analogously, change to `USING` in the second `INNER JOIN`.\n",
    "- Looking at the ER diagram, note that `category_id` is the only common column name between `products` and `categories`\n",
    "  - In such cases, under Postgres we can use the `NATURAL INNER JOIN` syntactic construct.\n",
    "  - Change `INNER JOIN categories c USING(category_id)` → `NATURAL INNER JOIN categories c`.\n",
    "  - Works as `USING(*list_of_commonly_named_table_columns*)`.\n",
    "  - Analogously with `supplier_id`, but note that this is actually dangerous.\n",
    "  - The `NATURAL INNER JOIN` construct relies on the columns names of input tables.\n",
    "  - What if\n",
    "    - I change the second `INNER JOIN` in this query to `NATURAL INNER JOIN`, and\n",
    "    - the `suppliers` table later gets modified to include a new `description` column?\n",
    "  - I advise against using `NATURAL INNER JOIN`.\n",
    "  - Similarly to `USING`, the `NATURAL INNER JOIN` is supported by Postgres, Oracle, MySQL and SQLite but not the SQL Server.\n",
    "- We shall get back to `JOIN`s later on."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbb59dc9",
   "metadata": {},
   "source": [
    "## The \"COUNT\" function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eec00a18",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT *\n",
    "FROM customers;"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cf93e85",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We see data of different companies, where some columns have unique values (e.g., `customer_id`) and some have non-unique values (e.g., `country`).\n",
    "-Theodore, which attributes from the customers table would you like to work with now?\n",
    "- Jupyter tells us the number of selected rows, but that number won't be readily available to a program that executes the query through an API.\n",
    "- A program could fetch the result and count rows by itself, but… what a waste of resources :c\n",
    "- Replace with `COUNT(*chosen_attribute*)`.\n",
    "- Note that use of aggregation changed the number of result rows.\n",
    "- We get the same result with `COUNT(city)` and `COUNT(country)`.\n",
    "- Replace with `COUNT(fax)`, result is different.\n",
    "  - One could conclude that `COUNT` is not merely used to count rows.\n",
    "  - It does not count unique values, either (countries are not unique in this table).\n",
    "  - It counts non-null values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a108bae1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT 'any string'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7325c3a8",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Wrap in `COUNT()`.\n",
    "  - It counted the single row that tableless `SELECT` could return.\n",
    "- Try\n",
    "  - `COUNT(1)`,\n",
    "  - `COUNT(0)`,\n",
    "  - `COUNT(TRUE)`, and\n",
    "  - `COUNT(FALSE)` — it is easy to forget that this one gives 1.\n",
    "- `COUNT(NULL)` — the only one that gives 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8873abd4",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT 100 - COUNT(customer_id)\n",
    "FROM customers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4cbb903",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We can perform further computation on the result of aggregation functions.\n",
    "- Actually, there is one more elegant way to count rows — `COUNT(*)`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "062bda6a",
   "metadata": {},
   "source": [
    "## The \"GROUP BY\" clause"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "865298df",
   "metadata": {},
   "source": [
    "Task: using orders table, get the date of last order of company 🥕."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7eecff9",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT company_name, customer_id\n",
    "FROM customers\n",
    "\n",
    "--SELECT -- fill in\n",
    "--FROM orders\n",
    "--WHERE customer_id = 🥕"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7045ed55",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Theodore, which company would you like the carrot to stand for?\n",
    "- Comment the current query command and uncomment the second one.\n",
    "- Fill in the `SELECT` clause with `order_date` to show order dates.\n",
    "- Change it to `MAX(order_date)`.\n",
    "- `MAX` happens to be another aggregate function, line `COUNT`.\n",
    "- `MIN` works analogously (change to it, but go back to `MAX`).\n",
    "- The query is now good… for getting the last order date of a single customer.\n",
    "- What if we wanted to get last order dates of all customers?\n",
    "- A loop would be useful, but\n",
    "  - we have not learned the procedural extensions to SQL yet, and\n",
    "  - these extensions are nonstandard, anyway."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "244f9dbb",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT customer_id, MAX(order_date) AS last_order_date\n",
    "FROM orders\n",
    "GROUP BY customer_id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "514e9f48",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- `GROUP BY` - makes aggragation functions like `COUNT` and `MAX` operate on smaller **groups** rather than all tuples.\n",
    "- Add\n",
    "  - `JOIN customers USING(customer_id)` to the `FROM` clause,\n",
    "  - `, company_name` to the `GROUP BY` clause, and\n",
    "  - `company_name, ` as the first expression in the `SELECT` clause.\n",
    "- So, `GROUP BY` allowed us to solve the earlier task for all possible values of carrot at once.\n",
    "- Note that you **can**\n",
    "    - perform operations on values before an aggregate function is applied, e.g., `MAX(EXTRACT(YEAR FROM order_date))` in Postgres and Oracle\n",
    "      - (`YEAR(order_date)` in MySQL/MariaDB, SQL Server and some others),\n",
    "    - perform operations on an aggregate function's result, e.g., `EXTRACT(YEAR FROM MAX(order_date))`,\n",
    "    - group by a column without directly using it in the `SELECT` clause,\n",
    "    - use expressions in the `GROUP BY` clause, e.g., `EXTRACT(YEAR FROM order_date)` to group by year and get the last order date in each year by each customer, and\n",
    "    - use aggregate functions in the `ORDER BY` clause, e.g., `ORDER BY MAX(order_date) DESC`.\n",
    "- Note that you **can't**\n",
    "  - use aggregate functions in `WHERE`, `FROM`, and — obviously — the `GROUP BY` clause, and\n",
    "  - omit in `GROUP BY` a column name that is referenced outside of aggregate functions in query's `SELECT` clause.\n",
    "- Note that the effective order of operations is\n",
    "  - `FROM`\n",
    "  - `WHERE`\n",
    "  - `GROUP BY`, and\n",
    "  - `ORDER BY`.\n",
    "- So, if we want to get each company's last order in 1997, we can add `WHERE EXTRACT(YEAR FROM order_date) = 1997`.\n",
    "- What if we instead wanted to get companies that made their last orders in 1997?\n",
    "  - I.e., we want to filter rows **after** grouping is performed.\n",
    "  - Add `HAVING EXTRACT(YEAR FROM MAX(order_date)) = 1997` after the `GROUP BY` clause.\n",
    "  - `HAVING` is similar to `WHERE` but affects the result of grouping.\n",
    "    - Note that, similarly to `WHERE`, the `HAVING` clause is not aware of result table's column names.\n",
    "    - E.g., `HAVING EXTRACT(YEAR FROM last_order_date) = 1997` gives an error.\n",
    "  - `HAVING` can be used together with `WHERE` in a single query.\n",
    "  - As in `SELECT`, is you reference a column name in `HAVING`, it has to be either\n",
    "    - inside an aggregate function call, or\n",
    "    - present in the `GROUP BY` clause.\n",
    "  - E.g.,\n",
    "    - `HAVING country = 'Poland'` only works after you add `country` to the `GROUP BY` clause, and\n",
    "    - `HAVING EXTRACT(YEAR FROM order_date) = 1997` des not work."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26c69093",
   "metadata": {},
   "source": [
    "## Other aggregate functions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9c902ec",
   "metadata": {},
   "source": [
    "Try out `AVG()`, `SUM()`, and `EVERY()` (on boolean values)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67f5d3b3",
   "metadata": {},
   "source": [
    "## Subqueries and Common Table Expressions (CTEs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "880c7bd7",
   "metadata": {},
   "source": [
    "Task: find 104 first order shipments.  If there are more shipments made on the same day as the 104th one, include all of them in the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "438fae47",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT order_id, shipped_date, ship_city, ship_country\n",
    "FROM orders o"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "428d9efc",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- One approach would be to compute the number of orders that shipped before a specific tested order.\n",
    "- If the number of earlier orders is less than 104, then the tested order is among the first 104 ones (even if ex-equo on the 104th place).\n",
    "- Theodore, is our current knowledge enough to compute the number of orders shipped before date X for certain X?\n",
    "  - Could you choose an X from the dates you see?\n",
    "- Comment out the current query.\n",
    "- Add `SELECT COUNT(*) FROM orders WHERE shipped_date < '1997-01-15' -- some X` below.\n",
    "- We get the number of orders shipped before.\n",
    "- We can change `COUNT(*)` → `COUNT(*) < 104` to instead get a boolean value telling us if there were less than 104 orders before.\n",
    "- Notice that the query result is a 1-row, 1-column table.\n",
    "- This allows us to blatantly do this.\n",
    "  - Uncomment the initial query above.\n",
    "  - Add `WHERE` right after it.\n",
    "  - Wrap the second query in parentheses (\"()\").\n",
    "  - Change the chosen \"X\" date to `COALESCE(o.shipped_date, DATE('now'))`.\n",
    "- We've just had Postgres execute a `SELECT` command that involved executing another `SELECT` command for every row of the `orders` table.\n",
    "- Let's make it slightly more readable.\n",
    "  - Change alias `outer_orders` to `o`.\n",
    "  - In the inner `SELECT`, use alias `inner_orders` analogously.\n",
    "  - Have the inner `SELECT` return just number with `COUNT(*)` and compare the entire subquery to 104.\n",
    "- Add `ORDER BY shipped_date`.\n",
    "- We can also add a similar subquery in the `SELECT` clause.\n",
    "- The subquery depends on a value from the outer query.\n",
    "  - I.e., the `outer_orders.shipped_date`.\n",
    "  - Such subquery is called **correlated** and gives us a lot of flexibility.\n",
    "  - This also means the DMBS **might** need to compute the subquery separately for each row.\n",
    "  - That can be a performance killer.\n",
    "- Let's\n",
    "  - drop the subquery on `SELECT`, and\n",
    "  - rewrite the subquery in `WHERE` as a non-correlated one,\n",
    "    - e.g., use `WHERE shipped_date IN (SELECT shipped_date FROM orders ORDER BY shipped_date LIMIT 104)`.\n",
    "- If subquery result is a 1-column table, it can be used as the right side of the `IN` operator.\n",
    "- Another _operator_ commonly used with subqueries is `EXISTS` (`NOT EXISTS`),\n",
    "  - e.g. `NOT EXISTS (SELECT shipped_date FROM orders WHERE shipped_date < COALESCE(outer_orders.shipped_date, DATE('now') + 1) ORDER BY shipped_date OFFSET 104)`,\n",
    "  - which tests if subquery result is empty, and\n",
    "  - once again gives us a correlated query :c — get back to the non-correlated one.\n",
    "- A non-correlated subquery can also be used as a table expression in the `FROM` clause,\n",
    "  - e.g. `FROM orders INNER JOIN (SELECT shipped_date FROM orders ORDER BY shipped_date LIMIT 104) USING(shipped_date)`,\n",
    "  - with no `WHERE` in the outer query, and\n",
    "  - with `DISTINCT` added to the outer `SELECT`.\n",
    "- Non-correlated subqueries can be replaced with common table expressions that are perhaps more readable (and more obviously non-correlated!).\n",
    "- Use `WITH initial_dates AS (*our_subquery*) SELECT *all_like_before* FROM orders INNER JOIN initial_dates USING(shipped_date) ORDER BY shipped_date`.\n",
    "- More comma-separated `*table_name* AS (*subquery*)` common table expressions can come after `WITH`.\n",
    "- Note that in this construct the `AS` keyword is required.\n",
    "- Later common table expressions can reference common table expressions specified earlier in the same `WITH` clause.\n",
    "  - Add `, initial_dates_distinct AS (SELECT DISTINCT shipped_date FROM initial_dates)`.\n",
    "  - Use it in the final query instead of `initial_dates`.\n",
    "  - Remove `DISTINCT` from the final query."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02bb8636",
   "metadata": {},
   "source": [
    "## Outer \"JOIN\"s"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e8b1e69",
   "metadata": {},
   "source": [
    "Count the number of orders of each customer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4c37abc",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT company_name, COUNT(order_id) AS order_count\n",
    "FROM customers INNER JOIN orders USING(customer_id)\n",
    "GROUP BY company_name, customer_id\n",
    "ORDER BY company_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "064c37da",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Theodore, is it clear what this query does?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "730e3207",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT company_name, 0\n",
    "FROM customers outer_c\n",
    "WHERE NOT EXISTS (SELECT order_id\n",
    "                  FROM orders\n",
    "                  WHERE customer_id = outer_c.customer_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a9c3978",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Theodore, is it clear what this one does?\n",
    "- These 2 companies do not appear in the result of the former query.\n",
    "- That is because their rows can never meet the `INNER JOIN`'s condition.\n",
    "- We can solve the problem using `UNION`.\n",
    "  - Add `, 0` to `SELECT`.\n",
    "  - Copy the previous query to this cell.\n",
    "  - Add `UNION` between the 2 queries.\n",
    "  - Remove the `ORDER BY` clause.\n",
    "- We are done, now let's see how the same thing is typically done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ead9b998",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT company_name, order_id, order_date\n",
    "FROM customers INNER JOIN orders USING(customer_id)\n",
    "WHERE company_name LIKE 'P%'\n",
    "ORDER BY company_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4b149f0",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Let's look closer at how our `JOIN`'s result looks like before grouping.\n",
    "- We shall only take into account the companies whose names start with \"P\".  The result table is going to be smaller and more comprehensible.\n",
    "- \"Paris spécialités\", which had no orders, obviously does not appead here.\n",
    "- So far we've been only using `INNER JOIN` (and `CROSS JOIN` once).\n",
    "- There is also another type of `JOIN`, the `OUTER JOIN`.\n",
    "- `OUTER JOIN` is similar but causes all rows of the input table to **appear at least once** in the result — even if `OUTER JOIN`'s condition was never satisfied for them.\n",
    "- If needed, it uses `NULL`s in place of nonexistent values that would come from the other input table.\n",
    "- Change `INNER JOIN` to `LEFT OUTER JOIN`.\n",
    "- See?  The special row with `NULL`s is not added for other customers, only for \"Paris spécialités\", which would otherwise not appear in the result.\n",
    "- Recall now that `COUNT` counted non-`NULL` values.\n",
    "- It is perfect for the task of counting the actual number of orders of each company here.\n",
    "  - Add `, COUNT(order_id) AS order_count` and `GROUP BY company_name, customer_id`.\n",
    "  - Remove `, order_date`.\n",
    "- Note that\n",
    "  - `COUNT(order_id)` works properly, because `order_id` is `NULL` in case of the special row, but\n",
    "  - `COUNT(*)` would be bad because it counts the rows in the group without regard to `NULL`s.\n",
    "- What is the purpose of `LEFT` in `LEFT OUTER JOIN`?\n",
    "  - Comment out the `GROUP BY` and `COUNT` parts.\n",
    "  - `LEFT OUTER JOIN`, as name suggests, ensures that all rows of the left table appear in its result.\n",
    "  - Swap table names around `LEFT OUTER JOIN`.\n",
    "  - \"Paris spécialités\" is gone again.\n",
    "  - We could now instead use `RIGHT OUTER JOIN` which functions as you'd expect.\n",
    "  - \"Paris spécialités\" is back there."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43d48810",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT employee_id, first_name, last_name, reports_to\n",
    "FROM employees"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aafb595c",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- Let's now look at the `employees` table.\n",
    "- An employee can have another employee as a boss.\n",
    "  - This is recorded in the `reports_to` attribute.\n",
    "  - The value of `reports_to`, if present, is another employee's id."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73e9440f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "SELECT se.first_name || ' ' || se.last_name AS subordinate,\n",
    "       me.first_name || ' ' || me.last_name AS manager\n",
    "FROM                 employees se\n",
    "     LEFT OUTER JOIN employees me\n",
    "         ON se.reports_to = me.employee_id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d08cea04",
   "metadata": {},
   "source": [
    "_notes_\n",
    "\n",
    "- We'd like to list all emploees together with their managers.\n",
    "- Theodore, is this query comprehensible?\n",
    "- We have employees and their bosses in the 2 columns of the result.\n",
    "  - Andrew Fuller is the only employee without real boss.\n",
    "  - Notice how concatenating a string with `NULL` yields a `NULL` value.\n",
    "- `LEFT OUTER JOIN` allowed us to have every employee appear in the left column, with possible `NULL`s in the right one.\n",
    "- We can instead use `RIGHT OUTER JOIN` to have every employee appear in the right column, with possible `NULL`s in the left one.\n",
    "- SQL also allows us to get both of those behaviors at once.\n",
    "  - I.e., do an `OUTER JOIN` that guarantees that each row from both the left and the right table appears in the result.\n",
    "  - It is, in a sense, an `OUTER JOIN` that is `RIGHT` and `LEFT` at the same time.\n",
    "  - Taking inspiration from boss' surname, change the code to use `FULL OUTER JOIN`.\n",
    "  - This one is definitely less frequently used than one-sided `OUTER JOIN`s.\n",
    "  - However, it can be argued that a table as produced by `FULL OUTER JOIN` allows more things to be looked up easily by a human reader.\n",
    "- Now, a useful hint.  The `OUTER` and `INNER` keywords can be dropped when writing queries :)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "/gnu/store/q35bxk1i5blx0rcgxphhrq9xqmgha9bz-python-3.11.11/bin/python3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}