{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "bac3442e", "metadata": {}, "outputs": [], "source": [ "import agh_db_lectures\n", "agh_db_lectures.prepare_notebook_for_sql()" ] }, { "cell_type": "markdown", "id": "6430579f", "metadata": {}, "source": [ "# Advanced SQL queries" ] }, { "cell_type": "markdown", "id": "bc6d5613", "metadata": {}, "source": [ "_notes_\n", "\n", "- aggregating selected data (sum, count…)\n", "- order of query results\n", "- combining data from multipe tables" ] }, { "cell_type": "code", "execution_count": null, "id": "f79c3828", "metadata": {}, "outputs": [], "source": [ "# Open the diagram in an image viewer for more convenience.\n", "agh_db_lectures.nw_diagram.download_open()" ] }, { "cell_type": "code", "execution_count": null, "id": "59ffd66d", "metadata": {}, "outputs": [], "source": [ "# Connect to the (locally-hosted) database.\n", "%sql postgresql://demo_user:demo_pwd@localhost:25432/agh_it_northwind" ] }, { "cell_type": "code", "execution_count": null, "id": "44553e37", "metadata": {}, "outputs": [], "source": [ "agh_db_lectures.download_restore_nw_postgres_dump()" ] }, { "cell_type": "markdown", "id": "dc15108a", "metadata": {}, "source": [ "## The \"ORDER BY\" and \"LIMIT\" clauses" ] }, { "cell_type": "markdown", "id": "f8865659", "metadata": {}, "source": [ "Task: find X first order shipments." ] }, { "cell_type": "code", "execution_count": null, "id": "95ab5146", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT order_id, shipped_date, ship_city, ship_country\n", "FROM orders" ] }, { "cell_type": "markdown", "id": "08a740f6", "metadata": {}, "source": [ "_notes_\n", "\n", "- Theodore, what would you like \"X\" to be?\n", "- Normally, the order of result rows is completely arbitrary.\n", "- Here we see that it does not match our requirement.\n", "- Add `ORDER BY shipped_date` — we can impose ordering this way.\n", "- The order inside groups with the same date is still arbitrary.\n", " - We can break these ties as well.\n", " - Make it `ORDER BY shipped_data, ship_country` (or analogous) — a comma separated list of sort keys can be used.\n", "- But there are still ties, e.g. on 1996-09-10.\n", "- We can add `, ship_city` to break it.\n", "- Change to `ship_country || ':' || ship_city`.\n", " - Sort keys can be expressions rather than just column names.\n", " - No need for the sort key expressions to actually be selected.\n", "- Add `, order_id` to break tie on 1996-08-21.\n", "- Now we have a deterministic result sequence, which is no longer (just) a set.\n", "- Add `LIMIT BY 100` — we can have the query only return the first 100 rows.\n", "- This is useful for paging when an application presents a sublist of entries and the user can click to see the next page.\n", " - How to get data of the next page?\n", " - Add `OFFSET 100` before (or after) `LIMIT`.\n", " - Change to `OFFSET 2*100`.\n", " - Theodore, what application that you know uses this kind of paging?\n", " - When a query is performed, the effective order of computation is:\n", " - `FROM`,\n", " - `WHERE`,\n", " - `ORDER BY`,\n", " - `OFFSET`, and\n", " - `LIMIT`.\n", " - `OFFSET` works, but what if entries at page borders get modified before the user clicks the \"next page\" button?\n", " - The user could see the same row again at the beginning of the subsequent page.\n", " - Or worse, the user could miss a row.\n", " - Remove `OFFSET` and add `WHERE (shipped_date, ship_country || ':' || ship_city, order_id) > ('1996-11-11', 'Germany:Cunewalde', 10345)`.\n", "- We are using 2 row expressions to perform comparison on their resulting tuples :)\n", "- Application, when requesting a page, could provide a (date,country+city,id) tuple instead of a page number.\n", "- This approach to paging\n", " - is more robust,\n", " - but somewhat harder to get right\n", " - and requires more queries to additionally\n", " - compute the current position (aka page number) of the user, and\n", " - allow easily navigating to, e.g., the middle page.\n", "- Add `ASC` to every sort key — this is the default.\n", "- What if we instead wanted to sort from newest dates?\n", " - Use `shipped_date DESC,`.\n", " - No `WHERE` initially.\n", " - Use `COALESCE(shipped_date, DATE('now') + 1) DESC`.\n", " - Add `WHERE ('1998-04-06' - COALESCE(shipped_date, DATE('now') + 1), ship_country || ':' || ship_city, order_id) > (0, 'UK:London', 10987)`.\n", "- What if we wanted to implement paging when, e.g., sorting ascending by country and decending by city?\n", " - Give up tuple comparisons and express it as a logical expression with more `AND`s and `OR`s.\n", " - Verbose but doable.\n", "- Now, if we had instead interpreted the initial tast to possibly return a 101st row if it has a date tie with 100th, then we'd need to use a subquery which we'll do later today." ] }, { "cell_type": "markdown", "id": "f456b374", "metadata": {}, "source": [ "## Cartesian product" ] }, { "cell_type": "code", "execution_count": null, "id": "da32a14b", "metadata": {}, "outputs": [], "source": [ "%sql --file dbmses-languages.sql" ] }, { "cell_type": "markdown", "id": "a2a13c8d", "metadata": {}, "source": [ "_notes_\n", "\n", "- We shall start with simpler tables and move to Northwind later.\n", "- The `dbmses-languages.sql` script creates and populates 2 tables with\n", " - programming languages, and\n", " - Database Management Systems." ] }, { "cell_type": "code", "execution_count": null, "id": "95dc98e7", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT * FROM languages" ] }, { "cell_type": "markdown", "id": "4b7feee1", "metadata": {}, "source": [ "_notes_\n", "\n", "- We have just 5 languages in the table to keep the example simple.\n", "- Erlang is developed and used by Ericsson, a company that students of teleinformatics here probably know well ;)\n", "- The other langs are probably known by everyone. Yup, Java and JavaScript are different things." ] }, { "cell_type": "code", "execution_count": null, "id": "3725f897", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT *\n", "FROM DBMSes" ] }, { "cell_type": "markdown", "id": "060b8d83", "metadata": {}, "source": [ "_notes_\n", "\n", "- Here, we just have entries for a few selected DBs for simplicity.\n", "- It is good to have a multi-column table for the examples, so we include information on\n", " - whether the DBMB is specific to a single programming language (if so, which one, otherwise, we store NULL) and\n", " - whether the DBMS is available under a free/libre software license.\n", "- Mnesia is a distributed database engine written in Erlang.\n", " - It is part of a typical Erlang installation.\n", " - It is a relational DBMS, except for value atomicity requirement, but SQL DBs commonly violate that one as well.\n", " - Unusually for a relational DBMS, Mnesia doesn't use SQL.\n", " - Erlang data structures are used to to interface with Mnesia.\n", "- IndexedDB is available to scripts executing on websites to store data in.\n", " - It is actually a database API in JavaScript rather than a single DBMS, I admit.\n", " - IndexedDB data is kept on disk by the web browser on behalf of websites, similarly to cookies.\n", " - It is another example of a relational database that is not queried with SQL.\n", "- Keep in mind these are toy tables — we would design the schema differently for a serious database.\n", "- Let's write a query that will produce possible DB-language combinations, for example for a software project.\n", "- Make it `FROM DBMSes, languages`.\n", "- There can be multiple comma-separated tables (well, table expressions) listed after the `FROM` keyword.\n", "- Our query result is a table that is a cartesian product of all the command-separated table names.\n", " - Each row with each row.\n", " - We had $4$ and $5$ rows, now we have $4\\cdot5 = 20$ rows.\n", "- Our cartesian product includes combinations of Mnesia and languages other than Erlang — these would not be practical for a software project.\n", " - Same issue with IndexedDB.\n", " - We can add `WHERE DBMSes.language_name IS NULL OR DBMSes.language_name = languages.language_name` to filter the 20 result rows that we have.\n", " - As you see, we can use the syntax `*table_name*.*column_name*` to refer to column values of a pecific table.\n", " - The result now makes sense WRT Mnesia and IndexedDB.\n", "- The `DBMSes.language_name` column is useful for filtering, but it is not necessary to have it in the result table.\n", "- Use `SELECT DBMSes.DBMS_name, DBMSes.DBMS_available_as_FLOSS, languages.language_name`.\n", "- The `*table_name*.*column_name*` syntax works as well in `SELECT` clause.\n", "- We can give aliases to input table names to make the query less verbose.\n", " - Use `FROM DBMSes D, languages l`.\n", " - Change `DBMSes.` to `D.` everywhere.\n", " - Analogously with `languages.`.\n", "- We could also use the `AS` keyword in `FROM`.\n", " - Make it `DBMSes AS D`.\n", " - Works the same.\n", " - Notably, Oracle would not allow `AS` in `FROM`.\n", " - A foolproof approach would be to\n", " - use `AS` for column aliases in `SELECT`, like `D.DBMS_available_as_FLOSS AS FLOSS_stack`, and\n", " - omit `AS` for table aliases in `FROM`.\n", "- Additionally, we can omit `*table_name*.` whenever the column name is unambigious.\n", " - Change `D.DBMS_name` → `DBMS_name`, and\n", " - `D.DBMS_available_as_FLOSS` → `DBMS_available_as_FLOSS`.\n", "- This is a basic example of how we can combine the data from multiple (here, 2) tables.\n", "- Let's apply this knowledge to querying the Northwind DB." ] }, { "cell_type": "code", "execution_count": null, "id": "7a82cf92", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT -- fill in\n", "FROM , -- fill in\n", "--WHERE -- fill in" ] }, { "cell_type": "markdown", "id": "38ae646f", "metadata": {}, "source": [ "_notes_\n", "\n", "- As you see in the Northwind ER diagram, certain ids appear in more than one table.\n", "- They are there to allow correlating rows from one table to rows from another table.\n", "- Theodore, as our example, would you like us to display\n", " - products with their category names and supplier names, or\n", " - orders with their customer names and responsible employee names?\n", "- `SELECT p.product_id, p.product_name, c.category_name FROM products p, categories c` — get a cartesian product.\n", "- `WHERE p.category_id = c.category_id` — limit it to interesting rows.\n", "- Add\n", " - `, s.company_name AS supplier_name` to `SELECT` clause,\n", " - `, suppliers s` to `FROM` clause, and\n", " - ` AND p.supplier_id = s.supplier_id` to `WHERE` clause.\n", "- We get a pretty human-readable listing of products." ] }, { "cell_type": "markdown", "id": "6b1b7d6e", "metadata": {}, "source": [ "## Basic \"JOIN\"s" ] }, { "cell_type": "code", "execution_count": null, "id": "eaafb4f7", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT D.DBMS_name,\n", " D.DBMS_available_as_FLOSS AS FLOSS_stack,\n", " l.language_name\n", "FROM DBMSes AS D, languages AS l\n", "WHERE D.language_name IS NULL OR\n", " D.language_name = l.language_name" ] }, { "cell_type": "markdown", "id": "52b65be7", "metadata": {}, "source": [ "_notes_\n", "\n", "- Let's get back to the programming languages example.\n", "- Modern SQL standard revisions allow using `CROSS JOIN` instead of a comma (\",\") to produce a cartesian product of tables.\n", "- SQL also provides an alternative syntax for working with a **subset of** a cartesian product.\n", "- Replace `FROM` and `WHERE` clauses with `FROM DBMSes AS D INNER JOIN languages AS l ON D.language_name IS NULL OR D.language_name = l.language_name`.\n", "- The result is the same.\n", "- The logic used to be \"Get the cartesian product, then filter it according to \\*condition\\*.\"\n", "- Now the logic is \"Get combined rows according to \\*condition\\*.\"\n", "- It does **not** mean that the latter is more efficient! A DBMS is free to optimize both the old and the new query.\n", "- The semantic difference is that, when combining data from more than 2 tables, `JOIN` takes precedence over `,` in the `FROM` clause.\n", "- Other than that, the major difference is that `INNER JOIN` might be more readable than a cartesian product (unless a cartesian product is what we want conceptually).\n", "- We can still use `WHERE` to additionally filter the combined rows.\n", " - Let's say I don't like relying on proprietary systems.\n", " - Add `WHERE D.DBMS_available_as_FLOSS`.\n", " - Technically, it would also be possible to include this in the `INNER JOIN` condition.\n", " - Add parentheses around `INNER JOIN` condition and change `WHERE` to `AND`.\n", " - Note the use of parentheses.\n", " - The result is the same, but it is arguably less readable, go back to using `WHERE` after `INNER JOIN`.\n", "- As a **digression**, note that result column name (`FLOSS_stack`) cannot be used in `WHERE`." ] }, { "cell_type": "code", "execution_count": null, "id": "360eb35d", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "-- paste in the code from the Northwind cartesian product example" ] }, { "cell_type": "markdown", "id": "506a4681", "metadata": {}, "source": [ "_notes_\n", "- Use `products p INNER JOIN categories c ON p.category_id = c.category_id` and remove part of the `WHERE` condition.\n", "- Note that we have 2 comma-separated table expressions in the `FROM` clause now.\n", "- Change\n", " - `, suppliers s` → `INNER JOIN suppliers s`, and\n", " - `WHERE` to `ON`.\n", "- One might guess that `table_A.some_id = table_B.some_id` is a common type of an `INNER JOIN` condition.\n", "- If the `some_id` column is named the same in both tables, we can use an alternative syntax.\n", "- Comment out `ON p.category_id = c.category_id` and put `USING(category_id)` there.\n", "- In case of our particular query, the behavior of this `INNER JOIN` is the same.\n", "- The differences are as follows.\n", " - The result of the `INNER JOIN` does not contain the duplicate column.\n", " - Add `, category_id` to the `SELECT` clause.\n", " - It works.\n", " - Temporarily revert to `ON` instead of `USING` in the `FROM` clause.\n", " - It now complains about column name `category_id` being ambigious.\n", " - The `USING` form of `INNER JOIN` can be more readable.\n", "- We could also require the equivalence of multiple columns between 2 tables.\n", " - It works by specifying multiple comma-separated column names in `USING`.\n", " - Hard to demonstrate on Northwind DB.\n", "- Note that Mocrosoft's SQL Server notably does not support `USING` with `JOIN`s.\n", "- Analogously, change to `USING` in the second `INNER JOIN`.\n", "- Looking at the ER diagram, note that `category_id` is the only common column name between `products` and `categories`\n", " - In such cases, under Postgres we can use the `NATURAL INNER JOIN` syntactic construct.\n", " - Change `INNER JOIN categories c USING(category_id)` → `NATURAL INNER JOIN categories c`.\n", " - Works as `USING(*list_of_commonly_named_table_columns*)`.\n", " - Analogously with `supplier_id`, but note that this is actually dangerous.\n", " - The `NATURAL INNER JOIN` construct relies on the columns names of input tables.\n", " - What if\n", " - I change the second `INNER JOIN` in this query to `NATURAL INNER JOIN`, and\n", " - the `suppliers` table later gets modified to include a new `description` column?\n", " - I advise against using `NATURAL INNER JOIN`.\n", " - Similarly to `USING`, the `NATURAL INNER JOIN` is supported by Postgres, Oracle, MySQL and SQLite but not the SQL Server.\n", "- We shall get back to `JOIN`s later on." ] }, { "cell_type": "markdown", "id": "bbb59dc9", "metadata": {}, "source": [ "## The \"COUNT\" function" ] }, { "cell_type": "code", "execution_count": null, "id": "eec00a18", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT *\n", "FROM customers;" ] }, { "cell_type": "markdown", "id": "4cf93e85", "metadata": {}, "source": [ "_notes_\n", "\n", "- We see data of different companies, where some columns have unique values (e.g., `customer_id`) and some have non-unique values (e.g., `country`).\n", "-Theodore, which attributes from the customers table would you like to work with now?\n", "- Jupyter tells us the number of selected rows, but that number won't be readily available to a program that executes the query through an API.\n", "- A program could fetch the result and count rows by itself, but… what a waste of resources :c\n", "- Replace with `COUNT(*chosen_attribute*)`.\n", "- Note that use of aggregation changed the number of result rows.\n", "- We get the same result with `COUNT(city)` and `COUNT(country)`.\n", "- Replace with `COUNT(fax)`, result is different.\n", " - One could conclude that `COUNT` is not merely used to count rows.\n", " - It does not count unique values, either (countries are not unique in this table).\n", " - It counts non-null values." ] }, { "cell_type": "code", "execution_count": null, "id": "a108bae1", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT 'any string'" ] }, { "cell_type": "markdown", "id": "7325c3a8", "metadata": {}, "source": [ "_notes_\n", "\n", "- Wrap in `COUNT()`.\n", " - It counted the single row that tableless `SELECT` could return.\n", "- Try\n", " - `COUNT(1)`,\n", " - `COUNT(0)`,\n", " - `COUNT(TRUE)`, and\n", " - `COUNT(FALSE)` — it is easy to forget that this one gives 1.\n", "- `COUNT(NULL)` — the only one that gives 0." ] }, { "cell_type": "code", "execution_count": null, "id": "8873abd4", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT 100 - COUNT(customer_id)\n", "FROM customers" ] }, { "cell_type": "markdown", "id": "c4cbb903", "metadata": {}, "source": [ "_notes_\n", "\n", "- We can perform further computation on the result of aggregation functions.\n", "- Actually, there is one more elegant way to count rows — `COUNT(*)`." ] }, { "cell_type": "markdown", "id": "062bda6a", "metadata": {}, "source": [ "## The \"GROUP BY\" clause" ] }, { "cell_type": "markdown", "id": "865298df", "metadata": {}, "source": [ "Task: using orders table, get the date of last order of company 🥕." ] }, { "cell_type": "code", "execution_count": null, "id": "e7eecff9", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT company_name, customer_id\n", "FROM customers\n", "\n", "--SELECT -- fill in\n", "--FROM orders\n", "--WHERE customer_id = 🥕" ] }, { "cell_type": "markdown", "id": "7045ed55", "metadata": {}, "source": [ "_notes_\n", "\n", "- Theodore, which company would you like the carrot to stand for?\n", "- Comment the current query command and uncomment the second one.\n", "- Fill in the `SELECT` clause with `order_date` to show order dates.\n", "- Change it to `MAX(order_date)`.\n", "- `MAX` happens to be another aggregate function, line `COUNT`.\n", "- `MIN` works analogously (change to it, but go back to `MAX`).\n", "- The query is now good… for getting the last order date of a single customer.\n", "- What if we wanted to get last order dates of all customers?\n", "- A loop would be useful, but\n", " - we have not learned the procedural extensions to SQL yet, and\n", " - these extensions are nonstandard, anyway." ] }, { "cell_type": "code", "execution_count": null, "id": "244f9dbb", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT customer_id, MAX(order_date) AS last_order_date\n", "FROM orders\n", "GROUP BY customer_id" ] }, { "cell_type": "markdown", "id": "514e9f48", "metadata": {}, "source": [ "_notes_\n", "\n", "- `GROUP BY` - makes aggragation functions like `COUNT` and `MAX` operate on smaller **groups** rather than all tuples.\n", "- Add\n", " - `JOIN customers USING(customer_id)` to the `FROM` clause,\n", " - `, company_name` to the `GROUP BY` clause, and\n", " - `company_name, ` as the first expression in the `SELECT` clause.\n", "- So, `GROUP BY` allowed us to solve the earlier task for all possible values of carrot at once.\n", "- Note that you **can**\n", " - perform operations on values before an aggregate function is applied, e.g., `MAX(EXTRACT(YEAR FROM order_date))` in Postgres and Oracle\n", " - (`YEAR(order_date)` in MySQL/MariaDB, SQL Server and some others),\n", " - perform operations on an aggregate function's result, e.g., `EXTRACT(YEAR FROM MAX(order_date))`,\n", " - group by a column without directly using it in the `SELECT` clause,\n", " - use expressions in the `GROUP BY` clause, e.g., `EXTRACT(YEAR FROM order_date)` to group by year and get the last order date in each year by each customer, and\n", " - use aggregate functions in the `ORDER BY` clause, e.g., `ORDER BY MAX(order_date) DESC`.\n", "- Note that you **can't**\n", " - use aggregate functions in `WHERE`, `FROM`, and — obviously — the `GROUP BY` clause, and\n", " - omit in `GROUP BY` a column name that is referenced outside of aggregate functions in query's `SELECT` clause.\n", "- Note that the effective order of operations is\n", " - `FROM`\n", " - `WHERE`\n", " - `GROUP BY`, and\n", " - `ORDER BY`.\n", "- So, if we want to get each company's last order in 1997, we can add `WHERE EXTRACT(YEAR FROM order_date) = 1997`.\n", "- What if we instead wanted to get companies that made their last orders in 1997?\n", " - I.e., we want to filter rows **after** grouping is performed.\n", " - Add `HAVING EXTRACT(YEAR FROM MAX(order_date)) = 1997` after the `GROUP BY` clause.\n", " - `HAVING` is similar to `WHERE` but affects the result of grouping.\n", " - Note that, similarly to `WHERE`, the `HAVING` clause is not aware of result table's column names.\n", " - E.g., `HAVING EXTRACT(YEAR FROM last_order_date) = 1997` gives an error.\n", " - `HAVING` can be used together with `WHERE` in a single query.\n", " - As in `SELECT`, is you reference a column name in `HAVING`, it has to be either\n", " - inside an aggregate function call, or\n", " - present in the `GROUP BY` clause.\n", " - E.g.,\n", " - `HAVING country = 'Poland'` only works after you add `country` to the `GROUP BY` clause, and\n", " - `HAVING EXTRACT(YEAR FROM order_date) = 1997` des not work." ] }, { "cell_type": "markdown", "id": "26c69093", "metadata": {}, "source": [ "## Other aggregate functions" ] }, { "cell_type": "markdown", "id": "b9c902ec", "metadata": {}, "source": [ "Try out `AVG()`, `SUM()`, and `EVERY()` (on boolean values)." ] }, { "cell_type": "markdown", "id": "67f5d3b3", "metadata": {}, "source": [ "## Subqueries and temporary tables" ] }, { "cell_type": "markdown", "id": "880c7bd7", "metadata": {}, "source": [ "Task: find 104 first order shipments. If there are more shipments made on the same day as the 104th one, include all of them in the result." ] }, { "cell_type": "code", "execution_count": null, "id": "438fae47", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT order_id, shipped_date, ship_city, ship_country\n", "FROM orders o" ] }, { "cell_type": "markdown", "id": "428d9efc", "metadata": {}, "source": [ "_notes_\n", "\n", "- One approach would be to compute the number of orders that shipped before a specific tested order.\n", "- If the number of earlier orders is less than 104, then the tested order is among the first 104 ones (even if ex-equo on the 104th place).\n", "- Theodore, is our current knowledge enough to compute the number of orders shipped before date X for certain X?\n", " - Could you choose an X from the dates you see?\n", "- Comment out the current query.\n", "- Add `SELECT COUNT(*) FROM orders WHERE shipped_date < '1997-01-15' -- some X` below.\n", "- We get the number of orders shipped before.\n", "- We can change `COUNT(*)` → `COUNT(*) < 104` to instead get a boolean value telling us if there were less than 104 orders before.\n", "- Notice that the query result is a 1-row, 1-column table.\n", "- This allows us to blatantly do this.\n", " - Uncomment the initial query above.\n", " - Add `WHERE` right after it.\n", " - Wrap the second query in parentheses (\"()\").\n", " - Change the chosen \"X\" date to `COALESCE(o.shipped_date, DATE('now'))`.\n", "- We've just had Postgres execute a `SELECT` command that involved executing another `SELECT` command for every row of the `orders` table.\n", "- Let's make it slightly more readable.\n", " - Change alias `outer_orders` to `o`.\n", " - In the inner `SELECT`, use alias `inner_orders` analogously.\n", " - Have the inner `SELECT` return just number with `COUNT(*)` and compare the entire subquery to 104.\n", "- Add `ORDER BY shipped_date`.\n", "- We can also add a similar subquery in the `SELECT` clause.\n", "- The subquery depends on a value from the outer query.\n", " - I.e., the `outer_orders.shipped_date`.\n", " - Such subquery is called **correlated** and gives us a lot of flexibility.\n", " - This also means the DMBS **might** need to compute the subquery separately for each row.\n", " - That can be a performance killer.\n", "- Let's\n", " - drop the subquery on `SELECT`, and\n", " - rewrite the subquery in `WHERE` as a non-correlated one,\n", " - e.g., use `WHERE shipped_date IN (SELECT shipped_date FROM orders ORDER BY shipped_date LIMIT 104)`.\n", "- If subquery result is a 1-column table, it can be used as the right side of the `IN` operator.\n", "- Another _operator_ commonly used with subqueries is `EXISTS` (`NOT EXISTS`),\n", " - e.g. `NOT EXISTS (SELECT shipped_date FROM orders WHERE shipped_date < COALESCE(outer_orders.shipped_date, DATE('now') + 1) ORDER BY shipped_date OFFSET 104)`,\n", " - which tests if subquery result is empty, and\n", " - once again gives us a correlated query :c — get back to the non-correlated one.\n", "- A non-correlated subquery can also be used as a table expression in the `FROM` clause,\n", " - e.g. `FROM orders INNER JOIN (SELECT shipped_date FROM orders ORDER BY shipped_date LIMIT 104) USING(shipped_date)`,\n", " - with no `WHERE` in the outer query, and\n", " - with `DISTINCT` added to the outer `SELECT`.\n", "- Non-correlated subqueries can be replaced with temporary tables that are perhaps more readable (and more obviously non-correlated!).\n", "- Use `WITH initial_dates AS (*our_subquery*) SELECT *all_like_before* FROM orders INNER JOIN initial_dates USING(shipped_date) ORDER BY shipped_date`.\n", "- More comma-separated `*table_name* AS (*subquery*)` temporary table terms can come after `WITH`.\n", "- Note that in this construct the `AS` keyword is required.\n", "- Later temporary table declarations can reference temporary tables specified earlier in the same `WITH` clause.\n", " - Add `, initial_dates_distinct AS (SELECT DISTINCT shipped_date FROM initial_dates)`.\n", " - Use it in the final query instead of `initial_dates`.\n", " - Remove `DISTINCT` from the final query." ] }, { "cell_type": "markdown", "id": "02bb8636", "metadata": {}, "source": [ "## Outer \"JOIN\"s" ] }, { "cell_type": "markdown", "id": "8e8b1e69", "metadata": {}, "source": [ "Count the number of orders of each customer." ] }, { "cell_type": "code", "execution_count": null, "id": "d4c37abc", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT company_name, COUNT(order_id) AS order_count\n", "FROM customers INNER JOIN orders USING(customer_id)\n", "GROUP BY company_name, customer_id\n", "ORDER BY company_name" ] }, { "cell_type": "markdown", "id": "064c37da", "metadata": {}, "source": [ "_notes_\n", "\n", "- Theodore, is it clear what this query does?" ] }, { "cell_type": "code", "execution_count": null, "id": "730e3207", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT company_name, 0\n", "FROM customers outer_c\n", "WHERE NOT EXISTS (SELECT order_id\n", " FROM orders\n", " WHERE customer_id = outer_c.customer_id)" ] }, { "cell_type": "markdown", "id": "2a9c3978", "metadata": {}, "source": [ "_notes_\n", "\n", "- Theodore, is it clear what this one does?\n", "- These 2 companies do not appear in the result of the former query.\n", "- That is because their rows can never meet the `INNER JOIN`'s condition.\n", "- We can solve the problem using `UNION`.\n", " - Add `, 0` to `SELECT`.\n", " - Copy the previous query to this cell.\n", " - Add `UNION` between the 2 queries.\n", " - Remove the `ORDER BY` clause.\n", "- We are done, now let's see how the same thing is typically done." ] }, { "cell_type": "code", "execution_count": null, "id": "ead9b998", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT company_name, order_id, order_date\n", "FROM customers INNER JOIN orders USING(customer_id)\n", "WHERE company_name LIKE 'P%'\n", "ORDER BY company_name" ] }, { "cell_type": "markdown", "id": "b4b149f0", "metadata": {}, "source": [ "_notes_\n", "\n", "- Let's look closer at how our `JOIN`'s result looks like before grouping.\n", "- We shall only take into account the companies whose names start with \"P\". The result table is going to be smaller and more comprehensible.\n", "- \"Paris spécialités\", which had no orders, obviously does not appead here.\n", "- So far we've been only using `INNER JOIN` (and `CROSS JOIN` once).\n", "- There is also another type of `JOIN`, the `OUTER JOIN`.\n", "- `OUTER JOIN` is similar but causes all rows of the input table to **appear at least once** in the result — even if `OUTER JOIN`'s condition was never satisfied for them.\n", "- If needed, it uses `NULL`s in place of nonexistent values that would come from the other input table.\n", "- Change `INNER JOIN` to `LEFT OUTER JOIN`.\n", "- See? The special row with `NULL`s is not added for other customers, only for \"Paris spécialités\", which would otherwise not appear in the result.\n", "- Recall now that `COUNT` counted non-`NULL` values.\n", "- It is perfect for the task of counting the actual number of orders of each company here.\n", " - Add `, COUNT(order_id) AS order_count` and `GROUP BY company_name, customer_id`.\n", " - Remove `, order_date`.\n", "- Note that\n", " - `COUNT(order_id)` works properly, because `order_id` is `NULL` in case of the special row, but\n", " - `COUNT(*)` would be bad because it counts the rows in the group without regard to `NULL`s.\n", "- What is the purpose of `LEFT` in `LEFT OUTER JOIN`?\n", " - Comment out the `GROUP BY` and `COUNT` parts.\n", " - `LEFT OUTER JOIN`, as name suggests, ensures that all rows of the left table appear in its result.\n", " - Swap table names around `LEFT OUTER JOIN`.\n", " - \"Paris spécialités\" is gone again.\n", " - We could now instead use `RIGHT OUTER JOIN` which functions as you'd expect.\n", " - \"Paris spécialités\" is back there." ] }, { "cell_type": "code", "execution_count": null, "id": "43d48810", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT employee_id, first_name, last_name, reports_to\n", "FROM employees" ] }, { "cell_type": "markdown", "id": "aafb595c", "metadata": {}, "source": [ "_notes_\n", "\n", "- Let's now look at the `employees` table.\n", "- An employee can have another employee as a boss.\n", " - This is recorded in the `reports_to` attribute.\n", " - The value of `reports_to`, if present, is another employee's id." ] }, { "cell_type": "code", "execution_count": null, "id": "73e9440f", "metadata": {}, "outputs": [], "source": [ "%%sql\n", "SELECT se.first_name || ' ' || se.last_name AS subordinate,\n", " me.first_name || ' ' || me.last_name AS manager\n", "FROM employees se\n", " LEFT OUTER JOIN employees me\n", " ON se.reports_to = me.employee_id" ] }, { "cell_type": "markdown", "id": "d08cea04", "metadata": {}, "source": [ "_notes_\n", "\n", "- We'd like to list all emploees together with their managers.\n", "- Theodore, is this query comprehensible?\n", "- We have employees and their bosses in the 2 columns of the result.\n", " - Andrew Fuller is the only employee without real boss.\n", " - Notice how concatenating a string with `NULL` yields a `NULL` value.\n", "- `LEFT OUTER JOIN` allowed us to have every employee appear in the left column, with possible `NULL`s in the right one.\n", "- We can instead use `RIGHT OUTER JOIN` to have every employee appear in the right column, with possible `NULL`s in the left one.\n", "- SQL also allows us to get both of those behaviors at once.\n", " - I.e., do an `OUTER JOIN` that guarantees that each row from both the left and the right table appears in the result.\n", " - It is, in a sense, an `OUTER JOIN` that is `RIGHT` and `LEFT` at the same time.\n", " - Taking inspiration from boss' surname, change the code to use `FULL OUTER JOIN`.\n", " - This one is definitely less frequently used than one-sided `OUTER JOIN`s.\n", " - However, it can be argued that a table as produced by `FULL OUTER JOIN` allows more things to be looked up easily by a human reader.\n", "- Now, a useful hint. The `OUTER` and `INNER` keywords can be dropped when writing queries :)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "/gnu/store/q35bxk1i5blx0rcgxphhrq9xqmgha9bz-python-3.11.11/bin/python3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }