{ "cells": [ { "cell_type": "markdown", "id": "ced9fab5", "metadata": {}, "source": [ "# NoSQL Databases" ] }, { "cell_type": "markdown", "id": "3627a50a", "metadata": {}, "source": [ "## Document (JSON) Databases" ] }, { "cell_type": "markdown", "id": "5c070777", "metadata": {}, "source": [ "_notes_\n", "\n", "Why base a database on a data format like JSON?\n", "\n", "This might be convenient because even if internally many systems use SQL, they often use JSON (or XML) to exchange data.\n", "\n", "Data in SQL databases must match the schema. With JSON and similar it is easy to add new keys to the documents and have older systems simply ignore these keys.\n", "\n", "Because it is fashionable?" ] }, { "cell_type": "markdown", "id": "7936b6a3", "metadata": {}, "source": [ "|database|description|\n", "|:-|:-|\n", "|MongoDB|A database system using JSON and JavaScript, FLOSS until 2018.|\n", "|Amazon DocumentDB|Amazon's SaaSS with some degree of compatibility with MongoDB.|\n", "|Azure Cosmos DB|Microsoft's SaaSS with some degree of compatibility with MongoDB, named \"DocumentDB\" before 2017.|\n", "|DocumentDB|A Microsoft-maintained Expat-licensed database system \"powering the […] Azure Cosmos DB, built on PostgreSQL,\" publicly released in 2025.|\n", "|FerretDB|An Apache-2.0-licensed database system with some degree of compatibility with MongoDB, utilizes Postgres and (since 2025) DocumentDB under the hood.|\n", "|CouchDB|An independent database system using JSON and JavaScript.|" ] }, { "cell_type": "markdown", "id": "019db03f", "metadata": {}, "source": [ "_notes_\n", "\n", "MongoDB Inc. wanted to prevent other tech companies from offering MongoDB as an API. So it moved from the GNU Affero General Public License to one that forbids offering a MongoDB database as a service.\n", "\n", "No prominent forks of MongoDB emerged, but several API-compatible or even protocol-compatible systems have been created.\n", "\n", "This lecture relies on an older release of FerretDB, which is nonetheless enough for our demo purposes." ] }, { "cell_type": "code", "execution_count": null, "id": "12d29ab2", "metadata": {}, "outputs": [], "source": [ "import pymongo\n", "\n", "client = pymongo.MongoClient('localhost', 27017)" ] }, { "cell_type": "markdown", "id": "095256b1", "metadata": {}, "source": [ "### Creating and Populating a Database" ] }, { "cell_type": "code", "execution_count": null, "id": "0c823a6e", "metadata": {}, "outputs": [], "source": [ "for db_name in client.list_database_names():\n", " client.drop_database(db_name)" ] }, { "cell_type": "code", "execution_count": null, "id": "bbd0afb3", "metadata": {}, "outputs": [], "source": [ "client.list_database_names()" ] }, { "cell_type": "markdown", "id": "4127fc0b", "metadata": {}, "source": [ "_notes_\n", "\n", "We shall start with an empty database." ] }, { "cell_type": "code", "execution_count": null, "id": "6ef74d9d", "metadata": {}, "outputs": [], "source": [ "client.my_simple_database" ] }, { "cell_type": "markdown", "id": "56d4264b", "metadata": {}, "source": [ "_notes_\n", "\n", "We do not need to explicitly create a database, it gets created automatically when we try to use it." ] }, { "cell_type": "code", "execution_count": null, "id": "180df3cf", "metadata": {}, "outputs": [], "source": [ "client.list_database_names()" ] }, { "cell_type": "markdown", "id": "500b418f", "metadata": {}, "source": [ "_notes_\n", "\n", "Database creation happens only after we add something to it." ] }, { "cell_type": "code", "execution_count": null, "id": "467ff7e5", "metadata": {}, "outputs": [], "source": [ "client.my_simple_database.create_collection('my_docs')" ] }, { "cell_type": "markdown", "id": "6f671ebc", "metadata": {}, "source": [ "_notes_\n", "\n", "\"Collection\" is where we can keep (JSON) documents." ] }, { "cell_type": "code", "execution_count": null, "id": "99a4067c", "metadata": {}, "outputs": [], "source": [ "client.list_database_names()" ] }, { "cell_type": "code", "execution_count": null, "id": "c8179c50", "metadata": {}, "outputs": [], "source": [ "client.my_simple_database.list_collection_names()" ] }, { "cell_type": "code", "execution_count": null, "id": "97f333f7", "metadata": {}, "outputs": [], "source": [ "client.my_simple_database.my_docs.insert_one({\n", " \"my_key\": \"my_value\",\n", " \"other_key\": [\"list\", \"with\", 4, \"elements\"],\n", " \"yet_another_one\": {\n", " \"can-we-have-nested-documents?\": True\n", " }\n", "})" ] }, { "cell_type": "markdown", "id": "a2784a3f", "metadata": {}, "source": [ "_notes_\n", "\n", "Although the native languages of the database are JS and JSON, Python syntax can be fairly similar to JSON and we'll use Python in our examples.\n", "\n", "Note that we received a database-assigned id of the document that we've just inserted.\n", "\n", "Note that in Python and JS we would say that we have a dict inside a dict and an object inside an object, respectively. In the context of MongoDB / FerretDB we typically use the term \"document\" instead. We can say about those nested \"objects\" that they are **subdocuments** within a document." ] }, { "cell_type": "code", "execution_count": null, "id": "9e9ff65c", "metadata": {}, "outputs": [], "source": [ "client.my_simple_database.my_docs.count_documents({})" ] }, { "cell_type": "markdown", "id": "79a57436", "metadata": {}, "source": [ "_notes_\n", "\n", "The empty dict is a filter. We use it to tell the database to count all documents. We'll see some examples of other filtrs shortly." ] }, { "cell_type": "code", "execution_count": null, "id": "566a9d5e", "metadata": {}, "outputs": [], "source": [ "client.my_simple_database.my_docs.insert_many(\n", " {\"some_key\": i} for i in range(10) # Iterables work here.\n", ")" ] }, { "cell_type": "markdown", "id": "fe336a37", "metadata": {}, "source": [ "_notes_\n", "\n", "Re-run the cell that shows document count (now 11).\n", "\n", "Notice that the structure of the documents added to the same collection can be totally different." ] }, { "cell_type": "code", "execution_count": null, "id": "9c39bb06", "metadata": {}, "outputs": [], "source": [ "!head -15 nosql-databases.ipynb" ] }, { "cell_type": "markdown", "id": "4e3acae5", "metadata": {}, "source": [ "_notes_\n", "\n", "The notebooks with our lectures are also JSON files.\n", "\n", "Let's add the cells from this notebook to a new collection. Save this notebook (`Ctrl+s`) and then run the cell below." ] }, { "cell_type": "code", "execution_count": null, "id": "4b7655f0", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import subprocess\n", "import json\n", "\n", "def get_notebook_cells():\n", " freshest_notebook_name = subprocess.getoutput('ls -cht *.ipynb | head -1')\n", " with open(freshest_notebook_name, 'rt') as _in:\n", " return json.load(_in)['cells']\n", "\n", "get_notebook_cells()" ] }, { "cell_type": "markdown", "id": "529cc666", "metadata": {}, "source": [ "_notes_\n", "\n", "Ok, we are able to load a notebook and make its cells into Pytho dicts. Let's also add an `_id` field to each and then insert them into a collection." ] }, { "cell_type": "code", "execution_count": null, "id": "f7549910", "metadata": {}, "outputs": [], "source": [ "def get_notebook_cells():\n", " freshest_notebook_name = subprocess.getoutput('ls -cht *.ipynb | head -1')\n", "\n", " with open(freshest_notebook_name, 'rt') as _in:\n", " cells = json.load(_in)['cells']\n", "\n", " for _id, cell in enumerate(cells):\n", " cell['_id'] = _id\n", " \n", " return cells\n", "\n", "client.db_lectures.nb_cells.insert_many(get_notebook_cells())" ] }, { "cell_type": "markdown", "id": "623bf5a4", "metadata": {}, "source": [ "_notes_\n", "\n", "As you see, we don't even need to explicitly create the collection.\n", "\n", "Also, we can put our own `_id` in the document that we insert and the database shall pick it up." ] }, { "cell_type": "markdown", "id": "98bbe078", "metadata": {}, "source": [ "### Simple Queries on Documents" ] }, { "cell_type": "code", "execution_count": null, "id": "cba47bca", "metadata": {}, "outputs": [], "source": [ "client.my_simple_database.my_docs.find_one()" ] }, { "cell_type": "markdown", "id": "a78e67a4", "metadata": {}, "source": [ "_notes_\n", "\n", "We see the database-assigned id under a special `_id` key in the document." ] }, { "cell_type": "code", "execution_count": null, "id": "cdb6c674", "metadata": {}, "outputs": [], "source": [ "client.my_simple_database.my_docs.find()" ] }, { "cell_type": "code", "execution_count": null, "id": "5cf05cc5", "metadata": {}, "outputs": [], "source": [ "list(client.my_simple_database.my_docs.find())" ] }, { "cell_type": "markdown", "id": "c89c597e", "metadata": {}, "source": [ "_notes_\n", "\n", "We can retrieve multiple documents. It happens through a **cursor** object (as in the case of rows fetched from an SQL database). We shall leverage the fact that pymongo cursors are Python iterables and collect our fetched documents into lists." ] }, { "cell_type": "code", "execution_count": null, "id": "2e1ce539", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\"_id\": 0}\n", "))" ] }, { "cell_type": "markdown", "id": "2d3731c3", "metadata": {}, "source": [ "_notes_\n", "\n", "We can query for documents with numeric value `0` under key `_id`." ] }, { "cell_type": "code", "execution_count": null, "id": "23125b62", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\"cell_type\": \"markdown\"}\n", "))" ] }, { "cell_type": "markdown", "id": "ee09b36d", "metadata": {}, "source": [ "_notes_\n", "\n", "We can query for documents with string value `\"markdown\"` under key `cell_type`." ] }, { "cell_type": "code", "execution_count": null, "id": "8ccce167", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\"source\": \"\\n\"}\n", "))" ] }, { "cell_type": "markdown", "id": "b3950084", "metadata": {}, "source": [ "_notes_\n", "\n", "We can query for documents with string value `\"\\n\"` inside a list that is under key `source` (these repreent cells with at least one empty line of input)." ] }, { "cell_type": "code", "execution_count": null, "id": "57c8f70a", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\"source.0\": \"list(client.db_lectures.nb_cells.find(\\n\"}\n", "))" ] }, { "cell_type": "markdown", "id": "78f77a01", "metadata": {}, "source": [ "_notes_\n", "\n", "We can query for documents with string value `\"list(client.my_simple_database.my_docs.find(\\n\"` at index 0 in a list that is under key `source`.\n", "\n", "As you see, we do have a few examples involving the `find()` API method." ] }, { "cell_type": "code", "execution_count": null, "id": "6d9cca6e", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\"_id\": {\"$lt\": 3}}\n", "))" ] }, { "cell_type": "markdown", "id": "3bd9d353", "metadata": {}, "source": [ "_notes_\n", "\n", "We can query for documents with value smaller than 3 under key `_id`." ] }, { "cell_type": "code", "execution_count": null, "id": "e15c9914", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\n", " \"_id\": {\n", " \"$gt\": 4,\n", " \"$lte\": 8\n", " }\n", " }\n", "))" ] }, { "cell_type": "markdown", "id": "34684a87", "metadata": {}, "source": [ "_notes_\n", "\n", "We can specify multiple conditions together." ] }, { "cell_type": "code", "execution_count": null, "id": "25e24f34", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\n", " \"$or\": [\n", " {\"_id\": {\"$lt\": 4}},\n", " {\"_id\": {\"$gt\": 8}}\n", " ]\n", " }\n", "))" ] }, { "cell_type": "markdown", "id": "3740c8f9", "metadata": {}, "source": [ "_notes_\n", "\n", "Note that we can also perform a logical `OR`." ] }, { "cell_type": "markdown", "id": "d0013b83", "metadata": {}, "source": [ "### Updates to Documents" ] }, { "cell_type": "code", "execution_count": null, "id": "d5a31ce9", "metadata": {}, "outputs": [], "source": [ "client.db_lectures.nb_cells.update_one(\n", " # A filter.\n", " {\n", " \"_id\": {\n", " \"$gt\": 4,\n", " \"$lte\": 8\n", " }\n", " },\n", " # An update specification.\n", " {\n", " \"$set\": {\"extra_key\": [\"a\", \"b\", \"c\"]}\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "6b66272f", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\n", " \"_id\": {\n", " \"$gt\": 4,\n", " \"$lte\": 8\n", " }\n", " }\n", "))" ] }, { "cell_type": "markdown", "id": "2a186ff6", "metadata": {}, "source": [ "_notes_\n", "\n", "`update_one()` only updates the first of the matching documents." ] }, { "cell_type": "code", "execution_count": null, "id": "d569d794", "metadata": {}, "outputs": [], "source": [ "client.db_lectures.nb_cells.update_many(\n", " {\n", " \"_id\": {\n", " \"$gt\": 4,\n", " \"$lte\": 6\n", " }\n", " },\n", " {\n", " \"$set\": {\"outerkey.innerkey.innerinnerkey\": None}\n", " }\n", ")" ] }, { "cell_type": "markdown", "id": "eb474d68", "metadata": {}, "source": [ "_notes_\n", "\n", "Note that we can re-run this one to see that the `nModified` property of the reported result holds the number of documents actually altered." ] }, { "cell_type": "code", "execution_count": null, "id": "d4ff1f75", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\n", " \"_id\": {\n", " \"$gt\": 4,\n", " \"$lte\": 8\n", " }\n", " }\n", "))" ] }, { "cell_type": "markdown", "id": "f274a974", "metadata": {}, "source": [ "_notes_\n", "\n", "See that adding a value under some path created the necessary intermediate subdocuments." ] }, { "cell_type": "code", "execution_count": null, "id": "46b5613e", "metadata": {}, "outputs": [], "source": [ "client.db_lectures.nb_cells.replace_one(\n", " {\"_id\": 0},\n", " {\n", " \"_id\": 0,\n", " \"cell_type\": \"markdown\",\n", " \"id\": \"deadbeef\",\n", " \"metadata\": {},\n", " \"source\": [\"This is a replaced document.\\n\",\n", " \"\\n\",\n", " \"# NoSQL Databases\\n\",\n", " \"\\n\",\n", " \"This is a replaced document.\"]\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "d9ee4cdc", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\"_id\": {\"$lt\": 2}}\n", "))" ] }, { "cell_type": "markdown", "id": "82937c32", "metadata": {}, "source": [ "_notes_\n", "\n", "See that we were able to replace a single document. Note that the same `_id` keeps being used for the document.\n", "\n", "Note that the order of the fetched documents is nondeterministic (and we're likely to witness that now, as we make updates to the documents)." ] }, { "cell_type": "code", "execution_count": null, "id": "a46b073d", "metadata": {}, "outputs": [], "source": [ "client.db_lectures.nb_cells.update_one(\n", " {\"_id\": 0},\n", " {\n", " \"$pop\": {\"source\": 1}\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "31713f3f", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\"_id\": {\"$lt\": 2}}\n", "))" ] }, { "cell_type": "markdown", "id": "fbef035f", "metadata": {}, "source": [ "_notes_\n", "\n", "We have removed the last element, string `\"This is a replaced document.\"`, of the array under `source`." ] }, { "cell_type": "code", "execution_count": null, "id": "b7100e76", "metadata": {}, "outputs": [], "source": [ "client.db_lectures.nb_cells.update_one(\n", " {\"_id\": 0},\n", " {\n", " \"$pop\": {\"source\": -1} # Note the -1 sign!\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "5294966f", "metadata": { "scrolled": true }, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find(\n", " {\"_id\": {\"$lt\": 2}}\n", "))" ] }, { "cell_type": "markdown", "id": "3e21f20d", "metadata": {}, "source": [ "_notes_\n", "\n", "We have removed the first element, string `\"This is a replaced document.\\n\"`, of the array under `source`.\n", "\n", "We can performs further `$pop` operations and if the list is already empty — they act as no-ops." ] }, { "cell_type": "code", "execution_count": null, "id": "089923ef", "metadata": {}, "outputs": [], "source": [ "client.db_lectures.nb_cells.update_many(\n", " {\n", " \"outputs.0\": {\"$exists\": True}\n", " },\n", " {\n", " \"$unset\": {\"source\": 1}\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "14bfc2ef", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.find())" ] }, { "cell_type": "markdown", "id": "e0e0671d", "metadata": {}, "source": [ "_notes_\n", "\n", "We can remove keys.\n", "\n", "Note the use of the `$exists` operator." ] }, { "cell_type": "markdown", "id": "d03d7ca8", "metadata": {}, "source": [ "### Deletions from a Document" ] }, { "cell_type": "code", "execution_count": null, "id": "07e28906", "metadata": {}, "outputs": [], "source": [ "client.db_lectures.nb_cells.delete_one(\n", " {\n", " \"_id\": {\n", " \"$gte\": 3,\n", " \"$lt\": 8\n", " }\n", " }\n", ")" ] }, { "cell_type": "markdown", "id": "4c543615", "metadata": {}, "source": [ "_notes_\n", "\n", "Note that there is no failure if there's no document to delete." ] }, { "cell_type": "markdown", "id": "995853a6", "metadata": {}, "source": [ "### Aggregation" ] }, { "cell_type": "code", "execution_count": null, "id": "36cd15f7", "metadata": {}, "outputs": [], "source": [ "client.db_lectures.nb_cells.delete_many({})\n", "client.db_lectures.nb_cells.insert_many(get_notebook_cells())" ] }, { "cell_type": "markdown", "id": "760d2b24", "metadata": {}, "source": [ "_notes_\n", "\n", "We shall start with a fresh collection of cells." ] }, { "cell_type": "code", "execution_count": null, "id": "6a3f7e86", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.aggregate([\n", "# Empty. For now.\n", "]))" ] }, { "cell_type": "markdown", "id": "243d5fb4", "metadata": {}, "source": [ "_notes_\n", "\n", "Documents can be processed in a pipeline. The above shows the documents returned by an empty pipeline fed with our notebook cell documents." ] }, { "cell_type": "code", "execution_count": null, "id": "2bd857e2", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.aggregate([\n", " # Pick non-code cells and code cells with nonempty `outputs' array.\n", " {\n", " \"$match\": {\n", " \"$or\": [\n", " {\"outputs.0\": {\"$exists\": True}},\n", " {\"cell_type\": {\"$not\": {\"$eq\": \"code\"}}}\n", " ]\n", " }\n", " }\n", "]))" ] }, { "cell_type": "markdown", "id": "f0ea1749", "metadata": {}, "source": [ "_notes_\n", "\n", "The pipeline can involve filtering. " ] }, { "cell_type": "code", "execution_count": null, "id": "ecb44c7c", "metadata": { "scrolled": true }, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.aggregate([\n", " # Pick non-code cells and code cells with nonempty `outputs' array.\n", " {\n", " \"$match\": {\n", " \"$or\": [\n", " {\"outputs.0\": {\"$exists\": True}},\n", " {\"cell_type\": {\"$not\": {\"$eq\": \"code\"}}}\n", " ]\n", " },\n", " },\n", " # Sort by `_id'.\n", " {\"$sort\": {\"_id\": 1}} # Use -1 for descending order.\n", "]))" ] }, { "cell_type": "markdown", "id": "2b4843fd", "metadata": {}, "source": [ "_notes_\n", "\n", "We can mandate the order of the returned documents." ] }, { "cell_type": "code", "execution_count": null, "id": "ef6f2600", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.aggregate([\n", " # Pick non-code cells and code cells with nonempty `outputs' array.\n", " {\n", " \"$match\": {\n", " \"$or\": [\n", " {\"outputs.0\": {\"$exists\": True}},\n", " {\"cell_type\": {\"$not\": {\"$eq\": \"code\"}}}\n", " ]\n", " },\n", " },\n", " # Sort by `_id'.\n", " {\"$sort\": {\"_id\": 1}},\n", " # Get rid of unneeded keys.\n", " {\"$unset\": [\"id\", \"_id\"]}\n", "]))" ] }, { "cell_type": "markdown", "id": "34e117d0", "metadata": {}, "source": [ "_notes_\n", "\n", "We can remove, add and modify fields of the documents." ] }, { "cell_type": "code", "execution_count": null, "id": "c1528ce8", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.aggregate([\n", " # Pick non-code cells and code cells with nonempty `outputs' array.\n", " {\n", " \"$match\": {\n", " \"$or\": [\n", " {\"outputs.0\": {\"$exists\": True}},\n", " {\"cell_type\": {\"$not\": {\"$eq\": \"code\"}}}\n", " ]\n", " }\n", " },\n", " # Group by cell type and get cell counts in each group.\n", " {\n", " \"$group\": {\n", " \"_id\": \"$cell_type\",\n", " \"count\": {\"$count\": {}}\n", " }\n", " }\n", "]))" ] }, { "cell_type": "markdown", "id": "3989be5f", "metadata": {}, "source": [ "_notes_\n", "\n", "We can do grouping and (although not necessarily in the old FerretDB version used in this demo):\n", "\n", "- count documents,\n", "- sum field values,\n", "- compute average, standard deviation, etc.,\n", "- collect field values into lists, and\n", "- more…" ] }, { "cell_type": "markdown", "id": "c462a12c", "metadata": {}, "source": [ "Try creating a \"raw\" cell, saving the notebook, re-populating the collection and running the grouping query again." ] }, { "cell_type": "markdown", "id": "9a05ca45", "metadata": {}, "source": [ "_notes_\n", "\n", "We shall now additionally sort the final documents." ] }, { "cell_type": "code", "execution_count": null, "id": "540722cb", "metadata": {}, "outputs": [], "source": [ "client.db_lectures.nb_cells.delete_many({})\n", "client.db_lectures.nb_cells.insert_many(get_notebook_cells())" ] }, { "cell_type": "code", "execution_count": null, "id": "9db64a97", "metadata": {}, "outputs": [], "source": [ "list(client.db_lectures.nb_cells.aggregate([\n", " # Pick non-code cells and code cells with nonempty `outputs' array.\n", " {\n", " \"$match\": {\n", " \"$or\": [\n", " {\"outputs.0\": {\"$exists\": True}},\n", " {\"cell_type\": {\"$not\": {\"$eq\": \"code\"}}}\n", " ]\n", " }\n", " },\n", " # Group by cell type and get cell counts in each group.\n", " {\n", " \"$group\": {\n", " \"_id\": \"$cell_type\",\n", " \"count\": {\"$count\": {}}\n", " }\n", " },\n", " # Sort types by cell count.\n", " {\n", " \"$sort\": {\"count\": -1}\n", " }\n", "]))" ] }, { "cell_type": "markdown", "id": "fc3da0d9", "metadata": {}, "source": [ "_notes_\n", "\n", "There are many more possible aggregation steps supported by MongoDB-compatible databases, including `$lookup`, which is the equivalent of SQL's `JOIN` — it can combine the aggregated documents with ones from another collection." ] }, { "cell_type": "markdown", "id": "8603db3a", "metadata": {}, "source": [ "## Key-Value Stores and In-Memory Databases" ] }, { "cell_type": "markdown", "id": "bfd15b19", "metadata": {}, "source": [ "|database|description|\n", "|:-|:-|\n", "|Redis|A popular key-value store, in-memory database and message broker.|\n", "|Valkey|A fork of Redis by several big companies, also over relicensing issues.|\n", "|localStorage|Browser API of a key-value store.|\n", "|BerkeleyDB|A key-value store from the 90s Unix world, familliar to all old hackers.|\n", "|Memcached|Another popular, in-memory key-value store.|\n", "|LMDB|A key-value store heavily leveraging memory-mapped files.|\n", "|_And dozens more…_|" ] }, { "cell_type": "markdown", "id": "96751bf0", "metadata": {}, "source": [ "_notes_\n", "\n", "We shall use Valkey in our examples." ] }, { "cell_type": "markdown", "id": "f1cb4122", "metadata": {}, "source": [ "### The Basics" ] }, { "cell_type": "code", "execution_count": null, "id": "3759a0fc", "metadata": {}, "outputs": [], "source": [ "import valkey\n", "\n", "vk = valkey.Valkey(host='localhost', port=20637, db=0)" ] }, { "cell_type": "markdown", "id": "17b1c248", "metadata": {}, "source": [ "Keys and values are strings." ] }, { "cell_type": "code", "execution_count": null, "id": "0ea5498d", "metadata": {}, "outputs": [], "source": [ "vk.set('bar', 'foo')" ] }, { "cell_type": "markdown", "id": "77b787a7", "metadata": {}, "source": [ "_notes_\n", "\n", "Note that this is idempotent." ] }, { "cell_type": "code", "execution_count": null, "id": "81ed3c9b", "metadata": {}, "outputs": [], "source": [ "vk.get('bar')" ] }, { "cell_type": "markdown", "id": "eb06f405", "metadata": {}, "source": [ "Keys and values are **bytestrings**." ] }, { "cell_type": "code", "execution_count": null, "id": "61454016", "metadata": {}, "outputs": [], "source": [ "with open('/dev/random', 'rb') as bytes_stream:\n", " val = bytes_stream.read(10)\n", " key = bytes_stream.read(10)\n", "\n", "(val, key)" ] }, { "cell_type": "code", "execution_count": null, "id": "43fdecb6", "metadata": {}, "outputs": [], "source": [ "vk.set(key, val)" ] }, { "cell_type": "code", "execution_count": null, "id": "76380dfd", "metadata": {}, "outputs": [], "source": [ "vk.get(key)" ] }, { "cell_type": "code", "execution_count": null, "id": "1e353828", "metadata": {}, "outputs": [], "source": [ "vk.set(b'lectures:17',\n", " b'{\"id\": \"nosql-databases\", \"title\": \"NoSQL Databases\"}')" ] }, { "cell_type": "markdown", "id": "2cef10bd", "metadata": {}, "source": [ "_notes_\n", "\n", "It is a Radis / Valkey convention to use key names with colons.\n", "\n", "It is common to store values in some format that the key-value database is agnostic of. For example, values that are JSON documents.\n", "\n", "Nonetheless, **Valkey and Redis support Lua server-side scripting and include Lua libraries for JSON and the more efficient [MessagePack](https://msgpack.org/index.html)**. The line between key-value and document-oriented databases is blurred." ] }, { "cell_type": "code", "execution_count": null, "id": "aa6a908f", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import json\n", "\n", "json.loads(vk.get('lectures:17'))" ] }, { "cell_type": "markdown", "id": "d2d12b06", "metadata": {}, "source": [ "_notes_\n", "\n", "Note that using a colon notation for key names is a recommended convention in Valkey and Redis." ] }, { "cell_type": "markdown", "id": "de3a6c56", "metadata": {}, "source": [ "### One Typical Valkey Use-Case" ] }, { "cell_type": "markdown", "id": "37532bb8", "metadata": {}, "source": [ "```\n", " Classical database\n", " |\n", " |\n", " |\n", "Valkey (cache of frequently used values)\n", " |\n", " |\n", " |\n", " Application\n", "```" ] }, { "cell_type": "markdown", "id": "988e4795", "metadata": {}, "source": [ "_notes_\n", "\n", "This would also be a typical use-case of, for example, Memcached.\n", "\n", "OTOH, key-value stores like LMDB and BerkeleyDB are typically used as the primary place of storing data, not as caches." ] }, { "cell_type": "markdown", "id": "e09fbe06", "metadata": {}, "source": [ "### Key Expiration and Eviction" ] }, { "cell_type": "code", "execution_count": null, "id": "1171bfd9", "metadata": {}, "outputs": [], "source": [ "vk.set(b'lectures:17',\n", " b'{\"id\": \"nosql-databases\", \"title\": \"NoSQL Databases\"}',\n", " ex=10) # TTL set to 10 seconds." ] }, { "cell_type": "code", "execution_count": null, "id": "bcae6e41", "metadata": {}, "outputs": [], "source": [ "vk.get(b'lectures:17')" ] }, { "cell_type": "markdown", "id": "73bc07a1", "metadata": {}, "source": [ "_notes_\n", "\n", "Once 10 seconds pass, the `vk.get()` call shall be returning `None` values as the key is no longer kept by Valkey." ] }, { "cell_type": "markdown", "id": "a8f27ce7", "metadata": {}, "source": [ "The following eviction types are supported by Valkey.\n", "\n", "|type|description|\n", "|:-|:-|\n", "|noeviction|Cannot add new keys upon reaching memory limit.|\n", "|allkeys-lru|Evict the least recently used keys.|\n", "|allkeys-lfu|Evict the least frequently used keys.|\n", "|volatile-lru|Evict the least recently used keys out of those with TTL set.|\n", "|volatile-lfu|Evict the least frequently used keys out of those with TTL set.|\n", "|allkeys-random|Evict whichever keys.|\n", "|volatile-random|Evict whichever keys out of those with TTL set.|\n", "|volatile-ttl|Evict keys with shortest remaining time-to-live.|" ] }, { "cell_type": "markdown", "id": "6e9e3d4b", "metadata": {}, "source": [ "### Valkey Hashes" ] }, { "cell_type": "code", "execution_count": null, "id": "97775774", "metadata": {}, "outputs": [], "source": [ "vk.hset(b'user:theodore', b'session_id', b'BL1ZxQKfCO6g')" ] }, { "cell_type": "code", "execution_count": null, "id": "fac6b281", "metadata": {}, "outputs": [], "source": [ "vk.hset(b'user:theodore', b'csrf_token', b'QwGESPE4OeUe')" ] }, { "cell_type": "code", "execution_count": null, "id": "af4d81b7", "metadata": {}, "outputs": [], "source": [ "vk.hget(b'user:theodore', b'session_id')" ] }, { "cell_type": "code", "execution_count": null, "id": "1009a4fd", "metadata": {}, "outputs": [], "source": [ "vk.hget(b'user:theodore', b'csrf_token')" ] }, { "cell_type": "markdown", "id": "a98436bc", "metadata": {}, "source": [ "_notes_\n", "\n", "Instead of storing a simple value under a key, we created a hash under the key `my_mapping`. The hash can contain multiple key-value pairs. The typical way of storing related pieces of information (e.g., attributes of a single user) is putting them as key-value pairs under a single hash." ] }, { "cell_type": "code", "execution_count": null, "id": "da3f9924", "metadata": { "scrolled": true }, "outputs": [], "source": [ "vk.get(b'user:theodore')" ] }, { "cell_type": "markdown", "id": "b4a72597", "metadata": {}, "source": [ "_notes_\n", "\n", "Note that if a key stores a hash, we cannot fetch its value using the `GET` operation." ] }, { "cell_type": "code", "execution_count": null, "id": "54a96e45", "metadata": {}, "outputs": [], "source": [ "vk.expire(b'user:theodore', 10)" ] }, { "cell_type": "markdown", "id": "3946fb76", "metadata": {}, "source": [ "_notes_\n", "\n", "The canonical way of expiring hashes is expiring the entire mapping at once. Above we set a TTL of 10 seconds on our hash.\n", "\n", "Note that since 2025 it is also possible to set expiration on individual hash keys in Valkey." ] }, { "cell_type": "code", "execution_count": null, "id": "7bb87089", "metadata": { "scrolled": true }, "outputs": [], "source": [ "vk.hget(b'user:theodore', 'session_id')" ] }, { "cell_type": "markdown", "id": "51c223cf", "metadata": {}, "source": [ "Let's re-add our now-expired hash." ] }, { "cell_type": "code", "execution_count": null, "id": "fa6ff1cd", "metadata": {}, "outputs": [], "source": [ "vk.hset(b'user:theodore', items={\n", " b'session_id': b'BL1ZxQKfCO6g',\n", " b'csrf_token': b'QwGESPE4OeUe'\n", "})" ] }, { "cell_type": "markdown", "id": "bd2c3ddd", "metadata": {}, "source": [ "_notes_\n", "\n", "We've added multiple hash keys at once." ] }, { "cell_type": "code", "execution_count": null, "id": "4a7b3983", "metadata": {}, "outputs": [], "source": [ "vk.hgetall(b'my_set')" ] }, { "cell_type": "markdown", "id": "fb41a530", "metadata": {}, "source": [ "_notes_\n", "\n", "We've fetched all hash keys at once." ] }, { "cell_type": "markdown", "id": "1dab93ea", "metadata": {}, "source": [ "### Valkey Lists" ] }, { "cell_type": "code", "execution_count": null, "id": "02520fba", "metadata": {}, "outputs": [], "source": [ "vk.lpush(b'messages', 'Let\\'s get back to times…')" ] }, { "cell_type": "code", "execution_count": null, "id": "28986a96", "metadata": {}, "outputs": [], "source": [ "vk.lpush(b'messages', '…when we used to talk about…')\n", "vk.lpush(b'messages', '…simple algorithmic data structures…')\n", "vk.lpush(b'messages', '…rather than microservices.')" ] }, { "cell_type": "markdown", "id": "4480cc64", "metadata": {}, "source": [ "_notes_\n", "\n", "A list in valkey is literally a list, as taught in beginner C courses. Access to a specific index has O(n) time. Addition of an element makes it stored in a new head." ] }, { "cell_type": "code", "execution_count": null, "id": "993fe4ad", "metadata": {}, "outputs": [], "source": [ "vk.lindex(b'messages', 2) # Get the element at index 2." ] }, { "cell_type": "code", "execution_count": null, "id": "838f82e0", "metadata": {}, "outputs": [], "source": [ "vk.lrange(b'messages', 0, 1) # Get elements from index 0 to 1." ] }, { "cell_type": "code", "execution_count": null, "id": "8ebe228b", "metadata": {}, "outputs": [], "source": [ "vk.lrange(b'messages', 0, -1) # Get elements from index 0 to end." ] }, { "cell_type": "code", "execution_count": null, "id": "ea3ff810", "metadata": {}, "outputs": [], "source": [ "vk.lpop(b'messages') # Remove the first element and return it." ] }, { "cell_type": "markdown", "id": "8dc99ecd", "metadata": {}, "source": [ "_notes_\n", "\n", "If the list is empty, `LPOP` yields nothing (`None` in case of a Python client)." ] }, { "cell_type": "code", "execution_count": null, "id": "6e9af2b7", "metadata": {}, "outputs": [], "source": [ "vk.blpop(b'messages', 10)" ] }, { "cell_type": "markdown", "id": "6b1533de", "metadata": {}, "source": [ "_notes_\n", "\n", "The above is a blocking variant of `LPOP`. If the list is empty, it waits for up to the specified number of seconds for some other client to add an element that could be popped from it." ] }, { "cell_type": "code", "execution_count": null, "id": "ff986097", "metadata": {}, "outputs": [], "source": [ "from threading import Thread\n", "\n", "class MessageAdderThread(Thread):\n", " def run(self):\n", " import time\n", " adder_vk = valkey.Valkey(host='localhost', port=20637, db=0)\n", " time.sleep(10)\n", " adder_vk.lpush(b'messages', 'Hello from thread!')\n", "\n", "MessageAdderThread().start()" ] }, { "cell_type": "markdown", "id": "af5039ca", "metadata": {}, "source": [ "_notes_\n", "\n", "The above cell shall push to the list 10 seconds after it is executed. Run the cell below in the meantime to see it wait for the list element and finally get one." ] }, { "cell_type": "code", "execution_count": null, "id": "da518032", "metadata": {}, "outputs": [], "source": [ "vk.blpop(b'messages', 10)" ] }, { "cell_type": "markdown", "id": "6a9f8941", "metadata": {}, "source": [ "### Another Typical Valkey Use-Case" ] }, { "cell_type": "markdown", "id": "f423fe80", "metadata": {}, "source": [ "```\n", "Producer1 Producer2 Producer3\n", " | | |\n", " | | |\n", " | | |\n", "Valkey (used as a message broker)\n", " | |\n", " | |\n", " | |\n", " Consumer1 Consumer2\n", "```" ] }, { "cell_type": "markdown", "id": "4a2a56b5", "metadata": {}, "source": [ "_notes_\n", "\n", "Although we're diverging away from the databases topic, it is worth mentioning that Valkey can be used for exchanging messages between various players in a distributed system.\n", "\n", "There are more opertions to use here than just blocking `LPOP`. Valkey also has **Pub/Sub** channels and **streams** that can can be more suitable, depending on the context." ] }, { "cell_type": "markdown", "id": "ce8c6956", "metadata": {}, "source": [ "### Valkey Sorted Sets" ] }, { "cell_type": "code", "execution_count": null, "id": "23cfcee6", "metadata": {}, "outputs": [], "source": [ "vk.zadd(b'players:ranking', {'theodore': 120})" ] }, { "cell_type": "markdown", "id": "20e02fc9", "metadata": {}, "source": [ "_notes_\n", "\n", "Sorted sets are sets in which each element has a _score_ associated with it. The score is used to keep the elements sorted (Valkey uses a skiplist for this and a hash to keep the actual values)." ] }, { "cell_type": "code", "execution_count": null, "id": "4f58369b", "metadata": {}, "outputs": [], "source": [ "vk.zadd(b'players:ranking', {\n", " 'pancratius': 56,\n", " 'eugenia': 120,\n", " 'alexis': 119,\n", "})" ] }, { "cell_type": "code", "execution_count": null, "id": "6d253af2", "metadata": {}, "outputs": [], "source": [ "vk.zrange(b'players:ranking', 0, 2)" ] }, { "cell_type": "markdown", "id": "d2613ae5", "metadata": {}, "source": [ "_notes_\n", "\n", "We can easily find the first elements of a sorted set (i.e., ones with lowest scores).\n", "\n", "Change the numbers to `-3, -1` to get the last few elements instead (i.e., ones with highest scores)." ] }, { "cell_type": "code", "execution_count": null, "id": "40a16980", "metadata": {}, "outputs": [], "source": [ "vk.zcard(b'players:ranking')" ] }, { "cell_type": "markdown", "id": "f5a8265c", "metadata": {}, "source": [ "_notes_\n", "\n", "We can also check the number of set's elements." ] }, { "cell_type": "code", "execution_count": null, "id": "9bd02191", "metadata": {}, "outputs": [], "source": [ "vk.zadd(b'lectures:dbs', {\n", " '2026-01-09': 0,\n", " '2026-01-27': 0,\n", " '2026-01-20': 0,\n", " '2026-01-13': 0,\n", "})" ] }, { "cell_type": "markdown", "id": "fe322b20", "metadata": {}, "source": [ "_notes_\n", "\n", "Note that elements with the same score are sorted lexicographically (by byte values).\n", "\n", "Since ISO dates' lexicographical order matches the chronological order, we can now easily retrieve the dates of two last lectures in January 2026." ] }, { "cell_type": "code", "execution_count": null, "id": "b82f3417", "metadata": {}, "outputs": [], "source": [ "vk.zrange(b'lectures:dbs', -2, -1)" ] }, { "cell_type": "code", "execution_count": null, "id": "726ee90d", "metadata": {}, "outputs": [], "source": [ "help(vk.zrangebylex)" ] }, { "cell_type": "code", "execution_count": null, "id": "532bbe22", "metadata": {}, "outputs": [], "source": [ "vk.zrangebylex(b'lectures:dbs',\n", " b'(2026-01-09', # Start string, `(' for \"exclusive\".\n", " b'[2026-01-20') # End string, `[' for \"inclusive\"." ] }, { "cell_type": "markdown", "id": "15ede032", "metadata": {}, "source": [ "LMDB is another example of a key-value store that allows keys to be queried in lexicographical byte order." ] }, { "cell_type": "markdown", "id": "ce269774", "metadata": {}, "source": [ "## XML Querying Support in Databases" ] }, { "cell_type": "markdown", "id": "4e09d809", "metadata": {}, "source": [ "- BaseX\n", "- SQL/XML\n", "- XQuery" ] }, { "cell_type": "markdown", "id": "74debd5f", "metadata": {}, "source": [ "_notes_\n", "\n", "BaseX is a flagship document-oriented database based on XML.\n", "\n", "Since 2003, the SQL standard includes XML-Related Specifications (SQL/XML) in part 14. It defined the `XML` data type and functions / syntax used to work with it.\n", "\n", "XQuery is a query language for XML data, used both by BaseX and by SQL databases implementing support for XML.\n", "\n", "We shall show some examples of XQuery." ] }, { "cell_type": "code", "execution_count": null, "id": "2a538e78", "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "\n", "def run_xquery(program):\n", " subprocess.run(['qexo', '-e', program + '\\n'])" ] }, { "cell_type": "code", "execution_count": null, "id": "b47a8940", "metadata": {}, "outputs": [], "source": [ "run_xquery('

7 * 5 is {7*5}

')" ] }, { "cell_type": "markdown", "id": "fe0f0221", "metadata": {}, "source": [ "_notes_\n", "\n", "XQuery can has many features of a programming language.\n", "\n", "We can write expressions to perform computation." ] }, { "cell_type": "code", "execution_count": null, "id": "902172b6", "metadata": {}, "outputs": [], "source": [ "run_xquery('''\n", " children(

this is text with spans.

)\n", "''')" ] }, { "cell_type": "markdown", "id": "cfbb178f", "metadata": {}, "source": [ "_notes_\n", "\n", "We can use some readily available functions to process XML." ] }, { "cell_type": "code", "execution_count": null, "id": "8f01b59d", "metadata": {}, "outputs": [], "source": [ "run_xquery('''\n", " let $document :=

this is text\n", " with spans.\n", "

return\n", " $document/span\n", "''')" ] }, { "cell_type": "markdown", "id": "9d1bd20e", "metadata": {}, "source": [ "_notes_\n", "\n", "We can use variables and we can query XML using using several operations.\n", "\n", "\"`/`\" means \"find matching tags directly below the root tag\"." ] }, { "cell_type": "code", "execution_count": null, "id": "bf0df57e", "metadata": {}, "outputs": [], "source": [ "run_xquery('''\n", " let $document :=

this is text\n", " with spans.\n", "

return\n", " $document//span[@class=\"bold\"]\n", "''')" ] }, { "cell_type": "markdown", "id": "6fafa0fa", "metadata": {}, "source": [ "_notes_\n", "\n", "\"`//`\" means \"find matching tags anywhere in the tree\"." ] }, { "cell_type": "code", "execution_count": null, "id": "ad531238", "metadata": {}, "outputs": [], "source": [ "run_xquery('''\n", " let $document :=

this is text\n", " with\n", "

\n", " spans\n", "
.\n", "

return\n", " $document//span[@class=\"bold\"]\n", "''')" ] }, { "cell_type": "markdown", "id": "3d44bea4", "metadata": {}, "source": [ "_notes_\n", "\n", "\"`//`\" does, of course, retrieve multiple matching tags if present." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "/gnu/store/pv50hmdxs15c32laa1vn03wkfl023wwk-python-3.11.14/bin/python3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 5 }