{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Guided Project Solution: Building a database for crime reports\n", "## Apply what you have learned to set up a database for storing crime reports data\n", "\n", "## François Aubry\n", "\n", "The goal of this guided project is to setup a database from scratch and the Boston crime data into it.\n", "\n", "We will create two user groups:\n", "\n", "* `readonly`: Users in this group will have permission to read data only.\n", "* `readwrite`: Users in this group will have permissions to read and alter data but not to delete tables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the database and the schema\n", "\n", "Create a database named `crime_db` and a schema named `crimes` for storing the tables for containing the crime data.\n", "\n", "The database `crime_db` does not exist yet so we connect to `dq`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "conn = psycopg2.connect(dbname=\"dq\", user=\"dq\")\n", "# set autocommit to True bacause this is required for creating databases\n", "conn.autocommit = True\n", "cur = conn.cursor()\n", "# create the crime_db database\n", "cur.execute(\"CREATE DATABASE crime_db;\")\n", "conn.close()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'psycopg2' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# now the crime_db database exists to we can connect to it\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mconn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpsycopg2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconnect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdbname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"crime_db\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"dq\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mautocommit\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mcur\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcursor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;31m# create he crimes schema\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'psycopg2' is not defined" ] } ], "source": [ "# now the crime_db database exists to we can connect to it\n", "conn = psycopg2.connect(dbname=\"crime_db\", user=\"dq\")\n", "conn.autocommit = True\n", "cur = conn.cursor()\n", "# create he crimes schema\n", "cur.execute(\"CREATE SCHEMA crimes;\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtaining the Column Names and Sample\n", " \n", "Obtain the header row and assign it to a variable named `col_headers` and obtain the first data row and assign it to a variable named `first_row`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import csv\n", "with open('boston.csv') as file:\n", " reader = csv.reader(file)\n", " col_headers = next(reader)\n", " first_row = next(reader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a function for analyzing column values\n", "\n", "Create a function `get_col_set` that given a CSV file name and a column index computes the set of all distinct values in that column.\n", "\n", "Use the function on each column to evaluate which columns have a lot of different values. Columns with a limited set of possible values are good candidates for enumerated datatypes." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "incident_number\t298329\n", "offense_code\t219\n", "description\t239\n", "date\t1177\n", "day_of_the_week\t7\n", "lat\t18177\n", "long\t18177\n" ] } ], "source": [ "def get_col_set(csv_file, col_index):\n", " import csv\n", " values = set()\n", " with open(csv_file, 'r') as f:\n", " next(f)\n", " reader = csv.reader(f)\n", " for row in reader:\n", " values.add(row[col_index])\n", " return values\n", "\n", "for i in range(len(col_headers)):\n", " values = get_col_set(\"boston.csv\", i)\n", " print(col_headers[i], len(values), sep='\\t')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing the maximum length of the description column\n", "\n", "Use the `get_col_set` function to compute the maximum description length to decide an appropriate length for that field." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']\n" ] } ], "source": [ "print(col_headers)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "58\n" ] } ], "source": [ "descriptions = get_col_set(\"boston.csv\", 2) # description is at index number 2\n", "max_len = 0\n", "for description in descriptions:\n", " max_len = max(max_len, len(description))\n", "print(max_len)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the table\n", "\n", "We have create an enumerated datatype named `weekday` for the `day_of_the_week` since there there only seven possible values.\n", "\n", "For the `incident_number` we have decided to user the type `INTEGER` and set it as the primary key. The same datatype was also used to represent the `offense_code`.\n", "\n", "Since the description has at most `58` character we decided to use the datatype `VARCHAR(100)` for representing it. This leave some margin while not being so big that we will waste a lot of memory.\n", "\n", "The date was represented as the `DATE` datatype. Finally, for the latitude and longitude we used `DECIMAL` datatypes." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']\n", "['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']\n" ] } ], "source": [ "print(col_headers)\n", "print(first_row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the same names for the column headers.\n", "\n", "The number of different values of each column was:\n", "\n", "```\n", "incident_number 298329\n", "offense_code 219\n", "description 239\n", "date\t 1177\n", "day_of_the_week 7\n", "lat 18177\n", "long\t 18177\n", "```\n", "\n", "From the result of printing `first_row` we see that kind of data that we have are:\n", "\n", "```\n", "integer numbers\n", "integer numbers\n", "string\n", "date\n", "string\n", "decimal number\n", "decimal number\n", "```\n", "\n", "Only column `day_of_the_week` has a small range of values so we will only create an enumerated datatype for this column. Column `offense_code` is also a good candidate since there is probably a limited set of possible offense codes.\n", "\n", "We saw that the `offense_code` column has size at most 59. To be on the safe side we will limit the size of the description to 100 and use the `VARCHAR(100)` datatype.\n", "\n", "The `lat` and `long` column see to need to hold quite a lot of precision so we will use the `decimal` type." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "ename": "ProgrammingError", "evalue": "type \"weekday\" already exists\n", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mProgrammingError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 4\u001b[0m cur.execute(\"\"\"\n\u001b[1;32m 5\u001b[0m \u001b[0mCREATE\u001b[0m \u001b[0mTYPE\u001b[0m \u001b[0mweekday\u001b[0m \u001b[0mAS\u001b[0m \u001b[0mENUM\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m'Monday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Tuesday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Wednesday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Thursday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Friday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Saturday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Sunday'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \"\"\")\n\u001b[0m\u001b[1;32m 7\u001b[0m \u001b[0;31m# create the table\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m cur.execute(\"\"\"\n", "\u001b[0;31mProgrammingError\u001b[0m: type \"weekday\" already exists\n" ] } ], "source": [ "# create the enumerated datatype for representing the weekday\n", "cur.execute(\"\"\"\n", " CREATE TYPE weekday AS ENUM ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday');\n", "\"\"\")\n", "# create the table\n", "cur.execute(\"\"\"\n", " CREATE TABLE crimes.boston_crimes (\n", " incident_number INTEGER PRIMARY KEY,\n", " offense_code INTEGER,\n", " description VARCHAR(100),\n", " date DATE,\n", " day_of_the_week weekday,\n", " lat decimal,\n", " long decimal\n", " );\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data into the table\n", "\n", "We used the `copy_expert` to load the data as it is very fast and very succinct to use." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'psycopg2' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mconn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpsycopg2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconnect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdbname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"crime_db\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"dq\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpassword\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"dq\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mcur\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcursor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;31m# load the data from boston.csv into the table boston_crimes that is in the crimes schema\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"boston.csv\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mcur\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcopy_expert\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'psycopg2' is not defined" ] } ], "source": [ "# load the data from boston.csv into the table boston_crimes that is in the crimes schema\n", "with open(\"boston.csv\") as f:\n", " cur.copy_expert(\"COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;\", f)\n", "cur.execute(\"SELECT * FROM crimes.boston_crimes\")\n", "# print the number of rows to ensure that they were loaded\n", "print(len(cur.fetchall()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Revoke public privileges\n", "\n", "We revoke all privileges of the public `public` group on the `public` schema to ensure that users will not inherit privileges on that schema such as the ability to create tables in the `public` schema.\n", "\n", "We also need to revoke all privileges in the newly created schema. Doing this also makes it so that we do not need to revoke the privileges when we create users and groups because unless specified otherwise, privileges are not granted by default." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "cur.execute(\"REVOKE ALL ON SCHEMA public FROM public;\")\n", "cur.execute(\"REVOKE ALL ON DATABASE crime_db FROM public;\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the read only group\n", "\n", "We create a `readonly` group with `NOLOGIN` because it is a group and not a user. We grant the group the ability to connect to the `crime_db` and the ability to use the `crimes` schema.\n", "\n", "Then we deal wit tables privileges by granting `SELECT`. We also add an extra line compared with what was asked. This extra line changes the way that privileges are given by default to the `readonly` group on new table that are created on the `crimes` schema. As we mentioned, by default not privileges are given. However we change is so that by default any user in the `readonly` group can issue select commands." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "cur.execute(\"CREATE GROUP readonly NOLOGIN;\")\n", "cur.execute(\"GRANT CONNECT ON DATABASE crime_db TO readonly;\")\n", "cur.execute(\"GRANT USAGE ON SCHEMA crimes TO readonly;\")\n", "cur.execute(\"GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly;\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the read-write group\n", "\n", "We create a `readwrite` group with `NOLOGIN` because it is a group and not a user. We grant the group the ability to connect to the `crime_db` and the ability to use the `crimes` schema.\n", "\n", "Then we deal wit tables privileges by granting `SELECT`, `INSERT`, `UPDATE` and `DELETE`. As before we change the default privileges so that user in the `readwrite` group have these privileges if we ever create a new table on the `crimes` schema." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "cur.execute(\"CREATE GROUP readwrite NOLOGIN;\")\n", "cur.execute(\"GRANT CONNECT ON DATABASE crime_db TO readwrite;\")\n", "cur.execute(\"GRANT USAGE ON SCHEMA crimes TO readwrite;\")\n", "cur.execute(\"GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA crimes TO readwrite;\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating one user for each group\n", "\n", "We create a user named `data_analyst` with password `secret1` in the `readonly` group.\n", "\n", "We create a user named `data_scientist` with password `secret2` in the `readwrite` group.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "cur.execute(\"CREATE USER data_analyst WITH PASSWORD 'secret1';\")\n", "cur.execute(\"GRANT readonly TO data_analyst;\")\n", "\n", "cur.execute(\"CREATE USER data_scientist WITH PASSWORD 'secret2';\")\n", "cur.execute(\"GRANT readwrite TO data_scientist;\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test the database setup\n", "\n", "Test the database setup using SQL queries on the `pg_roles` table and `information_schema.table_privileges`.\n", "\n", "In the `pg_roles` table we will check database related privileges and for that we will look at the following columns: \n", "\n", "* `rolname`: The name of the user / group that the privilege refers to.\n", "* `rolsuper`: Whether this user / group is a super user. It should be set to `False` on every user / group that we have created.\n", "* `rolcreaterole`: Whether user / group can create users, groups or roles. It should be `False` on every user / group that we have created.\n", "* `rolcreatedb`: Whether user / group can create databases. It should be `False` on every user / group that we have created.\n", "* `rolcanlogin`: Whether user / group can login. It should be `True` on the users and `False` on the groups that we have created.\n", "\n", "In the `information_schema.table_privileges` we will check privileges related to SQL queries on tables. We will list the privileges of each group that we have created." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('readonly', False, False, False, False)\n", "('readwrite', False, False, False, False)\n", "('data_analyst', False, False, False, True)\n", "('data_scientist', False, False, False, True)\n", "\n", "('readonly', 'SELECT')\n", "('readwrite', 'INSERT')\n", "('readwrite', 'SELECT')\n", "('readwrite', 'UPDATE')\n", "('readwrite', 'DELETE')\n" ] } ], "source": [ "# close the old connection to test with a brand new connection\n", "conn.close()\n", "\n", "conn = psycopg2.connect(dbname=\"crime_db\", user=\"dq\")\n", "cur = conn.cursor()\n", "# check users and groups\n", "cur.execute(\"\"\"\n", " SELECT rolname, rolsuper, rolcreaterole, rolcreatedb, rolcanlogin FROM pg_roles\n", " WHERE rolname IN ('readonly', 'readwrite', 'data_analyst', 'data_scientist');\n", "\"\"\")\n", "for user in cur:\n", " print(user)\n", "print()\n", "# check privileges\n", "cur.execute(\"\"\"\n", " SELECT grantee, privilege_type\n", " FROM information_schema.table_privileges\n", " WHERE grantee IN ('readonly', 'readwrite');\n", "\"\"\")\n", "for user in cur:\n", " print(user)\n", "conn.close()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }