Queer European MD passionate about IT
Explorar o código

Merge branch 'master' of github.com:dataquestio/solutions

Christian Pascual %!s(int64=5) %!d(string=hai) anos
pai
achega
170d083987
Modificáronse 3 ficheiros con 910 adicións e 121 borrados
  1. 121 121
      Mission155Solutions.ipynb
  2. 509 0
      Mission251Solution.ipynb
  3. 280 0
      Mission410Solutions.Rmd

A diferenza do arquivo foi suprimida porque é demasiado grande
+ 121 - 121
Mission155Solutions.ipynb


+ 509 - 0
Mission251Solution.ipynb

@@ -0,0 +1,509 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Guided Project Solution: Building a database for crime reports\n",
+    "## Apply what you have learned to set up a database for storing crime reports data\n",
+    "\n",
+    "## François Aubry\n",
+    "\n",
+    "The goal of this guided project is to setup a database from scratch and the Boston crime data into it.\n",
+    "\n",
+    "We will create two user groups:\n",
+    "\n",
+    "* `readonly`: Users in this group will have permission to read data only.\n",
+    "* `readwrite`:  Users in this group will have permissions to read and alter data but not to delete tables."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating the database and the schema\n",
+    "\n",
+    "Create a database named `crime_db` and a schema named `crimes` for storing the tables for containing the crime data.\n",
+    "\n",
+    "The database `crime_db` does not exist yet so we connect to `dq`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import psycopg2\n",
+    "conn = psycopg2.connect(dbname=\"dq\", user=\"dq\")\n",
+    "# set autocommit to True bacause this is required for creating databases\n",
+    "conn.autocommit = True\n",
+    "cur = conn.cursor()\n",
+    "# create the crime_db database\n",
+    "cur.execute(\"CREATE DATABASE crime_db;\")\n",
+    "conn.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'psycopg2' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-14-cf0881223d2f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# now the crime_db database exists to we can connect to it\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mconn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpsycopg2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconnect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdbname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"crime_db\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"dq\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mautocommit\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0mcur\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcursor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;31m# create he crimes schema\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'psycopg2' is not defined"
+     ]
+    }
+   ],
+   "source": [
+    "# now the crime_db database exists to we can connect to it\n",
+    "conn = psycopg2.connect(dbname=\"crime_db\", user=\"dq\")\n",
+    "conn.autocommit = True\n",
+    "cur = conn.cursor()\n",
+    "# create he crimes schema\n",
+    "cur.execute(\"CREATE SCHEMA crimes;\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Obtaining the Column Names and Sample\n",
+    " \n",
+    "Obtain the header row and assign it to a variable named `col_headers` and obtain the first data row and assign it to a variable named `first_row`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import csv\n",
+    "with open('boston.csv') as file:\n",
+    "    reader = csv.reader(file)\n",
+    "    col_headers = next(reader)\n",
+    "    first_row = next(reader)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating a function for analyzing column values\n",
+    "\n",
+    "Create a function `get_col_set` that given a CSV file name and a column index computes the set of all distinct values in that column.\n",
+    "\n",
+    "Use the function on each column to evaluate which columns have a lot of different values. Columns with a limited set of possible values are good candidates for enumerated datatypes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "incident_number\t298329\n",
+      "offense_code\t219\n",
+      "description\t239\n",
+      "date\t1177\n",
+      "day_of_the_week\t7\n",
+      "lat\t18177\n",
+      "long\t18177\n"
+     ]
+    }
+   ],
+   "source": [
+    "def get_col_set(csv_file, col_index):\n",
+    "    import csv\n",
+    "    values = set()\n",
+    "    with open(csv_file, 'r') as f:\n",
+    "        next(f)\n",
+    "        reader = csv.reader(f)\n",
+    "        for row in reader:\n",
+    "            values.add(row[col_index])\n",
+    "    return values\n",
+    "\n",
+    "for i in range(len(col_headers)):\n",
+    "    values = get_col_set(\"boston.csv\", i)\n",
+    "    print(col_headers[i], len(values), sep='\\t')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Analyzing the maximum length of the description column\n",
+    "\n",
+    "Use the `get_col_set` function to compute the maximum description length to decide an appropriate length for that field."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(col_headers)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "58\n"
+     ]
+    }
+   ],
+   "source": [
+    "descriptions = get_col_set(\"boston.csv\", 2) # description is at index number 2\n",
+    "max_len = 0\n",
+    "for description in descriptions:\n",
+    "    max_len = max(max_len, len(description))\n",
+    "print(max_len)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating the table\n",
+    "\n",
+    "We have create an enumerated datatype named `weekday` for the `day_of_the_week` since there there only seven possible values.\n",
+    "\n",
+    "For the `incident_number` we have decided to user the type `INTEGER` and set it as the primary key. The same datatype was also used to represent the `offense_code`.\n",
+    "\n",
+    "Since the description has at most `58` character we decided to use the datatype `VARCHAR(100)` for representing it. This leave some margin while not being so big that we will waste a lot of memory.\n",
+    "\n",
+    "The date was represented as the `DATE` datatype. Finally, for the latitude and longitude we used `DECIMAL` datatypes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']\n",
+      "['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(col_headers)\n",
+    "print(first_row)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will use the same names for the column headers.\n",
+    "\n",
+    "The number of different values of each column was:\n",
+    "\n",
+    "```\n",
+    "incident_number 298329\n",
+    "offense_code       219\n",
+    "description        239\n",
+    "date\t          1177\n",
+    "day_of_the_week      7\n",
+    "lat              18177\n",
+    "long\t         18177\n",
+    "```\n",
+    "\n",
+    "From the result of printing `first_row` we see that kind of data that we have are:\n",
+    "\n",
+    "```\n",
+    "integer numbers\n",
+    "integer numbers\n",
+    "string\n",
+    "date\n",
+    "string\n",
+    "decimal number\n",
+    "decimal number\n",
+    "```\n",
+    "\n",
+    "Only column `day_of_the_week` has a small range of values so we will only create an enumerated datatype for this column. Column `offense_code` is also a good candidate since there is probably a limited set of possible offense codes.\n",
+    "\n",
+    "We saw that the `offense_code` column has size at most 59. To be on the safe side we will limit the size of the description to 100 and use the `VARCHAR(100)` datatype.\n",
+    "\n",
+    "The `lat` and `long` column see to need to hold quite a lot of precision so we will use the `decimal` type."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "ProgrammingError",
+     "evalue": "type \"weekday\" already exists\n",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mProgrammingError\u001b[0m                          Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-5-f43f3cb51339>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m cur.execute(\"\"\"\n\u001b[1;32m      5\u001b[0m     \u001b[0mCREATE\u001b[0m \u001b[0mTYPE\u001b[0m \u001b[0mweekday\u001b[0m \u001b[0mAS\u001b[0m \u001b[0mENUM\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m'Monday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Tuesday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Wednesday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Thursday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Friday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Saturday'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Sunday'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \"\"\")\n\u001b[0m\u001b[1;32m      7\u001b[0m \u001b[0;31m# create the table\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m cur.execute(\"\"\"\n",
+      "\u001b[0;31mProgrammingError\u001b[0m: type \"weekday\" already exists\n"
+     ]
+    }
+   ],
+   "source": [
+    "# create the enumerated datatype for representing the weekday\n",
+    "cur.execute(\"\"\"\n",
+    "    CREATE TYPE weekday AS ENUM ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday');\n",
+    "\"\"\")\n",
+    "# create the table\n",
+    "cur.execute(\"\"\"\n",
+    "    CREATE TABLE crimes.boston_crimes (\n",
+    "        incident_number INTEGER PRIMARY KEY,\n",
+    "        offense_code INTEGER,\n",
+    "        description VARCHAR(100),\n",
+    "        date DATE,\n",
+    "        day_of_the_week weekday,\n",
+    "        lat decimal,\n",
+    "        long decimal\n",
+    "    );\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load the data into the table\n",
+    "\n",
+    "We used the `copy_expert` to load the data as it is very fast and very succinct to use."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'psycopg2' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-13-0d8c83600488>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mconn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpsycopg2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconnect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdbname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"crime_db\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"dq\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpassword\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"dq\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0mcur\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcursor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0;31m# load the data from boston.csv into the table boston_crimes that is in the crimes schema\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"boston.csv\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m     \u001b[0mcur\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcopy_expert\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'psycopg2' is not defined"
+     ]
+    }
+   ],
+   "source": [
+    "# load the data from boston.csv into the table boston_crimes that is in the crimes schema\n",
+    "with open(\"boston.csv\") as f:\n",
+    "    cur.copy_expert(\"COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;\", f)\n",
+    "cur.execute(\"SELECT * FROM crimes.boston_crimes\")\n",
+    "# print the number of rows to ensure that they were loaded\n",
+    "print(len(cur.fetchall()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Revoke public privileges\n",
+    "\n",
+    "We revoke all privileges of the public `public` group on the `public` schema to ensure that users will not inherit privileges on that schema such as the ability to create tables in the `public` schema.\n",
+    "\n",
+    "We also need to revoke all privileges in the newly created schema. Doing this also makes it so that we do not need to revoke the privileges when we create users and groups because unless specified otherwise, privileges are not granted by default."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cur.execute(\"REVOKE ALL ON SCHEMA public FROM public;\")\n",
+    "cur.execute(\"REVOKE ALL ON DATABASE crime_db FROM public;\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating the read only group\n",
+    "\n",
+    "We create a `readonly` group with `NOLOGIN` because it is a group and not a user. We grant the group the ability to connect to the `crime_db` and the ability to use the `crimes` schema.\n",
+    "\n",
+    "Then we deal wit tables privileges by granting `SELECT`. We also add an extra line compared with what was asked. This extra line changes the way that privileges are given by default to the `readonly` group on new table that are created on the `crimes` schema. As we mentioned, by default not privileges are given. However we change is so that by default any user in the `readonly` group can issue select commands."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cur.execute(\"CREATE GROUP readonly NOLOGIN;\")\n",
+    "cur.execute(\"GRANT CONNECT ON DATABASE crime_db TO readonly;\")\n",
+    "cur.execute(\"GRANT USAGE ON SCHEMA crimes TO readonly;\")\n",
+    "cur.execute(\"GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly;\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating the read-write group\n",
+    "\n",
+    "We create a `readwrite` group with `NOLOGIN` because it is a group and not a user. We grant the group the ability to connect to the `crime_db` and the ability to use the `crimes` schema.\n",
+    "\n",
+    "Then we deal wit tables privileges by granting `SELECT`, `INSERT`, `UPDATE` and `DELETE`. As before we change the default privileges so that user in the `readwrite` group have these privileges if we ever create a new table on the `crimes` schema."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cur.execute(\"CREATE GROUP readwrite NOLOGIN;\")\n",
+    "cur.execute(\"GRANT CONNECT ON DATABASE crime_db TO readwrite;\")\n",
+    "cur.execute(\"GRANT USAGE ON SCHEMA crimes TO readwrite;\")\n",
+    "cur.execute(\"GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA crimes TO readwrite;\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating one user for each group\n",
+    "\n",
+    "We create a user named `data_analyst` with password `secret1` in the `readonly` group.\n",
+    "\n",
+    "We create a user named `data_scientist` with password `secret2` in the `readwrite` group.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cur.execute(\"CREATE USER data_analyst WITH PASSWORD 'secret1';\")\n",
+    "cur.execute(\"GRANT readonly TO data_analyst;\")\n",
+    "\n",
+    "cur.execute(\"CREATE USER data_scientist WITH PASSWORD 'secret2';\")\n",
+    "cur.execute(\"GRANT readwrite TO data_scientist;\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test the database setup\n",
+    "\n",
+    "Test the database setup using SQL queries on the `pg_roles` table and `information_schema.table_privileges`.\n",
+    "\n",
+    "In the `pg_roles` table we will check database related privileges and for that we will look at the following columns: \n",
+    "\n",
+    "* `rolname`: The name of the user / group that the privilege refers to.\n",
+    "* `rolsuper`: Whether this user / group is a super user. It should be set to `False` on every user / group that we have created.\n",
+    "* `rolcreaterole`: Whether user / group can create users, groups or roles. It should be `False` on every user / group that we have created.\n",
+    "* `rolcreatedb`: Whether user / group can create databases. It should be `False` on every user / group that we have created.\n",
+    "* `rolcanlogin`: Whether user / group can login. It should be `True` on the users and `False` on the groups that we have created.\n",
+    "\n",
+    "In the `information_schema.table_privileges` we will check privileges related to SQL queries on tables. We will list the privileges of each group that we have created."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "('readonly', False, False, False, False)\n",
+      "('readwrite', False, False, False, False)\n",
+      "('data_analyst', False, False, False, True)\n",
+      "('data_scientist', False, False, False, True)\n",
+      "\n",
+      "('readonly', 'SELECT')\n",
+      "('readwrite', 'INSERT')\n",
+      "('readwrite', 'SELECT')\n",
+      "('readwrite', 'UPDATE')\n",
+      "('readwrite', 'DELETE')\n"
+     ]
+    }
+   ],
+   "source": [
+    "# close the old connection to test with a brand new connection\n",
+    "conn.close()\n",
+    "\n",
+    "conn = psycopg2.connect(dbname=\"crime_db\", user=\"dq\")\n",
+    "cur = conn.cursor()\n",
+    "# check users and groups\n",
+    "cur.execute(\"\"\"\n",
+    "    SELECT rolname, rolsuper, rolcreaterole, rolcreatedb, rolcanlogin FROM pg_roles\n",
+    "    WHERE rolname IN ('readonly', 'readwrite', 'data_analyst', 'data_scientist');\n",
+    "\"\"\")\n",
+    "for user in cur:\n",
+    "    print(user)\n",
+    "print()\n",
+    "# check privileges\n",
+    "cur.execute(\"\"\"\n",
+    "    SELECT grantee, privilege_type\n",
+    "    FROM information_schema.table_privileges\n",
+    "    WHERE grantee IN ('readonly', 'readwrite');\n",
+    "\"\"\")\n",
+    "for user in cur:\n",
+    "    print(user)\n",
+    "conn.close()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

+ 280 - 0
Mission410Solutions.Rmd

@@ -0,0 +1,280 @@
+---
+title: 'Statistics Fundamentals in R: Guided Project Solutions'
+author: "Dataquest"
+date: "8/13/2019"
+output: html_document
+---
+
+# Is Fandango Still Inflating Ratings?
+In October 2015, Walt Hickey from FiveThirtyEight published a [popular article](https://fivethirtyeight.com/features/fandango-movies-ratings/) where he presented strong evidence which suggest that Fandango's movie rating system was biased and dishonest. In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.
+
+# Understanding the Data
+We'll work with two samples of movie ratings: the data in one sample was collected *previous* to Hickey's analysis, while the other sample was collected *after*. Let's start by reading in the two samples (which are stored as CSV files) and getting familiar with their structure.
+
+```{r message=FALSE}
+library(readr)
+
+previous <- read_csv('fandango_score_comparison.csv')
+after <- read_csv('movie_ratings_16_17.csv')
+
+head(previous)
+```
+
+```{r}
+head(after)
+```
+
+Below we isolate only the columns that provide information about Fandango so we make the relevant data more readily available for later use. 
+
+```{r message=FALSE}
+library(dplyr)
+fandango_previous <- previous %>% 
+  select(FILM, Fandango_Stars, Fandango_Ratingvalue, 
+         Fandango_votes, Fandango_Difference)
+
+fandango_after <- after %>% 
+  select(movie, year, fandango)
+
+head(fandango_previous)
+```
+
+```{r}
+head(fandango_after)
+```
+
+Our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis. The population of interest for our analysis is made of all the movie ratings stored on Fandango's website, regardless of the releasing year.
+
+Because we want to find out whether the parameters of this population changed after Hickey's analysis, we're interested in sampling the population at two different periods in time — previous and after Hickey's analysis — so we can compare the two states.
+
+The data we're working with was sampled at the moments we want: one sample was taken previous to the analysis, and the other after the analysis. We want to describe the population, so we need to make sure that the samples are representative, otherwise we should expect a large sampling error and, ultimately, wrong conclusions.
+
+From Hickey's article and from the README.md of [the data set's repository](https://github.com/fivethirtyeight/data/tree/master/fandango), we can see that he used the following sampling criteria:
+
+* The movie must have had at least 30 fan ratings on Fandango's website at the time of sampling (Aug. 24, 2015).
+* The movie must have had tickets on sale in 2015.
+
+The sampling was clearly not random because not every movie had the same chance to be included in the sample — some movies didn't have a chance at all (like those having under 30 fan ratings or those without tickets on sale in 2015). It's questionable whether this sample is representative of the entire population we're interested to describe. It seems more likely that it isn't, mostly because this sample is subject to temporal trends — e.g. movies in 2015 might have been outstandingly good or bad compared to other years.
+
+The sampling conditions for our other sample were (as it can be read in the README.md of [the data set's repository](https://github.com/mircealex/Movie_ratings_2016_17)):
+
+* The movie must have been released in 2016 or later.
+* The movie must have had a considerable number of votes and reviews (unclear how many from the README.md or from the data).
+
+This second sample is also subject to temporal trends and it's unlikely to be representative of our population of interest.
+
+Both these authors had certain research questions in mind when they sampled the data, and they used a set of criteria to get a sample that would fit their questions. Their sampling method is called [purposive sampling](https://www.youtube.com/watch?v=CdK7N_kTzHI&feature=youtu.be) (or judgmental/selective/subjective sampling). While these samples were good enough for their research, they don't seem too useful for us.
+
+# Changing the Goal of our Analysis
+At this point, we can either collect new data or change our the goal of our analysis. We choose the latter and place some limitations on our initial goal.
+
+Instead of trying to determine whether there has been any change in Fandango's rating system after Hickey's analysis, our new goal is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. This new goal should also be a fairly good proxy for our initial goal.
+
+# Isolating the Samples We Need
+With this new research goal, we have two populations of interest:
+
+1. All Fandango's ratings for popular movies released in 2015.
+1. All Fandango's ratings for popular movies released in 2016.
+
+We need to be clear about what counts as popular movies. We'll use Hickey's benchmark of 30 fan ratings and count a movie as popular only if it has 30 fan ratings or more on Fandango's website.
+
+Although one of the sampling criteria in our second sample is movie popularity, the `fandango_after` dataframe doesn't provide information about the number of fan ratings. We should be skeptical once more and ask whether this sample is truly representative and contains popular movies (movies with over 30 fan ratings).
+
+One quick way to check the representativity of this sample might be to sample randomly 10 movies from it and then check the number of fan ratings ourselves on Fandango's website. 
+
+```{r}
+set.seed(1)
+sample_n(fandango_after, size = 10)
+```
+
+Above we used a value of 1 as the random seed. This is good practice because it suggests that we weren't trying out various random seeds just to get a favorable sample.
+
+After checking the number of fan ratings for the movies above, we discover that as of August, 2019 Fandango no longer uses the 5-Star Fan Ratings described above. Instead, Fandango now uses the [Rotten Tomatoes verified Audience Score](https://editorial.rottentomatoes.com/article/introducing-verified-audience-score/). These are the number of fan ratings we found on [Rotten Tomatoes](https://www.rottentomatoes.com/):
+
+```{r}
+set.seed(1)
+sampled <- sample_n(fandango_after, size = 10)
+# Create a single column tibble of Rotten Tomato review counts
+reviews <- tibble(reviews = c(13569, 74904, 24293, 4141, 30183, 48952, 14328, 59359, 54765, 82222))
+bind_cols(sampled, reviews)
+```
+
+All ten movies sampled have well above 30 fan ratings, but it is possible that the Rotten Tomatoes Verified Audience user base is larger than the Fandango user base. We cannot really say with confidence whether these review numbers are comparable to the Fandango fan ratings. In addition, time has passed since Hickey's analysis, giving more fans an opportunity to submit reviews. So even if we did still have access to Fandango's 5-star fan ratings, we would have no way to compare the number of fan ratings we see to the number that Hickey observed. 
+
+Let's move on to the `fandango_previous` dataframe that does include the number of fan ratings for each movie. The documentation states clearly that there're only movies with at least 30 fan ratings, but it should take only a couple of seconds to double-check here.
+
+```{r}
+sum(fandango_previous$Fandango_votes < 30)
+```
+
+If you explore the two data sets, you'll notice that there are movies with a releasing year different than 2015 or 2016. 
+
+```{r}
+head(fandango_previous$FILM, n = 10)
+```
+
+
+```{r}
+unique(fandango_after$year)
+```
+
+
+For our purposes, we'll need to isolate only the movies released in 2015 and 2016.
+
+```{r}
+library(stringr)
+fandango_previous <- fandango_previous %>% 
+  mutate(year = str_sub(FILM, -5, -2))
+```
+
+Let's examine the frequency distribution for the Year column and then isolate the movies released in 2015.
+
+```{r}
+fandango_previous %>% 
+  group_by(year) %>% 
+  summarize(Freq = n())
+```
+
+Alternatively, we can use the base R `table()` function because we only need to get a quick view of the distribution.
+```{r}
+table(fandango_previous$year)
+```
+
+```{r}
+fandango_2015 <- fandango_previous %>% 
+  filter(year == 2015)
+table(fandango_2015$year)
+```
+Great, now let's isolate the movies in the other data set.
+```{r}
+head(fandango_after)
+```
+
+```{r}
+table(fandango_after$year)
+```
+
+```{r}
+fandango_2016 <- fandango_after %>% 
+  filter(year == 2016)
+table(fandango_2016$year)
+```
+
+
+# Comparing Distribution Shapes for 2015 and 2016
+
+Our aim is to figure out whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. One way to go about is to analyze and compare the distributions of movie ratings for the two samples.
+
+We'll start with comparing the shape of the two distributions using kernel density plots.
+
+```{r}
+library(ggplot2)
+# 2015 dataframe is specified in the ggplot call
+ggplot(data = fandango_2015, 
+               aes(x = Fandango_Stars)) +
+  geom_density() +
+  # 2016 dataframe is specified in the second geom_density() call
+  geom_density(data = fandango_2016, 
+               aes(x = fandango), color = "blue") +
+  labs(title = "Comparing distribution shapes for Fandango's ratings\n(2015 vs 2016)",
+       x = "Stars",
+       y = "Density") +
+  scale_x_continuous(breaks = seq(0, 5, by = 0.5), 
+                     limits = c(0, 5))
+```
+
+
+
+Two aspects are striking on the figure above:
+
+* Both distributions are strongly left skewed.
+* The 2016 distribution is slightly shifted to the left relative to the 2015 distribution.
+
+The left skew suggests that movies on Fandango are given mostly high and very high fan ratings. Coupled with the fact that Fandango sells tickets, the high ratings are a bit dubious. It'd be really interesting to investigate this further — ideally in a separate project, since this is quite irrelevant for the current goal of our analysis.
+
+The slight left shift of the 2016 distribution is very interesting for our analysis. It shows that ratings were slightly lower in 2016 compared to 2015. This suggests that there was a difference indeed between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We can also see the direction of the difference: the ratings in 2016 were slightly lower compared to 2015.
+
+```{r}
+fandango_2015 %>% 
+  group_by(Fandango_Stars) %>% 
+  summarize(Percentage = n() / nrow(fandango_2015) * 100)
+```
+
+```{r}
+fandango_2016 %>% 
+  group_by(fandango) %>% 
+  summarize(Percentage = n() / nrow(fandango_2016) * 100)
+```
+
+In 2016, very high ratings (4.5 and 5 stars) had lower percentages compared to 2015. In 2016, under 1% of the movies had a perfect rating of 5 stars, compared to 2015 when the percentage was close to 7%. Ratings of 4.5 were also more popular in 2015 — there were approximately 13% more movies rated with a 4.5 in 2015 compared to 2016.
+
+The minimum rating is also lower in 2016 — 2.5 instead of 3 stars, the minimum of 2015. There clearly is a difference between the two frequency distributions.
+
+For some other ratings, the percentage went up in 2016. There was a greater percentage of movies in 2016 that received 3.5 and 4 stars, compared to 2015. 3.5 and 4.0 are high ratings and this challenges the direction of the change we saw on the kernel density plots.
+
+Determining the Direction of the Change
+
+Let's take a couple of summary metrics to get a more precise picture about the direction of the change. In what follows, we'll compute the mean, the median, and the mode for both distributions and then use a bar graph to plot the values.
+
+```{r}
+library(tidyr)
+
+# Mode function from stackoverflow
+mode <- function(x) {
+  ux <- unique(x)
+  ux[which.max(tabulate(match(x, ux)))]
+}
+
+summary_2015 <- fandango_2015 %>% 
+  summarize(year = "2015",
+    mean = mean(Fandango_Stars),
+    median = median(Fandango_Stars),
+    mode = mode(Fandango_Stars))
+
+summary_2016 <- fandango_2016 %>% 
+  summarize(year = "2016",
+            mean = mean(fandango),
+            median = median(fandango),
+            mode = mode(fandango))
+
+# Combine 2015 & 2016 summary dataframes
+summary_df <- bind_rows(summary_2015, summary_2016)
+
+# Gather combined dataframe into a format ready for ggplot
+summary_df <- summary_df %>% 
+  gather(key = "statistic", value = "value", - year)
+
+summary_df
+```
+
+```{r}
+ggplot(data = summary_df, aes(x = statistic, y = value, fill = year)) +
+  geom_bar(stat = "identity", position = "dodge") +
+  labs(title = "Comparing summary statistics: 2015 vs 2016",
+       x = "",
+       y = "Stars")
+```
+
+The mean rating was lower in 2016 with approximately 0.2. This means a drop of almost 5% relative to the mean rating in 2015.
+
+```{r}
+means <- summary_df %>% 
+  filter(statistic == "mean")
+
+means %>% 
+  summarize(change = (value[1] - value[2]) / value[1])
+```
+
+
+
+While the median is the same for both distributions, the mode is lower in 2016 by 0.5. Coupled with what we saw for the mean, the direction of the change we saw on the kernel density plot is confirmed: on average, popular movies released in 2016 were rated slightly lower than popular movies released in 2015.
+
+# Conclusion
+
+Our analysis showed that there's indeed a slight difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We also determined that, on average, popular movies released in 2016 were rated lower on Fandango than popular movies released in 2015.
+
+We cannot be completely sure what caused the change, but the chances are very high that it was caused by Fandango fixing the biased rating system after Hickey's analysis.
+
+
+
+
+

Algúns arquivos non se mostraron porque demasiados arquivos cambiaron neste cambio