5 years ago · 90d1bb1602
--- a/Mission310Solutions.ipynb
+++ b/Mission310Solutions.ipynb
@@ -6483,7 +6483,7 @@
 
				    "name": "python",
			
 
				    "nbconvert_exporter": "python",
			
 
				    "pygments_lexer": "ipython3",
			
 
				-   "version": "3.6.4"
			
 
				+   "version": "3.7.4"
			
 
				   }
			
 
				  },
			
 
				  "nbformat": 4,
			
--- a/Mission350Solutions.ipynb
+++ b/Mission350Solutions.ipynb
@@ -24,8 +24,8 @@
 
				     "\n",
			
 
				     "Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:\n",
			
 
				     "\n",
			
 
				-    "- [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately ten thousand Android apps from Google Play\n",
			
 
				-    "- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store\n",
			
 
				+    "- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).\n",
			
 
				+    "- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).\n",
			
 
				     "\n",
			
 
				     "Let's start by opening the two data sets and then continue with exploring the data."
			
 
				    ]
			
--- a/Mission382Solutions.ipynb
+++ b/Mission382Solutions.ipynb
@@ -605,7 +605,9 @@
 
				     "    - an integer between 2 and 5 that represents the number of winning numbers expected\n",
			
 
				     "- Our function prints information about the probability of having a certain number of winning numbers\n",
			
 
				     "\n",
			
 
				-    "To calculate the probabilities, we tell the engineering team that the specific combination on the ticket is irrelevant and we only need the integer between 2 and 5 representing the number of winning numbers expected. Consequently, we will write a function named `probability_less_6()` which takes in an integer and prints information about the chances of winning depending on the value of that integer."
			
 
				+    "To calculate the probabilities, we tell the engineering team that the specific combination on the ticket is irrelevant and we only need the integer between 2 and 5 representing the number of winning numbers expected. Consequently, we will write a function named `probability_less_6()` which takes in an integer and prints information about the chances of winning depending on the value of that integer.\n",
			
 
				+    "\n",
			
 
				+    "The function below calculates the probability that a player's ticket matches exactly the given number of winning numbers. If the player wants to find out the probability of having five winning numbers, the function will return the probability of having five winning numbers exactly (no more and no less). The function will not return the probability of having _at least_ five winning numbers."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -617,16 +619,14 @@
 
				     "def probability_less_6(n_winning_numbers):\n",
			
 
				     "    \n",
			
 
				     "    n_combinations_ticket = combinations(6, n_winning_numbers)\n",
			
 
				-    "    n_combinations_remaining = combinations(49 - n_winning_numbers,\n",
			
 
				-    "                                           6 - n_winning_numbers)\n",
			
 
				+    "    n_combinations_remaining = combinations(43, 6 - n_winning_numbers)\n",
			
 
				     "    successful_outcomes = n_combinations_ticket * n_combinations_remaining\n",
			
 
				-    "    n_combinations_total = combinations(49, 6)\n",
			
 
				     "    \n",
			
 
				+    "    n_combinations_total = combinations(49, 6)    \n",
			
 
				     "    probability = successful_outcomes / n_combinations_total\n",
			
 
				-    "    probability_percentage = probability * 100\n",
			
 
				-    "    \n",
			
 
				-    "    combinations_simplified = round(n_combinations_total/successful_outcomes)\n",
			
 
				     "    \n",
			
 
				+    "    probability_percentage = probability * 100    \n",
			
 
				+    "    combinations_simplified = round(n_combinations_total/successful_outcomes)    \n",
			
 
				     "    print('''Your chances of having {} winning numbers with this ticket are {:.6f}%.\n",
			
 
				     "In other words, you have a 1 in {:,} chances to win.'''.format(n_winning_numbers, probability_percentage,\n",
			
 
				     "                                                               int(combinations_simplified)))"
			
@@ -648,17 +648,17 @@
 
				      "name": "stdout",
			
 
				      "output_type": "stream",
			
 
				      "text": [
			
 
				-      "Your chances of having 2 winning numbers with this ticket are 19.132653%.\n",
			
 
				-      "In other words, you have a 1 in 5 chances to win.\n",
			
 
				+      "Your chances of having 2 winning numbers with this ticket are 13.237803%.\n",
			
 
				+      "In other words, you have a 1 in 8 chances to win.\n",
			
 
				       "--------------------------\n",
			
 
				-      "Your chances of having 3 winning numbers with this ticket are 2.171081%.\n",
			
 
				-      "In other words, you have a 1 in 46 chances to win.\n",
			
 
				+      "Your chances of having 3 winning numbers with this ticket are 1.765040%.\n",
			
 
				+      "In other words, you have a 1 in 57 chances to win.\n",
			
 
				       "--------------------------\n",
			
 
				-      "Your chances of having 4 winning numbers with this ticket are 0.106194%.\n",
			
 
				-      "In other words, you have a 1 in 942 chances to win.\n",
			
 
				+      "Your chances of having 4 winning numbers with this ticket are 0.096862%.\n",
			
 
				+      "In other words, you have a 1 in 1,032 chances to win.\n",
			
 
				       "--------------------------\n",
			
 
				-      "Your chances of having 5 winning numbers with this ticket are 0.001888%.\n",
			
 
				-      "In other words, you have a 1 in 52,969 chances to win.\n",
			
 
				+      "Your chances of having 5 winning numbers with this ticket are 0.001845%.\n",
			
 
				+      "In other words, you have a 1 in 54,201 chances to win.\n",
			
 
				       "--------------------------\n"
			
 
				      ]
			
 
				     }
			
@@ -680,12 +680,16 @@
 
				     "- `one_ticket_probability()` — calculates the probability of winning the big prize with a single ticket\n",
			
 
				     "- `check_historical_occurrence()` — checks whether a certain combination has occurred in the Canada lottery data set\n",
			
 
				     "- `multi_ticket_probability()` — calculates the probability for any number of of tickets between 1 and 13,983,816\n",
			
 
				-    "- `probability_less_6()` — calculates the probability of having two, three, four or five winning numbers\n",
			
 
				+    "- `probability_less_6()` — calculates the probability of having two, three, four or five winning numbers exactly\n",
			
 
				     "\n",
			
 
				     "Possible features for a second version of the app include:\n",
			
 
				     "\n",
			
 
				     "- Making the outputs even easier to understand by adding fun analogies (for example, we can find probabilities for strange events and compare with the chances of winning in lottery; for instance, we can output something along the lines \"You are 100 times more likely to be the victim of a shark attack than winning the lottery\")\n",
			
 
				-    "- Combining the `one_ticket_probability()`  and `check_historical_occurrence()` to output information on probability and historical occurrence at the same time"
			
 
				+    "- Combining the `one_ticket_probability()`  and `check_historical_occurrence()` to output information on probability and historical occurrence at the same time\n",
			
 
				+    "- Create a function similar to `probability_less_6()` which calculates the probability of having _at least_ two, three, four or five winning numbers. Hint: the number of successful outcomes for having at least four winning numbers is the sum of these three numbers:\n",
			
 
				+    "    - The number of successful outcomes for having four winning numbers exactly\n",
			
 
				+    "    - The number of successful outcomes for having five winning numbers exactly\n",
			
 
				+    "    - The number of successful outcomes for having six winning numbers exactly"
			
 
				    ]
			
 
				   }
			
 
				  ],
			
@@ -705,9 +709,9 @@
 
				    "name": "python",
			
 
				    "nbconvert_exporter": "python",
			
 
				    "pygments_lexer": "ipython3",
			
 
				-   "version": "3.6.8"
			
 
				+   "version": "3.7.3"
			
 
				   }
			
 
				  },
			
 
				  "nbformat": 4,
			
 
				- "nbformat_minor": 2
			
 
				+ "nbformat_minor": 4
			
 
				 }
			
--- a/Mission433Solutions.ipynb
+++ b/Mission433Solutions.ipynb
@@ -0,0 +1,1311 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Building a Spam Filter with Naive Bayes\n",
			
 
				+    "\n",
			
 
				+    "In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).\n",
			
 
				+    "\n",
			
 
				+    "To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the papers authored by Tiago A. Almeida and José María Gómez Hidalgo.\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "## Exploring the Dataset\n",
			
 
				+    "\n",
			
 
				+    "We'll now start by reading in the dataset."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 1,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "(5572, 2)\n"
			
 
				+     ]
			
 
				+    },
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/html": [
			
 
				+       "<div>\n",
			
 
				+       "<style scoped>\n",
			
 
				+       "    .dataframe tbody tr th:only-of-type {\n",
			
 
				+       "        vertical-align: middle;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe tbody tr th {\n",
			
 
				+       "        vertical-align: top;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe thead th {\n",
			
 
				+       "        text-align: right;\n",
			
 
				+       "    }\n",
			
 
				+       "</style>\n",
			
 
				+       "<table border=\"1\" class=\"dataframe\">\n",
			
 
				+       "  <thead>\n",
			
 
				+       "    <tr style=\"text-align: right;\">\n",
			
 
				+       "      <th></th>\n",
			
 
				+       "      <th>Label</th>\n",
			
 
				+       "      <th>SMS</th>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </thead>\n",
			
 
				+       "  <tbody>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>0</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>1</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>Ok lar... Joking wif u oni...</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>2</th>\n",
			
 
				+       "      <td>spam</td>\n",
			
 
				+       "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>3</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>U dun say so early hor... U c already then say...</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>4</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </tbody>\n",
			
 
				+       "</table>\n",
			
 
				+       "</div>"
			
 
				+      ],
			
 
				+      "text/plain": [
			
 
				+       "  Label                                                SMS\n",
			
 
				+       "0   ham  Go until jurong point, crazy.. Available only ...\n",
			
 
				+       "1   ham                      Ok lar... Joking wif u oni...\n",
			
 
				+       "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
			
 
				+       "3   ham  U dun say so early hor... U c already then say...\n",
			
 
				+       "4   ham  Nah I don't think he goes to usf, he lives aro..."
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 1,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "import pandas as pd\n",
			
 
				+    "\n",
			
 
				+    "sms_spam = pd.read_csv('SMSSpamCollection', sep='\\t', header=None, names=['Label', 'SMS'])\n",
			
 
				+    "\n",
			
 
				+    "print(sms_spam.shape)\n",
			
 
				+    "sms_spam.head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below, we see that about 87% of the messages are ham, and the remaining 13% are spam. This sample looks representative, since in practice most messages that people receive are ham."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 2,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "ham     0.865937\n",
			
 
				+       "spam    0.134063\n",
			
 
				+       "Name: Label, dtype: float64"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 2,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "sms_spam['Label'].value_counts(normalize=True)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Training and Test Set\n",
			
 
				+    "\n",
			
 
				+    "We're now going to split our dataset into a training and a test set, where the training set accounts for 80% of the data, and the test set for the remaining 20%."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 3,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "(4458, 2)\n",
			
 
				+      "(1114, 2)\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "# Randomize the dataset\n",
			
 
				+    "data_randomized = sms_spam.sample(frac=1, random_state=1)\n",
			
 
				+    "\n",
			
 
				+    "# Calculate index for split\n",
			
 
				+    "training_test_index = round(len(data_randomized) * 0.8)\n",
			
 
				+    "\n",
			
 
				+    "# Training/Test split\n",
			
 
				+    "training_set = data_randomized[:training_test_index].reset_index(drop=True)\n",
			
 
				+    "test_set = data_randomized[training_test_index:].reset_index(drop=True)\n",
			
 
				+    "\n",
			
 
				+    "print(training_set.shape)\n",
			
 
				+    "print(test_set.shape)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "We'll now analyze the percentage of spam and ham messages in the training and test sets. We expect the percentages to be close to what we have in the full dataset, where about 87% of the messages are ham, and the remaining 13% are spam."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 4,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "ham     0.86541\n",
			
 
				+       "spam    0.13459\n",
			
 
				+       "Name: Label, dtype: float64"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 4,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "training_set['Label'].value_counts(normalize=True)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 5,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "ham     0.868043\n",
			
 
				+       "spam    0.131957\n",
			
 
				+       "Name: Label, dtype: float64"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 5,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "test_set['Label'].value_counts(normalize=True)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "The results look good! We'll now move on to cleaning the dataset.\n",
			
 
				+    "\n",
			
 
				+    "## Data Cleaning\n",
			
 
				+    "\n",
			
 
				+    "To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.\n",
			
 
				+    "\n",
			
 
				+    "Essentially, we want to bring data to this format:\n",
			
 
				+    "\n",
			
 
				+    "![img](https://dq-content.s3.amazonaws.com/433/cpgp_dataset_3.png)\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "### Letter Case and Punctuation\n",
			
 
				+    "\n",
			
 
				+    "We'll begin with removing all the punctuation and bringing every letter to lower case."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 6,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/html": [
			
 
				+       "<div>\n",
			
 
				+       "<style scoped>\n",
			
 
				+       "    .dataframe tbody tr th:only-of-type {\n",
			
 
				+       "        vertical-align: middle;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe tbody tr th {\n",
			
 
				+       "        vertical-align: top;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe thead th {\n",
			
 
				+       "        text-align: right;\n",
			
 
				+       "    }\n",
			
 
				+       "</style>\n",
			
 
				+       "<table border=\"1\" class=\"dataframe\">\n",
			
 
				+       "  <thead>\n",
			
 
				+       "    <tr style=\"text-align: right;\">\n",
			
 
				+       "      <th></th>\n",
			
 
				+       "      <th>Label</th>\n",
			
 
				+       "      <th>SMS</th>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </thead>\n",
			
 
				+       "  <tbody>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>0</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>Yep, by the pretty sculpture</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>1</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>Yes, princess. Are you going to make me moan?</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>2</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>Welp apparently he retired</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>3</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>Havent.</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>4</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>I forgot 2 ask ü all smth.. There's a card on ...</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </tbody>\n",
			
 
				+       "</table>\n",
			
 
				+       "</div>"
			
 
				+      ],
			
 
				+      "text/plain": [
			
 
				+       "  Label                                                SMS\n",
			
 
				+       "0   ham                       Yep, by the pretty sculpture\n",
			
 
				+       "1   ham      Yes, princess. Are you going to make me moan?\n",
			
 
				+       "2   ham                         Welp apparently he retired\n",
			
 
				+       "3   ham                                            Havent.\n",
			
 
				+       "4   ham  I forgot 2 ask ü all smth.. There's a card on ..."
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 6,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "# Before cleaning\n",
			
 
				+    "training_set.head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 7,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/html": [
			
 
				+       "<div>\n",
			
 
				+       "<style scoped>\n",
			
 
				+       "    .dataframe tbody tr th:only-of-type {\n",
			
 
				+       "        vertical-align: middle;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe tbody tr th {\n",
			
 
				+       "        vertical-align: top;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe thead th {\n",
			
 
				+       "        text-align: right;\n",
			
 
				+       "    }\n",
			
 
				+       "</style>\n",
			
 
				+       "<table border=\"1\" class=\"dataframe\">\n",
			
 
				+       "  <thead>\n",
			
 
				+       "    <tr style=\"text-align: right;\">\n",
			
 
				+       "      <th></th>\n",
			
 
				+       "      <th>Label</th>\n",
			
 
				+       "      <th>SMS</th>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </thead>\n",
			
 
				+       "  <tbody>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>0</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>yep  by the pretty sculpture</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>1</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>yes  princess  are you going to make me moan</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>2</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>welp apparently he retired</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>3</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>havent</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>4</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>i forgot 2 ask ü all smth   there s a card on ...</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </tbody>\n",
			
 
				+       "</table>\n",
			
 
				+       "</div>"
			
 
				+      ],
			
 
				+      "text/plain": [
			
 
				+       "  Label                                                SMS\n",
			
 
				+       "0   ham                       yep  by the pretty sculpture\n",
			
 
				+       "1   ham      yes  princess  are you going to make me moan \n",
			
 
				+       "2   ham                         welp apparently he retired\n",
			
 
				+       "3   ham                                            havent \n",
			
 
				+       "4   ham  i forgot 2 ask ü all smth   there s a card on ..."
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 7,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "# After cleaning\n",
			
 
				+    "training_set['SMS'] = training_set['SMS'].str.replace('\\W', ' ')\n",
			
 
				+    "training_set['SMS'] = training_set['SMS'].str.lower()\n",
			
 
				+    "training_set.head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Creating the Vocabulary\n",
			
 
				+    "\n",
			
 
				+    "Let's now move to creating the vocabulary, which in this context means a list with all the unique words in our training set."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 8,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "training_set['SMS'] = training_set['SMS'].str.split()\n",
			
 
				+    "\n",
			
 
				+    "vocabulary = []\n",
			
 
				+    "for sms in training_set['SMS']:\n",
			
 
				+    "    for word in sms:\n",
			
 
				+    "        vocabulary.append(word)\n",
			
 
				+    "        \n",
			
 
				+    "vocabulary = list(set(vocabulary))"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "It looks like there are 7,783 unique words in all the messages of our training set."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 9,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "7783"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 9,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "len(vocabulary)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### The Final Training Set\n",
			
 
				+    "\n",
			
 
				+    "We're now going to use the vocabulary we just created to make the data transformation we want."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 10,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}\n",
			
 
				+    "\n",
			
 
				+    "for index, sms in enumerate(training_set['SMS']):\n",
			
 
				+    "    for word in sms:\n",
			
 
				+    "        word_counts_per_sms[word][index] += 1"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 11,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/html": [
			
 
				+       "<div>\n",
			
 
				+       "<style scoped>\n",
			
 
				+       "    .dataframe tbody tr th:only-of-type {\n",
			
 
				+       "        vertical-align: middle;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe tbody tr th {\n",
			
 
				+       "        vertical-align: top;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe thead th {\n",
			
 
				+       "        text-align: right;\n",
			
 
				+       "    }\n",
			
 
				+       "</style>\n",
			
 
				+       "<table border=\"1\" class=\"dataframe\">\n",
			
 
				+       "  <thead>\n",
			
 
				+       "    <tr style=\"text-align: right;\">\n",
			
 
				+       "      <th></th>\n",
			
 
				+       "      <th>ticket</th>\n",
			
 
				+       "      <th>kappa</th>\n",
			
 
				+       "      <th>too</th>\n",
			
 
				+       "      <th>abdomen</th>\n",
			
 
				+       "      <th>unhappy</th>\n",
			
 
				+       "      <th>hoody</th>\n",
			
 
				+       "      <th>start</th>\n",
			
 
				+       "      <th>die</th>\n",
			
 
				+       "      <th>wild</th>\n",
			
 
				+       "      <th>195</th>\n",
			
 
				+       "      <th>...</th>\n",
			
 
				+       "      <th>09058095201</th>\n",
			
 
				+       "      <th>chase</th>\n",
			
 
				+       "      <th>thru</th>\n",
			
 
				+       "      <th>ru</th>\n",
			
 
				+       "      <th>xclusive</th>\n",
			
 
				+       "      <th>fellow</th>\n",
			
 
				+       "      <th>red</th>\n",
			
 
				+       "      <th>entitled</th>\n",
			
 
				+       "      <th>auto</th>\n",
			
 
				+       "      <th>bothering</th>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </thead>\n",
			
 
				+       "  <tbody>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>0</th>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>1</th>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>2</th>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>3</th>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>4</th>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </tbody>\n",
			
 
				+       "</table>\n",
			
 
				+       "<p>5 rows × 7783 columns</p>\n",
			
 
				+       "</div>"
			
 
				+      ],
			
 
				+      "text/plain": [
			
 
				+       "   ticket  kappa  too  abdomen  unhappy  hoody  start  die  wild  195  ...  \\\n",
			
 
				+       "0       0      0    0        0        0      0      0    0     0    0  ...   \n",
			
 
				+       "1       0      0    0        0        0      0      0    0     0    0  ...   \n",
			
 
				+       "2       0      0    0        0        0      0      0    0     0    0  ...   \n",
			
 
				+       "3       0      0    0        0        0      0      0    0     0    0  ...   \n",
			
 
				+       "4       0      0    0        0        0      0      0    0     0    0  ...   \n",
			
 
				+       "\n",
			
 
				+       "   09058095201  chase  thru  ru  xclusive  fellow  red  entitled  auto  \\\n",
			
 
				+       "0            0      0     0   0         0       0    0         0     0   \n",
			
 
				+       "1            0      0     0   0         0       0    0         0     0   \n",
			
 
				+       "2            0      0     0   0         0       0    0         0     0   \n",
			
 
				+       "3            0      0     0   0         0       0    0         0     0   \n",
			
 
				+       "4            0      0     0   0         0       0    0         0     0   \n",
			
 
				+       "\n",
			
 
				+       "   bothering  \n",
			
 
				+       "0          0  \n",
			
 
				+       "1          0  \n",
			
 
				+       "2          0  \n",
			
 
				+       "3          0  \n",
			
 
				+       "4          0  \n",
			
 
				+       "\n",
			
 
				+       "[5 rows x 7783 columns]"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 11,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "word_counts = pd.DataFrame(word_counts_per_sms)\n",
			
 
				+    "word_counts.head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 12,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/html": [
			
 
				+       "<div>\n",
			
 
				+       "<style scoped>\n",
			
 
				+       "    .dataframe tbody tr th:only-of-type {\n",
			
 
				+       "        vertical-align: middle;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe tbody tr th {\n",
			
 
				+       "        vertical-align: top;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe thead th {\n",
			
 
				+       "        text-align: right;\n",
			
 
				+       "    }\n",
			
 
				+       "</style>\n",
			
 
				+       "<table border=\"1\" class=\"dataframe\">\n",
			
 
				+       "  <thead>\n",
			
 
				+       "    <tr style=\"text-align: right;\">\n",
			
 
				+       "      <th></th>\n",
			
 
				+       "      <th>Label</th>\n",
			
 
				+       "      <th>SMS</th>\n",
			
 
				+       "      <th>ticket</th>\n",
			
 
				+       "      <th>kappa</th>\n",
			
 
				+       "      <th>too</th>\n",
			
 
				+       "      <th>abdomen</th>\n",
			
 
				+       "      <th>unhappy</th>\n",
			
 
				+       "      <th>hoody</th>\n",
			
 
				+       "      <th>start</th>\n",
			
 
				+       "      <th>die</th>\n",
			
 
				+       "      <th>...</th>\n",
			
 
				+       "      <th>09058095201</th>\n",
			
 
				+       "      <th>chase</th>\n",
			
 
				+       "      <th>thru</th>\n",
			
 
				+       "      <th>ru</th>\n",
			
 
				+       "      <th>xclusive</th>\n",
			
 
				+       "      <th>fellow</th>\n",
			
 
				+       "      <th>red</th>\n",
			
 
				+       "      <th>entitled</th>\n",
			
 
				+       "      <th>auto</th>\n",
			
 
				+       "      <th>bothering</th>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </thead>\n",
			
 
				+       "  <tbody>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>0</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>[yep, by, the, pretty, sculpture]</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>1</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>[yes, princess, are, you, going, to, make, me,...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>2</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>[welp, apparently, he, retired]</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>3</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>[havent]</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>4</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>[i, forgot, 2, ask, ü, all, smth, there, s, a,...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>...</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "      <td>0</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </tbody>\n",
			
 
				+       "</table>\n",
			
 
				+       "<p>5 rows × 7785 columns</p>\n",
			
 
				+       "</div>"
			
 
				+      ],
			
 
				+      "text/plain": [
			
 
				+       "  Label                                                SMS  ticket  kappa  \\\n",
			
 
				+       "0   ham                  [yep, by, the, pretty, sculpture]       0      0   \n",
			
 
				+       "1   ham  [yes, princess, are, you, going, to, make, me,...       0      0   \n",
			
 
				+       "2   ham                    [welp, apparently, he, retired]       0      0   \n",
			
 
				+       "3   ham                                           [havent]       0      0   \n",
			
 
				+       "4   ham  [i, forgot, 2, ask, ü, all, smth, there, s, a,...       0      0   \n",
			
 
				+       "\n",
			
 
				+       "   too  abdomen  unhappy  hoody  start  die  ...  09058095201  chase  thru  \\\n",
			
 
				+       "0    0        0        0      0      0    0  ...            0      0     0   \n",
			
 
				+       "1    0        0        0      0      0    0  ...            0      0     0   \n",
			
 
				+       "2    0        0        0      0      0    0  ...            0      0     0   \n",
			
 
				+       "3    0        0        0      0      0    0  ...            0      0     0   \n",
			
 
				+       "4    0        0        0      0      0    0  ...            0      0     0   \n",
			
 
				+       "\n",
			
 
				+       "   ru  xclusive  fellow  red  entitled  auto  bothering  \n",
			
 
				+       "0   0         0       0    0         0     0          0  \n",
			
 
				+       "1   0         0       0    0         0     0          0  \n",
			
 
				+       "2   0         0       0    0         0     0          0  \n",
			
 
				+       "3   0         0       0    0         0     0          0  \n",
			
 
				+       "4   0         0       0    0         0     0          0  \n",
			
 
				+       "\n",
			
 
				+       "[5 rows x 7785 columns]"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 12,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "training_set_clean = pd.concat([training_set, word_counts], axis=1)\n",
			
 
				+    "training_set_clean.head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Calculating Constants First\n",
			
 
				+    "\n",
			
 
				+    "We're now done with cleaning the training set, and we can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:\n",
			
 
				+    "\n",
			
 
				+    "\\begin{equation}\n",
			
 
				+    "P(Spam | w_1,w_2, ..., w_n) \\propto P(Spam) \\cdot \\prod_{i=1}^{n}P(w_i|Spam)\n",
			
 
				+    "\\end{equation}\n",
			
 
				+    "\n",
			
 
				+    "\\begin{equation}\n",
			
 
				+    "P(Ham | w_1,w_2, ..., w_n) \\propto P(Ham) \\cdot \\prod_{i=1}^{n}P(w_i|Ham)\n",
			
 
				+    "\\end{equation}\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "Also, to calculate P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham) inside the formulas above, we'll need to use these equations:\n",
			
 
				+    "\n",
			
 
				+    "\\begin{equation}\n",
			
 
				+    "P(w_i|Spam) = \\frac{N_{w_i|Spam} + \\alpha}{N_{Spam} + \\alpha \\cdot N_{Vocabulary}}\n",
			
 
				+    "\\end{equation}\n",
			
 
				+    "\n",
			
 
				+    "\\begin{equation}\n",
			
 
				+    "P(w_i|Ham) = \\frac{N_{w_i|Ham} + \\alpha}{N_{Ham} + \\alpha \\cdot N_{Vocabulary}}\n",
			
 
				+    "\\end{equation}\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, we'll use our training set to calculate:\n",
			
 
				+    "\n",
			
 
				+    "- P(Spam) and P(Ham)\n",
			
 
				+    "- N<sub>Spam</sub>, N<sub>Ham</sub>, N<sub>Vocabulary</sub>\n",
			
 
				+    "\n",
			
 
				+    "We'll also use Laplace smoothing and set $\\alpha = 1$."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 13,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# Isolating spam and ham messages first\n",
			
 
				+    "spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']\n",
			
 
				+    "ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']\n",
			
 
				+    "\n",
			
 
				+    "# P(Spam) and P(Ham)\n",
			
 
				+    "p_spam = len(spam_messages) / len(training_set_clean)\n",
			
 
				+    "p_ham = len(ham_messages) / len(training_set_clean)\n",
			
 
				+    "\n",
			
 
				+    "# N_Spam\n",
			
 
				+    "n_words_per_spam_message = spam_messages['SMS'].apply(len)\n",
			
 
				+    "n_spam = n_words_per_spam_message.sum()\n",
			
 
				+    "\n",
			
 
				+    "# N_Ham\n",
			
 
				+    "n_words_per_ham_message = ham_messages['SMS'].apply(len)\n",
			
 
				+    "n_ham = n_words_per_ham_message.sum()\n",
			
 
				+    "\n",
			
 
				+    "# N_Vocabulary\n",
			
 
				+    "n_vocabulary = len(vocabulary)\n",
			
 
				+    "\n",
			
 
				+    "# Laplace smoothing\n",
			
 
				+    "alpha = 1"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Calculating Parameters\n",
			
 
				+    "\n",
			
 
				+    "Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.\n",
			
 
				+    "\n",
			
 
				+    "The parameters are calculated using the formulas:\n",
			
 
				+    "\n",
			
 
				+    "\\begin{equation}\n",
			
 
				+    "P(w_i|Spam) = \\frac{N_{w_i|Spam} + \\alpha}{N_{Spam} + \\alpha \\cdot N_{Vocabulary}}\n",
			
 
				+    "\\end{equation}\n",
			
 
				+    "\n",
			
 
				+    "\\begin{equation}\n",
			
 
				+    "P(w_i|Ham) = \\frac{N_{w_i|Ham} + \\alpha}{N_{Ham} + \\alpha \\cdot N_{Vocabulary}}\n",
			
 
				+    "\\end{equation}"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 14,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# Initiate parameters\n",
			
 
				+    "parameters_spam = {unique_word:0 for unique_word in vocabulary}\n",
			
 
				+    "parameters_ham = {unique_word:0 for unique_word in vocabulary}\n",
			
 
				+    "\n",
			
 
				+    "# Calculate parameters\n",
			
 
				+    "for word in vocabulary:\n",
			
 
				+    "    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above\n",
			
 
				+    "    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)\n",
			
 
				+    "    parameters_spam[word] = p_word_given_spam\n",
			
 
				+    "    \n",
			
 
				+    "    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above\n",
			
 
				+    "    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)\n",
			
 
				+    "    parameters_ham[word] = p_word_given_ham"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Classifying A New Message\n",
			
 
				+    "\n",
			
 
				+    "Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:\n",
			
 
				+    "\n",
			
 
				+    "- Takes in as input a new message (w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>).\n",
			
 
				+    "- Calculates P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>).\n",
			
 
				+    "- Compares the values of P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), and:\n",
			
 
				+    "    - If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) > P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as ham.\n",
			
 
				+    "    - If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) < P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as spam.\n",
			
 
				+    "    -  If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) = P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the algorithm may request human help."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 15,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import re\n",
			
 
				+    "\n",
			
 
				+    "def classify(message):\n",
			
 
				+    "    '''\n",
			
 
				+    "    message: a string\n",
			
 
				+    "    '''\n",
			
 
				+    "    \n",
			
 
				+    "    message = re.sub('\\W', ' ', message)\n",
			
 
				+    "    message = message.lower().split()\n",
			
 
				+    "    \n",
			
 
				+    "    p_spam_given_message = p_spam\n",
			
 
				+    "    p_ham_given_message = p_ham\n",
			
 
				+    "\n",
			
 
				+    "    for word in message:\n",
			
 
				+    "        if word in parameters_spam:\n",
			
 
				+    "            p_spam_given_message *= parameters_spam[word]\n",
			
 
				+    "            \n",
			
 
				+    "        if word in parameters_ham:\n",
			
 
				+    "            p_ham_given_message *= parameters_ham[word]\n",
			
 
				+    "            \n",
			
 
				+    "    print('P(Spam|message):', p_spam_given_message)\n",
			
 
				+    "    print('P(Ham|message):', p_ham_given_message)\n",
			
 
				+    "    \n",
			
 
				+    "    if p_ham_given_message > p_spam_given_message:\n",
			
 
				+    "        print('Label: Ham')\n",
			
 
				+    "    elif p_ham_given_message < p_spam_given_message:\n",
			
 
				+    "        print('Label: Spam')\n",
			
 
				+    "    else:\n",
			
 
				+    "        print('Equal proabilities, have a human classify this!')"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 16,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "P(Spam|message): 1.3481290211300841e-25\n",
			
 
				+      "P(Ham|message): 1.9368049028589875e-27\n",
			
 
				+      "Label: Spam\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "classify('WINNER!! This is the secret code to unlock the money: C3421.')"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 17,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "P(Spam|message): 2.4372375665888117e-25\n",
			
 
				+      "P(Ham|message): 3.687530435009238e-21\n",
			
 
				+      "Label: Ham\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "classify(\"Sounds good, Tom, then see u there\")"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Measuring the Spam Filter's Accuracy\n",
			
 
				+    "\n",
			
 
				+    "The two results above look promising, but let's see how well the filter does on our test set, which has 1,114 messages.\n",
			
 
				+    "\n",
			
 
				+    "We'll start by writing a function that returns classification labels instead of printing them."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 18,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "def classify_test_set(message):    \n",
			
 
				+    "    '''\n",
			
 
				+    "    message: a string\n",
			
 
				+    "    '''\n",
			
 
				+    "    \n",
			
 
				+    "    message = re.sub('\\W', ' ', message)\n",
			
 
				+    "    message = message.lower().split()\n",
			
 
				+    "    \n",
			
 
				+    "    p_spam_given_message = p_spam\n",
			
 
				+    "    p_ham_given_message = p_ham\n",
			
 
				+    "\n",
			
 
				+    "    for word in message:\n",
			
 
				+    "        if word in parameters_spam:\n",
			
 
				+    "            p_spam_given_message *= parameters_spam[word]\n",
			
 
				+    "            \n",
			
 
				+    "        if word in parameters_ham:\n",
			
 
				+    "            p_ham_given_message *= parameters_ham[word]\n",
			
 
				+    "    \n",
			
 
				+    "    if p_ham_given_message > p_spam_given_message:\n",
			
 
				+    "        return 'ham'\n",
			
 
				+    "    elif p_spam_given_message > p_ham_given_message:\n",
			
 
				+    "        return 'spam'\n",
			
 
				+    "    else:\n",
			
 
				+    "        return 'needs human classification'"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 19,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/html": [
			
 
				+       "<div>\n",
			
 
				+       "<style scoped>\n",
			
 
				+       "    .dataframe tbody tr th:only-of-type {\n",
			
 
				+       "        vertical-align: middle;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe tbody tr th {\n",
			
 
				+       "        vertical-align: top;\n",
			
 
				+       "    }\n",
			
 
				+       "\n",
			
 
				+       "    .dataframe thead th {\n",
			
 
				+       "        text-align: right;\n",
			
 
				+       "    }\n",
			
 
				+       "</style>\n",
			
 
				+       "<table border=\"1\" class=\"dataframe\">\n",
			
 
				+       "  <thead>\n",
			
 
				+       "    <tr style=\"text-align: right;\">\n",
			
 
				+       "      <th></th>\n",
			
 
				+       "      <th>Label</th>\n",
			
 
				+       "      <th>SMS</th>\n",
			
 
				+       "      <th>predicted</th>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </thead>\n",
			
 
				+       "  <tbody>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>0</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>Later i guess. I needa do mcat study too.</td>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>1</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>But i haf enuff space got like 4 mb...</td>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>2</th>\n",
			
 
				+       "      <td>spam</td>\n",
			
 
				+       "      <td>Had your mobile 10 mths? Update to latest Oran...</td>\n",
			
 
				+       "      <td>spam</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>3</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>All sounds good. Fingers . Makes it difficult ...</td>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "    <tr>\n",
			
 
				+       "      <th>4</th>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "      <td>All done, all handed in. Don't know if mega sh...</td>\n",
			
 
				+       "      <td>ham</td>\n",
			
 
				+       "    </tr>\n",
			
 
				+       "  </tbody>\n",
			
 
				+       "</table>\n",
			
 
				+       "</div>"
			
 
				+      ],
			
 
				+      "text/plain": [
			
 
				+       "  Label                                                SMS predicted\n",
			
 
				+       "0   ham          Later i guess. I needa do mcat study too.       ham\n",
			
 
				+       "1   ham             But i haf enuff space got like 4 mb...       ham\n",
			
 
				+       "2  spam  Had your mobile 10 mths? Update to latest Oran...      spam\n",
			
 
				+       "3   ham  All sounds good. Fingers . Makes it difficult ...       ham\n",
			
 
				+       "4   ham  All done, all handed in. Don't know if mega sh...       ham"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 19,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "test_set['predicted'] = test_set['SMS'].apply(classify_test_set)\n",
			
 
				+    "test_set.head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Now, we'll write a function to measure the accuracy of our spam filter to find out how well our spam filter does."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 20,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "Correct: 1100\n",
			
 
				+      "Incorrect: 14\n",
			
 
				+      "Accuracy: 0.9874326750448833\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "correct = 0\n",
			
 
				+    "total = test_set.shape[0]\n",
			
 
				+    "    \n",
			
 
				+    "for row in test_set.iterrows():\n",
			
 
				+    "    row = row[1]\n",
			
 
				+    "    if row['Label'] == row['predicted']:\n",
			
 
				+    "        correct += 1\n",
			
 
				+    "        \n",
			
 
				+    "print('Correct:', correct)\n",
			
 
				+    "print('Incorrect:', total - correct)\n",
			
 
				+    "print('Accuracy:', correct/total)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Next Steps\n",
			
 
				+    "\n",
			
 
				+    "In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.\n",
			
 
				+    "\n",
			
 
				+    "Next steps include:\n",
			
 
				+    "\n",
			
 
				+    "- Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly\n",
			
 
				+    "- Make the filtering process more complex by making the algorithm sensitive to letter case"
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.7.3"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 4
			
 
				+}
			
--- a/Mission449Solutions.Rmd
+++ b/Mission449Solutions.Rmd
@@ -0,0 +1,421 @@
 
				+---
			
 
				+title: 'Guided Project: Finding the Best Markets to Advertise In'
			
 
				+author: "Dataquest"
			
 
				+date: "11/19/2019"
			
 
				+output: html_document
			
 
				+---
			
 
				+
			
 
				+# Finding the Two Best Markets to Advertise in an E-learning Product
			
 
				+
			
 
				+In this project, we'll aim to find the two best markets to advertise our product in — we're working for an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc.
			
 
				+
			
 
				+# Understanding the Data
			
 
				+
			
 
				+To avoid spending money on organizing a survey, we'll first try to make use of existing data to determine whether we can reach any reliable result.
			
 
				+
			
 
				+One good candidate for our purpose is [freeCodeCamp's 2017 New Coder Survey](https://medium.freecodecamp.org/we-asked-20-000-people-who-they-are-and-how-theyre-learning-to-code-fff5d668969). [freeCodeCamp](https://www.freecodecamp.org/) is a free e-learning platform that offers courses on web development. Because they run [a popular Medium publication](https://medium.freecodecamp.org/) (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
			
 
				+
			
 
				+The survey data is publicly available in [this GitHub repository](https://github.com/freeCodeCamp/2017-new-coder-survey). Below, we'll do a quick exploration of the `2017-fCC-New-Coders-Survey-Data.csv` file stored in the `clean-data` folder of the repository we just mentioned. We'll read in the file using the direct link [here](https://raw.githubusercontent.com/freeCodeCamp/2017-new-coder-survey/master/clean-data/2017-fCC-New-Coders-Survey-Data.csv).
			
 
				+
			
 
				+```{r}
			
 
				+library(readr)
			
 
				+fcc <- read_csv("2017-fCC-New-Coders-Survey-Data.csv")
			
 
				+dim(fcc)
			
 
				+head(fcc, 5)
			
 
				+```
			
 
				+
			
 
				+# Checking for Sample Representativity
			
 
				+
			
 
				+As we mentioned in the introduction, most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. For the purpose of our analysis, we want to answer questions about a population of new coders that are interested in the subjects we teach. We'd like to know:
			
 
				+
			
 
				+* Where are these new coders located.
			
 
				+* What locations have the greatest densities of new coders.
			
 
				+* How much money they're willing to spend on learning.
			
 
				+
			
 
				+So we first need to clarify whether the data set has the right categories of people for our purpose. The `JobRoleInterest` column describes for every participant the role(s) they'd be interested in working in. If a participant is interested in working in a certain domain, it means that they're also interested in learning about that domain. So let's take a look at the frequency distribution table of this column [^1] and determine whether the data we have is relevant.
			
 
				+
			
 
				+
			
 
				+```{r}
			
 
				+#split-and-combine workflow
			
 
				+library(dplyr)
			
 
				+fcc %>%
			
 
				+  group_by(JobRoleInterest) %>%
			
 
				+  summarise(freq = n()*100/nrow(fcc)) %>%
			
 
				+  arrange(desc(freq))
			
 
				+```
			
 
				+
			
 
				+
			
 
				+The information in the table above is quite granular, but from a quick scan it looks like:
			
 
				+
			
 
				+* A lot of people are interested in web development (full-stack _web development_, front-end _web development_ and back-end _web development_).
			
 
				+* A few people are interested in mobile development.
			
 
				+* A few people are interested in domains other than web and mobile development.
			
 
				+
			
 
				+It's also interesting to note that many respondents are interested in more than one subject. It'd be useful to get a better picture of how many people are interested in a single subject and how many have mixed interests. Consequently, in the next code block, we'll:
			
 
				+
			
 
				+- Split each string in the `JobRoleInterest` column to find the number of options for each participant.
			
 
				+    - We'll first drop the NA values [^2] because we cannot split NA values.
			
 
				+- Generate a frequency table for the variable describing the number of options [^3].
			
 
				+
			
 
				+```{r}
			
 
				+# Split each string in the 'JobRoleInterest' column
			
 
				+splitted_interests <- fcc %>%
			
 
				+  select(JobRoleInterest) %>%
			
 
				+  tidyr::drop_na() %>%
			
 
				+  rowwise %>% #Tidyverse actually makes by default operation over columns, rowwise changes this behavior.
			
 
				+  mutate(opts = length(stringr::str_split(JobRoleInterest, ",")[[1]]))
			
 
				+
			
 
				+# Frequency table for the var describing the number of options
			
 
				+n_of_options <- splitted_interests %>%
			
 
				+  ungroup() %>%  #this is needeed because we used the rowwise() function before
			
 
				+  group_by(opts) %>%
			
 
				+  summarize(freq = n()*100/nrow(splitted_interests))
			
 
				+
			
 
				+n_of_options
			
 
				+```
			
 
				+
			
 
				+It turns out that only 31.65% of the participants have a clear idea about what programming niche they'd like to work in, while the vast majority of students have mixed interests. But given that we offer courses on various subjects, the fact that new coders have mixed interest might be actually good for us.
			
 
				+
			
 
				+The focus of our courses is on web and mobile development, so let's find out how many respondents chose at least one of these two options.
			
 
				+
			
 
				+```{r}
			
 
				+# Frequency table (we can also use split-and-combine) 
			
 
				+web_or_mobile <- stringr::str_detect(fcc$JobRoleInterest, "Web Developer|Mobile Developer")
			
 
				+freq_table <- table(web_or_mobile)
			
 
				+freq_table <- freq_table * 100 / sum(freq_table)
			
 
				+freq_table
			
 
				+
			
 
				+# Graph for the frequency table above
			
 
				+df <- tibble::tibble(x = c("Other Subject","Web or Mobile Developpement"),
			
 
				+                       y = freq_table)
			
 
				+
			
 
				+library(ggplot2)
			
 
				+
			
 
				+ggplot(data = df, aes(x = x, y = y, fill = x)) +
			
 
				+  geom_histogram(stat = "identity")
			
 
				+
			
 
				+
			
 
				+```
			
 
				+
			
 
				+It turns out that most people in this survey (roughly 86%) are interested in either web or mobile development. These figures offer us a strong reason to consider this sample representative for our population of interest. We want to advertise our courses to people interested in all sorts of programming niches but mostly web and mobile development.
			
 
				+
			
 
				+Now we need to figure out what are the best markets to invest money in for advertising our courses. We'd like to know:
			
 
				+
			
 
				+* Where are these new coders located.
			
 
				+* What are the locations with the greatest number of new coders.
			
 
				+* How much money new coders are willing to spend on learning.
			
 
				+
			
 
				+# New Coders - Locations and Densities
			
 
				+
			
 
				+Let's begin with finding out where these new coders are located, and what are the densities (how many new coders there are) for each location. This should be a good start for finding out the best two markets to run our ads campaign in.
			
 
				+
			
 
				+The data set provides information about the location of each participant at a country level. We can think of each country as an individual market, so we can frame our goal as finding the two best countries to advertise in.
			
 
				+
			
 
				+We can start by examining the frequency distribution table of the `CountryLive` variable, which describes what country each participant lives in (not their origin country). We'll only consider those participants who answered what role(s) they're interested in, to make sure we work with a representative sample.
			
 
				+
			
 
				+```{r}
			
 
				+# Isolate the participants that answered what role they'd be interested in
			
 
				+fcc_good <- fcc %>%
			
 
				+  tidyr::drop_na(JobRoleInterest) 
			
 
				+
			
 
				+# Frequency tables with absolute and relative frequencies
			
 
				+# Display the frequency tables in a more readable format
			
 
				+fcc_good %>%
			
 
				+group_by(CountryLive) %>%
			
 
				+summarise(`Absolute frequency` = n(),
			
 
				+          `Percentage` = n() * 100 /  nrow(fcc_good) ) %>%
			
 
				+  arrange(desc(Percentage))
			
 
				+```
			
 
				+
			
 
				+44.69% of our potential customers are located in the US, and this definitely seems like the most interesting market. India has the second customer density, but it's just 7.55%, which is not too far from the United Kingdom (4.50%) or Canada (3.71%).
			
 
				+
			
 
				+This is useful information, but we need to go more in depth than this and figure out how much money people are actually willing to spend on learning. Advertising in high-density markets where most people are only willing to learn for free is extremely unlikely to be profitable for us.
			
 
				+
			
 
				+# Spending Money for Learning
			
 
				+
			
 
				+The `MoneyForLearning` column describes in American dollars the amount of money spent by participants from the moment they started coding until the moment they completed the survey. Our company sells subscriptions at a price of \$59 per month, and for this reason we're interested in finding out how much money each student spends per month.
			
 
				+
			
 
				+We'll narrow down our analysis to only four countries: the US, India, the United Kingdom, and Canada. We do this for two reasons:
			
 
				+
			
 
				+* These are the countries having the highest frequency in the frequency table above, which means we have a decent amount of data for each.
			
 
				+* Our courses are written in English, and English is an official language in all these four countries. The more people know English, the better our chances to target the right people with our ads.
			
 
				+
			
 
				+Let's start with creating a new column that describes the amount of money a student has spent per month so far. To do that, we'll need to divide the `MoneyForLearning` column to the `MonthsProgramming` column. The problem is that some students answered that they have been learning to code for 0 months (it might be that they have just started). To avoid dividing by 0, we'll replace 0 with 1 in the `MonthsProgramming` column.
			
 
				+
			
 
				+
			
 
				+```{r}
			
 
				+# Replace 0s with 1s to avoid division by 0
			
 
				+fcc_good <- fcc_good %>%
			
 
				+  mutate(MonthsProgramming = replace(MonthsProgramming,  MonthsProgramming == 0, 1) )
			
 
				+
			
 
				+# New column for the amount of money each student spends each month
			
 
				+fcc_good <- fcc_good %>%
			
 
				+  mutate(money_per_month = MoneyForLearning/MonthsProgramming) 
			
 
				+
			
 
				+fcc_good %>%
			
 
				+  summarise(na_count = sum(is.na(money_per_month)) ) %>%
			
 
				+  pull(na_count)
			
 
				+```
			
 
				+
			
 
				+Let's keep only the rows that don't have NA values for the `money_per_month` column.
			
 
				+
			
 
				+```{r}
			
 
				+# Keep only the rows with non-NAs in the `money_per_month` column 
			
 
				+fcc_good  <-  fcc_good %>% tidyr::drop_na(money_per_month)
			
 
				+```
			
 
				+
			
 
				+We want to group the data by country, and then measure the average amount of money that students spend per month in each country. First, let's remove the rows having `NA` values for the `CountryLive` column, and check out if we still have enough data for the four countries that interest us.
			
 
				+
			
 
				+```{r}
			
 
				+# Remove the rows with NA values in 'CountryLive'
			
 
				+fcc_good  <-  fcc_good %>% tidyr::drop_na(CountryLive)
			
 
				+
			
 
				+# Frequency table to check if we still have enough data
			
 
				+fcc_good %>% group_by(CountryLive) %>%
			
 
				+  summarise(freq = n() ) %>%
			
 
				+  arrange(desc(freq)) %>%
			
 
				+  head()
			
 
				+```
			
 
				+
			
 
				+This should be enough, so let's compute the average value spent per month in each country by a student. We'll compute the average using the mean.
			
 
				+
			
 
				+```{r}
			
 
				+# Mean sum of money spent by students each month
			
 
				+countries_mean  <-  fcc_good %>% 
			
 
				+  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
			
 
				+  group_by(CountryLive) %>%
			
 
				+  summarize(mean = mean(money_per_month)) %>%
			
 
				+  arrange(desc(mean))
			
 
				+
			
 
				+countries_mean
			
 
				+```
			
 
				+
			
 
				+The results for the United Kingdom and Canada are a bit surprising relative to the values we see for India. If we considered a few socio-economical metrics (like [GDP per capita](https://bit.ly/2I3cukh)), we'd intuitively expect people in the UK and Canada to spend more on learning than people in India.
			
 
				+
			
 
				+It might be that we don't have have enough representative data for the United Kingdom and Canada, or we have some outliers (maybe coming from wrong survey answers) making the mean too large for India, or too low for the UK and Canada. Or it might be that the results are correct.
			
 
				+
			
 
				+# Dealing with Extreme Outliers
			
 
				+
			
 
				+Let's use box plots to visualize the distribution of the `money_per_month` variable for each country.
			
 
				+
			
 
				+```{r}
			
 
				+# Isolate only the countries of interest
			
 
				+only_4  <-  fcc_good %>% 
			
 
				+  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada')
			
 
				+
			
 
				+# Since maybe, we will remove elements from the database, 
			
 
				+# we add an index column containing the number of each row. 
			
 
				+# Hence, we will have a match with the original database in case of some indexes.
			
 
				+only_4 <- only_4 %>%
			
 
				+  mutate(index = row_number())
			
 
				+
			
 
				+# Box plots to visualize distributions
			
 
				+ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
			
 
				+  geom_boxplot() +
			
 
				+  ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
			
 
				+  xlab("Country") +
			
 
				+  ylab("Money per month (US dollars)") +
			
 
				+  theme_bw()
			
 
				+
			
 
				+```
			
 
				+
			
 
				+It's hard to see on the plot above if there's anything wrong with the data for the United Kingdom, India, or Canada, but we can see immediately that there's something really off for the US: two persons spend each month \$50,000 or more for learning. This is not impossible, but it seems extremely unlikely, so we'll remove every value that goes over \$20,000 per month.
			
 
				+
			
 
				+```{r}
			
 
				+# Isolate only those participants who spend less than 10,000 per month
			
 
				+fcc_good  <- fcc_good %>% 
			
 
				+  filter(money_per_month < 20000)
			
 
				+```
			
 
				+
			
 
				+Now let's recompute the mean values and plot the box plots again.
			
 
				+
			
 
				+```{r}
			
 
				+# Mean sum of money spent by students each month
			
 
				+countries_mean = fcc_good %>% 
			
 
				+  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
			
 
				+  group_by(CountryLive) %>%
			
 
				+  summarize(mean = mean(money_per_month)) %>%
			
 
				+  arrange(desc(mean))
			
 
				+
			
 
				+countries_mean
			
 
				+```
			
 
				+
			
 
				+```{r}
			
 
				+# Isolate only the countries of interest
			
 
				+only_4  <-  fcc_good %>% 
			
 
				+  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
			
 
				+  mutate(index = row_number())
			
 
				+
			
 
				+# Box plots to visualize distributions
			
 
				+ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
			
 
				+  geom_boxplot() +
			
 
				+  ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
			
 
				+  xlab("Country") +
			
 
				+  ylab("Money per month (US dollars)") +
			
 
				+  theme_bw()
			
 
				+
			
 
				+```
			
 
				+
			
 
				+We can see a few extreme outliers for India (values over \$2,500 per month), but it's unclear whether this is good data or not. Maybe these persons attended several bootcamps, which tend to be very expensive. Let's examine these two data points to see if we can find anything relevant.
			
 
				+
			
 
				+```{r}
			
 
				+# Inspect the extreme outliers for India
			
 
				+india_outliers  <-  only_4 %>%
			
 
				+  filter(CountryLive == 'India' & 
			
 
				+           money_per_month >= 2500)
			
 
				+
			
 
				+india_outliers
			
 
				+```
			
 
				+
			
 
				+It seems that neither participant attended a bootcamp. Overall, it's really hard to figure out from the data whether these persons really spent that much money with learning. The actual question of the survey was _"Aside from university tuition, about how much money have you spent on learning to code so far (in US dollars)?"_, so they might have misunderstood and thought university tuition is included. It seems safer to remove these six rows.
			
 
				+
			
 
				+```{r}
			
 
				+# Remove the outliers for India
			
 
				+only_4 <-  only_4 %>% 
			
 
				+  filter(!(index %in% india_outliers$index))
			
 
				+```
			
 
				+
			
 
				+Looking back at the box plot above, we can also see more extreme outliers for the US (values over \$6,000 per month). Let's examine these participants in more detail.
			
 
				+
			
 
				+```{r}
			
 
				+# Examine the extreme outliers for the US
			
 
				+us_outliers = only_4 %>%
			
 
				+  filter(CountryLive == 'United States of America' & 
			
 
				+           money_per_month >= 6000)
			
 
				+
			
 
				+us_outliers
			
 
				+
			
 
				+only_4  <-  only_4 %>% 
			
 
				+  filter(!(index %in% us_outliers$index))
			
 
				+```
			
 
				+
			
 
				+Out of these 11 extreme outliers, six people attended bootcamps, which justify the large sums of money spent on learning. For the other five, it's hard to figure out from the data where they could have spent that much money on learning. Consequently, we'll remove those rows where participants reported thay they spend \$6,000 each month, but they have never attended a bootcamp.
			
 
				+
			
 
				+Also, the data shows that eight respondents had been programming for no more than three months when they completed the survey. They most likely paid a large sum of money for a bootcamp that was going to last for several months, so the amount of money spent per month is unrealistic and should be significantly lower (because they probably didn't spend anything for the next couple of months after the survey). As a consequence, we'll remove every these eight outliers.
			
 
				+
			
 
				+In the next code block, we'll remove respondents that:
			
 
				+
			
 
				+- Didn't attend bootcamps.
			
 
				+- Had been programming for three months or less when at the time they completed the survey.
			
 
				+
			
 
				+```{r}
			
 
				+# Remove the respondents who didn't attendent a bootcamp
			
 
				+no_bootcamp = only_4 %>%
			
 
				+    filter(CountryLive == 'United States of America' & 
			
 
				+           money_per_month >= 6000 &
			
 
				+             AttendedBootcamp == 0)
			
 
				+
			
 
				+only_4_  <-  only_4 %>% 
			
 
				+  filter(!(index %in% no_bootcamp$index))
			
 
				+
			
 
				+
			
 
				+# Remove the respondents that had been programming for less than 3 months
			
 
				+less_than_3_months = only_4 %>%
			
 
				+    filter(CountryLive == 'United States of America' & 
			
 
				+           money_per_month >= 6000 &
			
 
				+           MonthsProgramming <= 3)
			
 
				+
			
 
				+only_4  <-  only_4 %>% 
			
 
				+  filter(!(index %in% less_than_3_months$index))
			
 
				+```
			
 
				+
			
 
				+
			
 
				+Looking again at the last box plot above, we can also see an extreme outlier for Canada — a person who spends roughly \$5,000 per month. Let's examine this person in more depth.
			
 
				+
			
 
				+```{r}
			
 
				+# Examine the extreme outliers for Canada
			
 
				+canada_outliers = only_4 %>%
			
 
				+  filter(CountryLive == 'Canada' & 
			
 
				+           money_per_month >= 4500 &
			
 
				+           MonthsProgramming <= 3)
			
 
				+
			
 
				+canada_outliers
			
 
				+```
			
 
				+
			
 
				+Here, the situation is similar to some of the US respondents — this participant had been programming for no more than two months when he completed the survey. He seems to have paid a large sum of money in the beginning to enroll in a bootcamp, and then he probably didn't spend anything for the next couple of months after the survey. We'll take the same approach here as for the US and remove this outlier.
			
 
				+
			
 
				+```{r}
			
 
				+# Remove the extreme outliers for Canada
			
 
				+only_4  <-  only_4 %>% 
			
 
				+  filter(!(index %in% canada_outliers$index))
			
 
				+```
			
 
				+
			
 
				+Let's recompute the mean values and generate the final box plots.
			
 
				+
			
 
				+```{r}
			
 
				+# Mean sum of money spent by students each month
			
 
				+countries_mean = only_4 %>%
			
 
				+  group_by(CountryLive) %>%
			
 
				+  summarize(mean = mean(money_per_month)) %>%
			
 
				+  arrange(desc(mean))
			
 
				+
			
 
				+countries_mean
			
 
				+```
			
 
				+
			
 
				+
			
 
				+```{r}
			
 
				+# Box plots to visualize distributions
			
 
				+ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
			
 
				+  geom_boxplot() +
			
 
				+  ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
			
 
				+  xlab("Country") +
			
 
				+  ylab("Money per month (US dollars)") +
			
 
				+  theme_bw()
			
 
				+
			
 
				+```
			
 
				+
			
 
				+
			
 
				+## Choosing the Two Best Markets
			
 
				+
			
 
				+Obviously, one country we should advertise in is the US. Lots of new coders live there and they are willing to pay a good amount of money each month (roughly \$143).
			
 
				+
			
 
				+We sell subscriptions at a price of \$59 per month, and Canada seems to be the best second choice because people there are willing to pay roughly \$93 per month, compared to India (\$66) and the United Kingdom (\$45).
			
 
				+
			
 
				+The data suggests strongly that we shouldn't advertise in the UK, but let's take a second look at India before deciding to choose Canada as our second best choice:
			
 
				+
			
 
				+* $59 doesn't seem like an expensive sum for people in India since they spend on average \$66 each month.
			
 
				+* We have almost twice as more potential customers in India than we have in Canada:
			
 
				+
			
 
				+```{r}
			
 
				+# Frequency table for the 'CountryLive' column
			
 
				+only_4 %>% group_by(CountryLive) %>%
			
 
				+  summarise(freq = n() * 100 / nrow(only_4) ) %>%
			
 
				+  arrange(desc(freq)) %>%
			
 
				+  head()
			
 
				+```
			
 
				+
			
 
				+```{r}
			
 
				+# Frequency table to check if we still have enough data
			
 
				+only_4 %>% group_by(CountryLive) %>%
			
 
				+  summarise(freq = n() ) %>%
			
 
				+  arrange(desc(freq)) %>%
			
 
				+  head()
			
 
				+```
			
 
				+
			
 
				+
			
 
				+So it's not crystal clear what to choose between Canada and India. Although it seems more tempting to choose Canada, there are good chances that India might actually be a better choice because of the large number of potential customers.
			
 
				+
			
 
				+At this point, it seems that we have several options:
			
 
				+
			
 
				+1. Advertise in the US, India, and Canada by splitting the advertisement budget in various combinations:
			
 
				+    - 60% for the US, 25% for India, 15% for Canada.
			
 
				+    - 50% for the US, 30% for India, 20% for Canada; etc.
			
 
				+
			
 
				+2. Advertise only in the US and India, or the US and Canada. Again, it makes sense to split the advertisement budget unequally. For instance:
			
 
				+    - 70% for the US, and 30% for India.
			
 
				+    - 65% for the US, and 35% for Canada; etc.
			
 
				+
			
 
				+3. Advertise only in the US.
			
 
				+
			
 
				+At this point, it's probably best to send our analysis to the marketing team and let them use their domain knowledge to decide. They might want to do some extra surveys in India and Canada and then get back to us for analyzing the new survey data.
			
 
				+
			
 
				+# Conclusion
			
 
				+
			
 
				+In this project, we analyzed survey data from new coders to find the best two markets to advertise in. The only solid conclusion we reached is that the US would be a good market to advertise in.
			
 
				+
			
 
				+For the second best market, it wasn't clear-cut what to choose between India and Canada. We decided to send the results to the marketing team so they can use their domain knowledge to take the best decision.
			
 
				+
			
 
				+# Documentation
			
 
				+[^1]: We can use the [Split-and-Combine workflow](https://app.dataquest.io/m/339/a/5).
			
 
				+[^2]: We can use the [`drop_na()` function](https://app.dataquest.io/m/326/a/6).
			
 
				+[^3]: We can use the [`stringr::str_split()` function](https://app.dataquest.io/m/342/a/6).
			
--- a/Mission459Solutions.Rmd
+++ b/Mission459Solutions.Rmd
@@ -0,0 +1,428 @@
 
				+---
			
 
				+title: 'Linear Modeling in R: Guided Project Solutions'
			
 
				+author: "Dataquest"
			
 
				+date: "12/10/2019"
			
 
				+output:
			
 
				+  pdf_document: default
			
 
				+  html_document: default
			
 
				+---
			
 
				+
			
 
				+# How well does the size of a condominium in New York City explain sale price?
			
 
				+
			
 
				+In this project we'll explore how well the size of a condominium (measured in gross square feet) explains, or predicts, sale price in New York City. We will also explore how well the size of a condominium predicts sale price in each of the five boroughs of New York City: the Bronx, Brooklyn, Manhattan, Staten Island, and Queens. 
			
 
				+
			
 
				+Before we build linear regression models we will plot sale price versus gross square feet to see if the data exhibits any obvious visual patterns. Plotting the data will also allow us to visualize outliers, and we will investigate some of the outliers to determine if the data was recorded correctly. This property sales data is [publicly available](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) and contains sales records from a twelve-month period (November, 2018 through October, 2019). 
			
 
				+
			
 
				+# Understanding the Data
			
 
				+
			
 
				+The data used for this project originates from five separate Microsoft Excel files, one for each borough in New York City. The data structure is identical for all five files, which makes it possible to combine all of the data into a single file. The programming steps below outline the steps taken to load each dataset into R, combine the datasets, format the data to facilitate ease of use, and export the dataset as a csv file for later use. Because we are predicting sale price on the basis of size, we deleted sale records with a `sale_price` less than \$10,000 (we assumed these deals to be between family members), and deleted `gross_square_feet` values of 0.
			
 
				+
			
 
				+```{r eval=FALSE}
			
 
				+# Set `eval=FALSE` so that this code chunk is not run multiple times
			
 
				+# Load packages required for New York City property sales data linear modeling
			
 
				+library(readxl) # Load Excel files
			
 
				+library(magrittr) # Make all colnames lower case with no spaces
			
 
				+library(stringr) # String formatting and replacement
			
 
				+library(dplyr) # Data wrangling and manipulation
			
 
				+library(readr) # Load and write csv files
			
 
				+library(ggplot2) # Data visualization
			
 
				+library(tidyr) # Nesting and unnesting dataframes
			
 
				+
			
 
				+# Data accessed November, 2019 from: 
			
 
				+# https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page
			
 
				+# Data Used for this Guided Project is from November 2018 to October 2019
			
 
				+brooklyn <- read_excel("rollingsales_brooklyn_Oct2019.xls", skip = 4)
			
 
				+bronx <- read_excel("rollingsales_bronx_Oct2019.xls", skip = 4)
			
 
				+manhattan <- read_excel("rollingsales_manhattan_Oct2019.xls", skip = 4)
			
 
				+staten_island <- read_excel("rollingsales_statenisland_Oct2019.xls", skip = 4)
			
 
				+queens <- read_excel("rollingsales_queens_Oct2019.xls", skip = 4)
			
 
				+
			
 
				+# Numeric codes for each borough - to be replaced with names
			
 
				+# Manhattan = 1
			
 
				+# Bronx = 2
			
 
				+# Brooklyn = 3
			
 
				+# Queens = 4 
			
 
				+# Staten Island = 5
			
 
				+
			
 
				+# Bind all dataframes into one
			
 
				+NYC_property_sales <- bind_rows(manhattan, bronx, brooklyn, queens, staten_island)
			
 
				+
			
 
				+# Remove individual dataframes for each neighborhood
			
 
				+rm(brooklyn, bronx, manhattan, staten_island, queens)
			
 
				+
			
 
				+# Replace borough number with borough name, for clarity
			
 
				+NYC_property_sales <- NYC_property_sales %>% 
			
 
				+  mutate(BOROUGH = 
			
 
				+  case_when(BOROUGH == 1 ~ "Manhattan",
			
 
				+            BOROUGH == 2 ~ "Bronx",
			
 
				+            BOROUGH == 3 ~ "Brooklyn",
			
 
				+            BOROUGH == 4 ~ "Queens",
			
 
				+            BOROUGH == 5 ~ "Staten Island"))
			
 
				+
			
 
				+# Convert all colnames to lower case with no spaces (use underscores instead of spaces)
			
 
				+colnames(NYC_property_sales) %<>% str_replace_all("\\s", "_") %>% tolower()
			
 
				+
			
 
				+# Convert CAPITALIZED columns to Title Case
			
 
				+NYC_property_sales <- NYC_property_sales %>% 
			
 
				+  mutate(neighborhood = str_to_title(neighborhood)) %>% 
			
 
				+  mutate(building_class_category = 
			
 
				+           str_to_title(building_class_category)) %>% 
			
 
				+  mutate(address = str_to_title(address)) 
			
 
				+
			
 
				+NYC_property_sales <- NYC_property_sales %>% 
			
 
				+  # Drop ease-ment column that contains no data
			
 
				+  select(-`ease-ment`) %>%
			
 
				+  # Select only distinct observations (drop duplicates)
			
 
				+  distinct()
			
 
				+
			
 
				+NYC_property_sales <- NYC_property_sales %>% 
			
 
				+  # Filter out property exchanges between family members
			
 
				+  # We assume here that the threshold is $10,000 US DOllars
			
 
				+  filter(sale_price > 10000) %>% 
			
 
				+  # Remove observations with gross square footage of zero
			
 
				+  # NOTE: We are only doing this here because we are analyzing condominium sales
			
 
				+  # If analyzing single family homes, we would also consider "land_square_feet"
			
 
				+  filter(gross_square_feet > 0) %>% 
			
 
				+  # Drop na values in columns of interest
			
 
				+  drop_na(c(gross_square_feet, sale_price))
			
 
				+
			
 
				+# Arrange observations alphabetically by borough and neighborhood
			
 
				+NYC_property_sales <- NYC_property_sales %>% 
			
 
				+  arrange(borough, neighborhood)
			
 
				+
			
 
				+# Save results to csv file for future use
			
 
				+# The code below is commented-out to avoid accidental overwriting of the file later on
			
 
				+# write_csv(NYC_property_sales, "NYC_property_sales.csv")
			
 
				+```
			
 
				+
			
 
				+The `readr` package is loaded so that the csv file can be read into R.
			
 
				+
			
 
				+```{r message=FALSE}
			
 
				+library(readr)
			
 
				+# Read in the CSV file we generated above
			
 
				+NYC_property_sales <- read_csv('NYC_property_sales.csv')
			
 
				+```
			
 
				+
			
 
				+A first glimpse of the data reveals that there are currently over 38,000 sale records in the dataset. 
			
 
				+
			
 
				+```{r message=FALSE}
			
 
				+library(dplyr) # Data wrangling and manipulation
			
 
				+glimpse(NYC_property_sales)
			
 
				+```
			
 
				+
			
 
				+For this project we will only work with a single type of building class ("R4"), a condominium residential unit in a building with an elevator. This building class is the most common building class in this `NYC_property_sales` dataframe. 
			
 
				+
			
 
				+```{r}
			
 
				+NYC_condos <- NYC_property_sales %>% 
			
 
				+  # Filter to include only property type: CONDO; RESIDENTIAL UNIT IN ELEVATOR BLDG.
			
 
				+  # https://www1.nyc.gov/assets/finance/jump/hlpbldgcode.html
			
 
				+  filter(building_class_at_time_of_sale == "R4")
			
 
				+```
			
 
				+
			
 
				+# Explore Bivariate Relationships with Scatterplots
			
 
				+
			
 
				+Now that the data is loaded, processed, and ready to analyze we will use scatterplots to visualize the relationships between condominium sale price and size. The scatterplot below depicts sale price versus size for all five New York City boroughs, combined. In general we see a trend that larger condominiums are associated with a higher sale price. The data follows a somewhat linear pattern. There is no obvious curvature with the shape of the data, but there is a fair amount of spread. The strength of the bivariate relationship is moderate. 
			
 
				+
			
 
				+```{r}
			
 
				+library(ggplot2)
			
 
				+ggplot(data = NYC_condos, 
			
 
				+       aes(x = gross_square_feet, y = sale_price)) +
			
 
				+  geom_point(aes(color = borough), alpha = 0.3) +
			
 
				+  scale_y_continuous(labels = scales::comma, limits = c(0, 75000000)) +
			
 
				+  xlim(0, 10000) +
			
 
				+  geom_smooth(method = "lm", se = FALSE) +
			
 
				+  theme_minimal() +
			
 
				+  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
			
 
				+       x = "Size (Gross Square Feet)",
			
 
				+       y = "Sale Price (USD)")
			
 
				+```
			
 
				+
			
 
				+Zooming in on a smaller subset of the data, we observe the same trend below that in general, as the size of a condominium increases, so does the sale price. The pattern is somewhat linear, but there is a fair amount of spread, or dispersion, that becomes more pronounced with an increase in condominium size.
			
 
				+
			
 
				+```{r}
			
 
				+library(ggplot2)
			
 
				+ggplot(data = NYC_condos, 
			
 
				+       aes(x = gross_square_feet, y = sale_price)) +
			
 
				+  geom_point(aes(color = borough), alpha = 0.3) +
			
 
				+  scale_y_continuous(labels = scales::comma, limits = c(0, 20000000)) +
			
 
				+  xlim(0, 5000) +
			
 
				+  geom_smooth(method = "lm", se = FALSE) +
			
 
				+  theme_minimal() +
			
 
				+  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
			
 
				+       x = "Size (Gross Square Feet)",
			
 
				+       y = "Sale Price (USD)")
			
 
				+```
			
 
				+
			
 
				+To better visualize the spread of data for each borough, we use y-axis and x-axis scales that are specific to each borough. What neighborhoods have outliers that we should investigate? 
			
 
				+
			
 
				+```{r}
			
 
				+ggplot(data = NYC_condos, 
			
 
				+       aes(x = gross_square_feet, y = sale_price)) +
			
 
				+  geom_point(alpha = 0.3) +
			
 
				+  facet_wrap(~ borough, scales = "free", ncol = 2) +
			
 
				+  scale_y_continuous(labels = scales::comma) +
			
 
				+  geom_smooth(method = "lm", se = FALSE) +
			
 
				+  theme_minimal() +
			
 
				+  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
			
 
				+       x = "Size (Gross Square Feet)",
			
 
				+       y = "Sale Price (USD)")
			
 
				+```
			
 
				+
			
 
				+Looking at the plot above, we see that, in general, larger condominiums are associated with a higher sale price in each borough. The data follows a somewhat linear pattern in each plot. But the spread is difficult to see with the Manhattan scatterplot, potentially because of the property sale of around $200 million visible to the far-right which may be impacting the visualization. There is no obvious curvature with the shape of the data, for any borough. The strength of the bivariate relationship is moderate for most boroughs, except for the Queens borough which looks to have a weaker relationship between sale price and size. 
			
 
				+
			
 
				+# Outliers and Data Integrity Issues
			
 
				+
			
 
				+We begin our investigation of outliers by sorting all sale records by sale price, from high to low.
			
 
				+
			
 
				+```{r}
			
 
				+NYC_condos %>% 
			
 
				+  arrange(desc(sale_price)) %>% 
			
 
				+  head
			
 
				+```
			
 
				+
			
 
				+Research of the highest price listing in the dataset reveals that this property sale was actually the [most expensive home ever sold in the United States](https://www.6sqft.com/billionaire-ken-griffin-buys-238m-nyc-penthouse-the-most-expensive-home-sold-in-the-u-s/) at the time of the sale. The luxurious building that this particular unit is located in even has its own [Wikipedia page](https://en.wikipedia.org/wiki/220_Central_Park_South). 
			
 
				+
			
 
				+The real estate transaction with the second-highest sale price in this dataset was also [news worthy](https://therealdeal.com/2019/04/12/cim-group-acquires-resi-portion-of-ues-luxury-rental-for-200m/).
			
 
				+
			
 
				+These two most expensive property sales also happen to be the two largest in terms of gross square footage. We will remove the second-highest listing at 165 East 66th Street because this transaction looks to be for an entire block of residences. We would like to limit this analysis to transactions of single units, if possible.
			
 
				+
			
 
				+```{r}
			
 
				+# Make copy of dataframe before removing any sale records
			
 
				+NYC_condos_original <- NYC_condos
			
 
				+
			
 
				+# Remove 165 East 66th Street sale record
			
 
				+NYC_condos <- NYC_condos %>% 
			
 
				+  filter(address != "165 East 66th St, Resi")
			
 
				+```
			
 
				+
			
 
				+We will leave the record-setting home sale observation in the dataset for now because we confirmed the sale price to be legitimate. 
			
 
				+
			
 
				+# How well does gross square feet explain sale price for all records combined?
			
 
				+
			
 
				+Next we'll take a look at the highest sale price observations in Brooklyn. There are a number of sale records at a sale price of around \$30 Million, but there is only a single observations in the range of \$10 to \$30 Million. Could this be correct?
			
 
				+
			
 
				+```{r}
			
 
				+NYC_condos %>% 
			
 
				+  filter(borough == "Brooklyn") %>% 
			
 
				+  arrange(desc(sale_price))
			
 
				+```
			
 
				+
			
 
				+Looking through the results we see that there are approximately 40 sales records with a price of \$29,620,207. This price point appears to be unusual for Brooklyn. Scrolling through the results using the viewer in R Studio we also see that all 40 property sales took place on the same day, 2019-04-08. This indicates that a transaction took place on this date where all 40 units were purchased for a TOTAL price of \$29,620,207, not \$29,620,207 per unit. 
			
 
				+
			
 
				+Thanks to the internet it does not take long for us to find [information about this new building](https://streeteasy.com/building/554-4-avenue-brooklyn). Sure enough, this building contains 40 total units. But according to the website, the average price *per unit* for the 26 "active sales" is around \$990,000 and the average price for the 14 previous sales is around \$816,000, per unit. 
			
 
				+
			
 
				+For our purposes we will remove all 40 observations from the dataset because sale prices for each unit are erroneous. We could consider other ways of correcting the data. One option is to determine the price-per-square-foot by dividing the $29M sale price by the total number of square feet sold across all 40 units, and then using this number to assign a price to each unit based on its size. But that is not worth our time and we can't be certain that method would yield valid results. 
			
 
				+
			
 
				+Fortunately, we have a programmatic option for surfacing potential multi-unit sales where each sale record contains the sale price for the entire real estate deal, not the price for the individual unit. Below we build a grouped filter that returns all sale records with three or more observations that have the same sale price and sale date. In general, multi-unit sales contain the same price and sale date across many sale records. When building a grouped filter we want to be careful not to "over-filter" by making the criteria too specific. In our case it looks like the filter effectively surfaces multi-sale transactions using only two grouping parameters: `sale_price` and `sale_date`.  
			
 
				+
			
 
				+```{r}
			
 
				+multi_unit_sales <- NYC_condos %>% 
			
 
				+  group_by(sale_price, sale_date) %>% 
			
 
				+  filter(n() >= 3) %>% 
			
 
				+  arrange(desc(sale_price))
			
 
				+```
			
 
				+
			
 
				+We researched many of the addresses listed in the `multi-unit-sales` dataframe and confirmed that most of the sale records included here are part of a multi-unit transaction. We do not expect this filter to be 100 percent accurate, for example there may be a few property sales included here that are not part of a multi-unit sale. But overall, this grouped filter appears to be effective. 
			
 
				+
			
 
				+There are many ways to remove the multi-unit sales from the `NYC_condos` dataframe. Below are two identical methods: (1) filter for only the sale records we wish to *retain* that have two or less instances of `sale_price` and `sale_date`, or (2) use an anti-join to drop all records from `NYC_condos` found in `multi_unit_sales`. 
			
 
				+
			
 
				+```{r}
			
 
				+NYC_condos <- NYC_condos %>%
			
 
				+  group_by(sale_price, sale_date) %>%
			
 
				+  filter(n() <= 2) %>%
			
 
				+  ungroup()
			
 
				+
			
 
				+# Alternative method
			
 
				+NYC_condos <- NYC_condos %>% 
			
 
				+  anti_join(multi_unit_sales)
			
 
				+```
			
 
				+
			
 
				+# Linear Regression Model for Boroughs in New York City Combined
			
 
				+
			
 
				+Now that we've removed many multi-unit sales from the dataset, let's generate a linear regression model for all New York City neighborhoods combined. As a reminder, we are predicting `sale_price` on the basis of `gross_square_feet`.
			
 
				+
			
 
				+```{r}
			
 
				+NYC_condos_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos)  
			
 
				+
			
 
				+summary(NYC_condos_lm)
			
 
				+```
			
 
				+
			
 
				+How does this compare to the `NYC_condos_original` dataframe that includes multi-unit sales? 
			
 
				+
			
 
				+```{r}
			
 
				+NYC_condos_original_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos_original)  
			
 
				+
			
 
				+summary(NYC_condos_original_lm)
			
 
				+```
			
 
				+
			
 
				+## Comparison of linear modeling results
			
 
				+
			
 
				+A bivariate linear regression of `sale_price` (price) explained by `gross_square_feet` (size) was performed on two different datasets containing condominium sale records for New York City. One dataset, `NYC_condos`, was cleaned to remove multi-unit sale records (where the same sale price is recorded for many units). The other dataset, `NYC_condos_original`, remained unaltered and contained all original sale records. In each case, the hypothesis is that  there is a relationship between the size of a condominium (`gross_square_feet`) and the price (`sale_price`). We can declare there is a relationship between condominium size and price when the slope is sufficiently far from zero. 
			
 
				+
			
 
				+For each model, the t-statistic was high enough, and the p-value was low enough, to declare that there is, in fact, a relationship between `gross_square_feet` and `sale_price`. The t-statistic for the cleaned dataset (`NYC_condos`) was nearly double that of the original dataset (`NYC_condos_original`) at 113.04 versus 61.39. In each case the p-value was well below the 0.05 cutoff for significance meaning that it is extremely unlikely that the relationship between condominium size and sale price is due to random chance. 
			
 
				+
			
 
				+The confidence interval for the slope is [4384.254, 4538.999] for the `NYC_condos` dataset compared to only [1154.636, 1230.802] for the `NYC_condos_original` dataset. This difference can likely be attributed to the removal of many multi-million dollar sale records for smaller units which impacted price predictions in the original dataset. The measure for *lack of fit*, or residual standard error (RSE) was lower for the cleaned dataset at 2,945,000 compared to 4,745,000 for the original dataset. However, it must be noted that the `NYC_condos` is smaller than the `NYC_condos_original` by 150 observations. Finally, the R-squared, or the proportion of the variability in `sale_price` that can be explained by `gross_square_feet` is 0.6166 for the cleaned `NYC_condos`. This is nearly double the R-squared value estimated for the `NYC_condos_original` dataset at 0.3177. 
			
 
				+
			
 
				+Below is the updated scatterplot that uses the cleaned `NYC_condos` data. For the Brooklyn borough we are better able to see the spread of the data and how the trend line fits the data because we removed the \$30 million outliers. The same is true for the Manhattan borough because the $200 million multi-unit sale was removed.
			
 
				+
			
 
				+```{r}
			
 
				+ggplot(data = NYC_condos, 
			
 
				+       aes(x = gross_square_feet, y = sale_price)) +
			
 
				+  geom_point(alpha = 0.3) +
			
 
				+  facet_wrap(~ borough, scales = "free", ncol = 2) +
			
 
				+  scale_y_continuous(labels = scales::comma) +
			
 
				+  geom_smooth(method = "lm", se = FALSE) +
			
 
				+  theme_minimal() +
			
 
				+  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
			
 
				+       x = "Size (Gross Square Feet)",
			
 
				+       y = "Sale Price (USD)")
			
 
				+```
			
 
				+
			
 
				+# Linear Regression Models for each Borough - Coefficient Estimates
			
 
				+
			
 
				+Now let's apply the `broom` workflow to compare coefficient estimates across the five boroughs. The general workflow using broom and tidyverse tools to generate many models involves 4 steps:
			
 
				+
			
 
				+1. Nest a dataframe by a categorical variable with the `nest()` function from the `tidyr` package - we will nest by `borough`.
			
 
				+2. Fit models to nested dataframes with the `map()` function from the `purrr` package.
			
 
				+3. Apply the `broom` functions `tidy()`, `augment()`, and/or `glance()` using each nested model - we'll work with `tidy()` first.
			
 
				+4. Return a tidy dataframe with the `unnest()` function - this allows us to see the results.
			
 
				+
			
 
				+```{r}
			
 
				+# Step 1: nest by the borough categorical variable
			
 
				+library(broom)
			
 
				+library(tidyr)
			
 
				+library(purrr)
			
 
				+NYC_nested <- NYC_condos %>% 
			
 
				+  group_by(borough) %>% 
			
 
				+  nest()
			
 
				+```
			
 
				+
			
 
				+In the previous step, the `NYC_condos` dataframe was collapsed from 7,946 observations to only 5. The nesting process isolated the sale records for each borough into separate dataframes. 
			
 
				+
			
 
				+```{r}
			
 
				+# Inspect the format
			
 
				+print(NYC_nested)
			
 
				+```
			
 
				+
			
 
				+We can extract and inspect the values of any nested dataframe. Below is a look at the first six rows for Manhattan.
			
 
				+
			
 
				+```{r}
			
 
				+# View first few rows for Manhattan
			
 
				+head(NYC_nested$data[[3]])
			
 
				+```
			
 
				+
			
 
				+The next step in the process is to fit a linear model to each individual dataframe. What this means is that we are generating separate linear models for each borough individually.
			
 
				+
			
 
				+```{r}
			
 
				+# Step 2: fit linear models to each borough, individually
			
 
				+NYC_coefficients <- NYC_condos %>% 
			
 
				+  group_by(borough) %>% 
			
 
				+  nest() %>% 
			
 
				+  mutate(linear_model = map(.x = data, 
			
 
				+                            .f = ~lm(sale_price ~ gross_square_feet, 
			
 
				+                                     data = .)))
			
 
				+```
			
 
				+
			
 
				+Taking a look at the data structure we see that we have a new list-column called `linear_model` that contains a linear model object for each borough.
			
 
				+
			
 
				+```{r}
			
 
				+# Inspect the data structure
			
 
				+print(NYC_coefficients)
			
 
				+```
			
 
				+
			
 
				+We can view the linear modeling results for any one of the nested objects using the `summary()` function. Below are the linear regression statistics for Manhattan.
			
 
				+
			
 
				+```{r}
			
 
				+# Verify model results for Manhattan
			
 
				+summary(NYC_coefficients$linear_model[[3]])
			
 
				+```
			
 
				+
			
 
				+A quick look at the R-squared value for the Manhattan linear model indicates that `gross_square_feet` looks to be a fairly good single predictor of `sale_price`. Almost two-thirds of the variability with `sale_price` is explained by `gross_square_feet`.
			
 
				+
			
 
				+The next step is to transform these linear model summary statistics into a tidy format.
			
 
				+
			
 
				+```{r}
			
 
				+# Step 3: generate a tidy dataframe of coefficient estimates that includes confidence intervals
			
 
				+NYC_coefficients <- NYC_condos %>% 
			
 
				+  group_by(borough) %>% 
			
 
				+  nest() %>% 
			
 
				+  mutate(linear_model = map(.x = data, 
			
 
				+                            .f = ~lm(sale_price ~ gross_square_feet, 
			
 
				+                                     data = .))) %>%
			
 
				+  mutate(tidy_coefficients = map(.x = linear_model, 
			
 
				+                                 .f = tidy, 
			
 
				+                                 conf.int = TRUE))
			
 
				+NYC_coefficients
			
 
				+```
			
 
				+
			
 
				+Now we have a new variable called `tidy_coefficients` that contains tidy coefficient estimates for each of the five boroughs. These tidy statistics are currently stored in five separate dataframes. Below are the coefficient estimates for Manhattan.
			
 
				+
			
 
				+```{r}
			
 
				+# Inspect the results for Manhattan
			
 
				+print(NYC_coefficients$tidy_coefficients[[3]])
			
 
				+```
			
 
				+
			
 
				+Now we can unnest the `tidy_coefficients` variable into a single dataframe that includes coefficient estimates for each of New York City's five boroughs. 
			
 
				+
			
 
				+```{r}
			
 
				+# Step 4: Unnest to a tidy dataframe of coefficient estimates
			
 
				+NYC_coefficients_tidy <- NYC_coefficients %>% 
			
 
				+  select(borough, tidy_coefficients) %>% 
			
 
				+  unnest(cols = tidy_coefficients)
			
 
				+print(NYC_coefficients_tidy)
			
 
				+```
			
 
				+
			
 
				+We're mainly interested in the slope which explains the change in y (sale price) for each unit change in x (square footage). We can filter for the slope estimate only as follows.
			
 
				+
			
 
				+```{r}
			
 
				+# Filter to return the slope estimate only 
			
 
				+NYC_slope <- NYC_coefficients_tidy %>%   
			
 
				+  filter(term == "gross_square_feet") %>% 
			
 
				+  arrange(estimate)
			
 
				+print(NYC_slope)
			
 
				+```
			
 
				+
			
 
				+We've arranged the results in ascending order by the slope estimate. For each of the five boroughs, the t-statistic and p-value indicate that there is a relationship between `sale_price` and `gross_square_feet`. In Staten Island, an increase in square footage by one unit is estimated to increase the sale price by about \$288, on average. In contrast, an increase in total square footage by one unit is estimated to result in an increase in sale price of about \$4,728, on average.
			
 
				+
			
 
				+# Linear Regression Models for each Borough - Regression Summary Statistics
			
 
				+
			
 
				+Now we will apply the same workflow using `broom` tools to generate tidy regression summary statistics for each of the five boroughs. Below we follow the same process as we saw previously with the `tidy()` function, but instead we use the `glance()` function.
			
 
				+
			
 
				+```{r}
			
 
				+# Generate a tidy dataframe of regression summary statistics
			
 
				+NYC_summary_stats <- NYC_condos %>% 
			
 
				+  group_by(borough) %>% 
			
 
				+  nest() %>% 
			
 
				+  mutate(linear_model = map(.x = data, 
			
 
				+                            .f = ~lm(sale_price ~ gross_square_feet, 
			
 
				+                                     data = .))) %>%
			
 
				+  mutate(tidy_summary_stats = map(.x = linear_model,
			
 
				+                                  .f = glance))
			
 
				+print(NYC_summary_stats)
			
 
				+```
			
 
				+
			
 
				+Now we have a new variable called `tidy_summary_stats` that contains tidy regression summary statistics for each of the five boroughs in New York City. These tidy statistics are currently stored in five separate dataframes. Below we unnest the five dataframes to a single, tidy dataframe arranged by R-squared value.
			
 
				+
			
 
				+```{r}
			
 
				+# Unnest to a tidy dataframe of
			
 
				+NYC_summary_stats_tidy <- NYC_summary_stats %>% 
			
 
				+  select(borough, tidy_summary_stats) %>% 
			
 
				+  unnest(cols = tidy_summary_stats) %>% 
			
 
				+  arrange(r.squared)
			
 
				+print(NYC_summary_stats_tidy)
			
 
				+```
			
 
				+
			
 
				+These results will be summarized in our conclusion paragraph below. 
			
 
				+
			
 
				+# Conclusion
			
 
				+
			
 
				+Our analysis showed that, in general, the `gross_square_feet` variable is useful for explaining, or estimating, `sale_price` for condominiums in New York City. We observed that removing multi-unit sales from the dataset increased model accuracy. With linear models generated for New York City as a whole, and with linear models generated for each borough individually, we observed in all cases that the t-statistic was high enough, and the p-value was low enough, to declare that there is a relationship between `gross_square_feet` and `sale_price`.
			
 
				+
			
 
				+For the linear models that we generated for each individual borough, we observed a wide range in slope estimates. The slope estimate for Manhattan was much higher than the estimate for any of the other boroughs. We did not remove the record-setting \$240 million property sale from the dataset, but future analysis should investigate the impacts that this single listing has on modeling results. 
			
 
				+
			
 
				+Finally, regression summary statistics indicate that `gross_square_feet` is a better single predictor of `sale_price` in some boroughs versus others. For example, the R-squared value was estimated at approximately 0.63 in Manhattan, and 0.59 in Brooklyn, compared to an estimate of only 0.35 in Queens. These differences in R-squared correspond with the scatterplots generated for each borough; the strength of sale prices versus gross square feet was higher, and the dispersion (spread), was lower for Manhattan and Brooklyn as compared to Queens where the relationship was noticeably weaker because the data was more spread out.
			
 
				+
			
 
				+
			
 
				+
			
 
				+
			
 
				+
			
--- a/README.md
+++ b/README.md
@@ -31,3 +31,4 @@ Of course, there are always going to be multiple ways to solve any one problem,
 
				 - [Guided Project: Forest Fires Data](https://github.com/dataquestio/solutions/blob/master/Mission277Solutions.Rmd)
			
 
				 - [Guided Project: NYC Schools Perceptions](https://github.com/dataquestio/solutions/blob/master/Mission327Solutions.Rmd)
			
 
				 - [Guided Project: Clean and Analyze Employee Exit Surveys](https://github.com/dataquestio/solutions/blob/master/Mission348Solutions.ipynb)
			
 
				+- [Guided Project: Finding the Best Markets to Advertise In](https://github.com/dataquestio/solutions/blob/master/Mission449Solutions.Rmd)