5 rokov pred · 7a0d70b180
--- a/Mission433Solutions.ipynb
+++ b/Mission433Solutions.ipynb
@@ -8,12 +8,12 @@
 
				     "\n",
			
 
				     "In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).\n",
			
 
				     "\n",
			
 
				-    "To train the algorithm, we'll use a data set of 5,572 SMS messages that are already classified by humans. The data set was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the papers authored by Tiago A. Almeida and José María Gómez Hidalgo.\n",
			
 
				+    "To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the papers authored by Tiago A. Almeida and José María Gómez Hidalgo.\n",
			
 
				     "\n",
			
 
				     "\n",
			
 
				-    "## Exploring the Data Set\n",
			
 
				+    "## Exploring the Dataset\n",
			
 
				     "\n",
			
 
				-    "We'll now start by reading in the data set."
			
 
				+    "We'll now start by reading in the dataset."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -141,7 +141,7 @@
 
				    "source": [
			
 
				     "## Training and Test Set\n",
			
 
				     "\n",
			
 
				-    "We're now going to split our data set into a training and a test set, where the training set accounts for 80% of the data, and the test set for the remaining 20%."
			
 
				+    "We're now going to split our dataset into a training and a test set, where the training set accounts for 80% of the data, and the test set for the remaining 20%."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -159,7 +159,7 @@
 
				     }
			
 
				    ],
			
 
				    "source": [
			
 
				-    "# Randomize the data set\n",
			
 
				+    "# Randomize the dataset\n",
			
 
				     "data_randomized = sms_spam.sample(frac=1, random_state=1)\n",
			
 
				     "\n",
			
 
				     "# Calculate index for split\n",
			
@@ -177,7 +177,7 @@
 
				    "cell_type": "markdown",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "We'll now analyze the percentage of spam and ham messages in the training and test sets. We expect the percentages to be close to what we have in the full data set, where about 87% of the messages are ham, and the remaining 13% are spam."
			
 
				+    "We'll now analyze the percentage of spam and ham messages in the training and test sets. We expect the percentages to be close to what we have in the full dataset, where about 87% of the messages are ham, and the remaining 13% are spam."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -228,7 +228,7 @@
 
				    "cell_type": "markdown",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "The results look good! We'll now move on to cleaning the data set.\n",
			
 
				+    "The results look good! We'll now move on to cleaning the dataset.\n",
			
 
				     "\n",
			
 
				     "## Data Cleaning\n",
			
 
				     "\n",