|
@@ -0,0 +1,1314 @@
|
|
|
+{
|
|
|
+ "cells": [
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "# Building a Spam Filter with Naive Bayes\n",
|
|
|
+ "\n",
|
|
|
+ "In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).\n",
|
|
|
+ "\n",
|
|
|
+ "To train the algorithm, we'll use a data set of 5,572 SMS messages that are already classified by humans. The data set was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the papers authored by Tiago A. Almeida and José María Gómez Hidalgo.\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "## Exploring the Data Set\n",
|
|
|
+ "\n",
|
|
|
+ "We'll now start by reading in the data set."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 1,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "(5572, 2)\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>Label</th>\n",
|
|
|
+ " <th>SMS</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>0</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>Go until jurong point, crazy.. Available only ...</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>1</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>Ok lar... Joking wif u oni...</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>2</th>\n",
|
|
|
+ " <td>spam</td>\n",
|
|
|
+ " <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>3</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>U dun say so early hor... U c already then say...</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>4</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " Label SMS\n",
|
|
|
+ "0 ham Go until jurong point, crazy.. Available only ...\n",
|
|
|
+ "1 ham Ok lar... Joking wif u oni...\n",
|
|
|
+ "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n",
|
|
|
+ "3 ham U dun say so early hor... U c already then say...\n",
|
|
|
+ "4 ham Nah I don't think he goes to usf, he lives aro..."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 1,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "import pandas as pd\n",
|
|
|
+ "\n",
|
|
|
+ "sms_spam = pd.read_csv('SMSSpamCollection', sep='\\t', header=None, names=['Label', 'SMS'])\n",
|
|
|
+ "\n",
|
|
|
+ "print(sms_spam.shape)\n",
|
|
|
+ "sms_spam.head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Below, we see that about 87% of the messages are ham, and the remaining 13% are spam. This sample looks representative, since in practice most messages that people receive are ham."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 2,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "ham 0.865937\n",
|
|
|
+ "spam 0.134063\n",
|
|
|
+ "Name: Label, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 2,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "sms_spam['Label'].value_counts(normalize=True)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Training and Test Set\n",
|
|
|
+ "\n",
|
|
|
+ "We're now going to split our data set into a training and a test set, where the training set accounts for 80% of the data, and the test set for the remaining 20%."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 3,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "(4458, 2)\n",
|
|
|
+ "(1114, 2)\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "# Randomize the data set\n",
|
|
|
+ "data_randomized = sms_spam.sample(frac=1, random_state=1)\n",
|
|
|
+ "\n",
|
|
|
+ "# Calculate index for split\n",
|
|
|
+ "training_test_index = round(len(data_randomized) * 0.8)\n",
|
|
|
+ "\n",
|
|
|
+ "# Training/Test split\n",
|
|
|
+ "training_set = data_randomized[:training_test_index].reset_index(drop=True)\n",
|
|
|
+ "test_set = data_randomized[training_test_index:].reset_index(drop=True)\n",
|
|
|
+ "\n",
|
|
|
+ "print(training_set.shape)\n",
|
|
|
+ "print(test_set.shape)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "We'll now analyze the percentage of spam and ham messages in the training and test sets. We expect the percentages to be close to what we have in the full data set, where about 87% of the messages are ham, and the remaining 13% are spam."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 4,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "ham 0.86541\n",
|
|
|
+ "spam 0.13459\n",
|
|
|
+ "Name: Label, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 4,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "training_set['Label'].value_counts(normalize=True)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 5,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "ham 0.868043\n",
|
|
|
+ "spam 0.131957\n",
|
|
|
+ "Name: Label, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 5,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "test_set['Label'].value_counts(normalize=True)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The results look good! We'll now move on to cleaning the data set.\n",
|
|
|
+ "\n",
|
|
|
+ "## Data Cleaning\n",
|
|
|
+ "\n",
|
|
|
+ "To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.\n",
|
|
|
+ "\n",
|
|
|
+ "Essentially, we want to bring data to this format:\n",
|
|
|
+ "\n",
|
|
|
+ "<center>\n",
|
|
|
+ "\n",
|
|
|
+ "</center>\n",
|
|
|
+ "\n",
|
|
|
+ "### Letter Case and Punctuation\n",
|
|
|
+ "\n",
|
|
|
+ "We'll begin with removing all the punctuation and bringing every letter to lower case."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 6,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>Label</th>\n",
|
|
|
+ " <th>SMS</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>0</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>Yep, by the pretty sculpture</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>1</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>Yes, princess. Are you going to make me moan?</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>2</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>Welp apparently he retired</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>3</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>Havent.</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>4</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>I forgot 2 ask ü all smth.. There's a card on ...</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " Label SMS\n",
|
|
|
+ "0 ham Yep, by the pretty sculpture\n",
|
|
|
+ "1 ham Yes, princess. Are you going to make me moan?\n",
|
|
|
+ "2 ham Welp apparently he retired\n",
|
|
|
+ "3 ham Havent.\n",
|
|
|
+ "4 ham I forgot 2 ask ü all smth.. There's a card on ..."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 6,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "# Before cleaning\n",
|
|
|
+ "training_set.head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 7,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>Label</th>\n",
|
|
|
+ " <th>SMS</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>0</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>yep by the pretty sculpture</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>1</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>yes princess are you going to make me moan</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>2</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>welp apparently he retired</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>3</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>havent</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>4</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>i forgot 2 ask ü all smth there s a card on ...</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " Label SMS\n",
|
|
|
+ "0 ham yep by the pretty sculpture\n",
|
|
|
+ "1 ham yes princess are you going to make me moan \n",
|
|
|
+ "2 ham welp apparently he retired\n",
|
|
|
+ "3 ham havent \n",
|
|
|
+ "4 ham i forgot 2 ask ü all smth there s a card on ..."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 7,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "# After cleaning\n",
|
|
|
+ "training_set['SMS'] = training_set['SMS'].str.replace('\\W', ' ')\n",
|
|
|
+ "training_set['SMS'] = training_set['SMS'].str.lower()\n",
|
|
|
+ "training_set.head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Creating the Vocabulary\n",
|
|
|
+ "\n",
|
|
|
+ "Let's now move to creating the vocabulary, which in this context means a list with all the unique words in our training set."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 8,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "training_set['SMS'] = training_set['SMS'].str.split()\n",
|
|
|
+ "\n",
|
|
|
+ "vocabulary = []\n",
|
|
|
+ "for sms in training_set['SMS']:\n",
|
|
|
+ " for word in sms:\n",
|
|
|
+ " vocabulary.append(word)\n",
|
|
|
+ " \n",
|
|
|
+ "vocabulary = list(set(vocabulary))"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "It looks like there are 7,783 unique words in all the messages of our training set."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 9,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "7783"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 9,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "len(vocabulary)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "### The Final Training Set\n",
|
|
|
+ "\n",
|
|
|
+ "We're now going to use the vocabulary we just created to make the data transformation we want."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 10,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "word_counts_per_sms = {unique_word:[] for unique_word in vocabulary}\n",
|
|
|
+ "\n",
|
|
|
+ "for unique_word in vocabulary:\n",
|
|
|
+ " \n",
|
|
|
+ " for sms in training_set['SMS']: \n",
|
|
|
+ " counter = 0\n",
|
|
|
+ " \n",
|
|
|
+ " for word in sms: \n",
|
|
|
+ " if word == unique_word:\n",
|
|
|
+ " counter += 1\n",
|
|
|
+ " \n",
|
|
|
+ " word_counts_per_sms[unique_word].append(counter)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 11,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>should</th>\n",
|
|
|
+ " <th>huiming</th>\n",
|
|
|
+ " <th>double</th>\n",
|
|
|
+ " <th>freshers</th>\n",
|
|
|
+ " <th>here</th>\n",
|
|
|
+ " <th>tea</th>\n",
|
|
|
+ " <th>running</th>\n",
|
|
|
+ " <th>seat</th>\n",
|
|
|
+ " <th>mb</th>\n",
|
|
|
+ " <th>safely</th>\n",
|
|
|
+ " <th>...</th>\n",
|
|
|
+ " <th>nike</th>\n",
|
|
|
+ " <th>ache</th>\n",
|
|
|
+ " <th>08000930705</th>\n",
|
|
|
+ " <th>printing</th>\n",
|
|
|
+ " <th>450</th>\n",
|
|
|
+ " <th>shanghai</th>\n",
|
|
|
+ " <th>purple</th>\n",
|
|
|
+ " <th>bday</th>\n",
|
|
|
+ " <th>nd</th>\n",
|
|
|
+ " <th>ignorant</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>0</th>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>1</th>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>2</th>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>3</th>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>4</th>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "<p>5 rows × 7783 columns</p>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " should huiming double freshers here tea running seat mb safely \\\n",
|
|
|
+ "0 0 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "1 0 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "2 0 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "3 0 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "4 0 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "\n",
|
|
|
+ " ... nike ache 08000930705 printing 450 shanghai purple bday nd \\\n",
|
|
|
+ "0 ... 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "1 ... 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "2 ... 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "3 ... 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "4 ... 0 0 0 0 0 0 0 0 0 \n",
|
|
|
+ "\n",
|
|
|
+ " ignorant \n",
|
|
|
+ "0 0 \n",
|
|
|
+ "1 0 \n",
|
|
|
+ "2 0 \n",
|
|
|
+ "3 0 \n",
|
|
|
+ "4 0 \n",
|
|
|
+ "\n",
|
|
|
+ "[5 rows x 7783 columns]"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 11,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "word_counts = pd.DataFrame(word_counts_per_sms)\n",
|
|
|
+ "word_counts.head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 12,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>Label</th>\n",
|
|
|
+ " <th>SMS</th>\n",
|
|
|
+ " <th>should</th>\n",
|
|
|
+ " <th>huiming</th>\n",
|
|
|
+ " <th>double</th>\n",
|
|
|
+ " <th>freshers</th>\n",
|
|
|
+ " <th>here</th>\n",
|
|
|
+ " <th>tea</th>\n",
|
|
|
+ " <th>running</th>\n",
|
|
|
+ " <th>seat</th>\n",
|
|
|
+ " <th>...</th>\n",
|
|
|
+ " <th>nike</th>\n",
|
|
|
+ " <th>ache</th>\n",
|
|
|
+ " <th>08000930705</th>\n",
|
|
|
+ " <th>printing</th>\n",
|
|
|
+ " <th>450</th>\n",
|
|
|
+ " <th>shanghai</th>\n",
|
|
|
+ " <th>purple</th>\n",
|
|
|
+ " <th>bday</th>\n",
|
|
|
+ " <th>nd</th>\n",
|
|
|
+ " <th>ignorant</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>0</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>[yep, by, the, pretty, sculpture]</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>1</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>[yes, princess, are, you, going, to, make, me,...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>2</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>[welp, apparently, he, retired]</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>3</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>[havent]</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>4</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>[i, forgot, 2, ask, ü, all, smth, there, s, a,...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>...</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "<p>5 rows × 7785 columns</p>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " Label SMS should huiming \\\n",
|
|
|
+ "0 ham [yep, by, the, pretty, sculpture] 0 0 \n",
|
|
|
+ "1 ham [yes, princess, are, you, going, to, make, me,... 0 0 \n",
|
|
|
+ "2 ham [welp, apparently, he, retired] 0 0 \n",
|
|
|
+ "3 ham [havent] 0 0 \n",
|
|
|
+ "4 ham [i, forgot, 2, ask, ü, all, smth, there, s, a,... 0 0 \n",
|
|
|
+ "\n",
|
|
|
+ " double freshers here tea running seat ... nike ache 08000930705 \\\n",
|
|
|
+ "0 0 0 0 0 0 0 ... 0 0 0 \n",
|
|
|
+ "1 0 0 0 0 0 0 ... 0 0 0 \n",
|
|
|
+ "2 0 0 0 0 0 0 ... 0 0 0 \n",
|
|
|
+ "3 0 0 0 0 0 0 ... 0 0 0 \n",
|
|
|
+ "4 0 0 0 0 0 0 ... 0 0 0 \n",
|
|
|
+ "\n",
|
|
|
+ " printing 450 shanghai purple bday nd ignorant \n",
|
|
|
+ "0 0 0 0 0 0 0 0 \n",
|
|
|
+ "1 0 0 0 0 0 0 0 \n",
|
|
|
+ "2 0 0 0 0 0 0 0 \n",
|
|
|
+ "3 0 0 0 0 0 0 0 \n",
|
|
|
+ "4 0 0 0 0 0 0 0 \n",
|
|
|
+ "\n",
|
|
|
+ "[5 rows x 7785 columns]"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 12,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "training_set_clean = pd.concat([training_set, word_counts], axis=1)\n",
|
|
|
+ "training_set_clean.head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Calculating Constants First\n",
|
|
|
+ "\n",
|
|
|
+ "We're now done with cleaning the training set, and we can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probabilities questions to be able to classify new messages:\n",
|
|
|
+ "\n",
|
|
|
+ "\\begin{equation}\n",
|
|
|
+ "P(Spam | w_1,w_2, ..., w_n) = P(Spam) \\cdot \\prod_{i=1}^{n}P(w_i|Spam)\n",
|
|
|
+ "\\end{equation}\n",
|
|
|
+ "\n",
|
|
|
+ "\\begin{equation}\n",
|
|
|
+ "P(Ham | w_1,w_2, ..., w_n) = P(Ham) \\cdot \\prod_{i=1}^{n}P(w_i|Ham)\n",
|
|
|
+ "\\end{equation}\n",
|
|
|
+ "\n",
|
|
|
+ "Also, to calculate $P(w_i|Spam)$ and $P(w_i|Ham)$ inside the formulas above (where $w_i$ can be any word), we'll need to use:\n",
|
|
|
+ "\n",
|
|
|
+ "\\begin{equation}\n",
|
|
|
+ "P(w_i|Spam) = \\frac{card(w_i|Spam) + \\alpha}{card(Spam) + \\alpha \\cdot card(Vocabulary)}\n",
|
|
|
+ "\\end{equation}\n",
|
|
|
+ "\n",
|
|
|
+ "\\begin{equation}\n",
|
|
|
+ "P(w_i|Ham) = \\frac{card(w_i|Ham) + \\alpha}{card(Ham) + \\alpha \\cdot card(Vocabulary)}\n",
|
|
|
+ "\\end{equation}\n",
|
|
|
+ "\n",
|
|
|
+ "Some of the terms in the four equations above will have the same value for every new message or word. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, we'll use our training set to calculate:\n",
|
|
|
+ "\n",
|
|
|
+ "- P(Spam) and P(Ham)\n",
|
|
|
+ "- card(Spam), card(Ham), and card(Vocabulary)\n",
|
|
|
+ "\n",
|
|
|
+ "We'll also use Laplace smoothing and set $\\alpha = 1$."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 13,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# P(Spam) and P(Ham)\n",
|
|
|
+ "p_spam = training_set_clean['Label'].value_counts(normalize=True)['spam']\n",
|
|
|
+ "p_ham = training_set_clean['Label'].value_counts(normalize=True)['ham']\n",
|
|
|
+ "\n",
|
|
|
+ "# card(Spam), card(Ham), and card(Vocabulary)\n",
|
|
|
+ "card_spam = training_set_clean[training_set_clean['Label'] == 'spam'].sum(\n",
|
|
|
+ " axis=1).sum()\n",
|
|
|
+ "card_ham = training_set_clean[training_set_clean['Label'] == 'ham'].sum(\n",
|
|
|
+ " axis=1).sum()\n",
|
|
|
+ "card_vocabulary = len(vocabulary)\n",
|
|
|
+ "\n",
|
|
|
+ "# Laplace smoothing\n",
|
|
|
+ "alpha = 1"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Calculating Parameters\n",
|
|
|
+ "\n",
|
|
|
+ "Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.\n",
|
|
|
+ "\n",
|
|
|
+ "The parameters are calculated using the formulas:\n",
|
|
|
+ "\n",
|
|
|
+ "\\begin{equation}\n",
|
|
|
+ "P(w_i|Spam) = \\frac{card(w_i|Spam) + \\alpha}{card(Spam) + \\alpha \\cdot card(Vocabulary)}\n",
|
|
|
+ "\\end{equation}\n",
|
|
|
+ "\n",
|
|
|
+ "\\begin{equation}\n",
|
|
|
+ "P(w_i|Ham) = \\frac{card(w_i|Ham) + \\alpha}{card(Ham) + \\alpha \\cdot card(Vocabulary)}\n",
|
|
|
+ "\\end{equation}"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 14,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Initiate parameters\n",
|
|
|
+ "parameters_spam = {unique_word:0 for unique_word in vocabulary}\n",
|
|
|
+ "parameters_ham = {unique_word:0 for unique_word in vocabulary}\n",
|
|
|
+ "\n",
|
|
|
+ "# Isolate spam and ham messages for the loop below\n",
|
|
|
+ "# Don't do this inside the loop, it'll add to code running time significantly\n",
|
|
|
+ "spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']\n",
|
|
|
+ "ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']\n",
|
|
|
+ "\n",
|
|
|
+ "# Calculate parameters\n",
|
|
|
+ "for word in vocabulary:\n",
|
|
|
+ " card_word_given_spam = spam_messages[word].sum()\n",
|
|
|
+ " p_word_given_spam = (card_word_given_spam + alpha) / (card_spam + alpha*card_vocabulary)\n",
|
|
|
+ " parameters_spam[word] = p_word_given_spam\n",
|
|
|
+ " \n",
|
|
|
+ " card_word_given_ham = ham_messages[word].sum()\n",
|
|
|
+ " p_word_given_ham = (card_word_given_ham + alpha) / (card_ham + alpha*card_vocabulary)\n",
|
|
|
+ " parameters_ham[word] = p_word_given_ham"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Classifying A New Message\n",
|
|
|
+ "\n",
|
|
|
+ "Now that we have all our parameters calculate, we can start creating our spam filter. The spam filter can be understood as a function that:\n",
|
|
|
+ "\n",
|
|
|
+ "- Takes in as input a new message (w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>).\n",
|
|
|
+ "- Calculates P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>).\n",
|
|
|
+ "- Compares the values of P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), and:\n",
|
|
|
+ " - If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) > P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as ham.\n",
|
|
|
+ " - If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) < P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the message is classified as spam.\n",
|
|
|
+ " - If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) = P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>), then the algorithm may request human help."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 15,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "import re\n",
|
|
|
+ "\n",
|
|
|
+ "def classify(message):\n",
|
|
|
+ " '''\n",
|
|
|
+ " message: a string\n",
|
|
|
+ " '''\n",
|
|
|
+ " \n",
|
|
|
+ " message = re.sub('\\W', ' ', message)\n",
|
|
|
+ " message = message.lower().split()\n",
|
|
|
+ " \n",
|
|
|
+ " p_spam_given_message = p_spam\n",
|
|
|
+ " p_ham_given_message = p_ham\n",
|
|
|
+ "\n",
|
|
|
+ " for word in message:\n",
|
|
|
+ " if word in parameters_spam:\n",
|
|
|
+ " p_spam_given_message *= parameters_spam[word]\n",
|
|
|
+ " \n",
|
|
|
+ " if word in parameters_ham:\n",
|
|
|
+ " p_ham_given_message *= parameters_ham[word]\n",
|
|
|
+ " \n",
|
|
|
+ " print('P(Spam|message):', p_spam_given_message)\n",
|
|
|
+ " print('P(Ham|message):', p_ham_given_message)\n",
|
|
|
+ " \n",
|
|
|
+ " if p_ham_given_message > p_spam_given_message:\n",
|
|
|
+ " print('Label: Ham')\n",
|
|
|
+ " elif p_ham_given_message < p_spam_given_message:\n",
|
|
|
+ " print('Label: Spam')\n",
|
|
|
+ " else:\n",
|
|
|
+ " print('Equal proabilities, have a human classify this!')"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 16,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "P(Spam|message): 1.3481290211300841e-25\n",
|
|
|
+ "P(Ham|message): 1.9368049028589875e-27\n",
|
|
|
+ "Label: Spam\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "classify('WINNER!! This is the secret code to unlock the money: C3421.')"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 17,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "P(Spam|message): 2.4372375665888117e-25\n",
|
|
|
+ "P(Ham|message): 3.687530435009238e-21\n",
|
|
|
+ "Label: Ham\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "classify(\"Sounds good, Tom, then see u there\")"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Measuring the Spam Filter's Accuracy\n",
|
|
|
+ "\n",
|
|
|
+ "The two results above look promising, but let's see how well the filter does on our test set, which has 1,114 messages.\n",
|
|
|
+ "\n",
|
|
|
+ "We'll start by writing a function that returns classification labels instead of print them."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 18,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "def classify_test_set(message): \n",
|
|
|
+ " '''\n",
|
|
|
+ " message: a string\n",
|
|
|
+ " '''\n",
|
|
|
+ " \n",
|
|
|
+ " message = re.sub('\\W', ' ', message)\n",
|
|
|
+ " message = message.lower().split()\n",
|
|
|
+ " \n",
|
|
|
+ " p_spam_given_message = p_spam\n",
|
|
|
+ " p_ham_given_message = p_ham\n",
|
|
|
+ "\n",
|
|
|
+ " for word in message:\n",
|
|
|
+ " if word in parameters_spam:\n",
|
|
|
+ " p_spam_given_message *= parameters_spam[word]\n",
|
|
|
+ " \n",
|
|
|
+ " if word in parameters_ham:\n",
|
|
|
+ " p_ham_given_message *= parameters_ham[word]\n",
|
|
|
+ " \n",
|
|
|
+ " if p_ham_given_message > p_spam_given_message:\n",
|
|
|
+ " return 'ham'\n",
|
|
|
+ " elif p_spam_given_message > p_ham_given_message:\n",
|
|
|
+ " return 'spam'\n",
|
|
|
+ " else:\n",
|
|
|
+ " return 'needs human classification'"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 19,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>Label</th>\n",
|
|
|
+ " <th>SMS</th>\n",
|
|
|
+ " <th>predicted</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>0</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>Later i guess. I needa do mcat study too.</td>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>1</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>But i haf enuff space got like 4 mb...</td>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>2</th>\n",
|
|
|
+ " <td>spam</td>\n",
|
|
|
+ " <td>Had your mobile 10 mths? Update to latest Oran...</td>\n",
|
|
|
+ " <td>spam</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>3</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>All sounds good. Fingers . Makes it difficult ...</td>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>4</th>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " <td>All done, all handed in. Don't know if mega sh...</td>\n",
|
|
|
+ " <td>ham</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " Label SMS predicted\n",
|
|
|
+ "0 ham Later i guess. I needa do mcat study too. ham\n",
|
|
|
+ "1 ham But i haf enuff space got like 4 mb... ham\n",
|
|
|
+ "2 spam Had your mobile 10 mths? Update to latest Oran... spam\n",
|
|
|
+ "3 ham All sounds good. Fingers . Makes it difficult ... ham\n",
|
|
|
+ "4 ham All done, all handed in. Don't know if mega sh... ham"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 19,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "test_set['predicted'] = test_set['SMS'].apply(classify_test_set)\n",
|
|
|
+ "test_set.head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Now, we'll write a function to measure the accuracy of our spam filter to find out how well our spam filter does."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 20,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "Correct: 1100\n",
|
|
|
+ "Incorrect: 14\n",
|
|
|
+ "Accuracy: 0.9874326750448833\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "correct = 0\n",
|
|
|
+ "total = test_set.shape[0]\n",
|
|
|
+ " \n",
|
|
|
+ "for row in test_set.iterrows():\n",
|
|
|
+ " row = row[1]\n",
|
|
|
+ " if row['Label'] == row['predicted']:\n",
|
|
|
+ " correct += 1\n",
|
|
|
+ " \n",
|
|
|
+ "print('Correct:', correct)\n",
|
|
|
+ "print('Incorrect:', total - correct)\n",
|
|
|
+ "print('Accuracy:', correct/total)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Next Steps\n",
|
|
|
+ "\n",
|
|
|
+ "In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.\n",
|
|
|
+ "\n",
|
|
|
+ "Next steps include:\n",
|
|
|
+ "\n",
|
|
|
+ "- Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly\n",
|
|
|
+ "- Make the filtering process more complex by making the algorithm sensitive to letter case"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "metadata": {
|
|
|
+ "kernelspec": {
|
|
|
+ "display_name": "Python 3",
|
|
|
+ "language": "python",
|
|
|
+ "name": "python3"
|
|
|
+ },
|
|
|
+ "language_info": {
|
|
|
+ "codemirror_mode": {
|
|
|
+ "name": "ipython",
|
|
|
+ "version": 3
|
|
|
+ },
|
|
|
+ "file_extension": ".py",
|
|
|
+ "mimetype": "text/x-python",
|
|
|
+ "name": "python",
|
|
|
+ "nbconvert_exporter": "python",
|
|
|
+ "pygments_lexer": "ipython3",
|
|
|
+ "version": "3.7.3"
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "nbformat": 4,
|
|
|
+ "nbformat_minor": 4
|
|
|
+}
|