{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Mission730Solutions.ipynb", "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Purpose of Notebook\n", "\n", "The purpose of this notebook is to offer as a an example answer to the Guided Project for Linear Regression in Python course. Since the choice of model predictors is up to the student, results can differ. Use this solution as a guide to how to structure your own answer." ], "metadata": { "id": "OsSvDQ0Y3yNk" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "from sklearn.model_selection import train_test_split" ], "metadata": { "id": "lmpsg5fi4t8d" }, "execution_count": 1, "outputs": [] }, { "cell_type": "code", "source": [ "# Load in the insurance dataset\n", "insurance = pd.read_csv(\"insurance.csv\")" ], "metadata": { "id": "y6aqBHg-5cJT" }, "execution_count": 2, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Exploring The Dataset" ], "metadata": { "id": "wS1GbH1g7ta5" } }, { "cell_type": "code", "source": [ "# Columns in the dataset\n", "insurance.columns" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JuqpqwPd5l6d", "outputId": "470c6b14-184b-4489-e151-0a6cfe235d08" }, "execution_count": 3, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')" ] }, "metadata": {}, "execution_count": 3 } ] }, { "cell_type": "markdown", "source": [ "The `charges` column is our outcome, while everything else are the potential predictors to use in the model" ], "metadata": { "id": "M5zR7HAr8tsb" } }, { "cell_type": "code", "source": [ "insurance.hist(\"charges\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 316 }, "id": "SMObDGAlOci0", "outputId": "82673471-fdbe-48fe-fc14-5974fab5f95f" }, "execution_count": 4, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[]],\n", " dtype=object)" ] }, "metadata": {}, "execution_count": 4 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAVTklEQVR4nO3df7BndX3f8edLVpCyyoLQ2xW2LlZqxkiDcIM4puldGRNAGpyOsVgTgZLZToKNGWnjmszUJtMfaGqMTFLjRoxrJ8lKUQMFbSTopnE6oGxUFkTK8qvuDrKiQFz8UdF3/7ifNd+93N3vvdyf55PnY+Y795zP+ZzzfX/u9/C63/2c7/eQqkKS1JdnrHQBkqTFZ7hLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcFe3klyS5DMrXYe0Egx3SeqQ4S7NQZI1K12DNB+Gu7qQZEOSjyb5WpKvJ/ndkW3/JcmjSe5Pct5I+6VJ7kryzST3JflXI9umkuxJ8tYkXwX+MMnRSba1Y92V5FeT7BnZ53lJPtJquD/JL49sOyvJbUn+OsnDSX57GX4t+lvMcNfgJTkCuAF4ENgInARsb5tfBtwNnAC8E7g6Sdq2fcAFwHOAS4F3Jzlj5NB/DzgeeD6wGXh7O/4LgFcBPzdSwzOA/wF8sT3/OcCvJPnp1uU9wHuq6jnAPwCuWZTBS4dguKsHZwHPA/5tVT1RVd+pqgMXUh+sqj+oqu8D24D1wARAVd1YVffWtL8APgn845Hj/gB4e1V9t6q+DbwO+E9V9WhV7QGuGun748CJVfWbVfX/quo+4A+Ai9r27wEvTHJCVe2vqluW5DchNYa7erCB6RB/cpZtXz2wUFXfaotrAZKcl+SWJN9I8hhwPtPv8A/4WlV9Z2T9ecBXRtZHl58PPC/JYwcewK/R/pAAlwH/EPhyks8luWD+w5TmzotE6sFXgL+fZM0hAv4pkhwFfAR4I3BdVX0vyZ8CGek285apDwEnA19q6xtm1HB/VZ062/NV1T3A69v0zT8Drk3y3Kp6Yi71SvPlO3f14LNMB++VSY5J8qwkrxizz5HAUcDXgCfbhdafGrPPNcDbkhyX5CTgTTNq+Ga7AHt0kiOSvCTJjwMk+bkkJ1bVD4DH2j4/mN8wpbkz3DV4bT79nwIvBP4vsAf452P2+Sbwy0wH9qPAvwCuH/NUv9mOfT/w58C1wHdHargAOL1tfwR4P3Bs2/dc4M4k+5m+uHpRm8eXlkT8n3VIT0+SX2Q6pP/JStcizeQ7d2mOkqxP8ookz0jyIuAK4GMrXZc0Gy+oSnN3JPA+4BSm5823A/91RSuSDsFpGUnqkNMyktShVTEtc8IJJ9TGjRvH9nviiSc45phjlr6gJTL0+sExrAZDrx+GP4bVUv/OnTsfqaoTZ9u2KsJ948aN3HbbbWP77dixg6mpqaUvaIkMvX5wDKvB0OuH4Y9htdSf5MFDbXNaRpI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOrQqvqG6EBu33Lhiz/3Ala9eseeWpMPxnbskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHVoTuGe5IEku5J8Icltre34JDcluaf9PK61J8lVSXYnuT3JGUs5AEnSU83nnfumqjq9qibb+hbg5qo6Fbi5rQOcB5zaHpuB9y5WsZKkuVnItMyFwLa2vA14zUj7h2raLcC6JOsX8DySpHmaa7gX8MkkO5Nsbm0TVfVQW/4qMNGWTwK+MrLvntYmSVomqarxnZKTqmpvkr8L3AT8a+D6qlo30ufRqjouyQ3AlVX1mdZ+M/DWqrptxjE3Mz1tw8TExJnbt28fW8f+/ftZu3btQW279j4+dr+lctpJx86r/2z1D41jWHlDrx+GP4bVUv+mTZt2jkyVH2ROt/ytqr3t574kHwPOAh5Osr6qHmrTLvta973AhpHdT25tM4+5FdgKMDk5WVNTU2Pr2LFjBzP7XbKSt/x9w9S8+s9W/9A4hpU39Pph+GMYQv1jp2WSHJPk2QeWgZ8C7gCuBy5u3S4GrmvL1wNvbJ+aORt4fGT6RpK0DObyzn0C+FiSA/3/uKr+Z5LPAdckuQx4EHhd6/9x4HxgN/At4NJFr1qSdFhjw72q7gN+bJb2rwPnzNJewOWLUp0k6WnxG6qS1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ3MO9yRHJPl8khva+ilJbk2yO8mHkxzZ2o9q67vb9o1LU7ok6VDm8879zcBdI+vvAN5dVS8EHgUua+2XAY+29ne3fpKkZTSncE9yMvBq4P1tPcArgWtbl23Aa9ryhW2dtv2c1l+StExSVeM7JdcC/xl4NvBvgEuAW9q7c5JsAD5RVS9JcgdwblXtadvuBV5WVY/MOOZmYDPAxMTEmdu3bx9bx/79+1m7du1Bbbv2Pj52v6Vy2knHzqv/bPUPjWNYeUOvH4Y/htVS/6ZNm3ZW1eRs29aM2znJBcC+qtqZZGqxiqqqrcBWgMnJyZqaGn/oHTt2MLPfJVtuXKyS5m/XE/PqfsVp3+ddn5nfPrN54MpXL/gYT9dsr8HQDH0MQ68fhj+GIdQ/NtyBVwA/k+R84FnAc4D3AOuSrKmqJ4GTgb2t/15gA7AnyRrgWODri165JOmQxs65V9XbqurkqtoIXAR8qqreAHwaeG3rdjFwXVu+vq3Ttn+q5jL3I0laNAv5nPtbgbck2Q08F7i6tV8NPLe1vwXYsrASJUnzNZdpmR+qqh3AjrZ8H3DWLH2+A/zsItQmSXqa/IaqJHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6tDYcE/yrCSfTfLFJHcm+Y3WfkqSW5PsTvLhJEe29qPa+u62fePSDkGSNNNc3rl/F3hlVf0YcDpwbpKzgXcA766qFwKPApe1/pcBj7b2d7d+kqRlNDbca9r+tvrM9ijglcC1rX0b8Jq2fGFbp20/J0kWrWJJ0lipqvGdkiOAncALgd8Dfgu4pb07J8kG4BNV9ZIkdwDnVtWetu1e4GVV9ciMY24GNgNMTEycuX379rF17N+/n7Vr1x7Utmvv42P3Wy0mjoaHv73w45x20rELP8jTNNtrMDRDH8PQ64fhj2G11L9p06adVTU527Y1czlAVX0fOD3JOuBjwI8stKiq2gpsBZicnKypqamx++zYsYOZ/S7ZcuNCS1k2V5z2JO/aNadf+WE98IaphRfzNM32GgzN0Mcw9Pph+GMYQv3z+rRMVT0GfBp4ObAuyYGkOhnY25b3AhsA2vZjga8vSrWSpDmZy6dlTmzv2ElyNPAq4C6mQ/61rdvFwHVt+fq2Ttv+qZrL3I8kadHMZY5gPbCtzbs/A7imqm5I8iVge5L/AHweuLr1vxr4b0l2A98ALlqCuiVJhzE23KvqduCls7TfB5w1S/t3gJ9dlOokSU+L31CVpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SerQ2HBPsiHJp5N8KcmdSd7c2o9PclOSe9rP41p7klyVZHeS25OcsdSDkCQdbC7v3J8ErqiqFwNnA5cneTGwBbi5qk4Fbm7rAOcBp7bHZuC9i161JOmwxoZ7VT1UVX/Vlr8J3AWcBFwIbGvdtgGvacsXAh+qabcA65KsX/TKJUmHNK859yQbgZcCtwITVfVQ2/RVYKItnwR8ZWS3Pa1NkrRMUlVz65isBf4C+I9V9dEkj1XVupHtj1bVcUluAK6sqs+09puBt1bVbTOOt5npaRsmJibO3L59+9ga9u/fz9q1aw9q27X38TnVvxpMHA0Pf3vhxzntpGMXfpCnabbXYGiGPoah1w/DH8NqqX/Tpk07q2pytm1r5nKAJM8EPgL8UVV9tDU/nGR9VT3Upl32tfa9wIaR3U9ubQepqq3AVoDJycmampoaW8eOHTuY2e+SLTfOZQirwhWnPcm7ds3pV35YD7xhauHFPE2zvQZDM/QxDL1+GP4YhlD/XD4tE+Bq4K6q+u2RTdcDF7fli4HrRtrf2D41czbw+Mj0jSRpGczlbeQrgJ8HdiX5Qmv7NeBK4JoklwEPAq9r2z4OnA/sBr4FXLqoFUuSxhob7m3uPIfYfM4s/Qu4fIF1SZIWwG+oSlKHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDq1Z6QI0fxu33Lhiz/3Bc49ZseeWNHe+c5ekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUNjwz3JB5LsS3LHSNvxSW5Kck/7eVxrT5KrkuxOcnuSM5ayeEnS7Obyzv2DwLkz2rYAN1fVqcDNbR3gPODU9tgMvHdxypQkzcfYcK+q/wV8Y0bzhcC2trwNeM1I+4dq2i3AuiTrF6tYSdLcpKrGd0o2AjdU1Uva+mNVta4tB3i0qtYluQG4sqo+07bdDLy1qm6b5ZibmX53z8TExJnbt28fW8f+/ftZu3btQW279j4+dr/VYuJoePjbK13Fwpxy7BFPeQ2GZrbzaEiGXj8Mfwyrpf5NmzbtrKrJ2bYt+N4yVVVJxv+FeOp+W4GtAJOTkzU1NTV2nx07djCz3yUreJ+V+britCd5165h387ng+ce85TXYGhmO4+GZOj1w/DHMIT6n+6nZR4+MN3Sfu5r7XuBDSP9Tm5tkqRl9HTD/Xrg4rZ8MXDdSPsb26dmzgYer6qHFlijJGmexs4RJPkTYAo4Icke4O3AlcA1SS4DHgRe17p/HDgf2A18C7h0CWqWJI0xNtyr6vWH2HTOLH0LuHyhRUmSFmbYV/e07HbtfXxFLmI/cOWrl/05pSHz9gOS1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhv6EqHcbGFbyltN/K1UIY7tIqdag/LFec9uSS3gLCPyp9cFpGkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QO+Tl3DcJifploqT8nLq0GhrukgyzHt3IP9QfWL1AtHqdlJKlDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ0sS7knOTXJ3kt1JtizFc0iSDm3Rbz+Q5Ajg94BXAXuAzyW5vqq+tNjPJUmLYb63XFjM+xMt1S0XluLeMmcBu6vqPoAk24ELAcNd0mEtx31t/rZIVS3uAZPXAudW1S+09Z8HXlZVb5rRbzOwua2+CLh7Doc/AXhkEctdbkOvHxzDajD0+mH4Y1gt9T+/qk6cbcOK3RWyqrYCW+ezT5LbqmpyiUpackOvHxzDajD0+mH4YxhC/UtxQXUvsGFk/eTWJklaJksR7p8DTk1ySpIjgYuA65fgeSRJh7Do0zJV9WSSNwF/BhwBfKCq7lykw89rGmcVGnr94BhWg6HXD8Mfw6qvf9EvqEqSVp7fUJWkDhnuktShQYT7arudQZIPJNmX5I6RtuOT3JTknvbzuNaeJFe12m9PcsbIPhe3/vckuXik/cwku9o+VyXJIte/Icmnk3wpyZ1J3jzAMTwryWeTfLGN4Tda+ylJbm3P++F2UZ8kR7X13W37xpFjva21353kp0fal/y8S3JEks8nuWGg9T/QXucvJLmttQ3pPFqX5NokX05yV5KXD6n+w6qqVf1g+qLsvcALgCOBLwIvXuGafhI4A7hjpO2dwJa2vAV4R1s+H/gEEOBs4NbWfjxwX/t5XFs+rm37bOubtu95i1z/euCMtvxs4P8ALx7YGAKsbcvPBG5tz3cNcFFr/33gF9vyLwG/35YvAj7cll/czqmjgFPauXbEcp13wFuAPwZuaOtDq/8B4IQZbUM6j7YBv9CWjwTWDan+w45tuZ5oAb/8lwN/NrL+NuBtq6CujRwc7ncD69vyeuDutvw+4PUz+wGvB9430v6+1rYe+PJI+0H9lmgs1zF9L6BBjgH4O8BfAS9j+luDa2aeO0x/euvlbXlN65eZ59OBfstx3jH9HZCbgVcCN7R6BlN/O+4DPDXcB3EeAccC99M+WDK0+sc9hjAtcxLwlZH1Pa1ttZmoqofa8leBibZ8qPoP175nlvYl0f55/1Km3/kOagxtSuMLwD7gJqbfqT5WVU/O8rw/rLVtfxx47pgxLPV59zvArwI/aOvPHVj9AAV8MsnOTN9SBIZzHp0CfA34wzY19v4kxwyo/sMaQrgPTk3/mV71nzFNshb4CPArVfXXo9uGMIaq+n5Vnc70O+CzgB9Z4ZLmLMkFwL6q2rnStSzQT1TVGcB5wOVJfnJ04yo/j9YwPb363qp6KfAE09MwP7TK6z+sIYT7UG5n8HCS9QDt577Wfqj6D9d+8iztiyrJM5kO9j+qqo8OcQwHVNVjwKeZnopYl+TAl/NGn/eHtbbtxwJfZ/5jWyyvAH4myQPAdqanZt4zoPoBqKq97ec+4GNM/5Edynm0B9hTVbe29WuZDvuh1H94yzX/s4B5sTVMX6A4hb+5MPSjq6CujRw85/5bHHwR5p1t+dUcfBHms639eKbn+45rj/uB49u2mRdhzl/k2gN8CPidGe1DGsOJwLq2fDTwl8AFwH/n4AuSv9SWL+fgC5LXtOUf5eALkvcxfTFy2c47YIq/uaA6mPqBY4Bnjyz/b+DcgZ1Hfwm8qC3/+1b7YOo/7NiW64kW+AKcz/QnOu4Ffn0V1PMnwEPA95j+638Z0/OfNwP3AH8+8uKG6f95yb3ALmBy5Dj/EtjdHpeOtE8Cd7R9fpcZF3wWof6fYPqfmrcDX2iP8wc2hn8EfL6N4Q7g37X2F7T/oHYzHZRHtfZntfXdbfsLRo71663Ouxn5NMNynXccHO6Dqb/V+sX2uPPAcwzsPDoduK2dR3/KdDgPpv7DPbz9gCR1aAhz7pKkeTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUof+P+/jGHQlmobrAAAAAElFTkSuQmCC\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "The `charges` column is highly skewed to the right. Extremely costly insurance charges are more common than extremely small ones. This makes it unlikely that the errors in the model will truly be centered at zero. It might be worth it to log-transform the outcome." ], "metadata": { "id": "5rSOAfQHOr_8" } }, { "cell_type": "code", "source": [ "insurance[\"log_charges\"] = np.log2(insurance[\"charges\"])\n", "\n", "insurance.hist(\"log_charges\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 316 }, "id": "FG2IyzGFPHF2", "outputId": "c0f8e3d8-57fd-4467-e164-6797960c871f" }, "execution_count": 5, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[]],\n", " dtype=object)" ] }, "metadata": {}, "execution_count": 5 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAATr0lEQVR4nO3df7DldX3f8edLiQTZFKSLt7hsc60saZCNTLgl2KQzd0MmIjpZaBMGyxhQM+uk2Il1U7NqapxY2m0USTM2pOvAQKJxZaIGIpiIjLfUmVAFhrggWrZ6UTbLMhQEVi3twrt/nC/Jye79tfece8+5H5+PmTP3+/18v+f7fb+5e1/3ez/new6pKiRJbXnBqAuQJA2f4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXWMpyWySn1vlc04mqSTHrOZ5pZVguEtSgwx3aQV49a9RM9w11pIcm+R3k/x19/jdJMf2bX9nkv3dtl/pplVOW+SYxyW5KslDSZ5M8sUkx/XtcmmSbyV5LMl7+p53TpK/TPKd7pwfTvKivu2V5IokDwIPLlZf19sHu3MdSPIHz9eRZH2Sz3TnejzJf0/iz6uWzH8sGnfvAc4FzgJeBZwD/CZAkvOBdwA/B5wGTC/xmB8Ezgb+KXAS8E7gub7tPwP8GHAe8N4kP96NPwv8G2A98Opu+7867NgXAj8FnLGE+nYCp3e9nQZsAN7bbdsOPAycDEwA7wb8rBAtXVX58DF2D2CWXij+L+CCvvHXALPd8nXAf+zbdhq9ADxtgeO+APg+8Ko5tk12zz+1b+xLwCXzHOvtwKf71gv42b71eesDAnwXeEXf9lcD3+yWfxu4aaFefPhY6OGVu8bdy4CH+tYf6sae3/btvm39y/NZD/wwvV8a83mkb/l7wDqAJKd3UyWPJHkK+A/d8fr117BQfScDLwbu7qZevgP8eTcO8AFgL/C5JN9IsmMJvUl/w3DXuPtr4Ef71v9hNwawHzi1b9vGJRzvMeD/AK9YRi3XAF8DNlXV36M3VZLD9umfOlmovsfo/QXxyqo6sXucUFXrAKrq6araXlX/CPgF4B1JzltGzfoBZbhr3H0c+M0kJydZT29O+qPdthuBNyX58SQvBv7dYgerqufoTZd8KMnLkrwwyav7X6RdwI8ATwEHk/xj4FcX2X/e+ro6PgJcneSlAEk2JHlNt/z6JKclCfAkvfn+5444gzQPw13j7t8DdwFfAfYA93RjVNVngd8DvkBvCuPO7jnPLHLMX++O9WXgceA/sbSfhV8H/iXwNL1g/sRCOy+hvt94fryb5vk8vRdyATZ16weBvwR+v6q+sIQaJQBS5QvwakN3V8t9wLFVdWjU9Rxu3OtTW7xy15qW5KLufvGX0LsC/7NxCs5xr0/tMty11r0VeJTe3S/P0s2DJ7k/ycE5HpeOQ33SSnNaRpIa5JW7JDVoLD7caP369TU5ObngPt/97nc5/vjjV6egVWA/481+xl9rPS2nn7vvvvuxqjp5rm1jEe6Tk5PcddddC+4zMzPD9PT06hS0CuxnvNnP+Gutp+X0k+Sh+bY5LSNJDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0ai3eoSuNscsctIznv7M7XjeS8aoNX7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJatCi4Z5kY5IvJPlqkvuT/Fo3/r4k+5Lc2z0u6HvOu5LsTfL1JK9ZyQYkSUc6Zgn7HAK2V9U9SX4EuDvJbd22q6vqg/07JzkDuAR4JfAy4PNJTq+qZ4dZuCRpfoteuVfV/qq6p1t+GngA2LDAU7YCu6vqmar6JrAXOGcYxUqSliZVtfSdk0ngDuBM4B3A5cBTwF30ru6fSPJh4M6q+mj3nGuBz1bVnxx2rG3ANoCJiYmzd+/eveC5Dx48yLp165Zc67izn/HW38+efU+OpIbNG04Y2rFa+/5Aez0tp58tW7bcXVVTc21byrQMAEnWAZ8E3l5VTyW5Bng/UN3Xq4A3L/V4VbUL2AUwNTVV09PTC+4/MzPDYvusJfYz3vr7uXzHLSOpYfbS6aEdq7XvD7TX07D7WdLdMkl+iF6wf6yqPgVQVQeq6tmqeg74CH879bIP2Nj39FO7MUnSKlnK3TIBrgUeqKoP9Y2f0rfbRcB93fLNwCVJjk3ycmAT8KXhlSxJWsxSpmV+GngjsCfJvd3Yu4E3JDmL3rTMLPBWgKq6P8mNwFfp3WlzhXfKSNLqWjTcq+qLQObYdOsCz7kSuHKAuiRJA/AdqpLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgxYN9yQbk3whyVeT3J/k17rxk5LcluTB7utLuvEk+b0ke5N8JclPrnQTkqS/aylX7oeA7VV1BnAucEWSM4AdwO1VtQm4vVsHeC2wqXtsA64ZetWSpAUtGu5Vtb+q7umWnwYeADYAW4Ebut1uAC7slrcCf1g9dwInJjll6JVLkuaVqlr6zskkcAdwJvCtqjqxGw/wRFWdmOQzwM6q+mK37XbgN6rqrsOOtY3elT0TExNn7969e8FzHzx4kHXr1i251nFnP+Otv589+54cSQ2bN5wwtGO19v2B9npaTj9btmy5u6qm5tp2zFIPkmQd8Eng7VX1VC/Pe6qqkiz9t0TvObuAXQBTU1M1PT294P4zMzMsts9aYj/jrb+fy3fcMpIaZi+dHtqxWvv+QHs9DbufJYV7kh+iF+wfq6pPdcMHkpxSVfu7aZdHu/F9wMa+p5/ajUnLNrnKAbt986GRhbo0DEu5WybAtcADVfWhvk03A5d1y5cBN/WN/3J318y5wJNVtX+INUuSFrGUK/efBt4I7Elybzf2bmAncGOStwAPARd3224FLgD2At8D3jTUiiVJi1o03LsXRjPP5vPm2L+AKwasS5I0AN+hKkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ16JhRF6C1ZXLHLXOOb998iMvn2ablme+/9XIczfdndufrhnZejc6iV+5JrkvyaJL7+sbel2Rfknu7xwV9296VZG+Sryd5zUoVLkma31KmZa4Hzp9j/OqqOqt73AqQ5AzgEuCV3XN+P8kLh1WsJGlpFg33qroDeHyJx9sK7K6qZ6rqm8Be4JwB6pMkLUOqavGdkkngM1V1Zrf+PuBy4CngLmB7VT2R5MPAnVX10W6/a4HPVtWfzHHMbcA2gImJibN37969YA0HDx5k3bp1S+1r7K3Vfvbse3LO8Ynj4MD3V7mYFfSD3M/mDSesbDFDslZ/huaznH62bNlyd1VNzbVtuS+oXgO8H6ju61XAm4/mAFW1C9gFMDU1VdPT0wvuPzMzw2L7rCVrtZ/5XpTbvvkQV+1p5/X5H+R+Zi+dXtlihmSt/gzNZ9j9LOtWyKo6UFXPVtVzwEf426mXfcDGvl1P7cYkSatoWeGe5JS+1YuA5++kuRm4JMmxSV4ObAK+NFiJkqSjtejfaUk+DkwD65M8DPwWMJ3kLHrTMrPAWwGq6v4kNwJfBQ4BV1TVsytTuiRpPouGe1W9YY7haxfY/0rgykGKkiQNxo8fkKQGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNauf/IyZpKCbn+V8probZna8b2blb45W7JDXIcJekBhnuktQgw12SGmS4S1KDvFtG0tg4mjt1tm8+xOVDurOnxbt0vHKXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDVr0Vsgk1wGvBx6tqjO7sZOATwCTwCxwcVU9kSTAfwYuAL4HXF5V96xM6aM3yAcsDXIbV4u3bUkarqVcuV8PnH/Y2A7g9qraBNzerQO8FtjUPbYB1wynTEnS0Vg03KvqDuDxw4a3Ajd0yzcAF/aN/2H13AmcmOSUYRUrSVqa5b5DdaKq9nfLjwAT3fIG4Nt9+z3cje1HQzPKz9uWtDYM/PEDVVVJ6mifl2QbvakbJiYmmJmZWXD/gwcPLrrPatu++dCynztx3GDPHzf2M95a6weG29M4ZMuwM2654X4gySlVtb+bdnm0G98HbOzb79Ru7AhVtQvYBTA1NVXT09MLnnBmZobF9lltg3yuxfbNh7hqTzsf7WM/4621fmC4Pc1eOj2U4wxi2Bm33FshbwYu65YvA27qG//l9JwLPNk3fSNJWiVLuRXy48A0sD7Jw8BvATuBG5O8BXgIuLjb/VZ6t0HupXcr5JtWoGZJ0iIWDfeqesM8m86bY98Crhi0KEnSYHyHqiQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUoGNGXYAkjdrkjltGdu7Zna9bkeMOFO5JZoGngWeBQ1U1leQk4BPAJDALXFxVTwxWpiTpaAxjWmZLVZ1VVVPd+g7g9qraBNzerUuSVtFKzLlvBW7olm8ALlyBc0iSFpCqWv6Tk28CTwAF/Neq2pXkO1V1Yrc9wBPPrx/23G3ANoCJiYmzd+/eveC5Dh48yLp165Zd60rYs+/JZT934jg48P0hFjNi9jPeWusH2ulp84YTgOVl3JYtW+7umzX5OwZ9QfVnqmpfkpcCtyX5Wv/Gqqokc/72qKpdwC6Aqampmp6eXvBEMzMzLLbPart8gBdhtm8+xFV72nk9237GW2v9QDs9zV46DQw/4waalqmqfd3XR4FPA+cAB5KcAtB9fXTQIiVJR2fZv/aSHA+8oKqe7pZ/Hvht4GbgMmBn9/WmYRQ6n1HewiRJ42qQv2kmgE/3ptU5BvjjqvrzJF8GbkzyFuAh4OLBy5QkHY1lh3tVfQN41Rzj/xs4b5CiJEmD8eMHJKlBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1KAVC/ck5yf5epK9SXas1HkkSUdakXBP8kLgvwCvBc4A3pDkjJU4lyTpSCt15X4OsLeqvlFV/xfYDWxdoXNJkg6Tqhr+QZNfBM6vql/p1t8I/FRVva1vn23Atm71x4CvL3LY9cBjQy92dOxnvNnP+Gutp+X086NVdfJcG44ZvJ7lqapdwK6l7p/krqqaWsGSVpX9jDf7GX+t9TTsflZqWmYfsLFv/dRuTJK0ClYq3L8MbEry8iQvAi4Bbl6hc0mSDrMi0zJVdSjJ24C/AF4IXFdV9w942CVP4awR9jPe7Gf8tdbTUPtZkRdUJUmj5TtUJalBhrskNWjswj3JdUkeTXJf39hJSW5L8mD39SWjrPFozdPTLyW5P8lzSdbU7Vzz9POBJF9L8pUkn05y4ihrPBrz9PP+rpd7k3wuyctGWePRmKufvm3bk1SS9aOobTnm+f68L8m+7vtzb5ILRlnj0Zrve5TkX3c/R/cn+Z1BzjF24Q5cD5x/2NgO4Paq2gTc3q2vJddzZE/3Af8cuGPVqxnc9RzZz23AmVX1E8D/BN612kUN4HqO7OcDVfUTVXUW8Bngvate1fJdz5H9kGQj8PPAt1a7oAFdzxz9AFdX1Vnd49ZVrmlQ13NYT0m20Hsn/6uq6pXABwc5wdiFe1XdATx+2PBW4IZu+QbgwlUtakBz9VRVD1TVYu/KHUvz9PO5qjrUrd5J770Na8I8/TzVt3o8sGbuPJjnZwjgauCdrKFeYMF+1qx5evpVYGdVPdPt8+gg5xi7cJ/HRFXt75YfASZGWYwW9Wbgs6MuYlBJrkzybeBS1taV+xGSbAX2VdVfjbqWIXpbN3V23Vqbqp3H6cA/S/I/kvy3JP9kkIOtlXD/G9W7d3NNXXn8IEnyHuAQ8LFR1zKoqnpPVW2k18vbFtt/XCV5MfBu1vgvqMNcA7wCOAvYD1w12nKG4hjgJOBc4N8CNybJcg+2VsL9QJJTALqvA/25opWR5HLg9cCl1dYbKD4G/ItRFzGAVwAvB/4qySy9KbN7kvyDkVY1gKo6UFXPVtVzwEfofRLtWvcw8Knq+RLwHL0PE1uWtRLuNwOXdcuXATeNsBbNIcn59OZzf6GqvjfqegaVZFPf6lbga6OqZVBVtaeqXlpVk1U1SS9EfrKqHhlxacv2/MVe5yJ6NyisdX8KbAFIcjrwIgb51MuqGqsH8HF6f2b9P3r/CN8C/H16d8k8CHweOGnUdQ6hp4u65WeAA8BfjLrOAfvZC3wbuLd7/MGo6xywn0/SC4yvAH8GbBh1nYP0c9j2WWD9qOsc8PvzR8Ce7vtzM3DKqOscQk8vAj7a/bu7B/jZQc7hxw9IUoPWyrSMJOkoGO6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQf8fj+ORkw6jsrgAAAAASUVORK5CYII=\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "The log-transformed `charges` values are more centered, which is what we wanted. This makes it more likely that the errors will be unbiased." ], "metadata": { "id": "sFiFbsCUQHv8" } }, { "cell_type": "code", "source": [ "# Checking the correlation between the continuous columns in the insurance data\n", "insurance.corr()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "uSRq96HH8AAR", "outputId": "ace3b1fc-0cbc-4cce-b57e-38b26b1dbfe7" }, "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " age bmi children charges log_charges\n", "age 1.000000 0.109272 0.042469 0.299008 0.527834\n", "bmi 0.109272 1.000000 0.012759 0.198341 0.132669\n", "children 0.042469 0.012759 1.000000 0.067998 0.161336\n", "charges 0.299008 0.198341 0.067998 1.000000 0.892964\n", "log_charges 0.527834 0.132669 0.161336 0.892964 1.000000" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agebmichildrenchargeslog_charges
age1.0000000.1092720.0424690.2990080.527834
bmi0.1092721.0000000.0127590.1983410.132669
children0.0424690.0127591.0000000.0679980.161336
charges0.2990080.1983410.0679981.0000000.892964
log_charges0.5278340.1326690.1613360.8929641.000000
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "markdown", "source": [ "## Comments on correlation\n", "\n", "`age` has 30% correlation with `charges`, `bmi` has 19.8% correlation, and number of children has 6.7% correlation." ], "metadata": { "id": "-ghJAB_59B9Z" } }, { "cell_type": "code", "source": [ "insurance.boxplot(column = [\"log_charges\"], by = \"sex\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 373 }, "id": "amZ_91bU8352", "outputId": "dec66ac2-8799-432a-9aa5-344d5a6eaac6" }, "execution_count": 7, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.\n", " X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 7 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "code", "source": [ "insurance.boxplot(column = [\"log_charges\"], by = \"smoker\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 373 }, "id": "9_gIr-7AJ1Uo", "outputId": "61cfc34b-4718-429c-ff27-2e9ee9e39d22" }, "execution_count": 8, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.\n", " X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 8 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "code", "source": [ "insurance.boxplot(column = [\"log_charges\"], by = \"region\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 373 }, "id": "Z5-n30hmKH_Y", "outputId": "aeabd663-0781-432c-954f-7ce33dc5c4f0" }, "execution_count": 9, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.\n", " X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 9 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "code", "source": [ "insurance.boxplot(column = [\"log_charges\"], by = \"region\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 373 }, "id": "lVogyvIOKK6o", "outputId": "30499162-adaf-4260-f26d-b650759d4b73" }, "execution_count": 10, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.\n", " X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 10 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "## Comments on plots\n", "\n", "Males seem to have a wider distribution of charges compared to women. Smokers have much higher costs than non-smokers. There doesn't seem tobe many appreciable differences between regions. " ], "metadata": { "id": "EcQUy2RaKPNh" } }, { "cell_type": "markdown", "source": [ "# Dividing The Data\n", "\n", "Based on the univariate relationships shown above, `age`, `bmi` and `smoker` are positively associated with higher `charges`. We'll include these predictors in our final model." ], "metadata": { "id": "Q4f6SIK8Kv-k" } }, { "cell_type": "code", "source": [ "# Splitting the data up into a training and test set\n", "insurance[\"is_smoker\"] = (insurance[\"smoker\"] == \"yes\")\n", "X = insurance[[\"age\", \"bmi\", \"is_smoker\"]]\n", "y = insurance[\"log_charges\"]\n", "\n", "# 75% for training set, 25% for test set\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, \n", " random_state = 1)" ], "metadata": { "id": "lihP2kTALKL-" }, "execution_count": 11, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Build The Model" ], "metadata": { "id": "YE-XKUr_428I" } }, { "cell_type": "code", "source": [ "# Training and checking model performance on training set\n", "insurance_model = LinearRegression()\n", "insurance_model.fit(X_train, y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "jCHcD-PcLfxu", "outputId": "93957439-f797-4f85-e5fc-1149ae619dd0" }, "execution_count": 12, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "LinearRegression()" ] }, "metadata": {}, "execution_count": 12 } ] }, { "cell_type": "markdown", "source": [ "" ], "metadata": { "id": "r9pR9V2TOYdT" } }, { "cell_type": "code", "source": [ "# Get predicted values by model\n", "y_pred = insurance_model.predict(X_train)\n", "\n", "# MSE on the log scale for the insurance charges\n", "mean_squared_error(y_train, y_pred)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WY4gnJJpMqNN", "outputId": "ac7676db-6ccf-4b35-aa30-d81905dac57d" }, "execution_count": 13, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0.4546665339270644" ] }, "metadata": {}, "execution_count": 13 } ] }, { "cell_type": "code", "source": [ "# MSE on the original scale for the insurance charges\n", "np.exp(mean_squared_error(y_train, y_pred))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YvUvASEyQ0aN", "outputId": "5f99b5a4-317f-47cc-93b1-8d22464612a6" }, "execution_count": 14, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1.575647870310887" ] }, "metadata": {}, "execution_count": 14 } ] }, { "cell_type": "code", "source": [ "# Coefficient of determination\n", "r2_score(y_train, y_pred)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tOFi1oxFRFPk", "outputId": "351d9bd3-2b78-4ff0-fec0-f5ca559b5d1c" }, "execution_count": 15, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0.7421118855283421" ] }, "metadata": {}, "execution_count": 15 } ] }, { "cell_type": "markdown", "source": [ "## Comments\n", "\n", "The training MSE for the model is 0.454 and is 1.57 on the original scale. The $R^2$ indicates that the model can explain 74% of the variation in the log-insurance charges. These preliminary results are promising, but we must remember that these are optimistic values." ], "metadata": { "id": "70BciwrTQucW" } }, { "cell_type": "markdown", "source": [ "# Residual Diagnostics" ], "metadata": { "id": "Op2iO4grSZBZ" } }, { "cell_type": "code", "source": [ "# Quick visual check of residuals\n", "check = pd.DataFrame()\n", "check[\"residuals\"] = y_train - y_pred\n", "check[\"fitted\"] = y_pred\n", "\n", "check.plot.scatter(x = \"fitted\", y = \"residuals\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 296 }, "id": "dLCBNJMuNb-x", "outputId": "626a7509-6890-4932-cd42-8b443001d99f" }, "execution_count": 16, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 16 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "The residuals suggest some violations to the assumptions of linear regression. As fitted values get larger, the residuals trend downward. We expect an even band, centered around zero. This does not necessarily make the model predictions unusable, but it puts into question the linear regression assumptions." ], "metadata": { "id": "PW19C6k5TgUS" } }, { "cell_type": "markdown", "source": [ "# Interpreting The Model" ], "metadata": { "id": "2NpUAgXgUI0Z" } }, { "cell_type": "code", "source": [ "# Getting the non-intercept coefficients\n", "insurance_model.coef_" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YmfjKyIjSqr-", "outputId": "7e388b83-ffb1-4116-c711-7aabea9312d5" }, "execution_count": 17, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([0.04892865, 0.01523672, 2.23063344])" ] }, "metadata": {}, "execution_count": 17 } ] }, { "cell_type": "markdown", "source": [ "- A year increase in the subject is associated with a 0.04 increase in the log charges, holding smoking status and bmi constant. About a 4% increase in the charges on the regular scale.\n", "- A unit increase in the subject BMI is associated with a 0.01 increase in the log charges, holding smoking status and age constant. About a 1.5% increase in the charges on the regular scale.\n", "- A smoker is associated with a 2.23 increase in the log charges, holding age and bmi constant. About a 930% increase in the charges on the regular scale.\n", "\n", "Note: we are not concerned about if these changes are *statistically significant*, so we don't know if these associations are truly non-zero. Our primary goal is prediction." ], "metadata": { "id": "guf4YyvQVSfq" } }, { "cell_type": "markdown", "source": [ "# Final Model Evaluation" ], "metadata": { "id": "aYOK6CucWXBB" } }, { "cell_type": "code", "source": [ "# Getting MSE on test model\n", "test_pred = insurance_model.predict(X_test)\n", "\n", "mean_squared_error(y_test, test_pred)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tXrcpejXUNDv", "outputId": "fbbab0a6-4446-4b35-8626-dc4a5d47dd76" }, "execution_count": 18, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0.4355350875308211" ] }, "metadata": {}, "execution_count": 18 } ] }, { "cell_type": "code", "source": [ "# Putting the outcome (in log-terms) back into the original scale\n", "np.exp(mean_squared_error(y_test, test_pred))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NjH8j9ZPXDHv", "outputId": "001d7c0a-70db-4be3-8a1d-84521e71ac8b" }, "execution_count": 20, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1.545789970635098" ] }, "metadata": {}, "execution_count": 20 } ] }, { "cell_type": "markdown", "source": [ "# Drawing Conclusions\n", "\n", "The test MSE was about 0.435, while the training MSE was about 0.454. In this case, the two errors match up pretty well, so we can conclude that the model is not overfit. The residuals suggest that the model is predicting much lower costs for subjects who were actually charged much higher. Therefore the model struggles with these higher costs. As a whole, the model predictions are too conservative. \n", "\n", "We might improve the model by including more complex terms in the regression, such as interactions or quadratic terms. " ], "metadata": { "id": "iZcGg2tgWpGx" } }, { "cell_type": "code", "source": [ "" ], "metadata": { "id": "sMKpzdDG5B7_" }, "execution_count": 19, "outputs": [] } ] }