{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "fihFj3tMwplr" }, "source": [ "# Purpose of Notebook\n", "\n", "The purpose of this notebook is to offer an example answer to the guided project for the Optimizing Models course. The reference model will be the same for all students, but any other models are constructed by the student. Results may vary." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "RfWzhN9BaIK_" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "vfAO-T-5l4h4" }, "outputs": [], "source": [ "# Load in the insurance dataset\n", "fires = pd.read_csv(\"/content/drive/MyDrive/fires.csv\")\n", "\n", "fires_reference = fires[[\"wind\", \"temp\", \"area\"]].dropna()\n", "reference_X = fires_reference[[\"wind\", \"temp\"]]\n", "\n", "reference = LinearRegression()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7Ea_gbpU1iG-", "outputId": "d6dea921-85cb-44b3-cfc4-3d5a087ddef7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Unnamed: 0 column has 0 missing values.\n", "The X column has 0 missing values.\n", "The Y column has 0 missing values.\n", "The month column has 0 missing values.\n", "The day column has 0 missing values.\n", "The FFMC column has 48 missing values.\n", "The DMC column has 21 missing values.\n", "The DC column has 43 missing values.\n", "The ISI column has 2 missing values.\n", "The temp column has 21 missing values.\n", "The RH column has 30 missing values.\n", "The wind column has 35 missing values.\n", "The rain column has 32 missing values.\n", "The area column has 0 missing values.\n" ] } ], "source": [ "for col in fires.columns:\n", " num_na = sum(pd.isna(fires[col]))\n", " print(f\"The {col} column has {num_na} missing values.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "CK7iWUZX4R25" }, "source": [ "# Data Processing\n", "\n", "First, we'll convert the `month` column into a categorical feature. Instead of using the strings, we'll convert it into an indicator for the summer months in the northern hemisphere.\n", "\n", "For the sake of completion, we'll impute all of the features so that we can have the biggest set to choose from for sequential feature selection. We'll go with K-nearest neighbors imputation since we expect area damage to be similar among similar fires. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 316 }, "id": "Rcqnhb76PpUk", "outputId": "feddd9c6-103b-424c-df3f-69c11ea2e150" }, "outputs": [ { "data": { "text/plain": [ "array([[]],\n", " dtype=object)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAASfElEQVR4nO3dcaxed33f8fenMYQQs9gh7MqzU5yKlCojIyRXJBGd5JCVhlA1TEoRUdSYzJP/STc6IjVmk8aYNslMTUNgE8MjtGmXYiiF2XIRNDW+qlqVlHhkSUjIcpMaYivEkDimBjrV8N0fz8/prbn2vX58r6/Pz++X9Oie8zu/5zy/7/1Zn+fc85znOFWFJKkvP7XUA5AkLTzDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnu0jEkWbbUY5DGZbjrjJNkU5Knkvx1kseS/PPW/p4kf57kriTPA/8hydlJfjPJt5I8l+S/Jzmn9V+ZZEeS7yQ50JbXLGlxUmO460z0FPBPgfOADwL/M8mqtu1K4GlgAvjPwGbgZ4HLgNcBq4F/3/r+FPDbwGuBnwZ+CPzXU1OCdHzx3jI60yV5CPgAsBL4j1X10609wCHgn1TVU63tauD3q+qiWfZzGbCrqlaessFLx+A5RZ1xktwCvA9Y25qWAxcAPwKemdH1NcArgd2jnB89HTir7eeVwF3AdYzeGABeleSsqvrRIpYgzclw1xklyWuB/wFcC/xFVf2oHbkfSe+Zf8p+l9Gpln9cVftm2d3twOuBK6vq2+3I/Wsz9iUtGc+560xzLqMA/w5AkluBN8zWsap+zOiN4K4k/7D1X53kF1uXVzEK/xeTnM/o1I50WjDcdUapqseAO4G/AJ4DLgX+/DhPuQOYBr6S5HvAnzA6Wgf4MHAOoyP8rwBfXKRhSyfMD1QlqUMeuUtShwx3SeqQ4S5JHZpXuCfZk+SRJA8lebC1nZ/k/iRPtp8rW3uSfCTJdJKHk1y+mAVIkn7SvD5QTbIHmKyq785o+y/AC1W1OckmYGVV3ZHkeuBfAdcz+ir33VV15fH2f8EFF9TatWvHKuD73/8+55577ljPHQLrGzbrG64h1LZ79+7vVtVrZt1YVXM+gD3ABUe1PQGsasurgCfa8seBm2brd6zHFVdcUePatWvX2M8dAusbNusbriHUBjxYx8jV+Z5zL+CPk+xOsrG1TVTVs23524xutASjGyvN/Ar33tYmSTpF5nv7gZ+vqn3tW3r3J/nGzI1VVUlO6IL59iaxEWBiYoKpqakTefpLDh06NPZzh8D6hs36hmvotc0r3KvdV6Oq9if5PPBm4Lkkq6rq2Xa71P2t+z7gwhlPX9Pajt7nFmALwOTkZK1bt26sAqamphj3uUNgfcNmfcM19NrmPC2T5NwkrzqyDLwNeBTYDqxv3dYD29ryduCWdtXMVcDBGadvJEmnwHyO3CeAz7dbni5jdC/rLyb5KvCZJBuAbwLvav2/wOhKmWngB8CtCz5qSdJxzRnuVfU08MZZ2p9ndNvUo9sLuG1BRidJGovfUJWkDhnuktQhw12SOjT4/2bvkX0Hec+mP5qz357N7zgFo5Gk04NH7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalD8w73JGcl+VqSHW39oiQPJJlO8ukkL2/tZ7f16bZ97eIMXZJ0LCdy5P5e4PEZ6x8C7qqq1wEHgA2tfQNwoLXf1fpJkk6heYV7kjXAO4BPtPUAbwU+27rcC7yzLd/Q1mnbr239JUmnyHyP3D8M/Abw47b+auDFqjrc1vcCq9vyauAZgLb9YOsvSTpFls3VIckvAfuraneSdQv1wkk2AhsBJiYmmJqaGms/E+fA7ZcenrPfuPtfaocOHRrs2OfD+oat5/qGXtuc4Q68BfjlJNcDrwD+AXA3sCLJsnZ0vgbY1/rvAy4E9iZZBpwHPH/0TqtqC7AFYHJystatWzdWAR+9bxt3PjJ3GXtuHm//S21qaopxfzdDYH3D1nN9Q69tztMyVfX+qlpTVWuBdwNfrqqbgV3Aja3bemBbW97e1mnbv1xVtaCjliQd18lc534H8L4k04zOqd/T2u8BXt3a3wdsOrkhSpJO1HxOy7ykqqaAqbb8NPDmWfr8DfArCzA2SdKY/IaqJHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA7NGe5JXpHkL5P8nyRfT/LB1n5RkgeSTCf5dJKXt/az2/p02752cUuQJB1tPkfu/w94a1W9EbgMuC7JVcCHgLuq6nXAAWBD678BONDa72r9JEmn0JzhXiOH2urL2qOAtwKfbe33Au9syze0ddr2a5NkwUYsSZrTvM65JzkryUPAfuB+4Cngxao63LrsBVa35dXAMwBt+0Hg1Qs5aEnS8S2bT6eq+hFwWZIVwOeBnzvZF06yEdgIMDExwdTU1Fj7mTgHbr/08Jz9xt3/Ujt06NBgxz4f1jdsPdc39NrmFe5HVNWLSXYBVwMrkixrR+drgH2t2z7gQmBvkmXAecDzs+xrC7AFYHJystatWzdWAR+9bxt3PjJ3GXtuHm//S21qaopxfzdDYH3D1nN9Q69tPlfLvKYdsZPkHOAXgMeBXcCNrdt6YFtb3t7Wadu/XFW1kIOWJB3ffI7cVwH3JjmL0ZvBZ6pqR5LHgK1J/hPwNeCe1v8e4PeSTAMvAO9ehHFLko5jznCvqoeBN83S/jTw5lna/wb4lQUZnSRpLH5DVZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6tCc4Z7kwiS7kjyW5OtJ3tvaz09yf5In28+VrT1JPpJkOsnDSS5f7CIkSX/ffI7cDwO3V9UlwFXAbUkuATYBO6vqYmBnWwd4O3Bxe2wEPrbgo5YkHdec4V5Vz1bV/27Lfw08DqwGbgDubd3uBd7Zlm8AfrdGvgKsSLJqwUcuSTqmVNX8OydrgT8F3gB8q6pWtPYAB6pqRZIdwOaq+rO2bSdwR1U9eNS+NjI6smdiYuKKrVu3jlXA/hcO8twP5+536erzxtr/Ujt06BDLly9f6mEsGusbtp7rG0Jt11xzze6qmpxt27L57iTJcuAPgV+vqu+N8nykqirJ/N8lRs/ZAmwBmJycrHXr1p3I01/y0fu2cecjc5ex5+bx9r/UpqamGPd3MwTWN2w91zf02uZ1tUySlzEK9vuq6nOt+bkjp1vaz/2tfR9w4Yynr2ltkqRTZD5XywS4B3i8qn5rxqbtwPq2vB7YNqP9lnbVzFXAwap6dgHHLEmaw3xOy7wF+FXgkSQPtbZ/C2wGPpNkA/BN4F1t2xeA64Fp4AfArQs6YknSnOYM9/bBaI6x+dpZ+hdw20mOS5J0EvyGqiR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOzRnuST6ZZH+SR2e0nZ/k/iRPtp8rW3uSfCTJdJKHk1y+mIOXJM1uPkfuvwNcd1TbJmBnVV0M7GzrAG8HLm6PjcDHFmaYkqQTMWe4V9WfAi8c1XwDcG9bvhd454z2362RrwArkqxaqMFKkuYnVTV3p2QtsKOq3tDWX6yqFW05wIGqWpFkB7C5qv6sbdsJ3FFVD86yz42Mju6ZmJi4YuvWrWMVsP+Fgzz3w7n7Xbr6vLH2v9QOHTrE8uXLl3oYi8b6hq3n+oZQ2zXXXLO7qiZn27bsZHdeVZVk7neIn3zeFmALwOTkZK1bt26s1//ofdu485G5y9hz83j7X2pTU1OM+7sZAusbtp7rG3pt414t89yR0y3t5/7Wvg+4cEa/Na1NknQKjRvu24H1bXk9sG1G+y3tqpmrgINV9exJjlGSdILmPJ+R5FPAOuCCJHuBDwCbgc8k2QB8E3hX6/4F4HpgGvgBcOsijFmSNIc5w72qbjrGpmtn6VvAbSc7KEnSyfEbqpLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SerQsqUewKmydtMfzbvvns3vWMSRSNLi88hdkjpkuEtShwx3SeqQ4S5JHTLcJalDi3K1TJLrgLuBs4BPVNXmxXidxTLfK2u8qkbS6WrBwz3JWcB/A34B2At8Ncn2qnpsoV+rN7O9qdx+6WHec1S7byqS5rIYR+5vBqar6mmAJFuBG4AzNtxP5Br7pXhd3yzODP57OLWW+vedqlrYHSY3AtdV1b9s678KXFlVv3ZUv43Axrb6euCJMV/yAuC7Yz53CKxv2KxvuIZQ22ur6jWzbViyb6hW1RZgy8nuJ8mDVTW5AEM6LVnfsFnfcA29tsW4WmYfcOGM9TWtTZJ0iixGuH8VuDjJRUleDrwb2L4IryNJOoYFPy1TVYeT/BrwJUaXQn6yqr6+0K8zw0mf2jnNWd+wWd9wDbq2Bf9AVZK09PyGqiR1yHCXpA4NOtyTXJfkiSTTSTYt9XhOVJILk+xK8liSryd5b2s/P8n9SZ5sP1e29iT5SKv34SSXL20F85PkrCRfS7KjrV+U5IFWx6fbB+8kObutT7fta5dy3PORZEWSzyb5RpLHk1zd0/wl+Tft3+ajST6V5BVDnr8kn0yyP8mjM9pOeL6SrG/9n0yyfilqmctgw33GbQ7eDlwC3JTkkqUd1Qk7DNxeVZcAVwG3tRo2ATur6mJgZ1uHUa0Xt8dG4GOnfshjeS/w+Iz1DwF3VdXrgAPAhta+ATjQ2u9q/U53dwNfrKqfA97IqM4u5i/JauBfA5NV9QZGF0i8m2HP3+8A1x3VdkLzleR84APAlYy+kf+BI28Ip5WqGuQDuBr40oz19wPvX+pxnWRN2xjdk+cJYFVrWwU80ZY/Dtw0o/9L/U7XB6PvOewE3grsAMLoW3/Ljp5HRldYXd2Wl7V+WeoajlPbecBfHT3GXuYPWA08A5zf5mMH8ItDnz9gLfDouPMF3AR8fEb73+t3ujwGe+TO3/3DO2Jvaxuk9ifsm4AHgImqerZt+jYw0ZaHWPOHgd8AftzWXw28WFWH2/rMGl6qr20/2Pqfri4CvgP8djvt9Ikk59LJ/FXVPuA3gW8BzzKaj930M39HnOh8DWIehxzu3UiyHPhD4Ner6nszt9Xo0GCQ16sm+SVgf1XtXuqxLJJlwOXAx6rqTcD3+bs/6YHBz99KRjf9uwj4R8C5/OQpja4Meb6ONuRw7+I2B0lexijY76uqz7Xm55KsattXAftb+9Bqfgvwy0n2AFsZnZq5G1iR5MgX6GbW8FJ9bft5wPOncsAnaC+wt6oeaOufZRT2vczfPwP+qqq+U1V/C3yO0Zz2Mn9HnOh8DWIehxzug7/NQZIA9wCPV9Vvzdi0HTjyCfx6Rufij7Tf0j7Fvwo4OOPPydNOVb2/qtZU1VpG8/PlqroZ2AXc2LodXd+Rum9s/U/bo6iq+jbwTJLXt6ZrGd3auov5Y3Q65qokr2z/Vo/U18X8zXCi8/Ul4G1JVra/bt7W2k4vS33S/yQ/GLke+L/AU8C/W+rxjDH+n2f0J+DDwEPtcT2j85Q7gSeBPwHOb/3D6Aqhp4BHGF3FsOR1zLPWdcCOtvwzwF8C08AfAGe39le09em2/WeWetzzqOsy4ME2h/8LWNnT/AEfBL4BPAr8HnD2kOcP+BSjzw/+ltFfXhvGmS/gX7Q6p4Fbl7qu2R7efkCSOjTk0zKSpGMw3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KH/j8OdimohsLZwgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fires.hist(\"area\", bins=30)" ] }, { "cell_type": "markdown", "metadata": { "id": "apVd2ZvbQCie" }, "source": [ "The outcome is highly right-skewed with extremely damaging fires. Furthermore, many of the rows have outcome values that are zero or near-zero. It might be worth it to log-transform the data. Note though that some of the outcomes are actually 0, so we can add `1` to prevent any errors. Recall that $log(0)$ is undefined." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 316 }, "id": "wXwiZuoDQh87", "outputId": "2e49a72f-c5f1-4814-ea0b-4ad38605b827" }, "outputs": [ { "data": { "text/plain": [ "array([[]],\n", " dtype=object)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAATqUlEQVR4nO3df/BldX3f8edLsSZhDYsD/c52oa5TqSPKBOU7xA5t+t2SRJBMwE7KQKnir1mboqMNHQM2jbYpM3SmGCNpTTZCxRH5hqAMlKCGErbG6RBlCXH5EeNGl8CG7GoXF76EmIG8+8f3rL3id/d79/7Y+z0fno+ZO/fec84953UX5nXP93POPTdVhSSpLS+YdQBJ0uRZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLc1WtJdiX5yVnnkNYay12SGmS5SxOQ5KhZZ5AGWe5qQpIXJ/lIkr/obh9J8uKB+e9P8lg3751JKskrVlnnOUn+KMkTSR5J8qGBeZu6dbwjyZ8Dv99Nf3uSh5I8nuQLSV428Jpf69bzRJLtSf7J5P8lpGWWu1rx74HXA6cCPwacDvwSQJKzgF8AfhJ4BbAw5DqfAt4CrAfOAX4+yXnPWeafAq8C3pDkXOADwD8Hjgf+ALhhYNmvdPleCnwa+J0kP3Q4b1IaVry2jPosyS7gncBvAu+pqtu76W8AfrOqNiW5FthTVZd3814BfB04qap2Hsa2PgJUVf3bJJuAbwL/oKq+0c3/HHBTVV3TPX8BsAS8qqoeXmF9jwMLVfXHI7156RDcc1cr/h4wWKAPd9MOzHtkYN7g44NK8uNJ7kryrST7gX8NHPecxQbX9TLg15J8J8l3gH1AgI3d+v5dN2Szv5t/zArrkybCclcr/oLlcj3g73fTAB4DThiYd+KQ6/w0cCtwYlUdA/wGy2U9aPBP30eAd1XV+oHbD1fV/+nG198PnA8cW1Xrgf0rrE+aCMtdrbgB+KUkxyc5Dvhl4FPdvBuBtyV5VZIfAf7DkOt8CbCvqv46yenAv1xl+d8ALk/yaoAkxyT5FwPregb4FnBUkl8GfnTYNycdLstdrfjPwD3AV4EdwL3dNKrqc8BHgbuAncDd3Wu+u8o6/w3wn5I8yfKHxY2HWriqbgb+C7CY5AngfuDsbvYXgM8Df8rykNFfM+TwkDQKD6jqeSfJq1gu3hdX1TOzziNNg3vuel5I8qbuXPhjWd67/p8Wu1pmuev54l3AXuDPgGeBnwdI8kCSpRVuF80yrDQuh2UkqUHuuUtSg9bExY6OO+642rRp00ivfeqppzj66KMnG2iK+pS3T1mhX3n7lBX6lbdPWWG8vNu3b/92VR2/4syqmvnttNNOq1HdddddI792FvqUt09Zq/qVt09Zq/qVt09Zq8bLC9xTB+lVh2UkqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSg1Yt9yQndj9Y8GD3Ve33dtM/lGR3kvu62xsHXnN5kp1Jvtb9Io4k6Qga5ktMzwCXVtW9SV4CbE9yRzfvV6vqvw4unORk4ALg1Sz/As7/SvIPq+rZSQaXJB3cqnvuVfVYVd3bPX4SeIjuZ8MO4lxgsaq+W1XfZPn62adPIqwkaTiHdfmB7keBXwv8IXAG8O4kb2H5RxIurarHWS7+uwde9iiH/jAYy47d+3nrZb871LK7rjxnWjEkaU0Z+qqQSdYB/xu4oqo+m2QO+DbLvyH5K8CGqnp7kl8H7q6qT3Wvuwb4XFXd9Jz1bQG2AMzNzZ22uLg40hvYu28/e54ebtlTNh4z0jYmaWlpiXXr1s06xlD6lBX6lbdPWaFfefuUFcbLu3nz5u1VNb/SvKH23JO8CPgMcH1VfRagqvYMzP8t4Lbu6W6+/weIT+imfZ+q2gpsBZifn6+FhYVhovyAq6+/hat2DPcHyK6LRtvGJG3bto1R3+uR1qes0K+8fcoK/crbp6wwvbzDnC0T4Brgoar68MD0DQOLvYnlny2D5V+Lv6D71ZuXAycBX55cZEnSaobZ5T0DeDOwI8l93bQPABcmOZXlYZldLP/SDVX1QJIbgQdZPtPmEs+UkaQja9Vyr6ovAVlh1u2HeM0VwBVj5JIkjcFvqEpSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoNWLfckJya5K8mDSR5I8t5u+kuT3JHk6939sd30JPlokp1JvprkddN+E5Kk7zfMnvszwKVVdTLweuCSJCcDlwF3VtVJwJ3dc4CzgZO62xbgYxNPLUk6pFXLvaoeq6p7u8dPAg8BG4Fzgeu6xa4Dzusenwt8spbdDaxPsmHiySVJB5WqGn7hZBPwReA1wJ9X1fpueoDHq2p9ktuAK6vqS928O4FfrKp7nrOuLSzv2TM3N3fa4uLiSG9g77797Hl6uGVP2XjMSNuYpKWlJdatWzfrGEPpU1boV94+ZYV+5e1TVhgv7+bNm7dX1fxK844adiVJ1gGfAd5XVU8s9/myqqokw39KLL9mK7AVYH5+vhYWFg7n5d9z9fW3cNWO4d7GrotG28Ykbdu2jVHf65HWp6zQr7x9ygr9ytunrDC9vEOdLZPkRSwX+/VV9dlu8p4Dwy3d/d5u+m7gxIGXn9BNkyQdIcOcLRPgGuChqvrwwKxbgYu7xxcDtwxMf0t31szrgf1V9dgEM0uSVjHMeMYZwJuBHUnu66Z9ALgSuDHJO4CHgfO7ebcDbwR2An8FvG2iiSVJq1q13LsDoznI7DNXWL6AS8bMJUkag99QlaQGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDVo1XJPcm2SvUnuH5j2oSS7k9zX3d44MO/yJDuTfC3JG6YVXJJ0cMPsuX8COGuF6b9aVad2t9sBkpwMXAC8unvNf0/ywkmFlSQNZ9Vyr6ovAvuGXN+5wGJVfbeqvgnsBE4fI58kaQSpqtUXSjYBt1XVa7rnHwLeCjwB3ANcWlWPJ/l14O6q+lS33DXA56rqphXWuQXYAjA3N3fa4uLiSG9g77797Hl6uGVP2XjMSNuYpKWlJdatWzfrGEPpU1boV94+ZYV+5e1TVhgv7+bNm7dX1fxK844aMc/HgF8Bqru/Cnj74aygqrYCWwHm5+drYWFhpCBXX38LV+0Y7m3sumi0bUzStm3bGPW9Hml9ygr9ytunrNCvvH3KCtPLO9LZMlW1p6qeraq/BX6L/z/0shs4cWDRE7ppkqQjaKRyT7Jh4OmbgANn0twKXJDkxUleDpwEfHm8iJKkw7XqeEaSG4AF4LgkjwIfBBaSnMrysMwu4F0AVfVAkhuBB4FngEuq6tnpRJckHcyq5V5VF64w+ZpDLH8FcMU4oSRJ4/EbqpLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGrVruSa5NsjfJ/QPTXprkjiRf7+6P7aYnyUeT7Ezy1SSvm2Z4SdLKhtlz/wRw1nOmXQbcWVUnAXd2zwHOBk7qbluAj00mpiTpcKxa7lX1RWDfcyafC1zXPb4OOG9g+idr2d3A+iQbJhVWkjScVNXqCyWbgNuq6jXd8+9U1frucYDHq2p9ktuAK6vqS928O4FfrKp7VljnFpb37pmbmzttcXFxpDewd99+9jw93LKnbDxmpG1M0tLSEuvWrZt1jKH0KSv0K2+fskK/8vYpK4yXd/Pmzduran6leUeNlQqoqkqy+ifED75uK7AVYH5+vhYWFkba/tXX38JVO4Z7G7suGm0bk7Rt2zZGfa9HWp+yQr/y9ikr9Ctvn7LC9PKOerbMngPDLd393m76buDEgeVO6KZJko6gUcv9VuDi7vHFwC0D09/SnTXzemB/VT02ZkZJ0mFadTwjyQ3AAnBckkeBDwJXAjcmeQfwMHB+t/jtwBuBncBfAW+bQmZJ0ipWLfequvAgs85cYdkCLhk3lCRpPH5DVZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktSgo8Z5cZJdwJPAs8AzVTWf5KXAbwObgF3A+VX1+HgxJUmHYxJ77pur6tSqmu+eXwbcWVUnAXd2zyVJR9A0hmXOBa7rHl8HnDeFbUiSDmHcci/g95JsT7KlmzZXVY91j/8SmBtzG5Kkw5SqGv3Fycaq2p3k7wJ3AO8Bbq2q9QPLPF5Vx67w2i3AFoC5ubnTFhcXR8qwd99+9jw93LKnbDxmpG1M0tLSEuvWrZt1jKH0KSv0K2+fskK/8vYpK4yXd/PmzdsHhsS/z1gHVKtqd3e/N8nNwOnAniQbquqxJBuAvQd57VZgK8D8/HwtLCyMlOHq62/hqh3DvY1dF422jUnatm0bo77XI61PWaFfefuUFfqVt09ZYXp5Rx6WSXJ0kpcceAz8NHA/cCtwcbfYxcAt44aUJB2ecfbc54CbkxxYz6er6vNJvgLcmOQdwMPA+ePHlCQdjpHLvaq+AfzYCtP/L3DmOKEkSePxG6qS1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJatBY13PXcDZd9rvfe3zpKc/w1oHng3Zdec6RiiSpce65S1KDLHdJapDlLkkNstwlqUEeUBXw/Qd9D1jp4K8HfaV+sNx1WFb6EBiHHxbSdDgsI0kNcs+9h4bde3avWHr+cs9dkhrknvsaMunx7EmvT1J/WO5jsDwlrVWW+wosbUl9Z7nreelwPsA9MK0+8oCqJDXIPXfNlKd1StPxvCp3x9L7a5j/dpee8gwLM9o2+AGktcVhGUlq0PNqz13t868zaZnlLk3I4XywHOrnFg9wmEfjsNylNaoPV+D0eMTaNbUx9yRnJflakp1JLpvWdiRJP2gq5Z7khcB/A84GTgYuTHLyNLYlSfpB0xqWOR3YWVXfAEiyCJwLPDil7UlaxTjDPMMcI5jWtlcyq2GeaRyw/8RZR098nQCpqsmvNPk54Kyqemf3/M3Aj1fVuweW2QJs6Z6+EvjaiJs7Dvj2GHGPtD7l7VNW6FfePmWFfuXtU1YYL+/Lqur4lWbM7IBqVW0Fto67niT3VNX8BCIdEX3K26es0K+8fcoK/crbp6wwvbzTOqC6Gzhx4PkJ3TRJ0hEwrXL/CnBSkpcn+TvABcCtU9qWJOk5pjIsU1XPJHk38AXghcC1VfXANLbFBIZ2jrA+5e1TVuhX3j5lhX7l7VNWmFLeqRxQlSTNlhcOk6QGWe6S1KBel3ufLnGQ5Noke5PcP+ssq0lyYpK7kjyY5IEk7511pkNJ8kNJvpzkj7u8/3HWmVaT5IVJ/ijJbbPOspoku5LsSHJfkntmnedQkqxPclOSP0nyUJJ/NOtMB5Pkld2/6YHbE0neN7H193XMvbvEwZ8CPwU8yvIZOhdW1Zr8FmySnwCWgE9W1WtmnedQkmwANlTVvUleAmwHzlvD/7YBjq6qpSQvAr4EvLeq7p5xtINK8gvAPPCjVfUzs85zKEl2AfNVtea/GJTkOuAPqurj3Zl6P1JV35l1rtV0fbab5S97PjyJdfZ5z/17lzioqr8BDlziYE2qqi8C+2adYxhV9VhV3ds9fhJ4CNg421QHV8uWuqcv6m5rdq8lyQnAOcDHZ52lJUmOAX4CuAagqv6mD8XeORP4s0kVO/S73DcCjww8f5Q1XEB9lWQT8FrgD2eb5NC6YY77gL3AHVW1lvN+BHg/8LezDjKkAn4vyfbusiFr1cuBbwH/oxvy+niS6Vy4ZfIuAG6Y5Ar7XO6asiTrgM8A76uqJ2ad51Cq6tmqOpXlb0OfnmRNDn0l+Rlgb1Vtn3WWw/CPq+p1LF/l9ZJuiHEtOgp4HfCxqnot8BSwpo/FAXTDRz8L/M4k19vncvcSB1PUjV1/Bri+qj476zzD6v4Mvws4a9ZZDuIM4Ge7cexF4J8l+dRsIx1aVe3u7vcCN7M8JLoWPQo8OvBX200sl/1adzZwb1XtmeRK+1zuXuJgSroDlNcAD1XVh2edZzVJjk+yvnv8wywfZP+T2aZaWVVdXlUnVNUmlv+f/f2q+lczjnVQSY7uDqrTDXH8NLAmz/iqqr8EHknyym7SmfTjMuMXMuEhGejxz+wd4UscjC3JDcACcFySR4EPVtU1s011UGcAbwZ2dOPYAB+oqttnmOlQNgDXdWccvAC4sarW/CmGPTEH3Lz8ec9RwKer6vOzjXRI7wGu73b4vgG8bcZ5Dqn7wPwp4F0TX3dfT4WUJB1cn4dlJEkHYblLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBv0/UwKZScNNSFUAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fires[\"log_area\"] = np.log(fires[\"area\"] + 1)\n", "\n", "fires.hist(\"log_area\", bins=30)" ] }, { "cell_type": "markdown", "metadata": { "id": "cGXK7TxAReM9" }, "source": [ "We can see that performing a log-transformation doesn't produce a bell-shaped distribution, but it does spread out the data a bit more than without the transformation. It's probably the case that most fires do not appreciably damage the forest, so we would be mistaken in removing all of these rows. \n", "\n", "Instead of using `month` directly, we'll derive another feature called `summer` that takes a value of 1 when the fire occurred during the summer. The idea here is that summer months are typically hotter, so fires are more likely. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "2JQ88NLeGJlt" }, "outputs": [], "source": [ "def is_summer_month(month):\n", " if month in [\"jun\", \"jul\", \"aug\"]:\n", " return 1\n", " else:\n", " return 0\n", "\n", "fires[\"summer\"] = [is_summer_month(m) for m in fires[\"month\"]]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "59Xkc88p2wrY", "outputId": "bbfe51c4-e278-4778-8176-d91d76b06805" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FFMCDMCDCISItempRHwindrain
086.226.20000094.3000005.116.651.06.7000000.0
190.656.433333669.1000006.718.033.00.9000000.0
290.643.700000470.8333336.714.633.01.3000000.0
391.733.30000077.5000009.08.397.04.0000000.2
489.351.300000102.2000009.611.499.04.3333330.0
...........................
51281.656.700000665.6000001.927.832.02.7000000.0
51381.656.700000665.6000001.921.971.05.8000000.0
51481.656.700000665.6000001.921.270.06.7000000.0
51594.4146.000000614.70000011.325.642.04.0000000.0
51679.53.000000106.7000001.111.831.04.5000000.0
\n", "

517 rows × 8 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " FFMC DMC DC ISI temp RH wind rain\n", "0 86.2 26.200000 94.300000 5.1 16.6 51.0 6.700000 0.0\n", "1 90.6 56.433333 669.100000 6.7 18.0 33.0 0.900000 0.0\n", "2 90.6 43.700000 470.833333 6.7 14.6 33.0 1.300000 0.0\n", "3 91.7 33.300000 77.500000 9.0 8.3 97.0 4.000000 0.2\n", "4 89.3 51.300000 102.200000 9.6 11.4 99.0 4.333333 0.0\n", ".. ... ... ... ... ... ... ... ...\n", "512 81.6 56.700000 665.600000 1.9 27.8 32.0 2.700000 0.0\n", "513 81.6 56.700000 665.600000 1.9 21.9 71.0 5.800000 0.0\n", "514 81.6 56.700000 665.600000 1.9 21.2 70.0 6.700000 0.0\n", "515 94.4 146.000000 614.700000 11.3 25.6 42.0 4.000000 0.0\n", "516 79.5 3.000000 106.700000 1.1 11.8 31.0 4.500000 0.0\n", "\n", "[517 rows x 8 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.impute import KNNImputer\n", "\n", "imp = KNNImputer(missing_values = np.nan, n_neighbors=3)\n", "\n", "fires_missing = fires[fires.columns[5:13]] # FFMC to rain\n", "imputed = pd.DataFrame(imp.fit_transform(fires_missing), \n", " columns = fires.columns[5:13])\n", "imputed" ] }, { "cell_type": "markdown", "metadata": { "id": "-A9-irrtAnD8" }, "source": [ "We'll examine the data for outliers using boxplots:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 282 }, "id": "NKD1Jgul6_95", "outputId": "4c5096c2-c9a3-4aec-b40c-f004a78aa5ef" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "imputed.boxplot(column=[\"FFMC\", \"DMC\", \"DC\", \"ISI\", \"temp\", \"RH\", \"wind\", \"rain\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "lGOBnfyZCG60" }, "source": [ "The dots indicate that there are some outliers in the data. Let's examine the number of outliers in each of the columns." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tPKtl1FfBsLa", "outputId": "fdb12867-9693-4dae-be06-7203af4f1c4a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The FFMC column has 0 according to the boxplot method.\n", "The DMC column has 0 according to the boxplot method.\n", "The DC column has 0 according to the boxplot method.\n", "The ISI column has 0 according to the boxplot method.\n", "The temp column has 0 according to the boxplot method.\n", "The RH column has 0 according to the boxplot method.\n", "The wind column has 0 according to the boxplot method.\n", "The rain column has 0 according to the boxplot method.\n" ] } ], "source": [ "for col in imputed:\n", "\n", " quartiles = np.percentile(fires[col], [25, 50, 75])\n", " iqr = quartiles[2] - quartiles[0]\n", " lower_bound = quartiles[0] - (1.5 * iqr)\n", " upper_bound = quartiles[2] + (1.5 * iqr)\n", " num_outliers =sum((imputed[col] < lower_bound) | (imputed[col] > upper_bound))\n", "\n", " print(f\"The {col} column has {num_outliers} according to the boxplot method.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "7CUialCuDUSx" }, "source": [ "Despite the visual cue in the boxplots, based on the actual calculations, there don't seem to be any outliers. In this case, we'll leave the dataset as-is. " ] }, { "cell_type": "markdown", "metadata": { "id": "kzs2mQR_OWoO" }, "source": [ "Now that the dataset has been inspected for missing values and outliers, we can proceed to standardize it. These standardized values will help for standardization. Afterwards, we'll append the `summmer` feature back into the dataset. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "N1Sxq-Pe5Vin", "outputId": "ea66385f-b731-44aa-9cde-d9ef1664c491" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
summerFFMCDMCDCISItempRHwindrain
00-0.812283-1.335942-1.846711-0.860187-0.3981870.4187261.514159-0.073268
10-0.010735-0.8590090.509582-0.508736-0.155493-0.715565-1.761003-0.073268
20-0.010735-1.059878-0.303178-0.508736-0.744894-0.715565-1.535130-0.073268
300.189652-1.223939-1.915580-0.003526-1.8370213.317471-0.0104850.603155
40-0.247556-0.939988-1.8143270.128267-1.2996253.4435030.177742-0.073268
..............................
5121-1.650265-0.8548030.495235-1.5630871.543370-0.778581-0.744573-0.073268
5131-1.650265-0.8548030.495235-1.5630870.5205851.6790501.005944-0.073268
5141-1.650265-0.8548030.495235-1.5630870.3992381.6160341.514159-0.073268
51510.6815110.5539120.2865790.5016831.161993-0.148419-0.010485-0.073268
5160-2.032821-1.701924-1.795880-1.738812-1.230284-0.8415970.271856-0.073268
\n", "

517 rows × 9 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " summer FFMC DMC DC ISI temp RH \\\n", "0 0 -0.812283 -1.335942 -1.846711 -0.860187 -0.398187 0.418726 \n", "1 0 -0.010735 -0.859009 0.509582 -0.508736 -0.155493 -0.715565 \n", "2 0 -0.010735 -1.059878 -0.303178 -0.508736 -0.744894 -0.715565 \n", "3 0 0.189652 -1.223939 -1.915580 -0.003526 -1.837021 3.317471 \n", "4 0 -0.247556 -0.939988 -1.814327 0.128267 -1.299625 3.443503 \n", ".. ... ... ... ... ... ... ... \n", "512 1 -1.650265 -0.854803 0.495235 -1.563087 1.543370 -0.778581 \n", "513 1 -1.650265 -0.854803 0.495235 -1.563087 0.520585 1.679050 \n", "514 1 -1.650265 -0.854803 0.495235 -1.563087 0.399238 1.616034 \n", "515 1 0.681511 0.553912 0.286579 0.501683 1.161993 -0.148419 \n", "516 0 -2.032821 -1.701924 -1.795880 -1.738812 -1.230284 -0.841597 \n", "\n", " wind rain \n", "0 1.514159 -0.073268 \n", "1 -1.761003 -0.073268 \n", "2 -1.535130 -0.073268 \n", "3 -0.010485 0.603155 \n", "4 0.177742 -0.073268 \n", ".. ... ... \n", "512 -0.744573 -0.073268 \n", "513 1.005944 -0.073268 \n", "514 1.514159 -0.073268 \n", "515 -0.010485 -0.073268 \n", "516 0.271856 -0.073268 \n", "\n", "[517 rows x 9 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "scaled = scaler.fit_transform(imputed)\n", "scaled = pd.DataFrame(scaled, columns = fires.columns[5:13])\n", "\n", "final = pd.concat([fires[\"summer\"], scaled], axis=1)\n", "\n", "final" ] }, { "cell_type": "markdown", "metadata": { "id": "r2VHEBeRUS-G" }, "source": [ "# Subset Selection" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "x5pquOswH1fE", "outputId": "1b554434-91ba-49e0-b5cb-b621255d9b33" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features selected in 2 feature model: ['FFMC' 'DC']\n", "Features selected in 4 feature model: ['FFMC' 'DC' 'RH' 'wind']\n", "Features selected in 6 feature model: ['summer' 'FFMC' 'DC' 'ISI' 'RH' 'wind']\n" ] } ], "source": [ "from sklearn.feature_selection import SequentialFeatureSelector\n", "\n", "y = fires[\"log_area\"]\n", "\n", "sfs_model = LinearRegression()\n", "sfs_model2 = LinearRegression()\n", "sfs_model3 = LinearRegression()\n", "\n", "forward2 = SequentialFeatureSelector(estimator=sfs_model,\n", " n_features_to_select=2, \n", " direction=\"forward\")\n", "\n", "forward4 = SequentialFeatureSelector(estimator=sfs_model2,\n", " n_features_to_select=4, \n", " direction=\"forward\")\n", "\n", "forward6 = SequentialFeatureSelector(estimator=sfs_model3,\n", " n_features_to_select=6, \n", " direction=\"forward\")\n", "\n", "forward2.fit(final, y)\n", "forward4.fit(final, y)\n", "forward6.fit(final, y)\n", "\n", "print(\"Features selected in 2 feature model:\", forward2.get_feature_names_out())\n", "print(\"Features selected in 4 feature model:\", forward4.get_feature_names_out())\n", "print(\"Features selected in 6 feature model:\", forward6.get_feature_names_out())" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "krNnlYx7YG0J", "outputId": "a94f6c39-81a9-46be-f1f4-3f73cc8b3fd5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features selected in 2 feature model: ['DC' 'wind']\n", "Features selected in 4 feature model: ['FFMC' 'DC' 'RH' 'wind']\n", "Features selected in 6 feature model: ['summer' 'FFMC' 'DC' 'ISI' 'RH' 'wind']\n" ] } ], "source": [ "backward2 = SequentialFeatureSelector(estimator=sfs_model,\n", " n_features_to_select=2, \n", " direction=\"backward\")\n", "\n", "backward4 = SequentialFeatureSelector(estimator=sfs_model,\n", " n_features_to_select=4, \n", " direction=\"backward\")\n", "\n", "backward6 = SequentialFeatureSelector(estimator=sfs_model,\n", " n_features_to_select=6, \n", " direction=\"backward\")\n", "\n", "backward2.fit(final, y)\n", "backward4.fit(final, y)\n", "backward6.fit(final, y)\n", "\n", "print(\"Features selected in 2 feature model:\", backward2.get_feature_names_out())\n", "print(\"Features selected in 4 feature model:\", backward4.get_feature_names_out())\n", "print(\"Features selected in 6 feature model:\", backward6.get_feature_names_out())" ] }, { "cell_type": "markdown", "metadata": { "id": "lsujk2bVY9K2" }, "source": [ "Based on the features chosen by forward and backward selection, it seems like `DC`, `wind` and `FFMC` seem to be the most impactful on predicting `log_area`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "CgfBlupqJLad" }, "outputs": [], "source": [ "fw2_model = LinearRegression() # .fit(final[forward2.get_feature_names_out()], y)\n", "fw4_model = LinearRegression() # .fit(final[forward4.get_feature_names_out()], y)\n", "fw6_model = LinearRegression() # .fit(final[forward6.get_feature_names_out()], y)\n", "\n", "bw2_model = LinearRegression() # .fit(final[backward2.get_feature_names_out()], y)\n", "bw4_model = LinearRegression() # .fit(final[backward4.get_feature_names_out()], y)\n", "bw6_model = LinearRegression() # .fit(final[backward6.get_feature_names_out()], y)" ] }, { "cell_type": "markdown", "metadata": { "id": "vku4dsrH1H4B" }, "source": [ "# More Candidate Models\n", "\n", "Another approach we might consider taking is using regularized versions of linear regression. Fires have many factors that can increase the damaage they have, so it seems unhelpful to restrict our model to a univariate, non-linear model. There are such models; however, they were beyond the scope of the course, but they might be plausible candidates for further next steps." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Z0FTDs2FYrZ4", "outputId": "75520a53-6672-4da1-9fed-3322892e1f0a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ridge tuning parameter: 1372.2342342342342\n", "LASSO tuning parameter: 10000.0\n", "Ridge coefficients: [-0.01455017 0.01311215 0.02006457 0.02004741 -0.01073465 0.01297049\n", " -0.01489714 0.02670554 0.00816103]\n", "LASSO coefficients: [-0. 0. 0. 0. -0. 0. -0. 0. 0.]\n" ] } ], "source": [ "from sklearn.linear_model import LassoCV, RidgeCV\n", "\n", "ridge = RidgeCV(alphas = np.linspace(1, 10000, num=1000))\n", "lasso = LassoCV(alphas = np.linspace(1, 10000, num=1000))\n", "\n", "ridge.fit(final, y)\n", "lasso.fit(final, y)\n", "\n", "print(\"Ridge tuning parameter: \", ridge.alpha_)\n", "print(\"LASSO tuning parameter: \", lasso.alpha_)\n", "\n", "print(\"Ridge coefficients: \", ridge.coef_)\n", "print(\"LASSO coefficients: \", lasso.coef_)" ] }, { "cell_type": "markdown", "metadata": { "id": "rLItzlCRVmVC" }, "source": [ "The LASSO tuning parameter always seems to be on the extreme. Given that the outcome has many small values, it suggests that having no features at all is better than having any. We'll try to home in on a better tuning parameter value below by choosing a smaller range to pick from." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7LtiwEd4VmHW", "outputId": "7984bf0e-28dc-4de2-aa61-e609cabc7b80" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ridge tuning parameter: 1371.3713713713714\n" ] } ], "source": [ "ridge = RidgeCV(alphas = np.linspace(1000, 1500, num=1000))\n", "ridge.fit(final, y)\n", "print(\"Ridge tuning parameter: \", ridge.alpha_)" ] }, { "cell_type": "markdown", "metadata": { "id": "8lCKgqRaWWvd" }, "source": [ "We'll use this value in k-fold cross-validation, rounded to the hundredths place. We'll use a ridge regression and choose not to use a LASSO model here since the regularization results aren't helpful." ] }, { "cell_type": "markdown", "metadata": { "id": "KUbG_WwI879s" }, "source": [ "# K-Fold Cross-Validation" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "icmdHtm_9VaV" }, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score \n", "\n", "reference_cv = cross_val_score(reference, final[[\"wind\", \"temp\"]], y, cv = 5, scoring = \"neg_mean_squared_error\")\n", "fw2_cv = cross_val_score(fw2_model, final[forward2.get_feature_names_out()], y, cv = 5, scoring = \"neg_mean_squared_error\")\n", "fw4_cv = cross_val_score(fw4_model, final[forward4.get_feature_names_out()], y, cv = 5, scoring = \"neg_mean_squared_error\")\n", "fw6_cv = cross_val_score(fw6_model, final[forward6.get_feature_names_out()], y, cv = 5, scoring = \"neg_mean_squared_error\")\n", "bw2_cv = cross_val_score(bw2_model, final[backward2.get_feature_names_out()], y, cv = 5, scoring = \"neg_mean_squared_error\")\n", "bw4_cv = cross_val_score(bw4_model, final[backward4.get_feature_names_out()], y, cv = 5, scoring = \"neg_mean_squared_error\")\n", "bw6_cv = cross_val_score(bw6_model, final[backward6.get_feature_names_out()], y, cv = 5, scoring = \"neg_mean_squared_error\")\n", "ridge_cv = cross_val_score(ridge, final, y, cv = 5, scoring = \"neg_mean_squared_error\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Q-hz_HaG_-Y6", "outputId": "5e328f7f-f31b-4d12-b5b1-e3d1345fe49b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reference Model, Avg Test MSE: -2.204650013004116 SD: 1.0600403553786375\n", "Forward-2 Model, Avg Test MSE: -2.1735431721198535 SD: 1.0208083278697586\n", "Forward-4 Model, Avg Test MSE: -2.193528106772711 SD: 1.0004774710977682\n", "Forward-6 Model, Avg Test MSE: -2.239722553934875 SD: 1.0123323877770343\n", "Backward-2 Model, Avg Test MSE: -2.173357302739327 SD: 1.0038109503795956\n", "Backward-4 Model, Avg Test MSE: -2.193528106772711 SD: 1.0004774710977682\n", "Backward-6 Model, Avg Test MSE: -2.239722553934875 SD: 1.0123323877770343\n", "Ridge Model, Avg Test MSE: -2.239722553934875 SD: 1.0123323877770343\n" ] } ], "source": [ "print(\"Reference Model, Avg Test MSE: \", np.mean(reference_cv), \" SD: \", np.std(reference_cv))\n", "print(\"Forward-2 Model, Avg Test MSE: \", np.mean(fw2_cv), \" SD: \", np.std(fw2_cv))\n", "print(\"Forward-4 Model, Avg Test MSE: \", np.mean(fw4_cv), \" SD: \", np.std(fw4_cv))\n", "print(\"Forward-6 Model, Avg Test MSE: \", np.mean(fw6_cv), \" SD: \", np.std(fw6_cv))\n", "print(\"Backward-2 Model, Avg Test MSE: \", np.mean(bw2_cv), \" SD: \", np.std(bw2_cv))\n", "print(\"Backward-4 Model, Avg Test MSE: \", np.mean(bw4_cv), \" SD: \", np.std(bw4_cv))\n", "print(\"Backward-6 Model, Avg Test MSE: \", np.mean(bw6_cv), \" SD: \", np.std(bw6_cv))\n", "print(\"Ridge Model, Avg Test MSE: \", np.mean(bw6_cv), \" SD: \", np.std(bw6_cv))" ] }, { "cell_type": "markdown", "metadata": { "id": "J6BPP8qmW7_x" }, "source": [ "Among our candidate models, the backward selection model using two features performs the best, with an average MSE of -2.17. However, note that this is on the log-scale, so this suggests that the predictions are off by a magnitude of about 2. On the surface, this suggests that the models overall are not good predictors. \n", "\n", "However, this problem is known to be a difficult one. The extreme skew in the outcome hurts many of the assumptions needed by linear models. We hope that this showcases that machine learning is not a universal fix. Several problems have characteristics that make prediction difficult." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "v6Sq9R9FRDqC" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }