2 years ago · b3725444a1
--- a/Mission764Solutions.ipynb
+++ b/Mission764Solutions.ipynb
@@ -87,9 +87,9 @@
 
				    "source": [
			
 
				     "# Data Processing\n",
			
 
				     "\n",
			
 
				-    "First, we'll convert the `month` column into a categorical features. Instead of using the strings, we'll convert it into an indicator for the summer months in the northern hemisphere.\n",
			
 
				+    "First, we'll convert the `month` column into a categorical feature. Instead of using the strings, we'll convert it into an indicator for the summer months in the northern hemisphere.\n",
			
 
				     "\n",
			
 
				-    "For completeness, we'll impute all of the features so that we can have the biggest set to choose from for sequential feature selection. We'll go with K-nearest neighbors imputation since we expect area damage to be similar among similar fires. "
			
 
				+    "For the sake of completion, we'll impute all of the features so that we can have the biggest set to choose from for sequential feature selection. We'll go with K-nearest neighbors imputation since we expect area damage to be similar among similar fires. "
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -138,7 +138,7 @@
 
				     "id": "apVd2ZvbQCie"
			
 
				    },
			
 
				    "source": [
			
 
				-    "The outcome is highly right-skewed with extremely damaging fires. Furthermore, many of the rows have outcome values that are zero or near-zero. It might be worth it to log-transform the data. Note though that some of the outcomes are actually 0, so we can add `1` beforehand to prevent any errors. Recall that $log(0)$ is undefined."
			
 
				+    "The outcome is highly right-skewed with extremely damaging fires. Furthermore, many of the rows have outcome values that are zero or near-zero. It might be worth it to log-transform the data. Note though that some of the outcomes are actually 0, so we can add `1` to prevent any errors. Recall that $log(0)$ is undefined."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -189,9 +189,9 @@
 
				     "id": "cGXK7TxAReM9"
			
 
				    },
			
 
				    "source": [
			
 
				-    "We can see that performing a log-transformation doesn't produce a bell-shaped distribution, but it does spread out a data a bit more than without the transformation. It's probably the case that most fires do not appreciably damage the forest, so we would be mistaken in removing all of these rows. \n",
			
 
				+    "We can see that performing a log-transformation doesn't produce a bell-shaped distribution, but it does spread out the data a bit more than without the transformation. It's probably the case that most fires do not appreciably damage the forest, so we would be mistaken in removing all of these rows. \n",
			
 
				     "\n",
			
 
				-    "Instead of using `month` directly, we'll derive another feature called `summer` that takes a value of 1 when the fire occurred during the summer. The intuition here is that summer months are typically hotter, so fires are more likely. "
			
 
				+    "Instead of using `month` directly, we'll derive another feature called `summer` that takes a value of 1 when the fire occurred during the summer. The idea here is that summer months are typically hotter, so fires are more likely. "
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -594,7 +594,7 @@
 
				     "id": "7CUialCuDUSx"
			
 
				    },
			
 
				    "source": [
			
 
				-    "Despite the visual cue in the boxplots, based on the actual calculations, there don't seem to be any outliers. In this case, we'll leave the dataset as is. "
			
 
				+    "Despite the visual cue in the boxplots, based on the actual calculations, there don't seem to be any outliers. In this case, we'll leave the dataset as-is. "
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -1051,7 +1051,7 @@
 
				    "source": [
			
 
				     "# More Candidate Models\n",
			
 
				     "\n",
			
 
				-    "Another approach we might consider taking is using regularized versions of linear regression. Fires have many factors that can increase the damaage they have, so it seems unhelpful to restrict our model to a univariate, non-linear model. There are such models, but they were out of the scope of the course, but they might be plausible candidates for further next steps."
			
 
				+    "Another approach we might consider taking is using regularized versions of linear regression. Fires have many factors that can increase the damaage they have, so it seems unhelpful to restrict our model to a univariate, non-linear model. There are such models; however, they were beyond the scope of the course, but they might be plausible candidates for further next steps."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -1099,7 +1099,7 @@
 
				     "id": "rLItzlCRVmVC"
			
 
				    },
			
 
				    "source": [
			
 
				-    "The LASSO tuning parameter always seems to be on the extreme. Given the outcome has many small values, it suggests that having no features at all is better than having any. We'll try to hone in on a better tuning parameter value below by choosing a smaller range to pick from."
			
 
				+    "The LASSO tuning parameter always seems to be on the extreme. Given that the outcome has many small values, it suggests that having no features at all is better than having any. We'll try to home in on a better tuning parameter value below by choosing a smaller range to pick from."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -1133,7 +1133,7 @@
 
				     "id": "8lCKgqRaWWvd"
			
 
				    },
			
 
				    "source": [
			
 
				-    "We'll use this value in k-fold cross-validation, rounded to the hundredths place. We'll use a ridge regression and choose not to use a LASSO model here since the regularization results aren't helpful here."
			
 
				+    "We'll use this value in k-fold cross-validation, rounded to the hundredths place. We'll use a ridge regression and choose not to use a LASSO model here since the regularization results aren't helpful."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -1208,9 +1208,9 @@
 
				     "id": "J6BPP8qmW7_x"
			
 
				    },
			
 
				    "source": [
			
 
				-    "Among our candidate models, the backward selection model using 2 features performs the best, with an average MSE of -2.17. However, note that this is on the log-scale, so this suggests that the predictions are off by a magnitude of about 2. On the surface, this suggests that the models overall are not good predictors. \n",
			
 
				+    "Among our candidate models, the backward selection model using two features performs the best, with an average MSE of -2.17. However, note that this is on the log-scale, so this suggests that the predictions are off by a magnitude of about 2. On the surface, this suggests that the models overall are not good predictors. \n",
			
 
				     "\n",
			
 
				-    "However, this problem is known to be a difficult one. The extreme skew in the outcome hurts many of the assumptions needed by linear models. We hope that this showcases that machine learning is not a silver bullet. Several problems have characteristics that make prediction difficult."
			
 
				+    "However, this problem is known to be a difficult one. The extreme skew in the outcome hurts many of the assumptions needed by linear models. We hope that this showcases that machine learning is not a universal fix. Several problems have characteristics that make prediction difficult."
			
 
				    ]
			
 
				   },
			
 
				   {