{ "cells": [ { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas\n", "\n", "bike_rentals = pandas.read_csv(\"bike_rental_hour.csv\")\n", "bike_rentals.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "plt.hist(bike_rentals[\"cnt\"])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bike_rentals.corr()[\"cnt\"]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def assign_label(hour):\n", " if hour >=0 and hour < 6:\n", " return 4\n", " elif hour >=6 and hour < 12:\n", " return 1\n", " elif hour >= 12 and hour < 18:\n", " return 2\n", " elif hour >= 18 and hour <=24:\n", " return 3\n", "\n", "bike_rentals[\"time_label\"] = bike_rentals[\"hr\"].apply(assign_label)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Error metric\n", "\n", "The mean squared error metric makes the most sense to evaluate our error. MSE works on continuous numeric data, which fits our data quite well." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train = bike_rentals.sample(frac=.8)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "predictors = list(train.columns)\n", "predictors.remove(\"cnt\")\n", "predictors.remove(\"casual\")\n", "predictors.remove(\"registered\")\n", "predictors.remove(\"dteday\")\n", "\n", "reg = LinearRegression()\n", "\n", "reg.fit(train[predictors], train[\"cnt\"])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy\n", "predictions = reg.predict(test[predictors])\n", "\n", "numpy.mean((predictions - test[\"cnt\"]) ** 2)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "actual" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "test[\"cnt\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Error\n", "\n", "The error is very high, which may be due to the fact that the data has a few extremely high rental counts, but otherwise mostly low counts. Larger errors are penalized more with MSE, which leads to a higher total error." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "reg = DecisionTreeRegressor(min_samples_leaf=5)\n", "\n", "reg.fit(train[predictors], train[\"cnt\"])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [], "source": [ "predictions = reg.predict(test[predictors])\n", "\n", "numpy.mean((predictions - test[\"cnt\"]) ** 2)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [], "source": [ "reg = DecisionTreeRegressor(min_samples_leaf=2)\n", "\n", "reg.fit(train[predictors], train[\"cnt\"])\n", "\n", "predictions = reg.predict(test[predictors])\n", "\n", "numpy.mean((predictions - test[\"cnt\"]) ** 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decision tree error\n", "\n", "By taking the nonlinear predictors into account, the decision tree regressor appears to have much higher accuracy than linear regression." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "reg = RandomForestRegressor(min_samples_leaf=5)\n", "reg.fit(train[predictors], train[\"cnt\"])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "predictions = reg.predict(test[predictors])\n", "\n", "numpy.mean((predictions - test[\"cnt\"]) ** 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random forest error\n", "\n", "By removing some of the sources of overfitting, the random forest accuracy is improved over the decision tree accuracy." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }