{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas\n",
    "\n",
    "board_games = pandas.read_csv(\"board_games.csv\")\n",
    "board_games = board_games.dropna(axis=0)\n",
    "board_games = board_games[board_games[\"users_rated\"] > 0]\n",
    "\n",
    "board_games.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.hist(board_games[\"average_rating\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(board_games[\"average_rating\"].std())\n",
    "print(board_games[\"average_rating\"].mean())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Error metric\n",
    "\n",
    "In this data set, using mean squared error as an error metric makes sense.  This is because the data is continuous, and follows a somewhat normal distribution.  We'll be able to compare our error to the standard deviation to see how good the model is at predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "\n",
    "clus = KMeans(n_clusters=5)\n",
    "cols = list(board_games.columns)\n",
    "cols.remove(\"name\")\n",
    "cols.remove(\"id\")\n",
    "cols.remove(\"type\")\n",
    "numeric = board_games[cols]\n",
    "\n",
    "clus.fit(numeric)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy\n",
    "game_mean = numeric.apply(numpy.mean, axis=1)\n",
    "game_std = numeric.apply(numpy.std, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "labels = clus.labels_\n",
    "\n",
    "plt.scatter(x=game_mean, y=game_std, c=labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Game clusters\n",
    "\n",
    "It looks like most of the games are similar, but as the game attributes tend to increase in value (such as number of users who rated), there are fewer high quality games.  So most games don't get played much, but a few get a lot of players."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "correlations = numeric.corr()\n",
    "\n",
    "correlations[\"average_rating\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Correlations\n",
    "\n",
    "The `yearpublished` column is surprisingly highly correlated with `average_rating`, showing that more recent games tend to be rated more highly.  Games for older players (`minage` is high) tend to be more highly rated.  The more \"weighty\" a game is (`average_weight` is high), the more highly it tends to be rated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "reg = LinearRegression()\n",
    "cols.remove(\"average_rating\")\n",
    "cols.remove(\"bayes_average_rating\")\n",
    "reg.fit(board_games[cols], board_games[\"average_rating\"])\n",
    "predictions = reg.predict(board_games[cols])\n",
    "\n",
    "numpy.mean((predictions - board_games[\"average_rating\"]) ** 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Game clusters\n",
    "\n",
    "The error rate is close to the standard deviation of all board game ratings.  This indicates that our model may not have high predictive power.  We'll need to dig more into which games were scored well, and which ones weren't."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}