Queer European MD passionate about IT
Browse Source

added 294

srinify 6 years ago
parent
commit
1cc5a7ef7a
1 changed files with 2253 additions and 0 deletions
  1. 2253 0
      Mission294Solutions.ipynb

+ 2253 - 0
Mission294Solutions.ipynb

@@ -0,0 +1,2253 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Analyzing Used Car Listings on eBay Kleinanzeigen"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will be working on a dataset of used cars from *eBay Kleinanzeigen*, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.\n",
+    "\n",
+    "The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data).  The version of the dataset we are working with is a sample of 50,000 data points that was prepared by [Dataquest](https://www.dataquest.io) including simulating a less-cleaned version of the data.\n",
+    "\n",
+    "The data dictionary provided with data is as follows:\n",
+    "\n",
+    "- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.\n",
+    "- `name` - Name of the car.\n",
+    "- `seller` - Whether the seller is private or a dealer.\n",
+    "- `offerType` - The type of listing\n",
+    "- `price` - The price on the ad to sell the car.\n",
+    "- `abtest` - Whether the listing is included in an A/B test.\n",
+    "- `vehicleType` - The vehicle Type.\n",
+    "- `yearOfRegistration` - The year in which which year the car was first registered.\n",
+    "- `gearbox` - The transmission type.\n",
+    "- `powerPS` - The power of the car in PS.\n",
+    "- `model` - The car model name.\n",
+    "- `kilometer` - How many kilometers the car has driven.\n",
+    "- `monthOfRegistration` - The month in which which year the car was first registered.\n",
+    "- `fuelType` - What type of fuel the car uses.\n",
+    "- `brand` - The brand of the car.\n",
+    "- `notRepairedDamage` - If the car has a damage which is not yet repaired.\n",
+    "- `dateCreated` - The date on which the eBay listing was created.\n",
+    "- `nrOfPictures` - The number of pictures in the ad.\n",
+    "- `postalCode` - The postal code for the location of the vehicle.\n",
+    "- `lastSeenOnline` - When the crawler saw this ad last online.\n",
+    "\n",
+    "\n",
+    "The aim of this project is to clean the data and analyze the included used car listings."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 50000 entries, 0 to 49999\n",
+      "Data columns (total 20 columns):\n",
+      "dateCrawled            50000 non-null object\n",
+      "name                   50000 non-null object\n",
+      "seller                 50000 non-null object\n",
+      "offerType              50000 non-null object\n",
+      "price                  50000 non-null object\n",
+      "abtest                 50000 non-null object\n",
+      "vehicleType            44905 non-null object\n",
+      "yearOfRegistration     50000 non-null int64\n",
+      "gearbox                47320 non-null object\n",
+      "powerPS                50000 non-null int64\n",
+      "model                  47242 non-null object\n",
+      "odometer               50000 non-null object\n",
+      "monthOfRegistration    50000 non-null int64\n",
+      "fuelType               45518 non-null object\n",
+      "brand                  50000 non-null object\n",
+      "notRepairedDamage      40171 non-null object\n",
+      "dateCreated            50000 non-null object\n",
+      "nrOfPictures           50000 non-null int64\n",
+      "postalCode             50000 non-null int64\n",
+      "lastSeen               50000 non-null object\n",
+      "dtypes: int64(5), object(15)\n",
+      "memory usage: 7.6+ MB\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>dateCrawled</th>\n",
+       "      <th>name</th>\n",
+       "      <th>seller</th>\n",
+       "      <th>offerType</th>\n",
+       "      <th>price</th>\n",
+       "      <th>abtest</th>\n",
+       "      <th>vehicleType</th>\n",
+       "      <th>yearOfRegistration</th>\n",
+       "      <th>gearbox</th>\n",
+       "      <th>powerPS</th>\n",
+       "      <th>model</th>\n",
+       "      <th>odometer</th>\n",
+       "      <th>monthOfRegistration</th>\n",
+       "      <th>fuelType</th>\n",
+       "      <th>brand</th>\n",
+       "      <th>notRepairedDamage</th>\n",
+       "      <th>dateCreated</th>\n",
+       "      <th>nrOfPictures</th>\n",
+       "      <th>postalCode</th>\n",
+       "      <th>lastSeen</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2016-03-26 17:47:46</td>\n",
+       "      <td>Peugeot_807_160_NAVTECH_ON_BOARD</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$5,000</td>\n",
+       "      <td>control</td>\n",
+       "      <td>bus</td>\n",
+       "      <td>2004</td>\n",
+       "      <td>manuell</td>\n",
+       "      <td>158</td>\n",
+       "      <td>andere</td>\n",
+       "      <td>150,000km</td>\n",
+       "      <td>3</td>\n",
+       "      <td>lpg</td>\n",
+       "      <td>peugeot</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-03-26 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>79588</td>\n",
+       "      <td>2016-04-06 06:45:54</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2016-04-04 13:38:56</td>\n",
+       "      <td>BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$8,500</td>\n",
+       "      <td>control</td>\n",
+       "      <td>limousine</td>\n",
+       "      <td>1997</td>\n",
+       "      <td>automatik</td>\n",
+       "      <td>286</td>\n",
+       "      <td>7er</td>\n",
+       "      <td>150,000km</td>\n",
+       "      <td>6</td>\n",
+       "      <td>benzin</td>\n",
+       "      <td>bmw</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-04-04 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>71034</td>\n",
+       "      <td>2016-04-06 14:45:08</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2016-03-26 18:57:24</td>\n",
+       "      <td>Volkswagen_Golf_1.6_United</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$8,990</td>\n",
+       "      <td>test</td>\n",
+       "      <td>limousine</td>\n",
+       "      <td>2009</td>\n",
+       "      <td>manuell</td>\n",
+       "      <td>102</td>\n",
+       "      <td>golf</td>\n",
+       "      <td>70,000km</td>\n",
+       "      <td>7</td>\n",
+       "      <td>benzin</td>\n",
+       "      <td>volkswagen</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-03-26 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>35394</td>\n",
+       "      <td>2016-04-06 20:15:37</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2016-03-12 16:58:10</td>\n",
+       "      <td>Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$4,350</td>\n",
+       "      <td>control</td>\n",
+       "      <td>kleinwagen</td>\n",
+       "      <td>2007</td>\n",
+       "      <td>automatik</td>\n",
+       "      <td>71</td>\n",
+       "      <td>fortwo</td>\n",
+       "      <td>70,000km</td>\n",
+       "      <td>6</td>\n",
+       "      <td>benzin</td>\n",
+       "      <td>smart</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-03-12 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>33729</td>\n",
+       "      <td>2016-03-15 03:16:28</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2016-04-01 14:38:50</td>\n",
+       "      <td>Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$1,350</td>\n",
+       "      <td>test</td>\n",
+       "      <td>kombi</td>\n",
+       "      <td>2003</td>\n",
+       "      <td>manuell</td>\n",
+       "      <td>0</td>\n",
+       "      <td>focus</td>\n",
+       "      <td>150,000km</td>\n",
+       "      <td>7</td>\n",
+       "      <td>benzin</td>\n",
+       "      <td>ford</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-04-01 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>39218</td>\n",
+       "      <td>2016-04-01 14:38:50</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "           dateCrawled                                               name  \\\n",
+       "0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   \n",
+       "1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   \n",
+       "2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   \n",
+       "3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   \n",
+       "4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   \n",
+       "\n",
+       "   seller offerType   price   abtest vehicleType  yearOfRegistration  \\\n",
+       "0  privat   Angebot  $5,000  control         bus                2004   \n",
+       "1  privat   Angebot  $8,500  control   limousine                1997   \n",
+       "2  privat   Angebot  $8,990     test   limousine                2009   \n",
+       "3  privat   Angebot  $4,350  control  kleinwagen                2007   \n",
+       "4  privat   Angebot  $1,350     test       kombi                2003   \n",
+       "\n",
+       "     gearbox  powerPS   model   odometer  monthOfRegistration fuelType  \\\n",
+       "0    manuell      158  andere  150,000km                    3      lpg   \n",
+       "1  automatik      286     7er  150,000km                    6   benzin   \n",
+       "2    manuell      102    golf   70,000km                    7   benzin   \n",
+       "3  automatik       71  fortwo   70,000km                    6   benzin   \n",
+       "4    manuell        0   focus  150,000km                    7   benzin   \n",
+       "\n",
+       "        brand notRepairedDamage          dateCreated  nrOfPictures  \\\n",
+       "0     peugeot              nein  2016-03-26 00:00:00             0   \n",
+       "1         bmw              nein  2016-04-04 00:00:00             0   \n",
+       "2  volkswagen              nein  2016-03-26 00:00:00             0   \n",
+       "3       smart              nein  2016-03-12 00:00:00             0   \n",
+       "4        ford              nein  2016-04-01 00:00:00             0   \n",
+       "\n",
+       "   postalCode             lastSeen  \n",
+       "0       79588  2016-04-06 06:45:54  \n",
+       "1       71034  2016-04-06 14:45:08  \n",
+       "2       35394  2016-04-06 20:15:37  \n",
+       "3       33729  2016-03-15 03:16:28  \n",
+       "4       39218  2016-04-01 14:38:50  "
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos = pd.read_csv('autos.csv', encoding='Latin-1')\n",
+    "autos.info()\n",
+    "autos.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our dataset contains 20 columns, most of which are stored as strings.  There are a few columns with null values, but no columns have more than ~20% null values.  There are some columns that contain dates stored as strings.\n",
+    "\n",
+    "We'll start by cleaning the column names to make the data easier to work with."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Clean Columns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',\n",
+       "       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',\n",
+       "       'odometer', 'monthOfRegistration', 'fuelType', 'brand',\n",
+       "       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',\n",
+       "       'lastSeen'],\n",
+       "      dtype='object')"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos.columns"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll make a few changes here:\n",
+    "\n",
+    "- Change the columns from camelcase to snakecase.\n",
+    "- Change a few wordings to more accurately describe the columns."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>date_crawled</th>\n",
+       "      <th>name</th>\n",
+       "      <th>seller</th>\n",
+       "      <th>offer_type</th>\n",
+       "      <th>price</th>\n",
+       "      <th>ab_test</th>\n",
+       "      <th>vehicle_type</th>\n",
+       "      <th>registration_year</th>\n",
+       "      <th>gearbox</th>\n",
+       "      <th>power_ps</th>\n",
+       "      <th>model</th>\n",
+       "      <th>odometer</th>\n",
+       "      <th>registration_month</th>\n",
+       "      <th>fuel_type</th>\n",
+       "      <th>brand</th>\n",
+       "      <th>unrepaired_damage</th>\n",
+       "      <th>ad_created</th>\n",
+       "      <th>num_photos</th>\n",
+       "      <th>postal_code</th>\n",
+       "      <th>last_seen</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2016-03-26 17:47:46</td>\n",
+       "      <td>Peugeot_807_160_NAVTECH_ON_BOARD</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$5,000</td>\n",
+       "      <td>control</td>\n",
+       "      <td>bus</td>\n",
+       "      <td>2004</td>\n",
+       "      <td>manuell</td>\n",
+       "      <td>158</td>\n",
+       "      <td>andere</td>\n",
+       "      <td>150,000km</td>\n",
+       "      <td>3</td>\n",
+       "      <td>lpg</td>\n",
+       "      <td>peugeot</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-03-26 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>79588</td>\n",
+       "      <td>2016-04-06 06:45:54</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2016-04-04 13:38:56</td>\n",
+       "      <td>BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$8,500</td>\n",
+       "      <td>control</td>\n",
+       "      <td>limousine</td>\n",
+       "      <td>1997</td>\n",
+       "      <td>automatik</td>\n",
+       "      <td>286</td>\n",
+       "      <td>7er</td>\n",
+       "      <td>150,000km</td>\n",
+       "      <td>6</td>\n",
+       "      <td>benzin</td>\n",
+       "      <td>bmw</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-04-04 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>71034</td>\n",
+       "      <td>2016-04-06 14:45:08</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2016-03-26 18:57:24</td>\n",
+       "      <td>Volkswagen_Golf_1.6_United</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$8,990</td>\n",
+       "      <td>test</td>\n",
+       "      <td>limousine</td>\n",
+       "      <td>2009</td>\n",
+       "      <td>manuell</td>\n",
+       "      <td>102</td>\n",
+       "      <td>golf</td>\n",
+       "      <td>70,000km</td>\n",
+       "      <td>7</td>\n",
+       "      <td>benzin</td>\n",
+       "      <td>volkswagen</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-03-26 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>35394</td>\n",
+       "      <td>2016-04-06 20:15:37</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2016-03-12 16:58:10</td>\n",
+       "      <td>Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$4,350</td>\n",
+       "      <td>control</td>\n",
+       "      <td>kleinwagen</td>\n",
+       "      <td>2007</td>\n",
+       "      <td>automatik</td>\n",
+       "      <td>71</td>\n",
+       "      <td>fortwo</td>\n",
+       "      <td>70,000km</td>\n",
+       "      <td>6</td>\n",
+       "      <td>benzin</td>\n",
+       "      <td>smart</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-03-12 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>33729</td>\n",
+       "      <td>2016-03-15 03:16:28</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2016-04-01 14:38:50</td>\n",
+       "      <td>Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$1,350</td>\n",
+       "      <td>test</td>\n",
+       "      <td>kombi</td>\n",
+       "      <td>2003</td>\n",
+       "      <td>manuell</td>\n",
+       "      <td>0</td>\n",
+       "      <td>focus</td>\n",
+       "      <td>150,000km</td>\n",
+       "      <td>7</td>\n",
+       "      <td>benzin</td>\n",
+       "      <td>ford</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-04-01 00:00:00</td>\n",
+       "      <td>0</td>\n",
+       "      <td>39218</td>\n",
+       "      <td>2016-04-01 14:38:50</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "          date_crawled                                               name  \\\n",
+       "0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   \n",
+       "1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   \n",
+       "2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   \n",
+       "3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   \n",
+       "4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   \n",
+       "\n",
+       "   seller offer_type   price  ab_test vehicle_type  registration_year  \\\n",
+       "0  privat    Angebot  $5,000  control          bus               2004   \n",
+       "1  privat    Angebot  $8,500  control    limousine               1997   \n",
+       "2  privat    Angebot  $8,990     test    limousine               2009   \n",
+       "3  privat    Angebot  $4,350  control   kleinwagen               2007   \n",
+       "4  privat    Angebot  $1,350     test        kombi               2003   \n",
+       "\n",
+       "     gearbox  power_ps   model   odometer  registration_month fuel_type  \\\n",
+       "0    manuell       158  andere  150,000km                   3       lpg   \n",
+       "1  automatik       286     7er  150,000km                   6    benzin   \n",
+       "2    manuell       102    golf   70,000km                   7    benzin   \n",
+       "3  automatik        71  fortwo   70,000km                   6    benzin   \n",
+       "4    manuell         0   focus  150,000km                   7    benzin   \n",
+       "\n",
+       "        brand unrepaired_damage           ad_created  num_photos  postal_code  \\\n",
+       "0     peugeot              nein  2016-03-26 00:00:00           0        79588   \n",
+       "1         bmw              nein  2016-04-04 00:00:00           0        71034   \n",
+       "2  volkswagen              nein  2016-03-26 00:00:00           0        35394   \n",
+       "3       smart              nein  2016-03-12 00:00:00           0        33729   \n",
+       "4        ford              nein  2016-04-01 00:00:00           0        39218   \n",
+       "\n",
+       "             last_seen  \n",
+       "0  2016-04-06 06:45:54  \n",
+       "1  2016-04-06 14:45:08  \n",
+       "2  2016-04-06 20:15:37  \n",
+       "3  2016-03-15 03:16:28  \n",
+       "4  2016-04-01 14:38:50  "
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',\n",
+    "       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',\n",
+    "       'odometer', 'registration_month', 'fuel_type', 'brand',\n",
+    "       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',\n",
+    "       'last_seen']\n",
+    "autos.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initial Data Exploration and Cleaning"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll start by exploring the data to find obvious areas where we can clean the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>date_crawled</th>\n",
+       "      <th>name</th>\n",
+       "      <th>seller</th>\n",
+       "      <th>offer_type</th>\n",
+       "      <th>price</th>\n",
+       "      <th>ab_test</th>\n",
+       "      <th>vehicle_type</th>\n",
+       "      <th>registration_year</th>\n",
+       "      <th>gearbox</th>\n",
+       "      <th>power_ps</th>\n",
+       "      <th>model</th>\n",
+       "      <th>odometer</th>\n",
+       "      <th>registration_month</th>\n",
+       "      <th>fuel_type</th>\n",
+       "      <th>brand</th>\n",
+       "      <th>unrepaired_damage</th>\n",
+       "      <th>ad_created</th>\n",
+       "      <th>num_photos</th>\n",
+       "      <th>postal_code</th>\n",
+       "      <th>last_seen</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>count</th>\n",
+       "      <td>50000</td>\n",
+       "      <td>50000</td>\n",
+       "      <td>50000</td>\n",
+       "      <td>50000</td>\n",
+       "      <td>50000</td>\n",
+       "      <td>50000</td>\n",
+       "      <td>44905</td>\n",
+       "      <td>50000.000000</td>\n",
+       "      <td>47320</td>\n",
+       "      <td>50000.000000</td>\n",
+       "      <td>47242</td>\n",
+       "      <td>50000</td>\n",
+       "      <td>50000.000000</td>\n",
+       "      <td>45518</td>\n",
+       "      <td>50000</td>\n",
+       "      <td>40171</td>\n",
+       "      <td>50000</td>\n",
+       "      <td>50000.0</td>\n",
+       "      <td>50000.000000</td>\n",
+       "      <td>50000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>unique</th>\n",
+       "      <td>48213</td>\n",
+       "      <td>38754</td>\n",
+       "      <td>2</td>\n",
+       "      <td>2</td>\n",
+       "      <td>2357</td>\n",
+       "      <td>2</td>\n",
+       "      <td>8</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>2</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>245</td>\n",
+       "      <td>13</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>7</td>\n",
+       "      <td>40</td>\n",
+       "      <td>2</td>\n",
+       "      <td>76</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>39481</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>top</th>\n",
+       "      <td>2016-03-09 11:54:38</td>\n",
+       "      <td>Ford_Fiesta</td>\n",
+       "      <td>privat</td>\n",
+       "      <td>Angebot</td>\n",
+       "      <td>$0</td>\n",
+       "      <td>test</td>\n",
+       "      <td>limousine</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>manuell</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>golf</td>\n",
+       "      <td>150,000km</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>benzin</td>\n",
+       "      <td>volkswagen</td>\n",
+       "      <td>nein</td>\n",
+       "      <td>2016-04-03 00:00:00</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>2016-04-07 06:17:27</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>freq</th>\n",
+       "      <td>3</td>\n",
+       "      <td>78</td>\n",
+       "      <td>49999</td>\n",
+       "      <td>49999</td>\n",
+       "      <td>1421</td>\n",
+       "      <td>25756</td>\n",
+       "      <td>12859</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>36993</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>4024</td>\n",
+       "      <td>32424</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>30107</td>\n",
+       "      <td>10687</td>\n",
+       "      <td>35232</td>\n",
+       "      <td>1946</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>8</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>mean</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>2005.073280</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>116.355920</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.723360</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>50813.627300</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>std</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>105.712813</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>209.216627</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>3.711984</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>25779.747957</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>min</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>1000.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>1067.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25%</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>1999.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>70.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>3.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>30451.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>50%</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>2003.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>105.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>6.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>49577.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>75%</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>2008.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>150.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>9.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>71540.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>max</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>9999.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>17700.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>12.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>99998.000000</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "               date_crawled         name  seller offer_type  price ab_test  \\\n",
+       "count                 50000        50000   50000      50000  50000   50000   \n",
+       "unique                48213        38754       2          2   2357       2   \n",
+       "top     2016-03-09 11:54:38  Ford_Fiesta  privat    Angebot     $0    test   \n",
+       "freq                      3           78   49999      49999   1421   25756   \n",
+       "mean                    NaN          NaN     NaN        NaN    NaN     NaN   \n",
+       "std                     NaN          NaN     NaN        NaN    NaN     NaN   \n",
+       "min                     NaN          NaN     NaN        NaN    NaN     NaN   \n",
+       "25%                     NaN          NaN     NaN        NaN    NaN     NaN   \n",
+       "50%                     NaN          NaN     NaN        NaN    NaN     NaN   \n",
+       "75%                     NaN          NaN     NaN        NaN    NaN     NaN   \n",
+       "max                     NaN          NaN     NaN        NaN    NaN     NaN   \n",
+       "\n",
+       "       vehicle_type  registration_year  gearbox      power_ps  model  \\\n",
+       "count         44905       50000.000000    47320  50000.000000  47242   \n",
+       "unique            8                NaN        2           NaN    245   \n",
+       "top       limousine                NaN  manuell           NaN   golf   \n",
+       "freq          12859                NaN    36993           NaN   4024   \n",
+       "mean            NaN        2005.073280      NaN    116.355920    NaN   \n",
+       "std             NaN         105.712813      NaN    209.216627    NaN   \n",
+       "min             NaN        1000.000000      NaN      0.000000    NaN   \n",
+       "25%             NaN        1999.000000      NaN     70.000000    NaN   \n",
+       "50%             NaN        2003.000000      NaN    105.000000    NaN   \n",
+       "75%             NaN        2008.000000      NaN    150.000000    NaN   \n",
+       "max             NaN        9999.000000      NaN  17700.000000    NaN   \n",
+       "\n",
+       "         odometer  registration_month fuel_type       brand unrepaired_damage  \\\n",
+       "count       50000        50000.000000     45518       50000             40171   \n",
+       "unique         13                 NaN         7          40                 2   \n",
+       "top     150,000km                 NaN    benzin  volkswagen              nein   \n",
+       "freq        32424                 NaN     30107       10687             35232   \n",
+       "mean          NaN            5.723360       NaN         NaN               NaN   \n",
+       "std           NaN            3.711984       NaN         NaN               NaN   \n",
+       "min           NaN            0.000000       NaN         NaN               NaN   \n",
+       "25%           NaN            3.000000       NaN         NaN               NaN   \n",
+       "50%           NaN            6.000000       NaN         NaN               NaN   \n",
+       "75%           NaN            9.000000       NaN         NaN               NaN   \n",
+       "max           NaN           12.000000       NaN         NaN               NaN   \n",
+       "\n",
+       "                 ad_created  num_photos   postal_code            last_seen  \n",
+       "count                 50000     50000.0  50000.000000                50000  \n",
+       "unique                   76         NaN           NaN                39481  \n",
+       "top     2016-04-03 00:00:00         NaN           NaN  2016-04-07 06:17:27  \n",
+       "freq                   1946         NaN           NaN                    8  \n",
+       "mean                    NaN         0.0  50813.627300                  NaN  \n",
+       "std                     NaN         0.0  25779.747957                  NaN  \n",
+       "min                     NaN         0.0   1067.000000                  NaN  \n",
+       "25%                     NaN         0.0  30451.000000                  NaN  \n",
+       "50%                     NaN         0.0  49577.000000                  NaN  \n",
+       "75%                     NaN         0.0  71540.000000                  NaN  \n",
+       "max                     NaN         0.0  99998.000000                  NaN  "
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos.describe(include='all')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our initial observations:\n",
+    "\n",
+    "- There are a number of text columns where all (or nearly all) of the values are the same:\n",
+    "    - `seller`\n",
+    "    - `offer_type`\n",
+    "- The `num_photos` column looks odd, we'll need to investigate this further."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0    50000\n",
+       "Name: num_photos, dtype: int64"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos[\"num_photos\"].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It looks like the `num_photos` column has `0` for every column.  We'll drop this column, plus the other two we noted as mostly one value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "autos = autos.drop([\"num_photos\", \"seller\", \"offer_type\"], axis=1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are two columns, `price` and `auto`, which are numeric values with extra characters being stored as text.  We'll clean and convert these."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0    5000\n",
+       "1    8500\n",
+       "2    8990\n",
+       "3    4350\n",
+       "4    1350\n",
+       "Name: price, dtype: int64"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos[\"price\"] = (autos[\"price\"]\n",
+    "                          .str.replace(\"$\",\"\")\n",
+    "                          .str.replace(\",\",\"\")\n",
+    "                          .astype(int)\n",
+    "                          )\n",
+    "autos[\"price\"].head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0    150000\n",
+       "1    150000\n",
+       "2     70000\n",
+       "3     70000\n",
+       "4    150000\n",
+       "Name: odometer_km, dtype: int64"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos[\"odometer\"] = (autos[\"odometer\"]\n",
+    "                             .str.replace(\"km\",\"\")\n",
+    "                             .str.replace(\",\",\"\")\n",
+    "                             .astype(int)\n",
+    "                             )\n",
+    "autos.rename({\"odometer\": \"odometer_km\"}, axis=1, inplace=True)\n",
+    "autos[\"odometer_km\"].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exploring Odometer and Price"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "150000    32424\n",
+       "125000     5170\n",
+       "100000     2169\n",
+       "90000      1757\n",
+       "80000      1436\n",
+       "70000      1230\n",
+       "60000      1164\n",
+       "50000      1027\n",
+       "5000        967\n",
+       "40000       819\n",
+       "30000       789\n",
+       "20000       784\n",
+       "10000       264\n",
+       "Name: odometer_km, dtype: int64"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos[\"odometer_km\"].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that the values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field.  Additionally, there are more high mileage than low mileage vehicles."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(2357,)\n",
+      "count    5.000000e+04\n",
+      "mean     9.840044e+03\n",
+      "std      4.811044e+05\n",
+      "min      0.000000e+00\n",
+      "25%      1.100000e+03\n",
+      "50%      2.950000e+03\n",
+      "75%      7.200000e+03\n",
+      "max      1.000000e+08\n",
+      "Name: price, dtype: float64\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "0       1421\n",
+       "500      781\n",
+       "1500     734\n",
+       "2500     643\n",
+       "1000     639\n",
+       "1200     639\n",
+       "600      531\n",
+       "800      498\n",
+       "3500     498\n",
+       "2000     460\n",
+       "999      434\n",
+       "750      433\n",
+       "900      420\n",
+       "650      419\n",
+       "850      410\n",
+       "700      395\n",
+       "4500     394\n",
+       "300      384\n",
+       "2200     382\n",
+       "950      379\n",
+       "Name: price, dtype: int64"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "print(autos[\"price\"].unique().shape)\n",
+    "print(autos[\"price\"].describe())\n",
+    "autos[\"price\"].value_counts().head(20)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again, the prices in this column seem rounded, however given there are 2357 unique values in the column, that may just be people's tendency to round prices on the site.\n",
+    "\n",
+    "\n",
+    "There are 1,421 cars listed with $0 price - given that this is only 2% of the of the cars, we might consider removing these rows.  The maximum price is one hundred million dollars, which seems a lot, let's look at the highest prices further."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "99999999    1\n",
+       "27322222    1\n",
+       "12345678    3\n",
+       "11111111    2\n",
+       "10000000    1\n",
+       "3890000     1\n",
+       "1300000     1\n",
+       "1234566     1\n",
+       "999999      2\n",
+       "999990      1\n",
+       "350000      1\n",
+       "345000      1\n",
+       "299000      1\n",
+       "295000      1\n",
+       "265000      1\n",
+       "259000      1\n",
+       "250000      1\n",
+       "220000      1\n",
+       "198000      1\n",
+       "197000      1\n",
+       "Name: price, dtype: int64"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos[\"price\"].value_counts().sort_index(ascending=False).head(20)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0     1421\n",
+       "1      156\n",
+       "2        3\n",
+       "3        1\n",
+       "5        2\n",
+       "8        1\n",
+       "9        1\n",
+       "10       7\n",
+       "11       2\n",
+       "12       3\n",
+       "13       2\n",
+       "14       1\n",
+       "15       2\n",
+       "17       3\n",
+       "18       1\n",
+       "20       4\n",
+       "25       5\n",
+       "29       1\n",
+       "30       7\n",
+       "35       1\n",
+       "Name: price, dtype: int64"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos[\"price\"].value_counts().sort_index(ascending=True).head(20)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are a number of listings with prices below \\$30, including about 1,500 at \\$0.  There are also a small number of listings with very high values, including 14 at around or over $1 million.\n",
+    "\n",
+    "Given that eBay is an auction site, there could legitimately be items where the opening bid is \\$1.  We will keep the \\$1 items, but remove anything above \\$350,000, since it seems that prices increase steadily to that number and then jump up to less realistic numbers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "count     48565.000000\n",
+       "mean       5888.935591\n",
+       "std        9059.854754\n",
+       "min           1.000000\n",
+       "25%        1200.000000\n",
+       "50%        3000.000000\n",
+       "75%        7490.000000\n",
+       "max      350000.000000\n",
+       "Name: price, dtype: float64"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos = autos[autos[\"price\"].between(1,351000)]\n",
+    "autos[\"price\"].describe()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exploring the date columns"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are a number of columns with date information:\n",
+    "\n",
+    "- `date_crawled`\n",
+    "- `registration_month`\n",
+    "- `registration_year`\n",
+    "- `ad_created`\n",
+    "- `last_seen`\n",
+    "\n",
+    "These are a combination of dates that were crawled, and dates with meta-information from the crawler. The non-registration dates are stored as strings.\n",
+    "\n",
+    "We'll explore each of these columns to learn more about the listings."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>date_crawled</th>\n",
+       "      <th>ad_created</th>\n",
+       "      <th>last_seen</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2016-03-26 17:47:46</td>\n",
+       "      <td>2016-03-26 00:00:00</td>\n",
+       "      <td>2016-04-06 06:45:54</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2016-04-04 13:38:56</td>\n",
+       "      <td>2016-04-04 00:00:00</td>\n",
+       "      <td>2016-04-06 14:45:08</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2016-03-26 18:57:24</td>\n",
+       "      <td>2016-03-26 00:00:00</td>\n",
+       "      <td>2016-04-06 20:15:37</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2016-03-12 16:58:10</td>\n",
+       "      <td>2016-03-12 00:00:00</td>\n",
+       "      <td>2016-03-15 03:16:28</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2016-04-01 14:38:50</td>\n",
+       "      <td>2016-04-01 00:00:00</td>\n",
+       "      <td>2016-04-01 14:38:50</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "          date_crawled           ad_created            last_seen\n",
+       "0  2016-03-26 17:47:46  2016-03-26 00:00:00  2016-04-06 06:45:54\n",
+       "1  2016-04-04 13:38:56  2016-04-04 00:00:00  2016-04-06 14:45:08\n",
+       "2  2016-03-26 18:57:24  2016-03-26 00:00:00  2016-04-06 20:15:37\n",
+       "3  2016-03-12 16:58:10  2016-03-12 00:00:00  2016-03-15 03:16:28\n",
+       "4  2016-04-01 14:38:50  2016-04-01 00:00:00  2016-04-01 14:38:50"
+      ]
+     },
+     "execution_count": 28,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos[['date_crawled','ad_created','last_seen']][0:5]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "2016-03-05    0.025327\n",
+       "2016-03-06    0.014043\n",
+       "2016-03-07    0.036014\n",
+       "2016-03-08    0.033296\n",
+       "2016-03-09    0.033090\n",
+       "2016-03-10    0.032184\n",
+       "2016-03-11    0.032575\n",
+       "2016-03-12    0.036920\n",
+       "2016-03-13    0.015670\n",
+       "2016-03-14    0.036549\n",
+       "2016-03-15    0.034284\n",
+       "2016-03-16    0.029610\n",
+       "2016-03-17    0.031628\n",
+       "2016-03-18    0.012911\n",
+       "2016-03-19    0.034778\n",
+       "2016-03-20    0.037887\n",
+       "2016-03-21    0.037373\n",
+       "2016-03-22    0.032987\n",
+       "2016-03-23    0.032225\n",
+       "2016-03-24    0.029342\n",
+       "2016-03-25    0.031607\n",
+       "2016-03-26    0.032204\n",
+       "2016-03-27    0.031092\n",
+       "2016-03-28    0.034860\n",
+       "2016-03-29    0.034099\n",
+       "2016-03-30    0.033687\n",
+       "2016-03-31    0.031834\n",
+       "2016-04-01    0.033687\n",
+       "2016-04-02    0.035478\n",
+       "2016-04-03    0.038608\n",
+       "2016-04-04    0.036487\n",
+       "2016-04-05    0.013096\n",
+       "2016-04-06    0.003171\n",
+       "2016-04-07    0.001400\n",
+       "Name: date_crawled, dtype: float64"
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "(autos[\"date_crawled\"]\n",
+    "        .str[:10]\n",
+    "        .value_counts(normalize=True, dropna=False)\n",
+    "        .sort_index()\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "2016-04-07    0.001400\n",
+       "2016-04-06    0.003171\n",
+       "2016-03-18    0.012911\n",
+       "2016-04-05    0.013096\n",
+       "2016-03-06    0.014043\n",
+       "2016-03-13    0.015670\n",
+       "2016-03-05    0.025327\n",
+       "2016-03-24    0.029342\n",
+       "2016-03-16    0.029610\n",
+       "2016-03-27    0.031092\n",
+       "2016-03-25    0.031607\n",
+       "2016-03-17    0.031628\n",
+       "2016-03-31    0.031834\n",
+       "2016-03-10    0.032184\n",
+       "2016-03-26    0.032204\n",
+       "2016-03-23    0.032225\n",
+       "2016-03-11    0.032575\n",
+       "2016-03-22    0.032987\n",
+       "2016-03-09    0.033090\n",
+       "2016-03-08    0.033296\n",
+       "2016-03-30    0.033687\n",
+       "2016-04-01    0.033687\n",
+       "2016-03-29    0.034099\n",
+       "2016-03-15    0.034284\n",
+       "2016-03-19    0.034778\n",
+       "2016-03-28    0.034860\n",
+       "2016-04-02    0.035478\n",
+       "2016-03-07    0.036014\n",
+       "2016-04-04    0.036487\n",
+       "2016-03-14    0.036549\n",
+       "2016-03-12    0.036920\n",
+       "2016-03-21    0.037373\n",
+       "2016-03-20    0.037887\n",
+       "2016-04-03    0.038608\n",
+       "Name: date_crawled, dtype: float64"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "(autos[\"date_crawled\"]\n",
+    "        .str[:10]\n",
+    "        .value_counts(normalize=True, dropna=False)\n",
+    "        .sort_values()\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Looks like the site was crawled daily over roughly a one month period in March and April 2016.  The distribution of listings crawled on each day is roughly uniform."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 150,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "2016-03-05    0.001071\n",
+       "2016-03-06    0.004324\n",
+       "2016-03-07    0.005395\n",
+       "2016-03-08    0.007413\n",
+       "2016-03-09    0.009595\n",
+       "2016-03-10    0.010666\n",
+       "2016-03-11    0.012375\n",
+       "2016-03-12    0.023783\n",
+       "2016-03-13    0.008895\n",
+       "2016-03-14    0.012602\n",
+       "2016-03-15    0.015876\n",
+       "2016-03-16    0.016452\n",
+       "2016-03-17    0.028086\n",
+       "2016-03-18    0.007351\n",
+       "2016-03-19    0.015834\n",
+       "2016-03-20    0.020653\n",
+       "2016-03-21    0.020632\n",
+       "2016-03-22    0.021373\n",
+       "2016-03-23    0.018532\n",
+       "2016-03-24    0.019767\n",
+       "2016-03-25    0.019211\n",
+       "2016-03-26    0.016802\n",
+       "2016-03-27    0.015649\n",
+       "2016-03-28    0.020859\n",
+       "2016-03-29    0.022341\n",
+       "2016-03-30    0.024771\n",
+       "2016-03-31    0.023783\n",
+       "2016-04-01    0.022794\n",
+       "2016-04-02    0.024915\n",
+       "2016-04-03    0.025203\n",
+       "2016-04-04    0.024483\n",
+       "2016-04-05    0.124761\n",
+       "2016-04-06    0.221806\n",
+       "2016-04-07    0.131947\n",
+       "Name: last_seen, dtype: float64"
+      ]
+     },
+     "execution_count": 150,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "(autos[\"last_seen\"]\n",
+    "        .str[:10]\n",
+    "        .value_counts(normalize=True, dropna=False)\n",
+    "        .sort_index()\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.\n",
+    "\n",
+    "The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(76,)\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "2015-06-11    0.000021\n",
+       "2015-08-10    0.000021\n",
+       "2015-09-09    0.000021\n",
+       "2015-11-10    0.000021\n",
+       "2015-12-05    0.000021\n",
+       "2015-12-30    0.000021\n",
+       "2016-01-03    0.000021\n",
+       "2016-01-07    0.000021\n",
+       "2016-01-10    0.000041\n",
+       "2016-01-13    0.000021\n",
+       "2016-01-14    0.000021\n",
+       "2016-01-16    0.000021\n",
+       "2016-01-22    0.000021\n",
+       "2016-01-27    0.000062\n",
+       "2016-01-29    0.000021\n",
+       "2016-02-01    0.000021\n",
+       "2016-02-02    0.000041\n",
+       "2016-02-05    0.000041\n",
+       "2016-02-07    0.000021\n",
+       "2016-02-08    0.000021\n",
+       "2016-02-09    0.000021\n",
+       "2016-02-11    0.000021\n",
+       "2016-02-12    0.000041\n",
+       "2016-02-14    0.000041\n",
+       "2016-02-16    0.000021\n",
+       "2016-02-17    0.000021\n",
+       "2016-02-18    0.000041\n",
+       "2016-02-19    0.000062\n",
+       "2016-02-20    0.000041\n",
+       "2016-02-21    0.000062\n",
+       "                ...   \n",
+       "2016-03-09    0.033151\n",
+       "2016-03-10    0.031895\n",
+       "2016-03-11    0.032904\n",
+       "2016-03-12    0.036755\n",
+       "2016-03-13    0.017008\n",
+       "2016-03-14    0.035190\n",
+       "2016-03-15    0.034016\n",
+       "2016-03-16    0.030125\n",
+       "2016-03-17    0.031278\n",
+       "2016-03-18    0.013590\n",
+       "2016-03-19    0.033687\n",
+       "2016-03-20    0.037949\n",
+       "2016-03-21    0.037579\n",
+       "2016-03-22    0.032801\n",
+       "2016-03-23    0.032060\n",
+       "2016-03-24    0.029280\n",
+       "2016-03-25    0.031751\n",
+       "2016-03-26    0.032266\n",
+       "2016-03-27    0.030989\n",
+       "2016-03-28    0.034984\n",
+       "2016-03-29    0.034037\n",
+       "2016-03-30    0.033501\n",
+       "2016-03-31    0.031875\n",
+       "2016-04-01    0.033687\n",
+       "2016-04-02    0.035149\n",
+       "2016-04-03    0.038855\n",
+       "2016-04-04    0.036858\n",
+       "2016-04-05    0.011819\n",
+       "2016-04-06    0.003253\n",
+       "2016-04-07    0.001256\n",
+       "Name: ad_created, Length: 76, dtype: float64"
+      ]
+     },
+     "execution_count": 33,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "print(autos[\"ad_created\"].str[:10].unique().shape)\n",
+    "(autos[\"ad_created\"]\n",
+    "        .str[:10]\n",
+    "        .value_counts(normalize=True, dropna=False)\n",
+    "        .sort_index()\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There is a large variety of ad created dates.  Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "count    48565.000000\n",
+       "mean      2004.755421\n",
+       "std         88.643887\n",
+       "min       1000.000000\n",
+       "25%       1999.000000\n",
+       "50%       2004.000000\n",
+       "75%       2008.000000\n",
+       "max       9999.000000\n",
+       "Name: registration_year, dtype: float64"
+      ]
+     },
+     "execution_count": 34,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos[\"registration_year\"].describe()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The year that the car was first registered will likely indicate the age of the car.  Looking at this column, we note some odd values.  The minimum value is `1000`, long before cars were invented and the maximum is `9999`, many years into the future."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Dealing with Incorrect Registration Year Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Because a car can't be first registered before the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate.  Determining the earliest valid year is more difficult.  Realistically, it could be somewhere in the first few decades of the 1900s.\n",
+    "\n",
+    "One option is to remove the listings with these values.  Let's determine what percentage of our data has invalid values in this column:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.038793369710697002"
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "(~autos[\"registration_year\"].between(1900,2016)).sum() / autos.shape[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Given that this is less than 4% of our data, we will remove these rows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "2000    0.067608\n",
+       "2005    0.062895\n",
+       "1999    0.062060\n",
+       "2004    0.057904\n",
+       "2003    0.057818\n",
+       "2006    0.057197\n",
+       "2001    0.056468\n",
+       "2002    0.053255\n",
+       "1998    0.050620\n",
+       "2007    0.048778\n",
+       "Name: registration_year, dtype: float64"
+      ]
+     },
+     "execution_count": 47,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Many ways to select rows in a dataframe that fall within a value range for a column.\n",
+    "# Using `Series.between()` is one way.\n",
+    "autos = autos[autos[\"registration_year\"].between(1900,2016)]\n",
+    "autos[\"registration_year\"].value_counts(normalize=True).head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It appears that most of the vehicles were first registered in the past 20 years."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exploring Price by Brand"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "volkswagen        0.211264\n",
+       "bmw               0.110045\n",
+       "opel              0.107581\n",
+       "mercedes_benz     0.096463\n",
+       "audi              0.086566\n",
+       "ford              0.069900\n",
+       "renault           0.047150\n",
+       "peugeot           0.029841\n",
+       "fiat              0.025642\n",
+       "seat              0.018273\n",
+       "skoda             0.016409\n",
+       "nissan            0.015274\n",
+       "mazda             0.015188\n",
+       "smart             0.014160\n",
+       "citroen           0.014010\n",
+       "toyota            0.012703\n",
+       "hyundai           0.010025\n",
+       "sonstige_autos    0.009811\n",
+       "volvo             0.009147\n",
+       "mini              0.008762\n",
+       "mitsubishi        0.008226\n",
+       "honda             0.007840\n",
+       "kia               0.007069\n",
+       "alfa_romeo        0.006641\n",
+       "porsche           0.006127\n",
+       "suzuki            0.005934\n",
+       "chevrolet         0.005698\n",
+       "chrysler          0.003513\n",
+       "dacia             0.002635\n",
+       "daihatsu          0.002506\n",
+       "jeep              0.002271\n",
+       "subaru            0.002142\n",
+       "land_rover        0.002099\n",
+       "saab              0.001649\n",
+       "jaguar            0.001564\n",
+       "daewoo            0.001500\n",
+       "trabant           0.001392\n",
+       "rover             0.001328\n",
+       "lancia            0.001071\n",
+       "lada              0.000578\n",
+       "Name: brand, dtype: float64"
+      ]
+     },
+     "execution_count": 48,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "autos[\"brand\"].value_counts(normalize=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "German manufacturers represent four out of the top five brands, almost 50% of the overall listings.  Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.\n",
+    "\n",
+    "There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to brands representing more than 5% of total listings."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')\n"
+     ]
+    }
+   ],
+   "source": [
+    "brand_counts = autos[\"brand\"].value_counts(normalize=True)\n",
+    "common_brands = brand_counts[brand_counts > .05].index\n",
+    "print(common_brands)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'audi': 9336,\n",
+       " 'bmw': 8332,\n",
+       " 'ford': 3749,\n",
+       " 'mercedes_benz': 8628,\n",
+       " 'opel': 2975,\n",
+       " 'volkswagen': 5402}"
+      ]
+     },
+     "execution_count": 39,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "brand_mean_prices = {}\n",
+    "\n",
+    "for brand in common_brands:\n",
+    "    brand_only = autos[autos[\"brand\"] == brand]\n",
+    "    mean_price = brand_only[\"price\"].mean()\n",
+    "    brand_mean_prices[brand] = int(mean_price)\n",
+    "\n",
+    "brand_mean_prices"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Of the top 5 brands, there is a distinct price gap:\n",
+    "\n",
+    "- Audi, BMW and Mercedes Benz are more expensive\n",
+    "- Ford and Opel are less expensive\n",
+    "- Volkswagen is in between - this may explain its popularity, it may be a 'best of 'both worlds' option."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exploring Mileage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 66,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>mean_mileage</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>audi</th>\n",
+       "      <td>9336</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>bmw</th>\n",
+       "      <td>8332</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>ford</th>\n",
+       "      <td>3749</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>mercedes_benz</th>\n",
+       "      <td>8628</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>opel</th>\n",
+       "      <td>2975</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>volkswagen</th>\n",
+       "      <td>5402</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "               mean_mileage\n",
+       "audi                   9336\n",
+       "bmw                    8332\n",
+       "ford                   3749\n",
+       "mercedes_benz          8628\n",
+       "opel                   2975\n",
+       "volkswagen             5402"
+      ]
+     },
+     "execution_count": 66,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "bmp_series = pd.Series(brand_mean_prices)\n",
+    "pd.DataFrame(bmp_series, columns=[\"mean_mileage\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "brand_mean_mileage = {}\n",
+    "\n",
+    "for brand in common_brands:\n",
+    "    brand_only = autos[autos[\"brand\"] == brand]\n",
+    "    mean_mileage = brand_only[\"odometer_km\"].mean()\n",
+    "    brand_mean_mileage[brand] = int(mean_mileage)\n",
+    "\n",
+    "mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)\n",
+    "mean_prices = pd.Series(brand_mean_prices).sort_values(ascending=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>mean_mileage</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>bmw</th>\n",
+       "      <td>132572</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>mercedes_benz</th>\n",
+       "      <td>130788</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>opel</th>\n",
+       "      <td>129310</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>audi</th>\n",
+       "      <td>129157</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>volkswagen</th>\n",
+       "      <td>128707</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>ford</th>\n",
+       "      <td>124266</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "               mean_mileage\n",
+       "bmw                  132572\n",
+       "mercedes_benz        130788\n",
+       "opel                 129310\n",
+       "audi                 129157\n",
+       "volkswagen           128707\n",
+       "ford                 124266"
+      ]
+     },
+     "execution_count": 51,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "brand_info = pd.DataFrame(mean_mileage,columns=['mean_mileage'])\n",
+    "brand_info"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>mean_mileage</th>\n",
+       "      <th>mean_price</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>bmw</th>\n",
+       "      <td>132572</td>\n",
+       "      <td>8332</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>mercedes_benz</th>\n",
+       "      <td>130788</td>\n",
+       "      <td>8628</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>opel</th>\n",
+       "      <td>129310</td>\n",
+       "      <td>2975</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>audi</th>\n",
+       "      <td>129157</td>\n",
+       "      <td>9336</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>volkswagen</th>\n",
+       "      <td>128707</td>\n",
+       "      <td>5402</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>ford</th>\n",
+       "      <td>124266</td>\n",
+       "      <td>3749</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "               mean_mileage  mean_price\n",
+       "bmw                  132572        8332\n",
+       "mercedes_benz        130788        8628\n",
+       "opel                 129310        2975\n",
+       "audi                 129157        9336\n",
+       "volkswagen           128707        5402\n",
+       "ford                 124266        3749"
+      ]
+     },
+     "execution_count": 52,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "brand_info[\"mean_price\"] = mean_prices\n",
+    "brand_info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands.  There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}