|
@@ -0,0 +1,2253 @@
|
|
|
+{
|
|
|
+ "cells": [
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Analyzing Used Car Listings on eBay Kleinanzeigen"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "We will be working on a dataset of used cars from *eBay Kleinanzeigen*, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.\n",
|
|
|
+ "\n",
|
|
|
+ "The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). The version of the dataset we are working with is a sample of 50,000 data points that was prepared by [Dataquest](https://www.dataquest.io) including simulating a less-cleaned version of the data.\n",
|
|
|
+ "\n",
|
|
|
+ "The data dictionary provided with data is as follows:\n",
|
|
|
+ "\n",
|
|
|
+ "- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.\n",
|
|
|
+ "- `name` - Name of the car.\n",
|
|
|
+ "- `seller` - Whether the seller is private or a dealer.\n",
|
|
|
+ "- `offerType` - The type of listing\n",
|
|
|
+ "- `price` - The price on the ad to sell the car.\n",
|
|
|
+ "- `abtest` - Whether the listing is included in an A/B test.\n",
|
|
|
+ "- `vehicleType` - The vehicle Type.\n",
|
|
|
+ "- `yearOfRegistration` - The year in which which year the car was first registered.\n",
|
|
|
+ "- `gearbox` - The transmission type.\n",
|
|
|
+ "- `powerPS` - The power of the car in PS.\n",
|
|
|
+ "- `model` - The car model name.\n",
|
|
|
+ "- `kilometer` - How many kilometers the car has driven.\n",
|
|
|
+ "- `monthOfRegistration` - The month in which which year the car was first registered.\n",
|
|
|
+ "- `fuelType` - What type of fuel the car uses.\n",
|
|
|
+ "- `brand` - The brand of the car.\n",
|
|
|
+ "- `notRepairedDamage` - If the car has a damage which is not yet repaired.\n",
|
|
|
+ "- `dateCreated` - The date on which the eBay listing was created.\n",
|
|
|
+ "- `nrOfPictures` - The number of pictures in the ad.\n",
|
|
|
+ "- `postalCode` - The postal code for the location of the vehicle.\n",
|
|
|
+ "- `lastSeenOnline` - When the crawler saw this ad last online.\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "The aim of this project is to clean the data and analyze the included used car listings."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 1,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "import pandas as pd\n",
|
|
|
+ "import numpy as np"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 2,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "<class 'pandas.core.frame.DataFrame'>\n",
|
|
|
+ "RangeIndex: 50000 entries, 0 to 49999\n",
|
|
|
+ "Data columns (total 20 columns):\n",
|
|
|
+ "dateCrawled 50000 non-null object\n",
|
|
|
+ "name 50000 non-null object\n",
|
|
|
+ "seller 50000 non-null object\n",
|
|
|
+ "offerType 50000 non-null object\n",
|
|
|
+ "price 50000 non-null object\n",
|
|
|
+ "abtest 50000 non-null object\n",
|
|
|
+ "vehicleType 44905 non-null object\n",
|
|
|
+ "yearOfRegistration 50000 non-null int64\n",
|
|
|
+ "gearbox 47320 non-null object\n",
|
|
|
+ "powerPS 50000 non-null int64\n",
|
|
|
+ "model 47242 non-null object\n",
|
|
|
+ "odometer 50000 non-null object\n",
|
|
|
+ "monthOfRegistration 50000 non-null int64\n",
|
|
|
+ "fuelType 45518 non-null object\n",
|
|
|
+ "brand 50000 non-null object\n",
|
|
|
+ "notRepairedDamage 40171 non-null object\n",
|
|
|
+ "dateCreated 50000 non-null object\n",
|
|
|
+ "nrOfPictures 50000 non-null int64\n",
|
|
|
+ "postalCode 50000 non-null int64\n",
|
|
|
+ "lastSeen 50000 non-null object\n",
|
|
|
+ "dtypes: int64(5), object(15)\n",
|
|
|
+ "memory usage: 7.6+ MB\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>dateCrawled</th>\n",
|
|
|
+ " <th>name</th>\n",
|
|
|
+ " <th>seller</th>\n",
|
|
|
+ " <th>offerType</th>\n",
|
|
|
+ " <th>price</th>\n",
|
|
|
+ " <th>abtest</th>\n",
|
|
|
+ " <th>vehicleType</th>\n",
|
|
|
+ " <th>yearOfRegistration</th>\n",
|
|
|
+ " <th>gearbox</th>\n",
|
|
|
+ " <th>powerPS</th>\n",
|
|
|
+ " <th>model</th>\n",
|
|
|
+ " <th>odometer</th>\n",
|
|
|
+ " <th>monthOfRegistration</th>\n",
|
|
|
+ " <th>fuelType</th>\n",
|
|
|
+ " <th>brand</th>\n",
|
|
|
+ " <th>notRepairedDamage</th>\n",
|
|
|
+ " <th>dateCreated</th>\n",
|
|
|
+ " <th>nrOfPictures</th>\n",
|
|
|
+ " <th>postalCode</th>\n",
|
|
|
+ " <th>lastSeen</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>0</th>\n",
|
|
|
+ " <td>2016-03-26 17:47:46</td>\n",
|
|
|
+ " <td>Peugeot_807_160_NAVTECH_ON_BOARD</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$5,000</td>\n",
|
|
|
+ " <td>control</td>\n",
|
|
|
+ " <td>bus</td>\n",
|
|
|
+ " <td>2004</td>\n",
|
|
|
+ " <td>manuell</td>\n",
|
|
|
+ " <td>158</td>\n",
|
|
|
+ " <td>andere</td>\n",
|
|
|
+ " <td>150,000km</td>\n",
|
|
|
+ " <td>3</td>\n",
|
|
|
+ " <td>lpg</td>\n",
|
|
|
+ " <td>peugeot</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-03-26 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>79588</td>\n",
|
|
|
+ " <td>2016-04-06 06:45:54</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>1</th>\n",
|
|
|
+ " <td>2016-04-04 13:38:56</td>\n",
|
|
|
+ " <td>BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$8,500</td>\n",
|
|
|
+ " <td>control</td>\n",
|
|
|
+ " <td>limousine</td>\n",
|
|
|
+ " <td>1997</td>\n",
|
|
|
+ " <td>automatik</td>\n",
|
|
|
+ " <td>286</td>\n",
|
|
|
+ " <td>7er</td>\n",
|
|
|
+ " <td>150,000km</td>\n",
|
|
|
+ " <td>6</td>\n",
|
|
|
+ " <td>benzin</td>\n",
|
|
|
+ " <td>bmw</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-04-04 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>71034</td>\n",
|
|
|
+ " <td>2016-04-06 14:45:08</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>2</th>\n",
|
|
|
+ " <td>2016-03-26 18:57:24</td>\n",
|
|
|
+ " <td>Volkswagen_Golf_1.6_United</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$8,990</td>\n",
|
|
|
+ " <td>test</td>\n",
|
|
|
+ " <td>limousine</td>\n",
|
|
|
+ " <td>2009</td>\n",
|
|
|
+ " <td>manuell</td>\n",
|
|
|
+ " <td>102</td>\n",
|
|
|
+ " <td>golf</td>\n",
|
|
|
+ " <td>70,000km</td>\n",
|
|
|
+ " <td>7</td>\n",
|
|
|
+ " <td>benzin</td>\n",
|
|
|
+ " <td>volkswagen</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-03-26 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>35394</td>\n",
|
|
|
+ " <td>2016-04-06 20:15:37</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>3</th>\n",
|
|
|
+ " <td>2016-03-12 16:58:10</td>\n",
|
|
|
+ " <td>Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$4,350</td>\n",
|
|
|
+ " <td>control</td>\n",
|
|
|
+ " <td>kleinwagen</td>\n",
|
|
|
+ " <td>2007</td>\n",
|
|
|
+ " <td>automatik</td>\n",
|
|
|
+ " <td>71</td>\n",
|
|
|
+ " <td>fortwo</td>\n",
|
|
|
+ " <td>70,000km</td>\n",
|
|
|
+ " <td>6</td>\n",
|
|
|
+ " <td>benzin</td>\n",
|
|
|
+ " <td>smart</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-03-12 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>33729</td>\n",
|
|
|
+ " <td>2016-03-15 03:16:28</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>4</th>\n",
|
|
|
+ " <td>2016-04-01 14:38:50</td>\n",
|
|
|
+ " <td>Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$1,350</td>\n",
|
|
|
+ " <td>test</td>\n",
|
|
|
+ " <td>kombi</td>\n",
|
|
|
+ " <td>2003</td>\n",
|
|
|
+ " <td>manuell</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>focus</td>\n",
|
|
|
+ " <td>150,000km</td>\n",
|
|
|
+ " <td>7</td>\n",
|
|
|
+ " <td>benzin</td>\n",
|
|
|
+ " <td>ford</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-04-01 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>39218</td>\n",
|
|
|
+ " <td>2016-04-01 14:38:50</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " dateCrawled name \\\n",
|
|
|
+ "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n",
|
|
|
+ "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n",
|
|
|
+ "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n",
|
|
|
+ "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n",
|
|
|
+ "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n",
|
|
|
+ "\n",
|
|
|
+ " seller offerType price abtest vehicleType yearOfRegistration \\\n",
|
|
|
+ "0 privat Angebot $5,000 control bus 2004 \n",
|
|
|
+ "1 privat Angebot $8,500 control limousine 1997 \n",
|
|
|
+ "2 privat Angebot $8,990 test limousine 2009 \n",
|
|
|
+ "3 privat Angebot $4,350 control kleinwagen 2007 \n",
|
|
|
+ "4 privat Angebot $1,350 test kombi 2003 \n",
|
|
|
+ "\n",
|
|
|
+ " gearbox powerPS model odometer monthOfRegistration fuelType \\\n",
|
|
|
+ "0 manuell 158 andere 150,000km 3 lpg \n",
|
|
|
+ "1 automatik 286 7er 150,000km 6 benzin \n",
|
|
|
+ "2 manuell 102 golf 70,000km 7 benzin \n",
|
|
|
+ "3 automatik 71 fortwo 70,000km 6 benzin \n",
|
|
|
+ "4 manuell 0 focus 150,000km 7 benzin \n",
|
|
|
+ "\n",
|
|
|
+ " brand notRepairedDamage dateCreated nrOfPictures \\\n",
|
|
|
+ "0 peugeot nein 2016-03-26 00:00:00 0 \n",
|
|
|
+ "1 bmw nein 2016-04-04 00:00:00 0 \n",
|
|
|
+ "2 volkswagen nein 2016-03-26 00:00:00 0 \n",
|
|
|
+ "3 smart nein 2016-03-12 00:00:00 0 \n",
|
|
|
+ "4 ford nein 2016-04-01 00:00:00 0 \n",
|
|
|
+ "\n",
|
|
|
+ " postalCode lastSeen \n",
|
|
|
+ "0 79588 2016-04-06 06:45:54 \n",
|
|
|
+ "1 71034 2016-04-06 14:45:08 \n",
|
|
|
+ "2 35394 2016-04-06 20:15:37 \n",
|
|
|
+ "3 33729 2016-03-15 03:16:28 \n",
|
|
|
+ "4 39218 2016-04-01 14:38:50 "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 2,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos = pd.read_csv('autos.csv', encoding='Latin-1')\n",
|
|
|
+ "autos.info()\n",
|
|
|
+ "autos.head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Our dataset contains 20 columns, most of which are stored as strings. There are a few columns with null values, but no columns have more than ~20% null values. There are some columns that contain dates stored as strings.\n",
|
|
|
+ "\n",
|
|
|
+ "We'll start by cleaning the column names to make the data easier to work with."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Clean Columns"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 3,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',\n",
|
|
|
+ " 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',\n",
|
|
|
+ " 'odometer', 'monthOfRegistration', 'fuelType', 'brand',\n",
|
|
|
+ " 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',\n",
|
|
|
+ " 'lastSeen'],\n",
|
|
|
+ " dtype='object')"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 3,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos.columns"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "We'll make a few changes here:\n",
|
|
|
+ "\n",
|
|
|
+ "- Change the columns from camelcase to snakecase.\n",
|
|
|
+ "- Change a few wordings to more accurately describe the columns."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 4,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>date_crawled</th>\n",
|
|
|
+ " <th>name</th>\n",
|
|
|
+ " <th>seller</th>\n",
|
|
|
+ " <th>offer_type</th>\n",
|
|
|
+ " <th>price</th>\n",
|
|
|
+ " <th>ab_test</th>\n",
|
|
|
+ " <th>vehicle_type</th>\n",
|
|
|
+ " <th>registration_year</th>\n",
|
|
|
+ " <th>gearbox</th>\n",
|
|
|
+ " <th>power_ps</th>\n",
|
|
|
+ " <th>model</th>\n",
|
|
|
+ " <th>odometer</th>\n",
|
|
|
+ " <th>registration_month</th>\n",
|
|
|
+ " <th>fuel_type</th>\n",
|
|
|
+ " <th>brand</th>\n",
|
|
|
+ " <th>unrepaired_damage</th>\n",
|
|
|
+ " <th>ad_created</th>\n",
|
|
|
+ " <th>num_photos</th>\n",
|
|
|
+ " <th>postal_code</th>\n",
|
|
|
+ " <th>last_seen</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>0</th>\n",
|
|
|
+ " <td>2016-03-26 17:47:46</td>\n",
|
|
|
+ " <td>Peugeot_807_160_NAVTECH_ON_BOARD</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$5,000</td>\n",
|
|
|
+ " <td>control</td>\n",
|
|
|
+ " <td>bus</td>\n",
|
|
|
+ " <td>2004</td>\n",
|
|
|
+ " <td>manuell</td>\n",
|
|
|
+ " <td>158</td>\n",
|
|
|
+ " <td>andere</td>\n",
|
|
|
+ " <td>150,000km</td>\n",
|
|
|
+ " <td>3</td>\n",
|
|
|
+ " <td>lpg</td>\n",
|
|
|
+ " <td>peugeot</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-03-26 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>79588</td>\n",
|
|
|
+ " <td>2016-04-06 06:45:54</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>1</th>\n",
|
|
|
+ " <td>2016-04-04 13:38:56</td>\n",
|
|
|
+ " <td>BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$8,500</td>\n",
|
|
|
+ " <td>control</td>\n",
|
|
|
+ " <td>limousine</td>\n",
|
|
|
+ " <td>1997</td>\n",
|
|
|
+ " <td>automatik</td>\n",
|
|
|
+ " <td>286</td>\n",
|
|
|
+ " <td>7er</td>\n",
|
|
|
+ " <td>150,000km</td>\n",
|
|
|
+ " <td>6</td>\n",
|
|
|
+ " <td>benzin</td>\n",
|
|
|
+ " <td>bmw</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-04-04 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>71034</td>\n",
|
|
|
+ " <td>2016-04-06 14:45:08</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>2</th>\n",
|
|
|
+ " <td>2016-03-26 18:57:24</td>\n",
|
|
|
+ " <td>Volkswagen_Golf_1.6_United</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$8,990</td>\n",
|
|
|
+ " <td>test</td>\n",
|
|
|
+ " <td>limousine</td>\n",
|
|
|
+ " <td>2009</td>\n",
|
|
|
+ " <td>manuell</td>\n",
|
|
|
+ " <td>102</td>\n",
|
|
|
+ " <td>golf</td>\n",
|
|
|
+ " <td>70,000km</td>\n",
|
|
|
+ " <td>7</td>\n",
|
|
|
+ " <td>benzin</td>\n",
|
|
|
+ " <td>volkswagen</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-03-26 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>35394</td>\n",
|
|
|
+ " <td>2016-04-06 20:15:37</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>3</th>\n",
|
|
|
+ " <td>2016-03-12 16:58:10</td>\n",
|
|
|
+ " <td>Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$4,350</td>\n",
|
|
|
+ " <td>control</td>\n",
|
|
|
+ " <td>kleinwagen</td>\n",
|
|
|
+ " <td>2007</td>\n",
|
|
|
+ " <td>automatik</td>\n",
|
|
|
+ " <td>71</td>\n",
|
|
|
+ " <td>fortwo</td>\n",
|
|
|
+ " <td>70,000km</td>\n",
|
|
|
+ " <td>6</td>\n",
|
|
|
+ " <td>benzin</td>\n",
|
|
|
+ " <td>smart</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-03-12 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>33729</td>\n",
|
|
|
+ " <td>2016-03-15 03:16:28</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>4</th>\n",
|
|
|
+ " <td>2016-04-01 14:38:50</td>\n",
|
|
|
+ " <td>Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$1,350</td>\n",
|
|
|
+ " <td>test</td>\n",
|
|
|
+ " <td>kombi</td>\n",
|
|
|
+ " <td>2003</td>\n",
|
|
|
+ " <td>manuell</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>focus</td>\n",
|
|
|
+ " <td>150,000km</td>\n",
|
|
|
+ " <td>7</td>\n",
|
|
|
+ " <td>benzin</td>\n",
|
|
|
+ " <td>ford</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-04-01 00:00:00</td>\n",
|
|
|
+ " <td>0</td>\n",
|
|
|
+ " <td>39218</td>\n",
|
|
|
+ " <td>2016-04-01 14:38:50</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " date_crawled name \\\n",
|
|
|
+ "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n",
|
|
|
+ "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n",
|
|
|
+ "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n",
|
|
|
+ "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n",
|
|
|
+ "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n",
|
|
|
+ "\n",
|
|
|
+ " seller offer_type price ab_test vehicle_type registration_year \\\n",
|
|
|
+ "0 privat Angebot $5,000 control bus 2004 \n",
|
|
|
+ "1 privat Angebot $8,500 control limousine 1997 \n",
|
|
|
+ "2 privat Angebot $8,990 test limousine 2009 \n",
|
|
|
+ "3 privat Angebot $4,350 control kleinwagen 2007 \n",
|
|
|
+ "4 privat Angebot $1,350 test kombi 2003 \n",
|
|
|
+ "\n",
|
|
|
+ " gearbox power_ps model odometer registration_month fuel_type \\\n",
|
|
|
+ "0 manuell 158 andere 150,000km 3 lpg \n",
|
|
|
+ "1 automatik 286 7er 150,000km 6 benzin \n",
|
|
|
+ "2 manuell 102 golf 70,000km 7 benzin \n",
|
|
|
+ "3 automatik 71 fortwo 70,000km 6 benzin \n",
|
|
|
+ "4 manuell 0 focus 150,000km 7 benzin \n",
|
|
|
+ "\n",
|
|
|
+ " brand unrepaired_damage ad_created num_photos postal_code \\\n",
|
|
|
+ "0 peugeot nein 2016-03-26 00:00:00 0 79588 \n",
|
|
|
+ "1 bmw nein 2016-04-04 00:00:00 0 71034 \n",
|
|
|
+ "2 volkswagen nein 2016-03-26 00:00:00 0 35394 \n",
|
|
|
+ "3 smart nein 2016-03-12 00:00:00 0 33729 \n",
|
|
|
+ "4 ford nein 2016-04-01 00:00:00 0 39218 \n",
|
|
|
+ "\n",
|
|
|
+ " last_seen \n",
|
|
|
+ "0 2016-04-06 06:45:54 \n",
|
|
|
+ "1 2016-04-06 14:45:08 \n",
|
|
|
+ "2 2016-04-06 20:15:37 \n",
|
|
|
+ "3 2016-03-15 03:16:28 \n",
|
|
|
+ "4 2016-04-01 14:38:50 "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 4,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',\n",
|
|
|
+ " 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',\n",
|
|
|
+ " 'odometer', 'registration_month', 'fuel_type', 'brand',\n",
|
|
|
+ " 'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',\n",
|
|
|
+ " 'last_seen']\n",
|
|
|
+ "autos.head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Initial Data Exploration and Cleaning"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "We'll start by exploring the data to find obvious areas where we can clean the data."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 5,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>date_crawled</th>\n",
|
|
|
+ " <th>name</th>\n",
|
|
|
+ " <th>seller</th>\n",
|
|
|
+ " <th>offer_type</th>\n",
|
|
|
+ " <th>price</th>\n",
|
|
|
+ " <th>ab_test</th>\n",
|
|
|
+ " <th>vehicle_type</th>\n",
|
|
|
+ " <th>registration_year</th>\n",
|
|
|
+ " <th>gearbox</th>\n",
|
|
|
+ " <th>power_ps</th>\n",
|
|
|
+ " <th>model</th>\n",
|
|
|
+ " <th>odometer</th>\n",
|
|
|
+ " <th>registration_month</th>\n",
|
|
|
+ " <th>fuel_type</th>\n",
|
|
|
+ " <th>brand</th>\n",
|
|
|
+ " <th>unrepaired_damage</th>\n",
|
|
|
+ " <th>ad_created</th>\n",
|
|
|
+ " <th>num_photos</th>\n",
|
|
|
+ " <th>postal_code</th>\n",
|
|
|
+ " <th>last_seen</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>count</th>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " <td>44905</td>\n",
|
|
|
+ " <td>50000.000000</td>\n",
|
|
|
+ " <td>47320</td>\n",
|
|
|
+ " <td>50000.000000</td>\n",
|
|
|
+ " <td>47242</td>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " <td>50000.000000</td>\n",
|
|
|
+ " <td>45518</td>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " <td>40171</td>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " <td>50000.0</td>\n",
|
|
|
+ " <td>50000.000000</td>\n",
|
|
|
+ " <td>50000</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>unique</th>\n",
|
|
|
+ " <td>48213</td>\n",
|
|
|
+ " <td>38754</td>\n",
|
|
|
+ " <td>2</td>\n",
|
|
|
+ " <td>2</td>\n",
|
|
|
+ " <td>2357</td>\n",
|
|
|
+ " <td>2</td>\n",
|
|
|
+ " <td>8</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>2</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>245</td>\n",
|
|
|
+ " <td>13</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>7</td>\n",
|
|
|
+ " <td>40</td>\n",
|
|
|
+ " <td>2</td>\n",
|
|
|
+ " <td>76</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>39481</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>top</th>\n",
|
|
|
+ " <td>2016-03-09 11:54:38</td>\n",
|
|
|
+ " <td>Ford_Fiesta</td>\n",
|
|
|
+ " <td>privat</td>\n",
|
|
|
+ " <td>Angebot</td>\n",
|
|
|
+ " <td>$0</td>\n",
|
|
|
+ " <td>test</td>\n",
|
|
|
+ " <td>limousine</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>manuell</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>golf</td>\n",
|
|
|
+ " <td>150,000km</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>benzin</td>\n",
|
|
|
+ " <td>volkswagen</td>\n",
|
|
|
+ " <td>nein</td>\n",
|
|
|
+ " <td>2016-04-03 00:00:00</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>2016-04-07 06:17:27</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>freq</th>\n",
|
|
|
+ " <td>3</td>\n",
|
|
|
+ " <td>78</td>\n",
|
|
|
+ " <td>49999</td>\n",
|
|
|
+ " <td>49999</td>\n",
|
|
|
+ " <td>1421</td>\n",
|
|
|
+ " <td>25756</td>\n",
|
|
|
+ " <td>12859</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>36993</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>4024</td>\n",
|
|
|
+ " <td>32424</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>30107</td>\n",
|
|
|
+ " <td>10687</td>\n",
|
|
|
+ " <td>35232</td>\n",
|
|
|
+ " <td>1946</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>8</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>mean</th>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>2005.073280</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>116.355920</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>5.723360</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>0.0</td>\n",
|
|
|
+ " <td>50813.627300</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>std</th>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>105.712813</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>209.216627</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>3.711984</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>0.0</td>\n",
|
|
|
+ " <td>25779.747957</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>min</th>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>1000.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>0.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>0.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>0.0</td>\n",
|
|
|
+ " <td>1067.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>25%</th>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>1999.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>70.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>3.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>0.0</td>\n",
|
|
|
+ " <td>30451.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>50%</th>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>2003.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>105.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>6.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>0.0</td>\n",
|
|
|
+ " <td>49577.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>75%</th>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>2008.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>150.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>9.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>0.0</td>\n",
|
|
|
+ " <td>71540.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>max</th>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>9999.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>17700.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>12.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " <td>0.0</td>\n",
|
|
|
+ " <td>99998.000000</td>\n",
|
|
|
+ " <td>NaN</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " date_crawled name seller offer_type price ab_test \\\n",
|
|
|
+ "count 50000 50000 50000 50000 50000 50000 \n",
|
|
|
+ "unique 48213 38754 2 2 2357 2 \n",
|
|
|
+ "top 2016-03-09 11:54:38 Ford_Fiesta privat Angebot $0 test \n",
|
|
|
+ "freq 3 78 49999 49999 1421 25756 \n",
|
|
|
+ "mean NaN NaN NaN NaN NaN NaN \n",
|
|
|
+ "std NaN NaN NaN NaN NaN NaN \n",
|
|
|
+ "min NaN NaN NaN NaN NaN NaN \n",
|
|
|
+ "25% NaN NaN NaN NaN NaN NaN \n",
|
|
|
+ "50% NaN NaN NaN NaN NaN NaN \n",
|
|
|
+ "75% NaN NaN NaN NaN NaN NaN \n",
|
|
|
+ "max NaN NaN NaN NaN NaN NaN \n",
|
|
|
+ "\n",
|
|
|
+ " vehicle_type registration_year gearbox power_ps model \\\n",
|
|
|
+ "count 44905 50000.000000 47320 50000.000000 47242 \n",
|
|
|
+ "unique 8 NaN 2 NaN 245 \n",
|
|
|
+ "top limousine NaN manuell NaN golf \n",
|
|
|
+ "freq 12859 NaN 36993 NaN 4024 \n",
|
|
|
+ "mean NaN 2005.073280 NaN 116.355920 NaN \n",
|
|
|
+ "std NaN 105.712813 NaN 209.216627 NaN \n",
|
|
|
+ "min NaN 1000.000000 NaN 0.000000 NaN \n",
|
|
|
+ "25% NaN 1999.000000 NaN 70.000000 NaN \n",
|
|
|
+ "50% NaN 2003.000000 NaN 105.000000 NaN \n",
|
|
|
+ "75% NaN 2008.000000 NaN 150.000000 NaN \n",
|
|
|
+ "max NaN 9999.000000 NaN 17700.000000 NaN \n",
|
|
|
+ "\n",
|
|
|
+ " odometer registration_month fuel_type brand unrepaired_damage \\\n",
|
|
|
+ "count 50000 50000.000000 45518 50000 40171 \n",
|
|
|
+ "unique 13 NaN 7 40 2 \n",
|
|
|
+ "top 150,000km NaN benzin volkswagen nein \n",
|
|
|
+ "freq 32424 NaN 30107 10687 35232 \n",
|
|
|
+ "mean NaN 5.723360 NaN NaN NaN \n",
|
|
|
+ "std NaN 3.711984 NaN NaN NaN \n",
|
|
|
+ "min NaN 0.000000 NaN NaN NaN \n",
|
|
|
+ "25% NaN 3.000000 NaN NaN NaN \n",
|
|
|
+ "50% NaN 6.000000 NaN NaN NaN \n",
|
|
|
+ "75% NaN 9.000000 NaN NaN NaN \n",
|
|
|
+ "max NaN 12.000000 NaN NaN NaN \n",
|
|
|
+ "\n",
|
|
|
+ " ad_created num_photos postal_code last_seen \n",
|
|
|
+ "count 50000 50000.0 50000.000000 50000 \n",
|
|
|
+ "unique 76 NaN NaN 39481 \n",
|
|
|
+ "top 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27 \n",
|
|
|
+ "freq 1946 NaN NaN 8 \n",
|
|
|
+ "mean NaN 0.0 50813.627300 NaN \n",
|
|
|
+ "std NaN 0.0 25779.747957 NaN \n",
|
|
|
+ "min NaN 0.0 1067.000000 NaN \n",
|
|
|
+ "25% NaN 0.0 30451.000000 NaN \n",
|
|
|
+ "50% NaN 0.0 49577.000000 NaN \n",
|
|
|
+ "75% NaN 0.0 71540.000000 NaN \n",
|
|
|
+ "max NaN 0.0 99998.000000 NaN "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 5,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos.describe(include='all')"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Our initial observations:\n",
|
|
|
+ "\n",
|
|
|
+ "- There are a number of text columns where all (or nearly all) of the values are the same:\n",
|
|
|
+ " - `seller`\n",
|
|
|
+ " - `offer_type`\n",
|
|
|
+ "- The `num_photos` column looks odd, we'll need to investigate this further."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 6,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "0 50000\n",
|
|
|
+ "Name: num_photos, dtype: int64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 6,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos[\"num_photos\"].value_counts()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "It looks like the `num_photos` column has `0` for every column. We'll drop this column, plus the other two we noted as mostly one value."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 7,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "autos = autos.drop([\"num_photos\", \"seller\", \"offer_type\"], axis=1)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "There are two columns, `price` and `auto`, which are numeric values with extra characters being stored as text. We'll clean and convert these."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 8,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "0 5000\n",
|
|
|
+ "1 8500\n",
|
|
|
+ "2 8990\n",
|
|
|
+ "3 4350\n",
|
|
|
+ "4 1350\n",
|
|
|
+ "Name: price, dtype: int64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 8,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos[\"price\"] = (autos[\"price\"]\n",
|
|
|
+ " .str.replace(\"$\",\"\")\n",
|
|
|
+ " .str.replace(\",\",\"\")\n",
|
|
|
+ " .astype(int)\n",
|
|
|
+ " )\n",
|
|
|
+ "autos[\"price\"].head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 9,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "0 150000\n",
|
|
|
+ "1 150000\n",
|
|
|
+ "2 70000\n",
|
|
|
+ "3 70000\n",
|
|
|
+ "4 150000\n",
|
|
|
+ "Name: odometer_km, dtype: int64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 9,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos[\"odometer\"] = (autos[\"odometer\"]\n",
|
|
|
+ " .str.replace(\"km\",\"\")\n",
|
|
|
+ " .str.replace(\",\",\"\")\n",
|
|
|
+ " .astype(int)\n",
|
|
|
+ " )\n",
|
|
|
+ "autos.rename({\"odometer\": \"odometer_km\"}, axis=1, inplace=True)\n",
|
|
|
+ "autos[\"odometer_km\"].head()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Exploring Odometer and Price"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 10,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "150000 32424\n",
|
|
|
+ "125000 5170\n",
|
|
|
+ "100000 2169\n",
|
|
|
+ "90000 1757\n",
|
|
|
+ "80000 1436\n",
|
|
|
+ "70000 1230\n",
|
|
|
+ "60000 1164\n",
|
|
|
+ "50000 1027\n",
|
|
|
+ "5000 967\n",
|
|
|
+ "40000 819\n",
|
|
|
+ "30000 789\n",
|
|
|
+ "20000 784\n",
|
|
|
+ "10000 264\n",
|
|
|
+ "Name: odometer_km, dtype: int64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 10,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos[\"odometer_km\"].value_counts()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "We can see that the values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field. Additionally, there are more high mileage than low mileage vehicles."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 11,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "(2357,)\n",
|
|
|
+ "count 5.000000e+04\n",
|
|
|
+ "mean 9.840044e+03\n",
|
|
|
+ "std 4.811044e+05\n",
|
|
|
+ "min 0.000000e+00\n",
|
|
|
+ "25% 1.100000e+03\n",
|
|
|
+ "50% 2.950000e+03\n",
|
|
|
+ "75% 7.200000e+03\n",
|
|
|
+ "max 1.000000e+08\n",
|
|
|
+ "Name: price, dtype: float64\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "0 1421\n",
|
|
|
+ "500 781\n",
|
|
|
+ "1500 734\n",
|
|
|
+ "2500 643\n",
|
|
|
+ "1000 639\n",
|
|
|
+ "1200 639\n",
|
|
|
+ "600 531\n",
|
|
|
+ "800 498\n",
|
|
|
+ "3500 498\n",
|
|
|
+ "2000 460\n",
|
|
|
+ "999 434\n",
|
|
|
+ "750 433\n",
|
|
|
+ "900 420\n",
|
|
|
+ "650 419\n",
|
|
|
+ "850 410\n",
|
|
|
+ "700 395\n",
|
|
|
+ "4500 394\n",
|
|
|
+ "300 384\n",
|
|
|
+ "2200 382\n",
|
|
|
+ "950 379\n",
|
|
|
+ "Name: price, dtype: int64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 11,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "print(autos[\"price\"].unique().shape)\n",
|
|
|
+ "print(autos[\"price\"].describe())\n",
|
|
|
+ "autos[\"price\"].value_counts().head(20)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Again, the prices in this column seem rounded, however given there are 2357 unique values in the column, that may just be people's tendency to round prices on the site.\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "There are 1,421 cars listed with $0 price - given that this is only 2% of the of the cars, we might consider removing these rows. The maximum price is one hundred million dollars, which seems a lot, let's look at the highest prices further."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 12,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "99999999 1\n",
|
|
|
+ "27322222 1\n",
|
|
|
+ "12345678 3\n",
|
|
|
+ "11111111 2\n",
|
|
|
+ "10000000 1\n",
|
|
|
+ "3890000 1\n",
|
|
|
+ "1300000 1\n",
|
|
|
+ "1234566 1\n",
|
|
|
+ "999999 2\n",
|
|
|
+ "999990 1\n",
|
|
|
+ "350000 1\n",
|
|
|
+ "345000 1\n",
|
|
|
+ "299000 1\n",
|
|
|
+ "295000 1\n",
|
|
|
+ "265000 1\n",
|
|
|
+ "259000 1\n",
|
|
|
+ "250000 1\n",
|
|
|
+ "220000 1\n",
|
|
|
+ "198000 1\n",
|
|
|
+ "197000 1\n",
|
|
|
+ "Name: price, dtype: int64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 12,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos[\"price\"].value_counts().sort_index(ascending=False).head(20)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 13,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "0 1421\n",
|
|
|
+ "1 156\n",
|
|
|
+ "2 3\n",
|
|
|
+ "3 1\n",
|
|
|
+ "5 2\n",
|
|
|
+ "8 1\n",
|
|
|
+ "9 1\n",
|
|
|
+ "10 7\n",
|
|
|
+ "11 2\n",
|
|
|
+ "12 3\n",
|
|
|
+ "13 2\n",
|
|
|
+ "14 1\n",
|
|
|
+ "15 2\n",
|
|
|
+ "17 3\n",
|
|
|
+ "18 1\n",
|
|
|
+ "20 4\n",
|
|
|
+ "25 5\n",
|
|
|
+ "29 1\n",
|
|
|
+ "30 7\n",
|
|
|
+ "35 1\n",
|
|
|
+ "Name: price, dtype: int64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 13,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos[\"price\"].value_counts().sort_index(ascending=True).head(20)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "There are a number of listings with prices below \\$30, including about 1,500 at \\$0. There are also a small number of listings with very high values, including 14 at around or over $1 million.\n",
|
|
|
+ "\n",
|
|
|
+ "Given that eBay is an auction site, there could legitimately be items where the opening bid is \\$1. We will keep the \\$1 items, but remove anything above \\$350,000, since it seems that prices increase steadily to that number and then jump up to less realistic numbers."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 14,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "count 48565.000000\n",
|
|
|
+ "mean 5888.935591\n",
|
|
|
+ "std 9059.854754\n",
|
|
|
+ "min 1.000000\n",
|
|
|
+ "25% 1200.000000\n",
|
|
|
+ "50% 3000.000000\n",
|
|
|
+ "75% 7490.000000\n",
|
|
|
+ "max 350000.000000\n",
|
|
|
+ "Name: price, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 14,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos = autos[autos[\"price\"].between(1,351000)]\n",
|
|
|
+ "autos[\"price\"].describe()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Exploring the date columns"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "There are a number of columns with date information:\n",
|
|
|
+ "\n",
|
|
|
+ "- `date_crawled`\n",
|
|
|
+ "- `registration_month`\n",
|
|
|
+ "- `registration_year`\n",
|
|
|
+ "- `ad_created`\n",
|
|
|
+ "- `last_seen`\n",
|
|
|
+ "\n",
|
|
|
+ "These are a combination of dates that were crawled, and dates with meta-information from the crawler. The non-registration dates are stored as strings.\n",
|
|
|
+ "\n",
|
|
|
+ "We'll explore each of these columns to learn more about the listings."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 28,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>date_crawled</th>\n",
|
|
|
+ " <th>ad_created</th>\n",
|
|
|
+ " <th>last_seen</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>0</th>\n",
|
|
|
+ " <td>2016-03-26 17:47:46</td>\n",
|
|
|
+ " <td>2016-03-26 00:00:00</td>\n",
|
|
|
+ " <td>2016-04-06 06:45:54</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>1</th>\n",
|
|
|
+ " <td>2016-04-04 13:38:56</td>\n",
|
|
|
+ " <td>2016-04-04 00:00:00</td>\n",
|
|
|
+ " <td>2016-04-06 14:45:08</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>2</th>\n",
|
|
|
+ " <td>2016-03-26 18:57:24</td>\n",
|
|
|
+ " <td>2016-03-26 00:00:00</td>\n",
|
|
|
+ " <td>2016-04-06 20:15:37</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>3</th>\n",
|
|
|
+ " <td>2016-03-12 16:58:10</td>\n",
|
|
|
+ " <td>2016-03-12 00:00:00</td>\n",
|
|
|
+ " <td>2016-03-15 03:16:28</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>4</th>\n",
|
|
|
+ " <td>2016-04-01 14:38:50</td>\n",
|
|
|
+ " <td>2016-04-01 00:00:00</td>\n",
|
|
|
+ " <td>2016-04-01 14:38:50</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " date_crawled ad_created last_seen\n",
|
|
|
+ "0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54\n",
|
|
|
+ "1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08\n",
|
|
|
+ "2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37\n",
|
|
|
+ "3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28\n",
|
|
|
+ "4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 28,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos[['date_crawled','ad_created','last_seen']][0:5]"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 30,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "2016-03-05 0.025327\n",
|
|
|
+ "2016-03-06 0.014043\n",
|
|
|
+ "2016-03-07 0.036014\n",
|
|
|
+ "2016-03-08 0.033296\n",
|
|
|
+ "2016-03-09 0.033090\n",
|
|
|
+ "2016-03-10 0.032184\n",
|
|
|
+ "2016-03-11 0.032575\n",
|
|
|
+ "2016-03-12 0.036920\n",
|
|
|
+ "2016-03-13 0.015670\n",
|
|
|
+ "2016-03-14 0.036549\n",
|
|
|
+ "2016-03-15 0.034284\n",
|
|
|
+ "2016-03-16 0.029610\n",
|
|
|
+ "2016-03-17 0.031628\n",
|
|
|
+ "2016-03-18 0.012911\n",
|
|
|
+ "2016-03-19 0.034778\n",
|
|
|
+ "2016-03-20 0.037887\n",
|
|
|
+ "2016-03-21 0.037373\n",
|
|
|
+ "2016-03-22 0.032987\n",
|
|
|
+ "2016-03-23 0.032225\n",
|
|
|
+ "2016-03-24 0.029342\n",
|
|
|
+ "2016-03-25 0.031607\n",
|
|
|
+ "2016-03-26 0.032204\n",
|
|
|
+ "2016-03-27 0.031092\n",
|
|
|
+ "2016-03-28 0.034860\n",
|
|
|
+ "2016-03-29 0.034099\n",
|
|
|
+ "2016-03-30 0.033687\n",
|
|
|
+ "2016-03-31 0.031834\n",
|
|
|
+ "2016-04-01 0.033687\n",
|
|
|
+ "2016-04-02 0.035478\n",
|
|
|
+ "2016-04-03 0.038608\n",
|
|
|
+ "2016-04-04 0.036487\n",
|
|
|
+ "2016-04-05 0.013096\n",
|
|
|
+ "2016-04-06 0.003171\n",
|
|
|
+ "2016-04-07 0.001400\n",
|
|
|
+ "Name: date_crawled, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 30,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "(autos[\"date_crawled\"]\n",
|
|
|
+ " .str[:10]\n",
|
|
|
+ " .value_counts(normalize=True, dropna=False)\n",
|
|
|
+ " .sort_index()\n",
|
|
|
+ " )"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 29,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "2016-04-07 0.001400\n",
|
|
|
+ "2016-04-06 0.003171\n",
|
|
|
+ "2016-03-18 0.012911\n",
|
|
|
+ "2016-04-05 0.013096\n",
|
|
|
+ "2016-03-06 0.014043\n",
|
|
|
+ "2016-03-13 0.015670\n",
|
|
|
+ "2016-03-05 0.025327\n",
|
|
|
+ "2016-03-24 0.029342\n",
|
|
|
+ "2016-03-16 0.029610\n",
|
|
|
+ "2016-03-27 0.031092\n",
|
|
|
+ "2016-03-25 0.031607\n",
|
|
|
+ "2016-03-17 0.031628\n",
|
|
|
+ "2016-03-31 0.031834\n",
|
|
|
+ "2016-03-10 0.032184\n",
|
|
|
+ "2016-03-26 0.032204\n",
|
|
|
+ "2016-03-23 0.032225\n",
|
|
|
+ "2016-03-11 0.032575\n",
|
|
|
+ "2016-03-22 0.032987\n",
|
|
|
+ "2016-03-09 0.033090\n",
|
|
|
+ "2016-03-08 0.033296\n",
|
|
|
+ "2016-03-30 0.033687\n",
|
|
|
+ "2016-04-01 0.033687\n",
|
|
|
+ "2016-03-29 0.034099\n",
|
|
|
+ "2016-03-15 0.034284\n",
|
|
|
+ "2016-03-19 0.034778\n",
|
|
|
+ "2016-03-28 0.034860\n",
|
|
|
+ "2016-04-02 0.035478\n",
|
|
|
+ "2016-03-07 0.036014\n",
|
|
|
+ "2016-04-04 0.036487\n",
|
|
|
+ "2016-03-14 0.036549\n",
|
|
|
+ "2016-03-12 0.036920\n",
|
|
|
+ "2016-03-21 0.037373\n",
|
|
|
+ "2016-03-20 0.037887\n",
|
|
|
+ "2016-04-03 0.038608\n",
|
|
|
+ "Name: date_crawled, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 29,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "(autos[\"date_crawled\"]\n",
|
|
|
+ " .str[:10]\n",
|
|
|
+ " .value_counts(normalize=True, dropna=False)\n",
|
|
|
+ " .sort_values()\n",
|
|
|
+ " )"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Looks like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 150,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "2016-03-05 0.001071\n",
|
|
|
+ "2016-03-06 0.004324\n",
|
|
|
+ "2016-03-07 0.005395\n",
|
|
|
+ "2016-03-08 0.007413\n",
|
|
|
+ "2016-03-09 0.009595\n",
|
|
|
+ "2016-03-10 0.010666\n",
|
|
|
+ "2016-03-11 0.012375\n",
|
|
|
+ "2016-03-12 0.023783\n",
|
|
|
+ "2016-03-13 0.008895\n",
|
|
|
+ "2016-03-14 0.012602\n",
|
|
|
+ "2016-03-15 0.015876\n",
|
|
|
+ "2016-03-16 0.016452\n",
|
|
|
+ "2016-03-17 0.028086\n",
|
|
|
+ "2016-03-18 0.007351\n",
|
|
|
+ "2016-03-19 0.015834\n",
|
|
|
+ "2016-03-20 0.020653\n",
|
|
|
+ "2016-03-21 0.020632\n",
|
|
|
+ "2016-03-22 0.021373\n",
|
|
|
+ "2016-03-23 0.018532\n",
|
|
|
+ "2016-03-24 0.019767\n",
|
|
|
+ "2016-03-25 0.019211\n",
|
|
|
+ "2016-03-26 0.016802\n",
|
|
|
+ "2016-03-27 0.015649\n",
|
|
|
+ "2016-03-28 0.020859\n",
|
|
|
+ "2016-03-29 0.022341\n",
|
|
|
+ "2016-03-30 0.024771\n",
|
|
|
+ "2016-03-31 0.023783\n",
|
|
|
+ "2016-04-01 0.022794\n",
|
|
|
+ "2016-04-02 0.024915\n",
|
|
|
+ "2016-04-03 0.025203\n",
|
|
|
+ "2016-04-04 0.024483\n",
|
|
|
+ "2016-04-05 0.124761\n",
|
|
|
+ "2016-04-06 0.221806\n",
|
|
|
+ "2016-04-07 0.131947\n",
|
|
|
+ "Name: last_seen, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 150,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "(autos[\"last_seen\"]\n",
|
|
|
+ " .str[:10]\n",
|
|
|
+ " .value_counts(normalize=True, dropna=False)\n",
|
|
|
+ " .sort_index()\n",
|
|
|
+ " )"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.\n",
|
|
|
+ "\n",
|
|
|
+ "The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 33,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "(76,)\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "2015-06-11 0.000021\n",
|
|
|
+ "2015-08-10 0.000021\n",
|
|
|
+ "2015-09-09 0.000021\n",
|
|
|
+ "2015-11-10 0.000021\n",
|
|
|
+ "2015-12-05 0.000021\n",
|
|
|
+ "2015-12-30 0.000021\n",
|
|
|
+ "2016-01-03 0.000021\n",
|
|
|
+ "2016-01-07 0.000021\n",
|
|
|
+ "2016-01-10 0.000041\n",
|
|
|
+ "2016-01-13 0.000021\n",
|
|
|
+ "2016-01-14 0.000021\n",
|
|
|
+ "2016-01-16 0.000021\n",
|
|
|
+ "2016-01-22 0.000021\n",
|
|
|
+ "2016-01-27 0.000062\n",
|
|
|
+ "2016-01-29 0.000021\n",
|
|
|
+ "2016-02-01 0.000021\n",
|
|
|
+ "2016-02-02 0.000041\n",
|
|
|
+ "2016-02-05 0.000041\n",
|
|
|
+ "2016-02-07 0.000021\n",
|
|
|
+ "2016-02-08 0.000021\n",
|
|
|
+ "2016-02-09 0.000021\n",
|
|
|
+ "2016-02-11 0.000021\n",
|
|
|
+ "2016-02-12 0.000041\n",
|
|
|
+ "2016-02-14 0.000041\n",
|
|
|
+ "2016-02-16 0.000021\n",
|
|
|
+ "2016-02-17 0.000021\n",
|
|
|
+ "2016-02-18 0.000041\n",
|
|
|
+ "2016-02-19 0.000062\n",
|
|
|
+ "2016-02-20 0.000041\n",
|
|
|
+ "2016-02-21 0.000062\n",
|
|
|
+ " ... \n",
|
|
|
+ "2016-03-09 0.033151\n",
|
|
|
+ "2016-03-10 0.031895\n",
|
|
|
+ "2016-03-11 0.032904\n",
|
|
|
+ "2016-03-12 0.036755\n",
|
|
|
+ "2016-03-13 0.017008\n",
|
|
|
+ "2016-03-14 0.035190\n",
|
|
|
+ "2016-03-15 0.034016\n",
|
|
|
+ "2016-03-16 0.030125\n",
|
|
|
+ "2016-03-17 0.031278\n",
|
|
|
+ "2016-03-18 0.013590\n",
|
|
|
+ "2016-03-19 0.033687\n",
|
|
|
+ "2016-03-20 0.037949\n",
|
|
|
+ "2016-03-21 0.037579\n",
|
|
|
+ "2016-03-22 0.032801\n",
|
|
|
+ "2016-03-23 0.032060\n",
|
|
|
+ "2016-03-24 0.029280\n",
|
|
|
+ "2016-03-25 0.031751\n",
|
|
|
+ "2016-03-26 0.032266\n",
|
|
|
+ "2016-03-27 0.030989\n",
|
|
|
+ "2016-03-28 0.034984\n",
|
|
|
+ "2016-03-29 0.034037\n",
|
|
|
+ "2016-03-30 0.033501\n",
|
|
|
+ "2016-03-31 0.031875\n",
|
|
|
+ "2016-04-01 0.033687\n",
|
|
|
+ "2016-04-02 0.035149\n",
|
|
|
+ "2016-04-03 0.038855\n",
|
|
|
+ "2016-04-04 0.036858\n",
|
|
|
+ "2016-04-05 0.011819\n",
|
|
|
+ "2016-04-06 0.003253\n",
|
|
|
+ "2016-04-07 0.001256\n",
|
|
|
+ "Name: ad_created, Length: 76, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 33,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "print(autos[\"ad_created\"].str[:10].unique().shape)\n",
|
|
|
+ "(autos[\"ad_created\"]\n",
|
|
|
+ " .str[:10]\n",
|
|
|
+ " .value_counts(normalize=True, dropna=False)\n",
|
|
|
+ " .sort_index()\n",
|
|
|
+ " )"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 34,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "count 48565.000000\n",
|
|
|
+ "mean 2004.755421\n",
|
|
|
+ "std 88.643887\n",
|
|
|
+ "min 1000.000000\n",
|
|
|
+ "25% 1999.000000\n",
|
|
|
+ "50% 2004.000000\n",
|
|
|
+ "75% 2008.000000\n",
|
|
|
+ "max 9999.000000\n",
|
|
|
+ "Name: registration_year, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 34,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos[\"registration_year\"].describe()"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The year that the car was first registered will likely indicate the age of the car. Looking at this column, we note some odd values. The minimum value is `1000`, long before cars were invented and the maximum is `9999`, many years into the future."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Dealing with Incorrect Registration Year Data"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Because a car can't be first registered before the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.\n",
|
|
|
+ "\n",
|
|
|
+ "One option is to remove the listings with these values. Let's determine what percentage of our data has invalid values in this column:"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 35,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "0.038793369710697002"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 35,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "(~autos[\"registration_year\"].between(1900,2016)).sum() / autos.shape[0]"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Given that this is less than 4% of our data, we will remove these rows."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 47,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "2000 0.067608\n",
|
|
|
+ "2005 0.062895\n",
|
|
|
+ "1999 0.062060\n",
|
|
|
+ "2004 0.057904\n",
|
|
|
+ "2003 0.057818\n",
|
|
|
+ "2006 0.057197\n",
|
|
|
+ "2001 0.056468\n",
|
|
|
+ "2002 0.053255\n",
|
|
|
+ "1998 0.050620\n",
|
|
|
+ "2007 0.048778\n",
|
|
|
+ "Name: registration_year, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 47,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "# Many ways to select rows in a dataframe that fall within a value range for a column.\n",
|
|
|
+ "# Using `Series.between()` is one way.\n",
|
|
|
+ "autos = autos[autos[\"registration_year\"].between(1900,2016)]\n",
|
|
|
+ "autos[\"registration_year\"].value_counts(normalize=True).head(10)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "It appears that most of the vehicles were first registered in the past 20 years."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Exploring Price by Brand"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 48,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "volkswagen 0.211264\n",
|
|
|
+ "bmw 0.110045\n",
|
|
|
+ "opel 0.107581\n",
|
|
|
+ "mercedes_benz 0.096463\n",
|
|
|
+ "audi 0.086566\n",
|
|
|
+ "ford 0.069900\n",
|
|
|
+ "renault 0.047150\n",
|
|
|
+ "peugeot 0.029841\n",
|
|
|
+ "fiat 0.025642\n",
|
|
|
+ "seat 0.018273\n",
|
|
|
+ "skoda 0.016409\n",
|
|
|
+ "nissan 0.015274\n",
|
|
|
+ "mazda 0.015188\n",
|
|
|
+ "smart 0.014160\n",
|
|
|
+ "citroen 0.014010\n",
|
|
|
+ "toyota 0.012703\n",
|
|
|
+ "hyundai 0.010025\n",
|
|
|
+ "sonstige_autos 0.009811\n",
|
|
|
+ "volvo 0.009147\n",
|
|
|
+ "mini 0.008762\n",
|
|
|
+ "mitsubishi 0.008226\n",
|
|
|
+ "honda 0.007840\n",
|
|
|
+ "kia 0.007069\n",
|
|
|
+ "alfa_romeo 0.006641\n",
|
|
|
+ "porsche 0.006127\n",
|
|
|
+ "suzuki 0.005934\n",
|
|
|
+ "chevrolet 0.005698\n",
|
|
|
+ "chrysler 0.003513\n",
|
|
|
+ "dacia 0.002635\n",
|
|
|
+ "daihatsu 0.002506\n",
|
|
|
+ "jeep 0.002271\n",
|
|
|
+ "subaru 0.002142\n",
|
|
|
+ "land_rover 0.002099\n",
|
|
|
+ "saab 0.001649\n",
|
|
|
+ "jaguar 0.001564\n",
|
|
|
+ "daewoo 0.001500\n",
|
|
|
+ "trabant 0.001392\n",
|
|
|
+ "rover 0.001328\n",
|
|
|
+ "lancia 0.001071\n",
|
|
|
+ "lada 0.000578\n",
|
|
|
+ "Name: brand, dtype: float64"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 48,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "autos[\"brand\"].value_counts(normalize=True)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.\n",
|
|
|
+ "\n",
|
|
|
+ "There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to brands representing more than 5% of total listings."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 49,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "brand_counts = autos[\"brand\"].value_counts(normalize=True)\n",
|
|
|
+ "common_brands = brand_counts[brand_counts > .05].index\n",
|
|
|
+ "print(common_brands)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 39,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/plain": [
|
|
|
+ "{'audi': 9336,\n",
|
|
|
+ " 'bmw': 8332,\n",
|
|
|
+ " 'ford': 3749,\n",
|
|
|
+ " 'mercedes_benz': 8628,\n",
|
|
|
+ " 'opel': 2975,\n",
|
|
|
+ " 'volkswagen': 5402}"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 39,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "brand_mean_prices = {}\n",
|
|
|
+ "\n",
|
|
|
+ "for brand in common_brands:\n",
|
|
|
+ " brand_only = autos[autos[\"brand\"] == brand]\n",
|
|
|
+ " mean_price = brand_only[\"price\"].mean()\n",
|
|
|
+ " brand_mean_prices[brand] = int(mean_price)\n",
|
|
|
+ "\n",
|
|
|
+ "brand_mean_prices"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Of the top 5 brands, there is a distinct price gap:\n",
|
|
|
+ "\n",
|
|
|
+ "- Audi, BMW and Mercedes Benz are more expensive\n",
|
|
|
+ "- Ford and Opel are less expensive\n",
|
|
|
+ "- Volkswagen is in between - this may explain its popularity, it may be a 'best of 'both worlds' option."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Exploring Mileage"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 66,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>mean_mileage</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>audi</th>\n",
|
|
|
+ " <td>9336</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>bmw</th>\n",
|
|
|
+ " <td>8332</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>ford</th>\n",
|
|
|
+ " <td>3749</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>mercedes_benz</th>\n",
|
|
|
+ " <td>8628</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>opel</th>\n",
|
|
|
+ " <td>2975</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>volkswagen</th>\n",
|
|
|
+ " <td>5402</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " mean_mileage\n",
|
|
|
+ "audi 9336\n",
|
|
|
+ "bmw 8332\n",
|
|
|
+ "ford 3749\n",
|
|
|
+ "mercedes_benz 8628\n",
|
|
|
+ "opel 2975\n",
|
|
|
+ "volkswagen 5402"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 66,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "bmp_series = pd.Series(brand_mean_prices)\n",
|
|
|
+ "pd.DataFrame(bmp_series, columns=[\"mean_mileage\"])"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 44,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "brand_mean_mileage = {}\n",
|
|
|
+ "\n",
|
|
|
+ "for brand in common_brands:\n",
|
|
|
+ " brand_only = autos[autos[\"brand\"] == brand]\n",
|
|
|
+ " mean_mileage = brand_only[\"odometer_km\"].mean()\n",
|
|
|
+ " brand_mean_mileage[brand] = int(mean_mileage)\n",
|
|
|
+ "\n",
|
|
|
+ "mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)\n",
|
|
|
+ "mean_prices = pd.Series(brand_mean_prices).sort_values(ascending=False)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 51,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>mean_mileage</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>bmw</th>\n",
|
|
|
+ " <td>132572</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>mercedes_benz</th>\n",
|
|
|
+ " <td>130788</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>opel</th>\n",
|
|
|
+ " <td>129310</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>audi</th>\n",
|
|
|
+ " <td>129157</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>volkswagen</th>\n",
|
|
|
+ " <td>128707</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>ford</th>\n",
|
|
|
+ " <td>124266</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " mean_mileage\n",
|
|
|
+ "bmw 132572\n",
|
|
|
+ "mercedes_benz 130788\n",
|
|
|
+ "opel 129310\n",
|
|
|
+ "audi 129157\n",
|
|
|
+ "volkswagen 128707\n",
|
|
|
+ "ford 124266"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 51,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "brand_info = pd.DataFrame(mean_mileage,columns=['mean_mileage'])\n",
|
|
|
+ "brand_info"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 52,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "data": {
|
|
|
+ "text/html": [
|
|
|
+ "<div>\n",
|
|
|
+ "<style scoped>\n",
|
|
|
+ " .dataframe tbody tr th:only-of-type {\n",
|
|
|
+ " vertical-align: middle;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe tbody tr th {\n",
|
|
|
+ " vertical-align: top;\n",
|
|
|
+ " }\n",
|
|
|
+ "\n",
|
|
|
+ " .dataframe thead th {\n",
|
|
|
+ " text-align: right;\n",
|
|
|
+ " }\n",
|
|
|
+ "</style>\n",
|
|
|
+ "<table border=\"1\" class=\"dataframe\">\n",
|
|
|
+ " <thead>\n",
|
|
|
+ " <tr style=\"text-align: right;\">\n",
|
|
|
+ " <th></th>\n",
|
|
|
+ " <th>mean_mileage</th>\n",
|
|
|
+ " <th>mean_price</th>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </thead>\n",
|
|
|
+ " <tbody>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>bmw</th>\n",
|
|
|
+ " <td>132572</td>\n",
|
|
|
+ " <td>8332</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>mercedes_benz</th>\n",
|
|
|
+ " <td>130788</td>\n",
|
|
|
+ " <td>8628</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>opel</th>\n",
|
|
|
+ " <td>129310</td>\n",
|
|
|
+ " <td>2975</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>audi</th>\n",
|
|
|
+ " <td>129157</td>\n",
|
|
|
+ " <td>9336</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>volkswagen</th>\n",
|
|
|
+ " <td>128707</td>\n",
|
|
|
+ " <td>5402</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " <tr>\n",
|
|
|
+ " <th>ford</th>\n",
|
|
|
+ " <td>124266</td>\n",
|
|
|
+ " <td>3749</td>\n",
|
|
|
+ " </tr>\n",
|
|
|
+ " </tbody>\n",
|
|
|
+ "</table>\n",
|
|
|
+ "</div>"
|
|
|
+ ],
|
|
|
+ "text/plain": [
|
|
|
+ " mean_mileage mean_price\n",
|
|
|
+ "bmw 132572 8332\n",
|
|
|
+ "mercedes_benz 130788 8628\n",
|
|
|
+ "opel 129310 2975\n",
|
|
|
+ "audi 129157 9336\n",
|
|
|
+ "volkswagen 128707 5402\n",
|
|
|
+ "ford 124266 3749"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ "execution_count": 52,
|
|
|
+ "metadata": {},
|
|
|
+ "output_type": "execute_result"
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "brand_info[\"mean_price\"] = mean_prices\n",
|
|
|
+ "brand_info"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage."
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "metadata": {
|
|
|
+ "kernelspec": {
|
|
|
+ "display_name": "Python 3",
|
|
|
+ "language": "python",
|
|
|
+ "name": "python3"
|
|
|
+ },
|
|
|
+ "language_info": {
|
|
|
+ "codemirror_mode": {
|
|
|
+ "name": "ipython",
|
|
|
+ "version": 3
|
|
|
+ },
|
|
|
+ "file_extension": ".py",
|
|
|
+ "mimetype": "text/x-python",
|
|
|
+ "name": "python",
|
|
|
+ "nbconvert_exporter": "python",
|
|
|
+ "pygments_lexer": "ipython3",
|
|
|
+ "version": "3.6.3"
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "nbformat": 4,
|
|
|
+ "nbformat_minor": 2
|
|
|
+}
|