{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "pd.options.display.max_columns = 99\n", "first_five = pd.read_csv('loans_2007.csv', nrows=5)\n", "first_five" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "thousand_chunk = pd.read_csv('loans_2007.csv', nrows=1000)\n", "thousand_chunk.memory_usage(deep=True).sum()/(1024*1024)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's try tripling to 3000 rows and calculate the memory footprint for each chunk." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "for chunk in chunk_iter:\n", " print(chunk.memory_usage(deep=True).sum()/(1024*1024))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many rows in the data set?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "total_rows = 0\n", "for chunk in chunk_iter:\n", " total_rows += len(chunk)\n", "print(total_rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring the Data in Chunks\n", "\n", "## How many columns have a numeric type? How many have a string type?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# Numeric columns\n", "chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "for chunk in chunk_iter:\n", " print(chunk.dtypes.value_counts())" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Are string columns consistent across chunks?\n", "obj_cols = []\n", "chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "\n", "for chunk in chunk_iter:\n", " chunk_obj_cols = chunk.select_dtypes(include=['object']).columns.tolist()\n", " if len(obj_cols) > 0:\n", " is_same = obj_cols == chunk_obj_cols\n", " if not is_same:\n", " print(\"overall obj cols:\", obj_cols, \"\\n\")\n", " print(\"chunk obj cols:\", chunk_obj_cols, \"\\n\") \n", " else:\n", " obj_cols = chunk_obj_cols" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Observation 1: By default -- 31 numeric columns and 21 string columns.\n", "\n", "### Observation 2: It seems like one column in particular (the `id` column) is being cast to int64 in the last 2 chunks but not in the earlier chunks. Since the `id` column won't be useful for analysis, visualization, or predictive modelling let's ignore this column.\n", "\n", "## How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "## Create dictionary (key: column, value: list of Series objects representing each chunk's value counts)\n", "chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "str_cols_vc = {}\n", "for chunk in chunk_iter:\n", " str_cols = chunk.select_dtypes(include=['object'])\n", " for col in str_cols.columns:\n", " current_col_vc = str_cols[col].value_counts()\n", " if col in str_cols_vc:\n", " str_cols_vc[col].append(current_col_vc)\n", " else:\n", " str_cols_vc[col] = [current_col_vc]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Combine the value counts.\n", "combined_vcs = {}\n", "\n", "for col in str_cols_vc:\n", " combined_vc = pd.concat(str_cols_vc[col])\n", " final_vc = combined_vc.groupby(combined_vc.index).sum()\n", " combined_vcs[col] = final_vc" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [], "source": [ "combined_vcs.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimizing String Columns\n", "\n", "### Determine which string columns you can convert to a numeric type if you clean them. Let's focus on columns that would actually be useful for analysis and modelling." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [], "source": [ "obj_cols" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "useful_obj_cols = ['term', 'sub_grade', 'emp_title', 'home_ownership', 'verification_status', 'issue_d', 'purpose', 'earliest_cr_line', 'revol_util', 'last_pymnt_d', 'last_credit_pull_d']" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for col in useful_obj_cols:\n", " print(col)\n", " print(combined_vcs[col])\n", " print(\"-----------\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert to category" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": true }, "outputs": [], "source": [ "convert_col_dtypes = {\n", " \"sub_grade\": \"category\", \"home_ownership\": \"category\", \n", " \"verification_status\": \"category\", \"purpose\": \"category\"\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert `term` and `revol_util` to numerical by data cleaning.\n", "### Convert `issue_d`, `earliest_cr_line`, `last_pymnt_d`, and `last_credit_pull_d` to datetime." ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "chunk[useful_obj_cols]" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [], "source": [ "chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=[\"issue_d\", \"earliest_cr_line\", \"last_pymnt_d\", \"last_credit_pull_d\"])\n", "\n", "for chunk in chunk_iter:\n", " term_cleaned = chunk['term'].str.lstrip(\" \").str.rstrip(\" months\")\n", " revol_cleaned = chunk['revol_util'].str.rstrip(\"%\")\n", " chunk['term'] = pd.to_numeric(term_cleaned)\n", " chunk['revol_util'] = pd.to_numeric(revol_cleaned)\n", " \n", "chunk.dtypes" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [], "source": [ "chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=[\"issue_d\", \"earliest_cr_line\", \"last_pymnt_d\", \"last_credit_pull_d\"])\n", "mv_counts = {}\n", "for chunk in chunk_iter:\n", " term_cleaned = chunk['term'].str.lstrip(\" \").str.rstrip(\" months\")\n", " revol_cleaned = chunk['revol_util'].str.rstrip(\"%\")\n", " chunk['term'] = pd.to_numeric(term_cleaned)\n", " chunk['revol_util'] = pd.to_numeric(revol_cleaned)\n", " float_cols = chunk.select_dtypes(include=['float'])\n", " for col in float_cols.columns:\n", " missing_values = len(chunk) - chunk[col].count()\n", " if col in mv_counts:\n", " mv_counts[col] = mv_counts[col] + missing_values\n", " else:\n", " mv_counts[col] = missing_values\n", "mv_counts" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "collapsed": false }, "outputs": [], "source": [ "chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=[\"issue_d\", \"earliest_cr_line\", \"last_pymnt_d\", \"last_credit_pull_d\"])\n", "mv_counts = {}\n", "for chunk in chunk_iter:\n", " term_cleaned = chunk['term'].str.lstrip(\" \").str.rstrip(\" months\")\n", " revol_cleaned = chunk['revol_util'].str.rstrip(\"%\")\n", " chunk['term'] = pd.to_numeric(term_cleaned)\n", " chunk['revol_util'] = pd.to_numeric(revol_cleaned)\n", " chunk = chunk.dropna(how='all')\n", " float_cols = chunk.select_dtypes(include=['float'])\n", " for col in float_cols.columns:\n", " missing_values = len(chunk) - chunk[col].count()\n", " if col in mv_counts:\n", " mv_counts[col] = mv_counts[col] + missing_values\n", " else:\n", " mv_counts[col] = missing_values\n", "mv_counts" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 2 }