{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "'Connected: None@factbook.db'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%capture\n", "%load_ext sql\n", "%sql sqlite:///factbook.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview of the Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll begin by getting a sense of what the data looks like." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcodenameareaarea_landarea_waterpopulationpopulation_growthbirth_ratedeath_ratemigration_rate
1afAfghanistan6522306522300325643422.3238.5713.891.51
2alAlbania2874827398135030292780.312.926.583.3
3agAlgeria238174123817410395421661.8423.674.310.92
4anAndorra4684680855800.128.136.960.0
5aoAngola124670012467000196253532.7838.7811.490.46
" ], "text/plain": [ "[(1, 'af', 'Afghanistan', 652230, 652230, 0, 32564342, 2.32, 38.57, 13.89, 1.51),\n", " (2, 'al', 'Albania', 28748, 27398, 1350, 3029278, 0.3, 12.92, 6.58, 3.3),\n", " (3, 'ag', 'Algeria', 2381741, 2381741, 0, 39542166, 1.84, 23.67, 4.31, 0.92),\n", " (4, 'an', 'Andorra', 468, 468, 0, 85580, 0.12, 8.13, 6.96, 0.0),\n", " (5, 'ao', 'Angola', 1246700, 1246700, 0, 19625353, 2.78, 38.78, 11.49, 0.46)]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT *\n", " FROM facts\n", " LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the descriptions for some of the columns:\n", "\n", "* `name` - The name of the country.\n", "* `area` - The total land and sea area of the country.\n", "* `population` - The country's population.\n", "* `population_growth`- The country's population growth as a percentage.\n", "* `birth_rate` - The country's birth rate, or the number of births a year per 1,000 people.\n", "* `death_rate` - The country's death rate, or the number of death a year per 1,000 people.\n", "* `area`- The country's total area (both land and water).\n", "* `area_land` - The country's land area in [square kilometers](https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html).\n", "* `area_water` - The country's waterarea in square kilometers.\n", "\n", "Let's start by calculating some summary statistics and see what they tell us." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary Statistics" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min_popmax_popmin_pop_growthmax_pop_growth
072564900110.04.02
" ], "text/plain": [ "[(0, 7256490011, 0.0, 4.02)]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT MIN(population) AS min_pop,\n", " MAX(population) AS max_pop,\n", " MIN(population_growth) AS min_pop_growth,\n", " MAX(population_growth) max_pop_growth \n", " FROM facts;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few things stick out from the summary statistics in the last screen:\n", "\n", "- There's a country with a population of `0`\n", "- There's a country with a population of `7256490011` (or more than 7.2 billion people) \n", "\n", "Let's use subqueries to zoom in on just these countries _without_ using the specific values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring Outliers" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcodenameareaarea_landarea_waterpopulationpopulation_growthbirth_ratedeath_ratemigration_rate
250ayAntarcticaNone280000None0NoneNoneNoneNone
" ], "text/plain": [ "[(250, 'ay', 'Antarctica', None, 280000, None, 0, None, None, None, None)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT *\n", " FROM facts\n", " WHERE population == (SELECT MIN(population)\n", " FROM facts\n", " );" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems like the table contains a row for Antarctica, which explains the population of 0. This seems to match the CIA Factbook [page for Antarctica](https://www.cia.gov/library/publications/the-world-factbook/geos/ay.html):\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcodenameareaarea_landarea_waterpopulationpopulation_growthbirth_ratedeath_ratemigration_rate
261xxWorldNoneNoneNone72564900111.0818.67.8None
" ], "text/plain": [ "[(261, 'xx', 'World', None, None, None, 7256490011, 1.08, 18.6, 7.8, None)]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT *\n", " FROM facts\n", " WHERE population == (SELECT MAX(population)\n", " FROM facts\n", " );" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also see that the table contains a row for the whole world, which explains the maximum population of over 7.2 billion we found earlier.\n", "\n", "Now that we know this, we should recalculate the summary statistics we calculated earlier, while excluding the row for the whole world." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary Statistics Revisited" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min_popmax_popmin_pop_growthmax_pop_growth
013674853880.04.02
" ], "text/plain": [ "[(0, 1367485388, 0.0, 4.02)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT MIN(population) AS min_pop,\n", " MAX(population) AS max_pop,\n", " MIN(population_growth) AS min_pop_growth,\n", " MAX(population_growth) AS max_pop_growth \n", " FROM facts\n", " WHERE name <> 'World';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's a country whose population closes in on 1.4 billion!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring Average Population and Area" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explore density. Density depends on the population and the country's area. Let's look at the average values for these two columns.\n", "\n", "We should take care of discarding the row for the whole planet." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
avg_populationavg_area
32242666.56846473555093.546184739
" ], "text/plain": [ "[(32242666.56846473, 555093.546184739)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT AVG(population) AS avg_population, AVG(area) AS avg_area\n", " FROM facts\n", " WHERE name <> 'World';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the average population is around 32 million and the average area is 555 thousand square kilometers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding Densely Populated Countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To finish, we'll build on the query above to find countries that are densely populated. We'll identify countries that have:\n", "\n", "- Above average values for population.\n", "- Below average values for area." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcodenameareaarea_landarea_waterpopulationpopulation_growthbirth_ratedeath_ratemigration_rate
14bgBangladesh148460130170182901689577451.621.145.610.46
65gmGermany3570223486728350808544080.178.4711.421.24
80izIraq438317437367950370561692.9331.453.771.62
83itItaly3013402941407200618551200.278.7410.194.1
85jaJapan377915364485134301269196590.167.939.510.0
91ksKorea, South99720969202800491151960.148.196.750.0
120moMorocco446550446300250333226991.018.24.813.36
138rpPhilippines30000029817018301009983761.6124.276.112.09
139plPoland3126853042558430385621890.099.7410.190.46
163spSpain5053704989806390481461340.899.649.048.31
173thThailand5131205108902230679764050.3411.197.80.0
182ugUganda24103819710043938371017453.2443.7910.690.74
185ukUnited Kingdom2436102419301680640882220.5412.179.352.54
192vmVietnam33121031007021140943488350.9715.965.930.3
" ], "text/plain": [ "[(14, 'bg', 'Bangladesh', 148460, 130170, 18290, 168957745, 1.6, 21.14, 5.61, 0.46),\n", " (65, 'gm', 'Germany', 357022, 348672, 8350, 80854408, 0.17, 8.47, 11.42, 1.24),\n", " (80, 'iz', 'Iraq', 438317, 437367, 950, 37056169, 2.93, 31.45, 3.77, 1.62),\n", " (83, 'it', 'Italy', 301340, 294140, 7200, 61855120, 0.27, 8.74, 10.19, 4.1),\n", " (85, 'ja', 'Japan', 377915, 364485, 13430, 126919659, 0.16, 7.93, 9.51, 0.0),\n", " (91, 'ks', 'Korea, South', 99720, 96920, 2800, 49115196, 0.14, 8.19, 6.75, 0.0),\n", " (120, 'mo', 'Morocco', 446550, 446300, 250, 33322699, 1.0, 18.2, 4.81, 3.36),\n", " (138, 'rp', 'Philippines', 300000, 298170, 1830, 100998376, 1.61, 24.27, 6.11, 2.09),\n", " (139, 'pl', 'Poland', 312685, 304255, 8430, 38562189, 0.09, 9.74, 10.19, 0.46),\n", " (163, 'sp', 'Spain', 505370, 498980, 6390, 48146134, 0.89, 9.64, 9.04, 8.31),\n", " (173, 'th', 'Thailand', 513120, 510890, 2230, 67976405, 0.34, 11.19, 7.8, 0.0),\n", " (182, 'ug', 'Uganda', 241038, 197100, 43938, 37101745, 3.24, 43.79, 10.69, 0.74),\n", " (185, 'uk', 'United Kingdom', 243610, 241930, 1680, 64088222, 0.54, 12.17, 9.35, 2.54),\n", " (192, 'vm', 'Vietnam', 331210, 310070, 21140, 94348835, 0.97, 15.96, 5.93, 0.3)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT *\n", " FROM facts\n", " WHERE population > (SELECT AVG(population)\n", " FROM facts\n", " WHERE name <> 'World'\n", " )\n", " AND area < (SELECT AVG(area)\n", " FROM facts\n", " WHERE name <> 'World'\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of these countries are generally known to be densely populated, so we have confidence in our results!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 4 }