Queer European MD passionate about IT
Просмотр исходного кода

Merge pull request #172 from dataquestio/darin-solutions-120922

Darin solutions 120922
darinbradley 2 лет назад
Родитель
Сommit
a5f2216720

+ 27 - 27
Mission217Solutions.ipynb

@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Read in the data"
+    "# Read in the Data"
    ]
   },
   {
@@ -37,7 +37,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Read in the surveys"
+    "# Read in the Surveys"
    ]
   },
   {
@@ -85,7 +85,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Add DBN columns"
+    "# Add DBN Columns"
    ]
   },
   {
@@ -111,7 +111,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Convert columns to numeric"
+    "# Convert Columns to Numeric"
    ]
   },
   {
@@ -147,7 +147,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Condense datasets"
+    "# Condense Datasets"
    ]
   },
   {
@@ -174,7 +174,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Convert AP scores to numeric"
+    "# Convert AP Scores to Numeric"
    ]
   },
   {
@@ -193,7 +193,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Combine the datasets"
+    "# Combine the Datasets"
    ]
   },
   {
@@ -220,7 +220,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Add a school district column for mapping"
+    "# Add a School District Column for Mapping"
    ]
   },
   {
@@ -239,7 +239,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Find correlations"
+    "# Find Correlations"
    ]
   },
   {
@@ -276,7 +276,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Plotting survey correlations"
+    "# Plotting Survey Correlations"
    ]
   },
   {
@@ -326,11 +326,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "There are high correlations between `N_s`, `N_t`, `N_p` and `sat_score`.  Since these columns are correlated with `total_enrollment`, it makes sense that they would be high.  \n",
+    "There are high correlations between `N_s`, `N_t`, `N_p`, and `sat_score`. Since these columns are correlated with `total_enrollment`, it makes sense that they would be high.  \n",
     "\n",
-    "It is more interesting that `rr_s`, the student response rate, or the percentage of students that completed the survey, correlates with `sat_score`.  This might make sense because students who are more likely to fill out surveys may be more likely to also be doing well academically.\n",
+    "It is more interesting that `rr_s`, the student response rate, or the percentage of students that completed the survey, correlates with `sat_score`. This might make sense because students who are more likely to fill out surveys may be more likely to also be doing well academically.\n",
     "\n",
-    "How students and teachers percieved safety (`saf_t_11` and `saf_s_11`) correlate with `sat_score`.  This make sense, as it's hard to teach or learn in an unsafe environment.\n",
+    "How students and teachers percieved safety (`saf_t_11` and `saf_s_11`) correlate with `sat_score`. This make sense — it's difficult to teach or learn in an unsafe environment.\n",
     "\n",
     "The last interesting correlation is the `aca_s_11`, which indicates how the student perceives academic standards, correlates with `sat_score`, but this is not true for `aca_t_11`, how teachers perceive academic standards, or `aca_p_11`, how parents perceive academic standards."
    ]
@@ -339,7 +339,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Exploring safety"
+    "# Exploring Safety"
    ]
   },
   {
@@ -376,14 +376,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "There appears to be a correlation between SAT scores and safety, although it isn't thatstrong.  It looks like there are a few schools with extremely high SAT scores and high safety scores.  There are a few schools with low safety scores and low SAT scores.  No school with a safety score lower than `6.5` has an average SAT score higher than 1500 or so."
+    "There appears to be a correlation between SAT scores and safety, although it isn't very strong. It looks like there are a few schools with extremely high SAT scores and high safety scores. There are a few schools with low safety scores and low SAT scores. No school with a safety score lower than `6.5` has an average SAT score higher than 1500 or so."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Borough safety"
+    "# Borough Safety"
    ]
   },
   {
@@ -421,7 +421,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Racial differences in SAT scores"
+    "# Racial Differences in SAT Scores"
    ]
   },
   {
@@ -459,7 +459,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "It looks like a higher percentage of white or asian students at a school correlates positively with sat score, whereas a higher percentage of black or hispanic students correlates negatively with sat score.  This may be due to a lack of funding for schools in certain areas, which are more likely to have a higher percentage of black or hispanic students."
+    "It looks like a higher percentage of white or Asian students at a school correlates positively with SAT scores, whereas a higher percentage of black or Hispanic students correlates negatively with SAT score. This may be due to a lack of funding for schools in certain areas, which are more likely to have a higher percentage of black or Hispanic students."
    ]
   },
   {
@@ -521,7 +521,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The schools listed above appear to primarily be geared towards recent immigrants to the US.  These schools have a lot of students who are learning English, which would explain the lower SAT scores."
+    "The schools listed above appear to primarily serve recent immigrants to the U.S. These schools have many students who are learning English, which would explain the lower SAT scores."
    ]
   },
   {
@@ -550,14 +550,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Many of the schools above appear to be specialized science and technology schools that receive extra funding, and only admit students who pass an entrance exam.  This doesn't explain the low `hispanic_per`, but it does explain why their students tend to do better on the SAT -- they are students from all over New York City who did well on a standardized test."
+    "Many of the schools above appear to be specialized science and technology schools that receive extra funding and only admit students who pass an entrance exam. This doesn't explain the low `hispanic_per`, but it does explain why their students tend to do better on the SAT  they are students from all over New York City who did well on a standardized test."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Gender differences in SAT scores"
+    "# Gender Differences in SAT Scores"
    ]
   },
   {
@@ -595,7 +595,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In the plot above, we can see that a high percentage of females at a school positively correlates with SAT score, whereas a high percentage of males at a school negatively correlates with SAT score.  Neither correlation is extremely strong."
+    "In the plot above, we can see that a high percentage of females at a school positively correlates with SAT scores, whereas a high percentage of males at a school negatively correlates with SAT scores. Neither correlation is extremely strong."
    ]
   },
   {
@@ -632,7 +632,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Based on the scatterplot, there doesn't seem to be any real correlation between `sat_score` and `female_per`.  However, there is a cluster of schools with a high percentage of females (`60` to `80`), and high SAT scores."
+    "Based on the scatter plot, there doesn't seem to be any real correlation between `sat_score` and `female_per`.  However, there is a cluster of schools with a high percentage of females (`60` to `80`) and high SAT scores."
    ]
   },
   {
@@ -661,14 +661,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "These schools appears to be very selective liberal arts schools that have high academic standards."
+    "These schools appear to be very selective liberal arts schools that have high academic standards."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# AP Exam Scores vs SAT Scores"
+    "# AP Exam Scores vs. SAT Scores"
    ]
   },
   {
@@ -707,7 +707,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "It looks like there is a relationship between the percentage of students in a school who take the AP exam, and their average SAT scores.  It's not an extremely strong correlation, though."
+    "It looks like there is a relationship between the percentage of students in a school who take the AP exam and their average SAT scores. It's not a very strong correlation, however."
    ]
   }
  ],
@@ -727,7 +727,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.2"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,

+ 16 - 16
Mission218Solution.ipynb

@@ -4,14 +4,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# US Gun Deaths Guided Project Solutions"
+    "# U.S. Gun Deaths Guided Project Solutions"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Introducing US Gun Deaths Data"
+    "# Introducing U.S. Gun Deaths Data"
    ]
   },
   {
@@ -48,7 +48,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Removing Headers From A List Of Lists"
+    "# Removing Headers from a List of Lists"
    ]
   },
   {
@@ -76,7 +76,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Counting Gun Deaths By Year"
+    "# Counting Gun Deaths by Year"
    ]
   },
   {
@@ -112,7 +112,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Exploring Gun Deaths By Month And Year"
+    "# Exploring Gun Deaths by Month and Year"
    ]
   },
   {
@@ -208,7 +208,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Exploring Gun Deaths By Race And Sex"
+    "# Exploring Gun Deaths by Race and Sex"
    ]
   },
   {
@@ -271,9 +271,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Findings so far\n",
+    "## Findings So Far\n",
     "\n",
-    "Gun deaths in the US seem to disproportionately affect men vs women.  They also seem to disproportionately affect minorities, although having some data on the percentage of each race in the overall US population would help.\n",
+    "Gun deaths in the U.S. seem to disproportionately affect men. They also seem to disproportionately affect minorities, although having some data on the percentage of each race in the overall U.S. population would help.\n",
     "\n",
     "There appears to be a minor seasonal correlation, with gun deaths peaking in the summer and declining in the winter.  It might be useful to filter by intent, to see if different categories of intent have different correlations with season, race, or gender."
    ]
@@ -282,7 +282,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Reading In A Second Dataset"
+    "# Reading in a Second Dataset"
    ]
   },
   {
@@ -348,7 +348,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Computing Rates Of Gun Deaths Per Race"
+    "# Computing Rates of Gun Deaths Per Race"
    ]
   },
   {
@@ -436,14 +436,14 @@
    "source": [
     "## Findings\n",
     "\n",
-    "It appears that gun related homicides in the US disproportionately affect people in the `Black` and `Hispanic` racial categories.\n",
+    "It appears that gun-related homicides in the U.S. disproportionately affect people in the `Black` and `Hispanic` racial categories.\n",
     "\n",
     "Some areas to investigate further:\n",
     "\n",
-    "* The link between month and homicide rate.\n",
-    "* Homicide rate by gender.\n",
-    "* The rates of other intents by gender and race.\n",
-    "* Gun death rates by location and education."
+    "* The link between month and homicide rate\n",
+    "* Homicide rate by gender\n",
+    "* The rates of other intents by gender and race\n",
+    "* Gun death rates by location and education"
    ]
   }
  ],
@@ -463,7 +463,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.1"
+   "version": "3.8.5"
   },
   "widgets": {
    "state": {},

+ 11 - 11
Mission219Solution.ipynb

@@ -403,7 +403,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Filtering Out Rows From A DataFrame"
+    "# Filtering out Rows from A DataFrame"
    ]
   },
   {
@@ -443,7 +443,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Using value_counts To Explore Main Dishes"
+    "# Using value_counts to Explore Main Dishes"
    ]
   },
   {
@@ -518,7 +518,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Figuring Out What Pies People Eat"
+    "# Determining Which Pies People Eat"
    ]
   },
   {
@@ -575,7 +575,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Converting Age To Numeric"
+    "# Converting Age to Numeric"
    ]
   },
   {
@@ -644,14 +644,14 @@
    "source": [
     "# Findings\n",
     "\n",
-    "Although we only have a rough approximation of age, and it skews downward because we took the first value in each string (the lower bound), we can see that that age groups of respondents are fairly evenly distributed."
+    "Although we only have a rough approximation of age, and it skews downward because we took the first value in each string (the lower bound), we can see that the age groups of respondents are fairly evenly distributed."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Converting Income To Numeric"
+    "# Converting Income to Numeric"
    ]
   },
   {
@@ -737,7 +737,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Correlating Travel Distance And Income"
+    "# Correlating Travel Distance and Income"
    ]
   },
   {
@@ -794,14 +794,14 @@
    "source": [
     "# Findings\n",
     "\n",
-    "It appears that more people with high income have Thanksgiving at home than people with low income.  This may be because younger students, who don't have a high income, tend to go home, whereas parents, who have higher incomes, don't."
+    "It appears that more people with high income have Thanksgiving at home than people with low income. This may be because younger students, who don't have a high income, tend to go home, whereas parents, who have higher incomes, don't."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Linking Friendship And Age"
+    "# Linking Friendship and Age"
    ]
   },
   {
@@ -950,7 +950,7 @@
    "source": [
     "# Findings\n",
     "\n",
-    "It appears that people who are younger are more likely to attend a Friendsgiving, and try to meet up with friends on Thanksgiving."
+    "It appears that people who are younger are more likely to attend a Friendsgiving and try to meet up with friends on Thanksgiving."
    ]
   }
  ],
@@ -970,7 +970,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.1"
+   "version": "3.8.5"
   },
   "widgets": {
    "state": {},

+ 10 - 26
Mission227Solutions.ipynb

@@ -10,9 +10,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "data": {
@@ -1034,9 +1032,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "data": {
@@ -1056,9 +1052,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -1333,15 +1327,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Reading In The Data"
+    "# Reading in the Data"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 12,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -1389,9 +1381,7 @@
   {
    "cell_type": "code",
    "execution_count": 26,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -1420,9 +1410,7 @@
   {
    "cell_type": "code",
    "execution_count": 27,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "data": {
@@ -1456,9 +1444,7 @@
   {
    "cell_type": "code",
    "execution_count": 33,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -1600,9 +1586,7 @@
   {
    "cell_type": "code",
    "execution_count": 40,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -2677,7 +2661,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.4.2"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,

+ 13 - 13
Mission240Solutions.ipynb

@@ -108,7 +108,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "1: All columns: Drop any with 5% or more missing values **for now**."
+    "1: All columns: drop any with 5% or more missing values **for now**."
    ]
   },
   {
@@ -140,7 +140,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "2: Text columns: Drop any with 1 or more missing values **for now**."
+    "2: Text columns: drop any with 1 or more missing values **for now**."
    ]
   },
   {
@@ -164,7 +164,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "3: Numerical columns: For columns with missing values, fill in with the most common value in that column"
+    "3: Numerical columns: for columns with missing values, fill in with the most common value in that column"
    ]
   },
   {
@@ -412,7 +412,7 @@
     "    test = df[1460:]\n",
     "    \n",
     "    ## You can use `pd.DataFrame.select_dtypes()` to specify column types\n",
-    "    ## and return only those columns as a data frame.\n",
+    "    ## and return only those columns as a DataFrame.\n",
     "    numeric_train = train.select_dtypes(include=['integer', 'float'])\n",
     "    numeric_test = test.select_dtypes(include=['integer', 'float'])\n",
     "    \n",
@@ -844,7 +844,7 @@
     }
    ],
    "source": [
-    "## Let's only keep columns with a correlation coefficient of larger than 0.4 (arbitrary, worth experimenting later!)\n",
+    "## Let's only keep columns with a correlation coefficient larger than 0.4 (arbitrary — worth experimenting later!).\n",
     "abs_corr_coeffs[abs_corr_coeffs > 0.4]"
    ]
   },
@@ -856,7 +856,7 @@
    },
    "outputs": [],
    "source": [
-    "## Drop columns with less than 0.4 correlation with SalePrice\n",
+    "## Drop columns with less than 0.4 correlation with SalePrice.\n",
     "transform_df = transform_df.drop(abs_corr_coeffs[abs_corr_coeffs < 0.4].index, axis=1)"
    ]
   },
@@ -875,7 +875,7 @@
    },
    "outputs": [],
    "source": [
-    "## Create a list of column names from documentation that are *meant* to be categorical\n",
+    "## Create a list of column names from documentation that are *meant* to be categorical.\n",
     "nominal_features = [\"PID\", \"MS SubClass\", \"MS Zoning\", \"Street\", \"Alley\", \"Land Contour\", \"Lot Config\", \"Neighborhood\", \n",
     "                    \"Condition 1\", \"Condition 2\", \"Bldg Type\", \"House Style\", \"Roof Style\", \"Roof Matl\", \"Exterior 1st\", \n",
     "                    \"Exterior 2nd\", \"Mas Vnr Type\", \"Foundation\", \"Heating\", \"Central Air\", \"Garage Type\", \n",
@@ -887,7 +887,7 @@
    "metadata": {},
    "source": [
     "- Which columns are currently numerical but need to be encoded as categorical instead (because the numbers don't have any semantic meaning)?\n",
-    "- If a categorical column has hundreds of unique values (or categories), should we keep it? When we dummy code this column, hundreds of columns will need to be added back to the data frame."
+    "- If a categorical column has hundreds of unique values (or categories), should we keep it? When we dummy-code this column, hundreds of columns will need to be added back to the DataFrame."
    ]
   },
   {
@@ -898,7 +898,7 @@
    },
    "outputs": [],
    "source": [
-    "## Which categorical columns have we still carried with us? We'll test these \n",
+    "## Which categorical columns have we still carried with us? We'll test these. \n",
     "transform_cat_cols = []\n",
     "for col in nominal_features:\n",
     "    if col in transform_df.columns:\n",
@@ -906,7 +906,7 @@
     "\n",
     "## How many unique values in each categorical column?\n",
     "uniqueness_counts = transform_df[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()\n",
-    "## Aribtrary cutoff of 10 unique values (worth experimenting)\n",
+    "## Aribtrary cutoff of 10 unique values (worth experimenting).\n",
     "drop_nonuniq_cols = uniqueness_counts[uniqueness_counts > 10].index\n",
     "transform_df = transform_df.drop(drop_nonuniq_cols, axis=1)"
    ]
@@ -919,12 +919,12 @@
    },
    "outputs": [],
    "source": [
-    "## Select just the remaining text columns and convert to categorical\n",
+    "## Select only the remaining text columns, and convert to categorical.\n",
     "text_cols = transform_df.select_dtypes(include=['object'])\n",
     "for col in text_cols:\n",
     "    transform_df[col] = transform_df[col].astype('category')\n",
     "    \n",
-    "## Create dummy columns and add back to the dataframe!\n",
+    "## Create dummy columns, and add back to the DataFrame!\n",
     "transform_df = pd.concat([\n",
     "    transform_df, \n",
     "    pd.get_dummies(transform_df.select_dtypes(include=['category']))\n",
@@ -1089,7 +1089,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.3"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,

+ 7 - 7
Mission244Solutions.ipynb

@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Working With Image Data"
+    "## Working with Image Data"
    ]
   },
   {
@@ -324,7 +324,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Neural Network With One Hidden Layer"
+    "## Neural Network with One Hidden Layer"
    ]
   },
   {
@@ -474,7 +474,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Neural Network With Two Hidden Layers"
+    "## Neural Network with Two Hidden Layers"
    ]
   },
   {
@@ -549,14 +549,14 @@
    "source": [
     "### Summary\n",
     "\n",
-    "Using 2 hidden layers improved our simple accuracy to `98%`. While I'd traditionally be worried about overfitting, using 4-fold cross validation also gives me a bit more assurance that the model is generalizing to achieve the extra `1%` in simple accuracy over the single hidden layer networks we tried earlier."
+    "Using two hidden layers improved our simple accuracy to `98%`. While, traditionally, we might worry about overfitting, using four-fold cross validation also gives us a bit more assurance that the model is generalizing to achieve the extra `1%` in simple accuracy over the single hidden layer networks we tried earlier."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Neural Network With Three Hidden Layers"
+    "## Neural Network with Three Hidden Layers"
    ]
   },
   {
@@ -698,7 +698,7 @@
    "source": [
     "### Summary\n",
     "\n",
-    "Using 3 hidden layers returned a simple accuracy of nearly `98%`, even with 6-fold cross validation."
+    "Using three hidden layers returned a simple accuracy of nearly `98%`, even with six-fold cross validation."
    ]
   }
  ],
@@ -718,7 +718,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.2"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,

+ 47 - 47
Mission251Solution.ipynb

@@ -4,28 +4,28 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Guided Project Solution: Building a database for crime reports\n",
-    "## Apply what you have learned to set up a database for storing crime reports data\n",
+    "# Guided Project Solution: Building a Database for Crime Reports\n",
+    "## Apply what you have learned to set up a database to store crime reports data.\n",
     "\n",
     "## François Aubry\n",
     "\n",
-    "The goal of this guided project is to setup a database from scratch and the Boston crime data into it.\n",
+    "The goal of this guided project is to setup a database of Boston crime data from scratch.\n",
     "\n",
     "We will create two user groups:\n",
     "\n",
-    "* `readonly`: Users in this group will have permission to read data only.\n",
-    "* `readwrite`:  Users in this group will have permissions to read and alter data but not to delete tables."
+    "* `readonly`: users in this group will have permission to read data only.\n",
+    "* `readwrite`:  users in this group will have permissions to read and alter data but not to delete tables."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Creating the database and the schema\n",
+    "## Creating the Database and the Schema\n",
     "\n",
     "Create a database named `crime_db` and a schema named `crimes` for storing the tables for containing the crime data.\n",
     "\n",
-    "The database `crime_db` does not exist yet so we connect to `dq`."
+    "The database `crime_db` does not exist yet, so we connect to `dq`."
    ]
   },
   {
@@ -88,7 +88,7 @@
    "source": [
     "## Obtaining the Column Names and Sample\n",
     " \n",
-    "Obtain the header row and assign it to a variable named `col_headers` and obtain the first data row and assign it to a variable named `first_row`."
+    "Obtain the header row, and assign it to a variable named `col_headers`. Obtain the first data row, and assign it to a variable named `first_row`."
    ]
   },
   {
@@ -108,11 +108,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Creating a function for analyzing column values\n",
+    "## Creating a Function for Analyzing Column Values\n",
     "\n",
-    "Create a function `get_col_set` that given a CSV file name and a column index computes the set of all distinct values in that column.\n",
+    "Create a function `get_col_set` that, given a CSV filename and a column index, computes the set of all distinct values in that column.\n",
     "\n",
-    "Use the function on each column to evaluate which columns have a lot of different values. Columns with a limited set of possible values are good candidates for enumerated datatypes."
+    "Use the function on each column to evaluate which columns have many different values. Columns with a limited set of possible values are good candidates for enumerated datatypes."
    ]
   },
   {
@@ -154,7 +154,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Analyzing the maximum length of the description column\n",
+    "## Analyzing the Maximum Length of the Description Column\n",
     "\n",
     "Use the `get_col_set` function to compute the maximum description length to decide an appropriate length for that field."
    ]
@@ -201,15 +201,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Creating the table\n",
+    "## Creating the Table\n",
     "\n",
-    "We have create an enumerated datatype named `weekday` for the `day_of_the_week` since there there only seven possible values.\n",
+    "We have created an enumerated datatype named `weekday` for the `day_of_the_week` since there there are only seven possible values.\n",
     "\n",
-    "For the `incident_number` we have decided to user the type `INTEGER` and set it as the primary key. The same datatype was also used to represent the `offense_code`.\n",
+    "For the `incident_number`, we have decided to user the type `INTEGER` and set it as the primary key. The same datatype was also used to represent the `offense_code`.\n",
     "\n",
-    "Since the description has at most `58` character we decided to use the datatype `VARCHAR(100)` for representing it. This leave some margin while not being so big that we will waste a lot of memory.\n",
+    "Since the description has at most `58` characters, we decided to use the datatype `VARCHAR(100)` for representing it. This leaves some margin while not being so big that we will waste a lot of memory.\n",
     "\n",
-    "The date was represented as the `DATE` datatype. Finally, for the latitude and longitude we used `DECIMAL` datatypes."
+    "The date was represented as the `DATE` datatype. Finally, for the latitude and longitude, we used `DECIMAL` datatypes."
    ]
   },
   {
@@ -237,7 +237,7 @@
    "source": [
     "We will use the same names for the column headers.\n",
     "\n",
-    "The number of different values of each column was:\n",
+    "The number of different values of each column was the following:\n",
     "\n",
     "```\n",
     "incident_number 298329\n",
@@ -249,7 +249,7 @@
     "long\t         18177\n",
     "```\n",
     "\n",
-    "From the result of printing `first_row` we see that kind of data that we have are:\n",
+    "From the result of printing `first_row`, we see which kind of data we have:\n",
     "\n",
     "```\n",
     "integer numbers\n",
@@ -261,11 +261,11 @@
     "decimal number\n",
     "```\n",
     "\n",
-    "Only column `day_of_the_week` has a small range of values so we will only create an enumerated datatype for this column. Column `offense_code` is also a good candidate since there is probably a limited set of possible offense codes.\n",
+    "Only column `day_of_the_week` has a small range of values, so we will only create an enumerated datatype for this column. Column `offense_code` is also a good candidate since there is probably a limited set of possible offense codes.\n",
     "\n",
-    "We saw that the `offense_code` column has size at most 59. To be on the safe side we will limit the size of the description to 100 and use the `VARCHAR(100)` datatype.\n",
+    "We saw that the `offense_code` column has size at most 59. To be safe, we will limit the size of the description to 100 and use the `VARCHAR(100)` datatype.\n",
     "\n",
-    "The `lat` and `long` column see to need to hold quite a lot of precision so we will use the `decimal` type."
+    "The `lat` and `long` columns need to hold quite a lot of precision, so we will use the `decimal` type."
    ]
   },
   {
@@ -286,11 +286,11 @@
     }
    ],
    "source": [
-    "# create the enumerated datatype for representing the weekday\n",
+    "# Create the enumerated datatype for representing the weekday.\n",
     "cur.execute(\"\"\"\n",
     "    CREATE TYPE weekday AS ENUM ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday');\n",
     "\"\"\")\n",
-    "# create the table\n",
+    "# Create the table.\n",
     "cur.execute(\"\"\"\n",
     "    CREATE TABLE crimes.boston_crimes (\n",
     "        incident_number INTEGER PRIMARY KEY,\n",
@@ -308,9 +308,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Load the data into the table\n",
+    "## Load the Data into the Table\n",
     "\n",
-    "We used the `copy_expert` to load the data as it is very fast and very succinct to use."
+    "We used the `copy_expert` to load the data because it is very fast and very succinct."
    ]
   },
   {
@@ -331,11 +331,11 @@
     }
    ],
    "source": [
-    "# load the data from boston.csv into the table boston_crimes that is in the crimes schema\n",
+    "# Load the data from boston.csv into the table boston_crimes that is in the crimes schema.\n",
     "with open(\"boston.csv\") as f:\n",
     "    cur.copy_expert(\"COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;\", f)\n",
     "cur.execute(\"SELECT * FROM crimes.boston_crimes\")\n",
-    "# print the number of rows to ensure that they were loaded\n",
+    "# Print the number of rows to ensure that they were loaded.\n",
     "print(len(cur.fetchall()))"
    ]
   },
@@ -343,11 +343,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Revoke public privileges\n",
+    "## Revoke Public Privileges\n",
     "\n",
-    "We revoke all privileges of the public `public` group on the `public` schema to ensure that users will not inherit privileges on that schema such as the ability to create tables in the `public` schema.\n",
+    "We revoke all privileges of the public `public` group on the `public` schema to ensure that users will not inherit privileges on that schema, such as the ability to create tables in the `public` schema.\n",
     "\n",
-    "We also need to revoke all privileges in the newly created schema. Doing this also makes it so that we do not need to revoke the privileges when we create users and groups because unless specified otherwise, privileges are not granted by default."
+    "We also need to revoke all privileges in the newly created schema. Doing this means we do not need to revoke the privileges when we create users and groups because, unless specified otherwise, privileges are not granted by default."
    ]
   },
   {
@@ -364,11 +364,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Creating the read only group\n",
+    "## Creating the Read Only Group\n",
     "\n",
     "We create a `readonly` group with `NOLOGIN` because it is a group and not a user. We grant the group the ability to connect to the `crime_db` and the ability to use the `crimes` schema.\n",
     "\n",
-    "Then we deal wit tables privileges by granting `SELECT`. We also add an extra line compared with what was asked. This extra line changes the way that privileges are given by default to the `readonly` group on new table that are created on the `crimes` schema. As we mentioned, by default not privileges are given. However we change is so that by default any user in the `readonly` group can issue select commands."
+    "Then we deal with tables privileges by granting `SELECT`. We also add an extra line over what was asked. This extra line changes the way that privileges are given by default to the `readonly` group on new table that are created on the `crimes` schema. As we mentioned, by default *not privileges* are given. However, we change it so that, by default, any user in the `readonly` group can issue select commands."
    ]
   },
   {
@@ -399,11 +399,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Creating the read-write group\n",
+    "## Creating the Read Write Group\n",
     "\n",
     "We create a `readwrite` group with `NOLOGIN` because it is a group and not a user. We grant the group the ability to connect to the `crime_db` and the ability to use the `crimes` schema.\n",
     "\n",
-    "Then we deal wit tables privileges by granting `SELECT`, `INSERT`, `UPDATE` and `DELETE`. As before we change the default privileges so that user in the `readwrite` group have these privileges if we ever create a new table on the `crimes` schema."
+    "Then we deal with tables privileges by granting `SELECT`, `INSERT`, `UPDATE`, and `DELETE`. As before, we change the default privileges so that users in the `readwrite` group have these privileges if we ever create a new table on the `crimes` schema."
    ]
   },
   {
@@ -434,7 +434,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Creating one user for each group\n",
+    "## Creating One User for Each Group\n",
     "\n",
     "We create a user named `data_analyst` with password `secret1` in the `readonly` group.\n",
     "\n",
@@ -470,19 +470,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Test the database setup\n",
+    "## Test the Database Setup\n",
     "\n",
     "Test the database setup using SQL queries on the `pg_roles` table and `information_schema.table_privileges`.\n",
     "\n",
-    "In the `pg_roles` table we will check database related privileges and for that we will look at the following columns: \n",
+    "In the `pg_roles` table, we will check database-related privileges, and for that we will look at the following columns: \n",
     "\n",
-    "* `rolname`: The name of the user / group that the privilege refers to.\n",
-    "* `rolsuper`: Whether this user / group is a super user. It should be set to `False` on every user / group that we have created.\n",
-    "* `rolcreaterole`: Whether user / group can create users, groups or roles. It should be `False` on every user / group that we have created.\n",
-    "* `rolcreatedb`: Whether user / group can create databases. It should be `False` on every user / group that we have created.\n",
-    "* `rolcanlogin`: Whether user / group can login. It should be `True` on the users and `False` on the groups that we have created.\n",
+    "* `rolname`: the name of the user/group to which the privilege refers.\n",
+    "* `rolsuper`: whether or not this user/group is a super user. It should be set to `False` on every user/group that we have created.\n",
+    "* `rolcreaterole`: whether or not user/group can create users, groups, or roles. It should be `False` on every user/group that we have created.\n",
+    "* `rolcreatedb`: whether or not user/group can create databases. It should be `False` on every user/group that we have created.\n",
+    "* `rolcanlogin`: whether or not user/group can log in. It should be `True` on the users and `False` on the groups that we have created.\n",
     "\n",
-    "In the `information_schema.table_privileges` we will check privileges related to SQL queries on tables. We will list the privileges of each group that we have created."
+    "In the `information_schema.table_privileges`, we will check privileges related to SQL queries on tables. We will list the privileges of each group that we have created."
    ]
   },
   {
@@ -508,12 +508,12 @@
     }
    ],
    "source": [
-    "# close the old connection to test with a brand new connection\n",
+    "# Close the old connection to test with a brand new connection.\n",
     "conn.close()\n",
     "\n",
     "conn = psycopg2.connect(dbname=\"crime_db\", user=\"dq\")\n",
     "cur = conn.cursor()\n",
-    "# check users and groups\n",
+    "# Check users and groups.\n",
     "cur.execute(\"\"\"\n",
     "    SELECT rolname, rolsuper, rolcreaterole, rolcreatedb, rolcanlogin FROM pg_roles\n",
     "    WHERE rolname IN ('readonly', 'readwrite', 'data_analyst', 'data_scientist');\n",
@@ -549,7 +549,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.2"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,

+ 20 - 29
Mission257Solutions.ipynb

@@ -11,7 +11,6 @@
    "cell_type": "code",
    "execution_count": 1,
    "metadata": {
-    "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
@@ -45,14 +44,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We'll begin by getting a sense of what the data looks like."
+    "We'll begin by exploring the data."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
    "metadata": {
-    "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
@@ -175,15 +173,15 @@
    "source": [
     "Here are the descriptions for some of the columns:\n",
     "\n",
-    "* `name` - The name of the country.\n",
-    "* `area` - The total land and sea area of the country.\n",
-    "* `population` - The country's population.\n",
-    "* `population_growth`- The country's population growth as a percentage.\n",
-    "* `birth_rate` - The country's birth rate, or the number of births a year per 1,000 people.\n",
-    "* `death_rate` - The country's death rate, or the number of death a year per 1,000 people.\n",
-    "* `area`- The country's total area (both land and water).\n",
-    "* `area_land` - The country's land area in [square kilometers](https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html).\n",
-    "* `area_water` - The country's waterarea in square kilometers.\n",
+    "* `name` — the name of the country.\n",
+    "* `area` — the total land and sea area of the country.\n",
+    "* `population` — the country's population.\n",
+    "* `population_growth`— the country's population growth as a percentage.\n",
+    "* `birth_rate` — the country's birth rate, or the number of births a year per 1,000 people.\n",
+    "* `death_rate` — the country's death rate, or the number of death a year per 1,000 people.\n",
+    "* `area`— the country's total area (both land and water).\n",
+    "* `area_land` — the country's land area in [square kilometers](https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html).\n",
+    "* `area_water` — the country's water area in square kilometers.\n",
     "\n",
     "Let's start by calculating some summary statistics and see what they tell us."
    ]
@@ -199,7 +197,6 @@
    "cell_type": "code",
    "execution_count": 3,
    "metadata": {
-    "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
@@ -252,12 +249,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "A few things stick out from the summary statistics in the last screen:\n",
+    "A few things are interesting in the summary statistics on the previous screen:\n",
     "\n",
-    "- There's a country with a population of `0`\n",
-    "- There's a country with a population of `7256490011` (or more than 7.2 billion people) \n",
+    "- There's a country with a population of `0`.\n",
+    "- There's a country with a population of `7256490011` (or more than 7.2 billion people).\n",
     "\n",
-    "Let's use subqueries to zoom in on just these countries _without_ using the specific values."
+    "Let's use subqueries to concentrate on these countries _without_ using the specific values."
    ]
   },
   {
@@ -271,7 +268,6 @@
    "cell_type": "code",
    "execution_count": 4,
    "metadata": {
-    "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
@@ -347,7 +343,6 @@
    "cell_type": "code",
    "execution_count": 5,
    "metadata": {
-    "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
@@ -429,9 +424,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -497,14 +490,13 @@
    "source": [
     "Let's explore density. Density depends on the population and the country's area. Let's look at the average values for these two columns.\n",
     "\n",
-    "We should take care of discarding the row for the whole planet."
+    "We should discard the row for the whole planet."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 7,
    "metadata": {
-    "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
@@ -565,17 +557,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To finish, we'll build on the query above to find countries that are densely populated.  We'll identify countries that have:\n",
+    "To finish, we'll build on the query above to find countries that are densely populated. We'll identify countries that have the following:\n",
     "\n",
-    "- Above average values for population.\n",
-    "- Below average values for area."
+    "- Above-average values for population.\n",
+    "- Below-average values for area."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 8,
    "metadata": {
-    "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
@@ -849,7 +840,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.4.3"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,

+ 1 - 1
Mission267Solutions.ipynb

@@ -116,7 +116,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.2"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,

+ 8 - 8
Mission280Solutions.ipynb

@@ -332,7 +332,7 @@
     "- Slim Jim Bites (Blues)\n",
     "- Meteor and the Girls (Pop)\n",
     "\n",
-    "It's worth keeping in mind that combined, these three genres only make up only 17% of total sales, so we should be on the lookout for artists and albums from the 'rock' genre, which accounts for 53% of sales."
+    "It's worth keeping in mind that, combined, these three genres only make up 17% of total sales, so we should watch for artists and albums from the \"rock\" genre, which accounts for 53% of sales."
    ]
   },
   {
@@ -463,7 +463,7 @@
     "hidden": true
    },
    "source": [
-    "While there is a 20% difference in sales between Jane (the top employee) and Steve (the bottom employee), the difference roughly corresponds with the differences in their hiring dates."
+    "While there is a 20% difference in sales between Jane (the top employee) and Steve (the bottom employee), the difference approximately corresponds to the differences in their hiring dates."
    ]
   },
   {
@@ -771,7 +771,7 @@
     "- United Kingdom\n",
     "- India\n",
     "\n",
-    "It's worth keeping in mind that because the amount of data from each of these countries is relatively low.  Because of this, we should be cautious spending too much money on new marketing campaigns, as the sample size is not large enough to give us high confidence.  A better approach would be to run small campaigns in these countries, collecting and analyzing the new customers to make sure that these trends hold with new customers."
+    "Because the amount of data from each of these countries is relatively low, we should not spend too much money on new marketing campaigns because the sample size is not large enough to give us high confidence. A better approach would be to run small campaigns in these countries, collecting and analyzing the new customers to make sure that these trends hold with new customers."
    ]
   },
   {
@@ -780,7 +780,7 @@
     "heading_collapsed": true
    },
    "source": [
-    "## Albums vs Individual Tracks"
+    "## Albums vs. Individual Tracks"
    ]
   },
   {
@@ -897,15 +897,15 @@
     "hidden": true
    },
    "source": [
-    "Album purchases account for 18.6% of purchases.  Based on this data, I would recommend against purchasing only select tracks from albums from record companies, since there is potential to lose one fifth of revenue."
+    "Album purchases account for 18.6% of purchases. Based on this data, we should not purchase only select tracks from albums from record companies, since there is potential to lose one fifth of revenue."
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "dscontent",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "dscontent"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -917,7 +917,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.4.4"
+   "version": "3.8.5"
   },
   "notify_time": "30"
  },

+ 27 - 27
Mission288Solutions.ipynb

@@ -6,11 +6,11 @@
    "source": [
     "# Is Fandango Still Inflating Ratings?\n",
     "\n",
-    "In October 2015, Walt Hickey from FiveThirtyEight published [a popular article](https://fivethirtyeight.com/features/fandango-movies-ratings/) where he presented strong evidence which suggest that Fandango's movie rating system was biased and dishonest. In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.\n",
+    "In October 2015, Walt Hickey from FiveThirtyEight published [a popular article](https://fivethirtyeight.com/features/fandango-movies-ratings/) where he presented strong evidence that suggests that Fandango's movie rating system was biased and dishonest. In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system following Hickey's analysis.\n",
     "\n",
     "# Understanding the Data\n",
     "\n",
-    "We'll work with two samples of movie ratings:the data in one sample was collected _previous_ to Hickey's analysis, while the other sample was collected _after_. Let's start by reading in the two samples (which are stored as CSV files) and getting familiar with their structure."
+    "We'll work with two samples of movie ratings: the data in one sample was collected _prior_ to Hickey's analysis, while the other sample was collected _after_. Let's start by reading in the two samples (which are stored as CSV files) and exploring their structure."
    ]
   },
   {
@@ -322,7 +322,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Below we isolate only the columns that provide information about Fandango so we make the relevant data more readily available for later use. We'll make copies [to avoid any `SettingWithCopyWarning`](https://www.dataquest.io/blog/settingwithcopywarning/) later on. "
+    "Below we isolate only the columns that provide information about Fandango to make the relevant data more readily available for later use. We'll make copies [to avoid any `SettingWithCopyWarning`](https://www.dataquest.io/blog/settingwithcopywarning/) later on. "
    ]
   },
   {
@@ -486,25 +486,25 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis. The population of interest for our analysis is made of all the movie ratings stored on Fandango's website, regardless of the releasing year.\n",
+    "Our goal is to determine whether there has been any change in Fandango's rating system following Hickey's analysis. The population of interest for our analysis comprises all the movie ratings stored on Fandango's website, regardless of the releasing year.\n",
     "\n",
-    "Because we want to find out whether the parameters of this population changed after Hickey's analysis, we're interested in sampling the population at two different periods in time — previous and after Hickey's analysis — so we can compare the two states.\n",
+    "Because we want to determine if the parameters of this population changed after Hickey's analysis, we're interested in sampling the population at two different periods in time — before and after Hickey's analysis — so we can compare the two states.\n",
     "\n",
-    "The data we're working with was sampled at the moments we want: one sample was taken previous to the analysis, and the other after the analysis. We want to describe the population, so we need to make sure that the samples are representative, otherwise we should expect a large sampling error and, ultimately, wrong conclusions.\n",
+    "The data we're working with was sampled at the moments we want: one sample was taken prior to the analysis, and the other was taken after the analysis. We want to describe the population, so we need to make sure that the samples are representative; otherwise, we should expect a large sampling error and, ultimately, inaccurate conclusions.\n",
     "\n",
     "From Hickey's article and from the `README.md` of [the data set's repository](https://github.com/fivethirtyeight/data/tree/master/fandango), we can see that he used the following sampling criteria:\n",
     "\n",
     "* The movie must have had at least 30 fan ratings on Fandango's website at the time of sampling (Aug. 24, 2015).\n",
     "* The movie must have had tickets on sale in 2015.\n",
     "\n",
-    "The sampling was clearly not random because not every movie had the same chance to be included in the sample — some movies didn't have a chance at all (like those having under 30 fan ratings or those without tickets on sale in 2015). It's questionable whether this sample is representative of the entire population we're interested to describe. It seems more likely that it isn't, mostly because this sample is subject to *temporal trends* — e.g. movies in 2015 might have been outstandingly good or bad compared to other years.\n",
+    "The sampling was clearly not random because not every movie had the same chance to be included in the sample — some movies didn't have a chance at all (like those having under 30 fan ratings or those without tickets on sale in 2015). It's questionable whether this sample is representative of the entire population we're interested in describing. It seems more likely that it isn't, mostly because this sample is subject to *temporal trends* (e.g., movies in 2015 might have been outstandingly good or bad compared to other years).\n",
     "\n",
-    "The sampling conditions for our other sample were (as it can be read in the `README.md` of [the data set's repository](https://github.com/mircealex/Movie_ratings_2016_17)):\n",
+    "The sampling conditions for our other sample were the following (as it can be read in the `README.md` of [the data set's repository](https://github.com/mircealex/Movie_ratings_2016_17)):\n",
     "\n",
     "* The movie must have been released in 2016 or later.\n",
-    "* The movie must have had a considerable number of votes and reviews (unclear how many from the `README.md` or from the data).\n",
+    "* The movie must have had a considerable number of votes and reviews (it's unclear how many from the `README.md` or from the data).\n",
     "\n",
-    "This second sample is also subject to temporal trends and it's unlikely to be representative of our population of interest.\n",
+    "This second sample is also subject to temporal trends, and it's unlikely to be representative of our population of interest.\n",
     "\n",
     "Both these authors had certain research questions in mind when they sampled the data, and they used a set of criteria to get a sample that would fit their questions. Their sampling method is called [**purposive sampling**](https://youtu.be/CdK7N_kTzHI) (or judgmental/selective/subjective sampling). While these samples were good enough for their research, they don't seem too useful for us.\n",
     "\n",
@@ -512,7 +512,7 @@
     "\n",
     "At this point, we can either collect new data or change our the goal of our analysis. We choose the latter and place some limitations on our initial goal.\n",
     "\n",
-    "Instead of trying to determine whether there has been any change in Fandango's rating system after Hickey's analysis, our new goal is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. This new goal should also be a fairly good proxy for our initial goal.\n",
+    "Instead of trying to determine whether there has been any change in Fandango's rating system following Hickey's analysis, our new goal is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. This new goal should also be a fairly good proxy for our initial goal.\n",
     "\n",
     "# Isolating the Samples We Need\n",
     "\n",
@@ -525,7 +525,7 @@
     "\n",
     "Although one of the sampling criteria in our second sample is movie popularity, the sample doesn't provide information about the number of fan ratings. We should be skeptical once more and ask whether this sample is truly representative and contains popular movies (movies with over 30 fan ratings).\n",
     "\n",
-    "One quick way to check the representativity of this sample is to sample randomly 10 movies from it and then check the number of fan ratings ourselves on Fandango's website. Ideally, at least 8 out of the 10 movies have 30 fan ratings or more."
+    "One quick way to check the representativity of this sample is to randomly sample 10 movies from it and then check the number of fan ratings ourselves on Fandango's website. Ideally, at least 8 out of the 10 movies have 30 fan ratings or more."
    ]
   },
   {
@@ -711,9 +711,9 @@
     "</table>\n",
     "\n",
     "\n",
-    "90% of the movies in our sample are popular. This is enough and we move forward with a bit more confidence.\n",
+    "90% of the movies in our sample are popular. This is enough for us to move forward with a bit more confidence.\n",
     "\n",
-    "Let's also double-check the other data set for popular movies. The documentation states clearly that there're only movies with at least 30 fan ratings, but it should take only a couple of seconds to double-check here."
+    "Let's also double-check the other dataset for popular movies. The documentation states clearly that there are only movies with at least 30 fan ratings, but it should take only a couple of seconds to double-check here."
    ]
   },
   {
@@ -740,9 +740,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "If you explore the two data sets, you'll notice that there are movies with a releasing year different than 2015 or 2016. For our purposes, we'll need to isolate only the movies released in 2015 and 2016.\n",
+    "If you explore the two datasets, you'll notice that there are movies with a release year different than 2015 or 2016. For our purposes, we'll need to isolate only the movies released in 2015 and 2016.\n",
     "\n",
-    "Let's start with Hickey's data set and isolate only the movies released in 2015. There's no special column for the releasing year, but we should be able to extract it from the strings in the `FILM` column."
+    "Let's start with Hickey's dataset and isolate only the movies released in 2015. There's no special column for the releasing year, but we should be able to extract it from the strings in the `FILM` column."
    ]
   },
   {
@@ -950,7 +950,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Great, now let's isolate the movies in the other data set."
+    "Great! Now, let's isolate the movies in the other dataset."
    ]
   },
   {
@@ -1066,7 +1066,7 @@
    "source": [
     "# Comparing Distribution Shapes for 2015 and 2016\n",
     "\n",
-    "Our aim is to figure out whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. One way to go about is to analyze and compare the distributions of movie ratings for the two samples.\n",
+    "Our goal is to determine whether or not there is any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. One way to do this is to analyze and compare the distributions of movie ratings for the two samples.\n",
     "\n",
     "We'll start with comparing the shape of the two distributions using kernel density plots. We'll use [the FiveThirtyEight style](https://www.dataquest.io/blog/making-538-plots/) for the plots."
    ]
@@ -1108,18 +1108,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Two aspects are striking on the figure above:\n",
-    "* Both distributions are strongly left skewed.\n",
+    "Two aspects are interesting in the figure above:\n",
+    "* Both distributions are strongly left-skewed.\n",
     "* The 2016 distribution is slightly shifted to the left relative to the 2015 distribution.\n",
     "\n",
-    "The left skew suggests that movies on Fandango are given mostly high and very high fan ratings. Coupled with the fact that Fandango sells tickets, the high ratings are a bit dubious. It'd be really interesting to investigate this further — ideally in a separate project, since this is quite irrelevant for the current goal of our analysis.\n",
+    "The left skew suggests that movies on Fandango are given mostly high and very high fan ratings. Coupled with the fact that Fandango sells tickets, the high ratings are a bit dubious. It'd be really interesting to investigate this further — ideally in a separate project, since this is irrelevant for the current goal of our analysis.\n",
     "\n",
     "The slight left shift of the 2016 distribution is very interesting for our analysis. It shows that ratings were slightly lower in 2016 compared to 2015. This suggests that there was a difference indeed between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We can also see the direction of the difference: the ratings in 2016 were slightly lower compared to 2015.\n",
     "\n",
     "\n",
     "# Comparing Relative Frequencies\n",
     "\n",
-    "It seems we're following a good thread so far, but we need to analyze more granular information. Let's examine the frequency tables of the two distributions to analyze some numbers. Because the data sets have different numbers of movies, we normalize the tables and show percentages instead."
+    "It seems we're following a good thread so far, but we need to analyze more granular information. Let's examine the frequency tables of the two distributions to analyze some numbers. Because the datasets have different numbers of movies, we normalize the tables and show percentages instead."
    ]
   },
   {
@@ -1200,11 +1200,11 @@
     "\n",
     "The minimum rating is also lower in 2016 — 2.5 instead of 3 stars, the minimum of 2015. There clearly is a difference between the two frequency distributions.\n",
     "\n",
-    "For some other ratings, the percentage went up in 2016. There was a greater percentage of movies in 2016 that received 3.5 and 4 stars, compared to 2015. 3.5 and 4.0 are high ratings and this challenges the direction of the change we saw on the kernel density plots.\n",
+    "For some other ratings, the percentage went up in 2016. There was a greater percentage of movies in 2016 that received 3.5 and 4 stars, compared to 2015. 3.5 and 4.0 are high ratings, and this challenges the direction of the change we saw on the kernel density plots.\n",
     "\n",
     "# Determining the Direction of the Change\n",
     "\n",
-    "Let's take a couple of summary metrics to get a more precise picture about the direction of the change. In what follows, we'll compute the mean, the median, and the mode for both distributions and then use a bar graph to plot the values."
+    "Let's take a couple of summary metrics for more precise information about the direction of the change. In what follows, we'll compute the mean, the median, and the mode for both distributions, and then we'll use a bar graph to plot the values."
    ]
   },
   {
@@ -1351,9 +1351,9 @@
     "\n",
     "# Conclusion\n",
     "\n",
-    "Our analysis showed that there's indeed a slight difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We also determined that, on average, popular movies released in 2016 were rated lower on Fandango than popular movies released in 2015.\n",
+    "Our analysis showed that there is indeed a slight difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We also determined that, on average, popular movies released in 2016 were rated lower on Fandango than popular movies released in 2015.\n",
     "\n",
-    "We cannot be completely sure what caused the change, but the chances are very high that it was caused by Fandango fixing the biased rating system after Hickey's analysis."
+    "We cannot be completely sure what caused the change, but the chances are very high that it was caused by Fandango fixing the biased rating system following Hickey's analysis."
    ]
   }
  ],
@@ -1373,7 +1373,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.4"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,