Queer European MD passionate about IT
Bläddra i källkod

Added a few improvements

Alex 6 år sedan
förälder
incheckning
b6bc06edbd
1 ändrade filer med 24 tillägg och 21 borttagningar
  1. 24 21
      Mission288Solutions.ipynb

+ 24 - 21
Mission288Solutions.ipynb

@@ -8,10 +8,9 @@
     "\n",
     "In October 2015, Walt Hickey from Fandango published [a popular article](https://fivethirtyeight.com/features/fandango-movies-ratings/) where he presented strong evidence which suggest that Fandango's movie rating system was biased and dishonest. In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.\n",
     "\n",
-    "\n",
     "# Understanding the Data\n",
     "\n",
-    "We'll work with two samples of movie ratings: the data in one sample was collected previous to Hickey's analysis, while the other sample was collected after. Let's start by reading in the two samples (which are stored as CSV files) and getting familiar with their structure."
+    "We'll work with two samples of movie ratings:the data in one sample was collected _previous_ to Hickey's analysis, while the other sample was collected _after_. Let's start by reading in the two samples (which are stored as CSV files) and getting familiar with their structure."
    ]
   },
   {
@@ -487,7 +486,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Our goal is to determine whether there has been change in Fandango's rating system after Hickey's analysis. The population of interest is made of all the movie ratings stored on Fandango's website, regardless of the releasing year. Because we're interested about whether the parameters of this population changed after Hickey's analysis, we're interested to sample it at two different periods in time — previous and after Hickey's analysis — so we can compare the two states.\n",
+    "Our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis. The population of interest for our analysis is made of all the movie ratings stored on Fandango's website, regardless of the releasing year.\n",
+    "\n",
+    "Because we want to find out whether the parameters of this population changed after Hickey's analysis, we're interested in sampling the population at two different periods in time — previous and after Hickey's analysis — so we can compare the two states.\n",
     "\n",
     "The data we're working with was sampled at the moments we want: one sample was taken previous to the analysis, and the other after the analysis. We want to describe the population, so we need to make sure that the samples are representative, otherwise we should expect a large sampling error and, ultimately, wrong conclusions.\n",
     "\n",
@@ -496,7 +497,7 @@
     "* The movie must have had at least 30 fan ratings on Fandango's website at the time of sampling (Aug. 24, 2015).\n",
     "* The movie must have had tickets on sale in 2015.\n",
     "\n",
-    "The sampling was clearly not random because not every movie had the same chance to be included in the sample — some movies didn't have a chance at all (like those with under 30 fan reviews or those without tickets on sale in 2015). It's questionable whether this sample is representative of the entire population we're interested to describe. It's much more likely that it isn't, mostly because this sample is subject to *temporal trends* — e.g. movies in 2015 might have been outstandingly good or bad compared to other years.\n",
+    "The sampling was clearly not random because not every movie had the same chance to be included in the sample — some movies didn't have a chance at all (like those having under 30 fan ratings or those without tickets on sale in 2015). It's questionable whether this sample is representative of the entire population we're interested to describe. It seems more likely that it isn't, mostly because this sample is subject to *temporal trends* — e.g. movies in 2015 might have been outstandingly good or bad compared to other years.\n",
     "\n",
     "The sampling conditions for our other sample were (as it can be read in the `README.md` of [the data set's repository](https://github.com/mircealex/Movie_ratings_2016_17)):\n",
     "\n",
@@ -505,7 +506,7 @@
     "\n",
     "This second sample is also subject to temporal trends and it's unlikely to be representative of our population of interest.\n",
     "\n",
-    "Both these authors had certain research questions in mind when they sampled the data, and they used a set of criteria to get a sample that would fit their questions. Their sampling method is called [**purposive sampling**](https://youtu.be/CdK7N_kTzHI) (or judgmental/selective/subjective sampling)]. While these samples were good enough for their research, they don't seem too useful for us.\n",
+    "Both these authors had certain research questions in mind when they sampled the data, and they used a set of criteria to get a sample that would fit their questions. Their sampling method is called [**purposive sampling**](https://youtu.be/CdK7N_kTzHI) (or judgmental/selective/subjective sampling). While these samples were good enough for their research, they don't seem too useful for us.\n",
     "\n",
     "# Changing the Goal of our Analysis\n",
     "\n",
@@ -522,9 +523,9 @@
     "\n",
     "We need to be clear about what counts as popular movies. We'll use Hickey's benchmark of 30 fan ratings and count a movie as popular only if it has 30 fan ratings or more on Fandango's website.\n",
     "\n",
-    "Although one of the sampling criteria in our second sample is movie popularity, the sample doesn't provide information about the number of fan ratings. We should be skeptical once more and ask whether this sample is truly representative and contains popular movies as we defined them.\n",
+    "Although one of the sampling criteria in our second sample is movie popularity, the sample doesn't provide information about the number of fan ratings. We should be skeptical once more and ask whether this sample is truly representative and contains popular movies (movies with over 30 fan ratings).\n",
     "\n",
-    "One quick way to check the representativity of this sample is to sample randomly 10 movies from it and then check the number of fan ratings ourselves on Fandango's website. Ideally, at least 8 out of the 10 movies have 30 fan ratings or over."
+    "One quick way to check the representativity of this sample is to sample randomly 10 movies from it and then check the number of fan ratings ourselves on Fandango's website. Ideally, at least 8 out of the 10 movies have 30 fan ratings or more."
    ]
   },
   {
@@ -739,7 +740,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "If you explore the data sets, you'll notice that there are movies that weren't released in 2015 or 2016. For our purposes, we'll need to isolate only the movies released in 2015 and 2016."
+    "If you explore the two data sets, you'll notice that there are movies with a releasing year different than 2015 or 2016. For our purposes, we'll need to isolate only the movies released in 2015 and 2016.\n",
+    "\n",
+    "Let's start with Hickey's data set and isolate only the movies released in 2015. There's no special column for the releasing year, but we should be able to extract it from the strings in the `FILM` column."
    ]
   },
   {
@@ -815,13 +818,6 @@
     "fandango_previous.head(2)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "There's no special column for the releasing year, but we can extract it from the strings in the `FILM` column."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -899,6 +895,13 @@
     "fandango_previous.head(2)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's examine the frequency distribution for the `Year` column and then isolate the movies released in 2015."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
@@ -1063,9 +1066,9 @@
    "source": [
     "# Comparing Distribution Shapes for 2015 and 2016\n",
     "\n",
-    "Once again, our aim is to figure out whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. One way to go about is to analyze and compare the distributions of movie ratings for the two samples.\n",
+    "Our aim is to figure out whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. One way to go about is to analyze and compare the distributions of movie ratings for the two samples.\n",
     "\n",
-    "We'll start with comparing the shape of the two distributions using kernel density plots."
+    "We'll start with comparing the shape of the two distributions using kernel density plots. We'll use [the FiveThirtyEight style](https://www.dataquest.io/blog/making-538-plots/) for the plots."
    ]
   },
   {
@@ -1107,11 +1110,11 @@
    "source": [
     "Two aspects are striking on the figure above:\n",
     "* Both distributions are strongly left skewed.\n",
-    "* The 2016 distribution is slightly shifted to the left.\n",
+    "* The 2016 distribution is slightly shifted to the left relative to the 2015 distribution.\n",
     "\n",
     "The left skew suggests that movies on Fandango are given mostly high and very high fan ratings. Coupled with the fact that Fandango sells tickets, the high ratings are a bit dubious. It'd be really interesting to investigate this further — ideally in a separate project, since this is quite irrelevant for the current goal of our analysis.\n",
     "\n",
-    "The slight left shift of the 2016 distribution is very interesting for our analysis. It shows that ratings were slightly lower in 2016 compared to 2015. This confirms that there was a difference indeed between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We can also see the direction of the difference: the ratings in 2016 were slightly lower than in 2015.\n",
+    "The slight left shift of the 2016 distribution is very interesting for our analysis. It shows that ratings were slightly lower in 2016 compared to 2015. This suggests that there was a difference indeed between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We can also see the direction of the difference: the ratings in 2016 were slightly lower compared to 2015.\n",
     "\n",
     "\n",
     "# Comparing Relative Frequencies\n",
@@ -1193,7 +1196,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In 2016, very high ratings (4.5 and 5 stars) had significantly lower percentages compared to 2015. In 2016, under 1% of the movies were given a perfect rating of 5 stars, compared to 2015 when the percentage was close to 7%. Ratings of 4.5 were also more popular in 2015 — there were approximately 13% more movies rated with a 4.5 in 2015 compared to 2016.\n",
+    "In 2016, very high ratings (4.5 and 5 stars) had significantly lower percentages compared to 2015. In 2016, under 1% of the movies had a perfect rating of 5 stars, compared to 2015 when the percentage was close to 7%. Ratings of 4.5 were also more popular in 2015 — there were approximately 13% more movies rated with a 4.5 in 2015 compared to 2016.\n",
     "\n",
     "The minimum rating is also lower in 2016 — 2.5 instead of 3 stars, the minimum of 2015. There clearly is a difference between the two frequency distributions.\n",
     "\n",
@@ -1201,7 +1204,7 @@
     "\n",
     "# Determining the Direction of the Change\n",
     "\n",
-    "Let's take a couple of summary metrics to get a more precise picture about the direction of the change. In what follows, we'll compute the mean, the median, and the mode for both distributions and then plot the values with a bar graph."
+    "Let's take a couple of summary metrics to get a more precise picture about the direction of the change. In what follows, we'll compute the mean, the median, and the mode for both distributions and then use a bar graph to plot the values."
    ]
   },
   {