Queer European MD passionate about IT
瀏覽代碼

tweaking the gp

johnaoga 6 年之前
父節點
當前提交
a32d02f798
共有 1 個文件被更改,包括 59 次插入45 次删除
  1. 59 45
      Mission449Solutions.Rmd

+ 59 - 45
Mission449Solutions.Rmd

@@ -1,6 +1,6 @@
 ---
 title: 'Guided Project: Finding the Best Markets to Advertise In'
-author: "John Aoga"
+author: "Dataquest"
 date: "11/19/2019"
 output: html_document
 ---
@@ -19,7 +19,7 @@ The survey data is publicly available in [this GitHub repository](https://github
 
 ```{r}
 library(readr)
-fcc <- read_csv("../../content/449/2017-fCC-New-Coders-Survey-Data.csv")
+fcc <- read_csv("2017-fCC-New-Coders-Survey-Data.csv")
 dim(fcc)
 head(fcc, 5)
 ```
@@ -32,11 +32,11 @@ As we mentioned in the introduction, most of our courses are on web and mobile d
 * What locations have the greatest densities of new coders.
 * How much money they're willing to spend on learning.
 
-So we first need to clarify whether the data set has the right categories of people for our purpose. The `JobRoleInterest` column describes for every participant the role(s) they'd be interested in working in. If a participant is interested in working in a certain domain, it means that they're also interested in learning about that domain. So let's take a look at the frequency distribution table of this column and determine whether the data we have is relevant.
+So we first need to clarify whether the data set has the right categories of people for our purpose. The `JobRoleInterest` column describes for every participant the role(s) they'd be interested in working in. If a participant is interested in working in a certain domain, it means that they're also interested in learning about that domain. So let's take a look at the frequency distribution table of this column [^1] and determine whether the data we have is relevant.
 
 
 ```{r}
-#split-and-combine
+#split-and-combine workflow
 library(dplyr)
 fcc %>%
   group_by(JobRoleInterest) %>%
@@ -54,39 +54,32 @@ The information in the table above is quite granular, but from a quick scan it l
 It's also interesting to note that many respondents are interested in more than one subject. It'd be useful to get a better picture of how many people are interested in a single subject and how many have mixed interests. Consequently, in the next code block, we'll:
 
 - Split each string in the `JobRoleInterest` column to find the number of options for each participant.
-    - We'll first drop the null values because we can't split `NA` values.
-- Generate a frequency table for the variable describing the number of options.
+    - We'll first drop the NA values [^2] because we cannot split NA values.
+- Generate a frequency table for the variable describing the number of options [^3].
 
 ```{r}
 # Split each string in the 'JobRoleInterest' column
-splitted_interests = fcc %>%
+splitted_interests <- fcc %>%
   select(JobRoleInterest) %>%
   tidyr::drop_na() %>%
-  rowwise %>%
+  rowwise %>% #Tidyverse actually makes by default operation over columns, rowwise changes this behavior.
   mutate(opts = length(stringr::str_split(JobRoleInterest, ",")[[1]]))
 
-# (alternative)
-splitted_interests <- purrr::map_int(stringr::str_split(fcc$JobRoleInterest, ","), length)
-
-
-
 # Frequency table for the var describing the number of options
 n_of_options <- splitted_interests %>%
-  ungroup() %>%
+  ungroup() %>%  #this is needeed because we used the rowwise() function before
   group_by(opts) %>%
   summarize(freq = n()*100/nrow(splitted_interests))
 
-n_of_options <- table(splitted_interests)
-
 n_of_options
 ```
 
-It turns out that only 31.7% of the participants have a clear idea about what programming niche they'd like to work in, while the vast majority of students have mixed interests. But given that we offer courses on various subjects, the fact that new coders have mixed interest might be actually good for us.
+It turns out that only 31.65% of the participants have a clear idea about what programming niche they'd like to work in, while the vast majority of students have mixed interests. But given that we offer courses on various subjects, the fact that new coders have mixed interest might be actually good for us.
 
 The focus of our courses is on web and mobile development, so let's find out how many respondents chose at least one of these two options.
 
 ```{r}
-# Frequency table
+# Frequency table (we can also use split-and-combine) 
 web_or_mobile <- stringr::str_detect(fcc$JobRoleInterest, "Web Developer|Mobile Developer")
 freq_table <- table(web_or_mobile)
 freq_table <- freq_table * 100 / sum(freq_table)
@@ -96,6 +89,8 @@ freq_table
 df <- tibble::tibble(x = c("Other Subject","Web or Mobile Developpement"),
                        y = freq_table)
 
+library(ggplot2)
+
 ggplot(data = df, aes(x = x, y = y, fill = x)) +
   geom_histogram(stat = "identity")
 
@@ -125,14 +120,14 @@ fcc_good <- fcc %>%
 
 # Frequency tables with absolute and relative frequencies
 # Display the frequency tables in a more readable format
-fcc_good%>%
+fcc_good %>%
 group_by(CountryLive) %>%
 summarise(`Absolute frequency` = n(),
           `Percentage` = n() * 100 /  nrow(fcc_good) ) %>%
   arrange(desc(Percentage))
 ```
 
-44.7% of our potential customers are located in the US, and this definitely seems like the most interesting market. India has the second customer density, but it's just 7.7%, which is not too far from the United Kingdom (4.6%) or Canada (3.8%).
+44.69% of our potential customers are located in the US, and this definitely seems like the most interesting market. India has the second customer density, but it's just 7.55%, which is not too far from the United Kingdom (4.50%) or Canada (3.71%).
 
 This is useful information, but we need to go more in depth than this and figure out how much money people are actually willing to spend on learning. Advertising in high-density markets where most people are only willing to learn for free is extremely unlikely to be profitable for us.
 
@@ -151,7 +146,7 @@ Let's start with creating a new column that describes the amount of money a stud
 ```{r}
 # Replace 0s with 1s to avoid division by 0
 fcc_good <- fcc_good %>%
-  mutate(MonthsProgramming = replace(MonthsProgramming,  MonthsProgramming==0, 1) )
+  mutate(MonthsProgramming = replace(MonthsProgramming,  MonthsProgramming == 0, 1) )
 
 # New column for the amount of money each student spends each month
 fcc_good <- fcc_good %>%
@@ -162,17 +157,17 @@ fcc_good %>%
   pull(na_count)
 ```
 
-Let's keep only the rows that don't have null values for the `money_per_month` column.
+Let's keep only the rows that don't have NA values for the `money_per_month` column.
 
 ```{r}
-# Keep only the rows with non-nulls in the `money_per_month` column 
+# Keep only the rows with non-NAs in the `money_per_month` column 
 fcc_good  <-  fcc_good %>% tidyr::drop_na(money_per_month)
 ```
 
 We want to group the data by country, and then measure the average amount of money that students spend per month in each country. First, let's remove the rows having `NA` values for the `CountryLive` column, and check out if we still have enough data for the four countries that interest us.
 
 ```{r}
-# Remove the rows with null values in 'CountryLive'
+# Remove the rows with NA values in 'CountryLive'
 fcc_good  <-  fcc_good %>% tidyr::drop_na(CountryLive)
 
 # Frequency table to check if we still have enough data
@@ -186,7 +181,7 @@ This should be enough, so let's compute the average value spent per month in eac
 
 ```{r}
 # Mean sum of money spent by students each month
-countries_mean = fcc_good %>% 
+countries_mean  <-  fcc_good %>% 
   filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
   group_by(CountryLive) %>%
   summarize(mean = mean(money_per_month)) %>%
@@ -205,11 +200,16 @@ Let's use box plots to visualize the distribution of the `money_per_month` varia
 
 ```{r}
 # Isolate only the countries of interest
-only_4 = fcc_good %>% 
+only_4  <-  fcc_good %>% 
   filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada')
 
-# Box plots to visualize distributions
+# Since maybe, we will remove elements from the database, 
+# we add an index column containing the number of each row. 
+# Hence, we will have a match with the original database in case of some indexes.
+only_4 <- only_4 %>%
+  mutate(index = row_number())
 
+# Box plots to visualize distributions
 ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
   geom_boxplot() +
   ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
@@ -219,10 +219,10 @@ ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
 
 ```
 
-It's hard to see on the plot above if there's anything wrong with the data for the United Kingdom, India, or Canada, but we can see immediately that there's something really off for the US: two persons spend each month \$50000 or more for learning. This is not impossible, but it seems extremely unlikely, so we'll remove every value that goes over \$20,000 per month.
+It's hard to see on the plot above if there's anything wrong with the data for the United Kingdom, India, or Canada, but we can see immediately that there's something really off for the US: two persons spend each month \$50,000 or more for learning. This is not impossible, but it seems extremely unlikely, so we'll remove every value that goes over \$20,000 per month.
 
 ```{r}
-# Isolate only those participants who spend less than 10000 per month
+# Isolate only those participants who spend less than 10,000 per month
 fcc_good  <- fcc_good %>% 
   filter(money_per_month < 20000)
 ```
@@ -242,11 +242,11 @@ countries_mean
 
 ```{r}
 # Isolate only the countries of interest
-only_4 = fcc_good %>% 
-  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada')
+only_4  <-  fcc_good %>% 
+  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
+  mutate(index = row_number())
 
 # Box plots to visualize distributions
-
 ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
   geom_boxplot() +
   ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
@@ -256,26 +256,26 @@ ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
 
 ```
 
-We can see a few extreme outliers for India (values over \$2500 per month), but it's unclear whether this is good data or not. Maybe these persons attended several bootcamps, which tend to be very expensive. Let's examine these two data points to see if we can find anything relevant.
+We can see a few extreme outliers for India (values over \$2,500 per month), but it's unclear whether this is good data or not. Maybe these persons attended several bootcamps, which tend to be very expensive. Let's examine these two data points to see if we can find anything relevant.
 
 ```{r}
 # Inspect the extreme outliers for India
-india_outliers = only_4 %>%
+india_outliers  <-  only_4 %>%
   filter(CountryLive == 'India' & 
            money_per_month >= 2500)
 
 india_outliers
 ```
 
-It seems that neither participant attended a bootcamp. Overall, it's really hard to figure out from the data whether these persons really spent that much money with learning. The actual question of the survey was _"Aside from university tuition, about how much money have you spent on learning to code so far (in US dollars)?"_, so they might have misunderstood and thought university tuition is included. It seems safer to remove these two rows.
+It seems that neither participant attended a bootcamp. Overall, it's really hard to figure out from the data whether these persons really spent that much money with learning. The actual question of the survey was _"Aside from university tuition, about how much money have you spent on learning to code so far (in US dollars)?"_, so they might have misunderstood and thought university tuition is included. It seems safer to remove these six rows.
 
 ```{r}
 # Remove the outliers for India
-only_4  <-  only_4 %>% 
-  drop(india_outliers.index) # using the row labels
+only_4 <-  only_4 %>% 
+  filter(!(index %in% india_outliers$index))
 ```
 
-Looking back at the box plot above, we can also see more extreme outliers for the US (values over \$6000 per month). Let's examine these participants in more detail.
+Looking back at the box plot above, we can also see more extreme outliers for the US (values over \$6,000 per month). Let's examine these participants in more detail.
 
 ```{r}
 # Examine the extreme outliers for the US
@@ -284,9 +284,12 @@ us_outliers = only_4 %>%
            money_per_month >= 6000)
 
 us_outliers
+
+only_4  <-  only_4 %>% 
+  filter(!(index %in% us_outliers$index))
 ```
 
-Out of these 11 extreme outliers, six people attended bootcamps, which justify the large sums of money spent on learning. For the other five, it's hard to figure out from the data where they could have spent that much money on learning. Consequently, we'll remove those rows where participants reported thay they spend \$6000 each month, but they have never attended a bootcamp.
+Out of these 11 extreme outliers, six people attended bootcamps, which justify the large sums of money spent on learning. For the other five, it's hard to figure out from the data where they could have spent that much money on learning. Consequently, we'll remove those rows where participants reported thay they spend \$6,000 each month, but they have never attended a bootcamp.
 
 Also, the data shows that eight respondents had been programming for no more than three months when they completed the survey. They most likely paid a large sum of money for a bootcamp that was going to last for several months, so the amount of money spent per month is unrealistic and should be significantly lower (because they probably didn't spend anything for the next couple of months after the survey). As a consequence, we'll remove every these eight outliers.
 
@@ -302,7 +305,8 @@ no_bootcamp = only_4 %>%
            money_per_month >= 6000 &
              AttendedBootcamp == 0)
 
-only_4  <-  only_4.drop(no_bootcamp.index)
+only_4_  <-  only_4 %>% 
+  filter(!(index %in% no_bootcamp$index))
 
 
 # Remove the respondents that had been programming for less than 3 months
@@ -311,11 +315,12 @@ less_than_3_months = only_4 %>%
            money_per_month >= 6000 &
            MonthsProgramming <= 3)
 
-only_4  <-  only_4.drop(less_than_3_months.index)
+only_4  <-  only_4 %>% 
+  filter(!(index %in% less_than_3_months$index))
 ```
 
 
-Looking again at the last box plot above, we can also see an extreme outlier for Canada — a person who spends roughly \$5000 per month. Let's examine this person in more depth.
+Looking again at the last box plot above, we can also see an extreme outlier for Canada — a person who spends roughly \$5,000 per month. Let's examine this person in more depth.
 
 ```{r}
 # Examine the extreme outliers for Canada
@@ -331,7 +336,8 @@ Here, the situation is similar to some of the US respondents — this participan
 
 ```{r}
 # Remove the extreme outliers for Canada
-only_4  <-  only_4.drop(canada_outliers.index)
+only_4  <-  only_4 %>% 
+  filter(!(index %in% canada_outliers$index))
 ```
 
 Let's recompute the mean values and generate the final box plots.
@@ -372,7 +378,10 @@ The data suggests strongly that we shouldn't advertise in the UK, but let's take
 
 ```{r}
 # Frequency table for the 'CountryLive' column
-only_4['CountryLive'].value_counts(normalize = True) * 100
+only_4 %>% group_by(CountryLive) %>%
+  summarise(freq = n() * 100 / nrow(only_4) ) %>%
+  arrange(desc(freq)) %>%
+  head()
 ```
 
 ```{r}
@@ -404,4 +413,9 @@ At this point, it's probably best to send our analysis to the marketing team and
 
 In this project, we analyzed survey data from new coders to find the best two markets to advertise in. The only solid conclusion we reached is that the US would be a good market to advertise in.
 
-For the second best market, it wasn't clear-cut what to choose between India and Canada. We decided to send the results to the marketing team so they can use their domain knowledge to take the best decision.
+For the second best market, it wasn't clear-cut what to choose between India and Canada. We decided to send the results to the marketing team so they can use their domain knowledge to take the best decision.
+
+# Documentation
+[^1]: We can use the [Split-and-Combine workflow](https://app.dataquest.io/m/339/a/5).
+[^2]: We can use the [`drop_na()` function](https://app.dataquest.io/m/326/a/6).
+[^3]: We can use the [`stringr::str_split()` function](https://app.dataquest.io/m/342/a/6).