Queer European MD passionate about IT

Mission516Solutions.Rmd 4.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
  1. ---
  2. title: "Guided Project: Creating An Efficient Data Analysis Workflow, Part 2"
  3. output: html_document
  4. ---
  5. ```{r}
  6. library(tidyverse)
  7. library(lubridate)
  8. sales <- read_csv("sales2019.csv")
  9. ```
  10. # Data Exploration
  11. ```{r}
  12. # How big is the dataset?
  13. dim(sales)
  14. ```
  15. ```{r}
  16. # What are the column names?
  17. colnames(sales)
  18. ```
  19. The `date` column shows the date that the order of books was made. This will help us distinguish between orders that were made before and after the new program was implemented. `quantity` describes how many books were made, and `user_submitted_review` looks like it's a hand typed review of the books themselves. `customer_type` indicates whether or not the customer was an individual or a business. It seems that the company has started selling in bulk to other business too.
  20. ```{r}
  21. # What are the types of all the columns?
  22. for (col in colnames(sales)) {
  23. paste0(col, " : ", typeof(sales[[col]])) %>% print
  24. }
  25. ```
  26. ```{r}
  27. # Is there missing data anywhere?
  28. for (col in colnames(sales)) {
  29. paste0(col,
  30. ", number of missing data rows: ",
  31. is.na(sales[[col]]) %>% sum) %>% print
  32. }
  33. ```
  34. The `user_submitted_review` column has some missing data in it. We'll have to handle this later in the data cleaning, but at least we know about it ahead of time. The `total_purchased` column also has missing data, which we'll handle with imputation.
  35. # Handling Missing Data
  36. ```{r}
  37. # Remove the rows with no user_submitted_review
  38. complete_sales <- sales %>%
  39. filter(
  40. !is.na(user_submitted_review)
  41. )
  42. # Calculate the mean of the total_purchased column, without the missing values
  43. purchase_mean <- complete_sales %>%
  44. filter(!is.na(total_purchased)) %>%
  45. pull(total_purchased) %>%
  46. mean
  47. # Assign this mean to all of the rows where total_purchased was NA
  48. complete_sales <- complete_sales %>%
  49. mutate(
  50. imputed_purchases = if_else(is.na(total_purchased),
  51. purchase_mean,
  52. total_purchased)
  53. )
  54. ```
  55. # Processing Review Data
  56. ```{r}
  57. complete_sales %>% pull(user_submitted_review) %>% unique
  58. ```
  59. The reviews range from outright hate ("Hated it") to positive ("Awesome!"). We'll create a function that uses a `case_when()` function to produce the output. `case_when()` functions can be incredibly bulky in cases where there's many options, but housing it in a function to `map` can make our code cleaner.
  60. ```{r}
  61. is_positive <- function(review) {
  62. review_positive = case_when(
  63. str_detect(review, "Awesome") ~ TRUE,
  64. str_detect(review, "OK") ~ TRUE,
  65. str_detect(review, "Never") ~ TRUE,
  66. str_detect(review, "a lot") ~ TRUE,
  67. TRUE ~ FALSE # The review did not contain any of the above phrases
  68. )
  69. }
  70. complete_sales <- complete_sales %>%
  71. mutate(
  72. is_positive = unlist(map(user_submitted_review, is_positive))
  73. )
  74. ```
  75. # Comparing Book Sales Between Pre- and Post-Program Sales
  76. ```{r}
  77. complete_sales <- complete_sales %>%
  78. mutate(
  79. date_status = if_else(mdy(date) < ymd("2019/07/01"), "Pre", "Post")
  80. )
  81. complete_sales %>%
  82. group_by(date_status) %>%
  83. summarize(
  84. books_purchased = sum(imputed_purchases)
  85. )
  86. ```
  87. It doesn't seem that the program has increased sales. Maybe there were certain books that increased in sales?
  88. ```{r}
  89. complete_sales %>%
  90. group_by(date_status, title) %>%
  91. summarize(
  92. books_purchased = sum(imputed_purchases)
  93. ) %>%
  94. arrange(title, date_status)
  95. ```
  96. It turns out that certain books actually got more popular after the program started! R For Dummies and Secrets of R For Advanced Students got more popular.
  97. # Comparing Book Sales Within Customer Type
  98. ```{r}
  99. complete_sales %>%
  100. group_by(date_status, customer_type) %>%
  101. summarize(
  102. books_purchased = sum(imputed_purchases)
  103. ) %>%
  104. arrange(customer_type, date_status)
  105. ```
  106. Baserd on the table, it looks like businesses started purchasing more books after the program! There was actually a drop in individual sales.
  107. # Comparing Review Sentiment Between Pre- and Post-Program Sales
  108. ```{r}
  109. complete_sales %>%
  110. group_by(date_status) %>%
  111. summarize(
  112. num_positive_reviews = sum(is_positive)
  113. )
  114. ```
  115. There's slightly more reviews before the program, but this difference seems negigible.