123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147 |
- ---
- title: "Guided Project: Creating An Efficient Data Analysis Workflow, Part 2"
- output: html_document
- ---
- ```{r}
- library(tidyverse)
- library(lubridate)
- sales = read_csv("sales2019.csv")
- ```
- # Data Exploration
- ```{r}
- # How big is the dataset?
- dim(sales)
- ```
- ```{r}
- # What are the column names?
- colnames(sales)
- ```
- The `date` column shows the date that the order of books was made. This will help us distinguish between orders that were made before and after the new program was implemented. `quantity` describes how many books were made, and `user_submitted_review` looks like it's a hand typed review of the books themselves. `customer_type` indicates whether or not the customer was an individual or a business. It seems that the company has started selling in bulk to other business too.
- ```{r}
- # What are the types of all the columns?
- for (col in colnames(sales)) {
- paste0(col, " : ", typeof(sales[[col]])) %>% print
- }
- ```
- ```{r}
- # Is there missing data anywhere?
- for (col in colnames(sales)) {
- paste0(col,
- ", number of missing data rows: ",
- is.na(sales[[col]]) %>% sum) %>% print
- }
- ```
- The `user_submitted_review` column has some missing data in it. We'll have to handle this later in the data cleaning, but at least we know about it ahead of time. The `total_purchased` column also has missing data, which we'll handle with imputation.
- # Handling Missing Data
- ```{r}
- # Remove the rows with no user_submitted_review
- complete_sales = sales %>%
- filter(
- !is.na(user_submitted_review)
- )
- # Calculate the mean of the total_purchased column, without the missing values
- purchase_mean = complete_sales %>%
- filter(!is.na(total_purchased)) %>%
- pull(total_purchased) %>%
- mean
- # Assign this mean to all of the rows where total_purchased was NA
- complete_sales = complete_sales %>%
- mutate(
- imputed_purchases = if_else(is.na(total_purchased),
- purchase_mean,
- total_purchased)
- )
- ```
- # Processing Review Data
- ```{r}
- complete_sales %>% pull(user_submitted_review) %>% unique
- ```
- The reviews range from outright hate ("Hated it") to positive ("Awesome!"). We'll create a function that uses a `case_when()` function to produce the output. `case_when()` functions can be incredibly bulky in cases where there's many options, but housing it in a function to `map` can make our code cleaner.
- ```{r}
- is_positive = function(review) {
- review_positive = case_when(
- str_detect(review, "Awesome") ~ TRUE,
- str_detect(review, "OK") ~ TRUE,
- str_detect(review, "Never") ~ TRUE,
- str_detect(review, "a lot") ~ TRUE,
- TRUE ~ FALSE # The review did not contain any of the above phrases
- )
- }
- complete_sales = complete_sales %>%
- mutate(
- is_positive = unlist(map(user_submitted_review, is_positive))
- )
- ```
- # Comparing Book Sales Between Pre- and Post-Program Sales
- ```{r}
- complete_sales = complete_sales %>%
- mutate(
- date_status = if_else(mdy(date) < ymd("2019/07/01"), "Pre", "Post")
- )
- complete_sales %>%
- group_by(date_status) %>%
- summarize(
- books_purchased = sum(imputed_purchases)
- )
- ```
- It doesn't seem that the program has increased sales. Maybe there were certain books that increased in sales?
- ```{r}
- complete_sales %>%
- group_by(date_status, title) %>%
- summarize(
- books_purchased = sum(imputed_purchases)
- ) %>%
- arrange(title, date_status)
- ```
- It turns out that certain books actually got more popular after the program started! R For Dummies and Secrets of R For Advanced Students got more popular.
- # Comparing Book Sales Within Customer Type
- ```{r}
- complete_sales %>%
- group_by(date_status, customer_type) %>%
- summarize(
- books_purchased = sum(imputed_purchases)
- ) %>%
- arrange(customer_type, date_status)
- ```
- Baserd on the table, it looks like businesses started purchasing more books after the program! There was actually a drop in individual sales.
- # Comparing Review Sentiment Between Pre- and Post-Program Sales
- ```{r}
- complete_sales %>%
- group_by(date_status) %>%
- summarize(
- num_positive_reviews = sum(is_positive)
- )
- ```
- There's slightly more reviews before the program, but this difference seems negigible.
|