|  | @@ -0,0 +1,147 @@
 | 
	
		
			
				|  |  | +---
 | 
	
		
			
				|  |  | +title: "Guided Project: Creating An Efficient Data Analysis Workflow, Part 2"
 | 
	
		
			
				|  |  | +output: html_document
 | 
	
		
			
				|  |  | +---
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +library(tidyverse)
 | 
	
		
			
				|  |  | +library(lubridate)
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +sales = read_csv("sales2019.csv")
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +# Data Exploration
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +# How big is the dataset?
 | 
	
		
			
				|  |  | +dim(sales)
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +# What are the column names?
 | 
	
		
			
				|  |  | +colnames(sales)
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +The `date` column shows the date that the order of books was made. This will help us distinguish between orders that were made before and after the new program was implemented. `quantity` describes how many books were made, and `user_submitted_review` looks like it's a hand typed review of the books themselves. `customer_type` indicates whether or not the customer was an individual or a business. It seems that the company has started selling in bulk to other business too.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +# What are the types of all the columns?
 | 
	
		
			
				|  |  | +for (col in colnames(sales)) {
 | 
	
		
			
				|  |  | +  paste0(col, " : ", typeof(sales[[col]])) %>% print
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +# Is there missing data anywhere?
 | 
	
		
			
				|  |  | +for (col in colnames(sales)) {
 | 
	
		
			
				|  |  | +  paste0(col, 
 | 
	
		
			
				|  |  | +         ", number of missing data rows: ", 
 | 
	
		
			
				|  |  | +         is.na(sales[[col]]) %>% sum) %>% print
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +The `user_submitted_review` column has some missing data in it. We'll have to handle this later in the data cleaning, but at least we know about it ahead of time. The `total_purchased` column also has missing data, which we'll handle with imputation.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +# Handling Missing Data
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +# Remove the rows with no user_submitted_review
 | 
	
		
			
				|  |  | +complete_sales = sales %>% 
 | 
	
		
			
				|  |  | +  filter(
 | 
	
		
			
				|  |  | +    !is.na(user_submitted_review)
 | 
	
		
			
				|  |  | +  )
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +# Calculate the mean of the total_purchased column, without the missing values
 | 
	
		
			
				|  |  | +purchase_mean = complete_sales %>% 
 | 
	
		
			
				|  |  | +  filter(!is.na(total_purchased)) %>% 
 | 
	
		
			
				|  |  | +  pull(total_purchased) %>% 
 | 
	
		
			
				|  |  | +  mean
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +# Assign this mean to all of the rows where total_purchased was NA
 | 
	
		
			
				|  |  | +complete_sales = complete_sales %>% 
 | 
	
		
			
				|  |  | +  mutate(
 | 
	
		
			
				|  |  | +    imputed_purchases = if_else(is.na(total_purchased), 
 | 
	
		
			
				|  |  | +                                purchase_mean,
 | 
	
		
			
				|  |  | +                                total_purchased)
 | 
	
		
			
				|  |  | +  )
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +# Processing Review Data
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +complete_sales %>% pull(user_submitted_review) %>% unique
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +The reviews range from outright hate ("Hated it") to positive ("Awesome!"). We'll create a function that uses a `case_when()` function to produce the output. `case_when()` functions can be incredibly bulky in cases where there's many options, but housing it in a function to `map` can make our code cleaner.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +is_positive = function(review) {
 | 
	
		
			
				|  |  | +  review_positive = case_when(
 | 
	
		
			
				|  |  | +  str_detect(review, "Awesome") ~ TRUE,
 | 
	
		
			
				|  |  | +  str_detect(review, "OK") ~ TRUE,
 | 
	
		
			
				|  |  | +  str_detect(review, "Never") ~ TRUE,
 | 
	
		
			
				|  |  | +  str_detect(review, "a lot") ~ TRUE,
 | 
	
		
			
				|  |  | +  TRUE ~ FALSE # The review did not contain any of the above phrases
 | 
	
		
			
				|  |  | +  )
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +complete_sales = complete_sales %>% 
 | 
	
		
			
				|  |  | +  mutate(
 | 
	
		
			
				|  |  | +    is_positive = unlist(map(user_submitted_review, is_positive))
 | 
	
		
			
				|  |  | +  )
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +# Comparing Book Sales Between Pre- and Post-Program Sales
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +complete_sales = complete_sales %>% 
 | 
	
		
			
				|  |  | +  mutate(
 | 
	
		
			
				|  |  | +    date_status = if_else(mdy(date) < ymd("2019/07/01"), "Pre", "Post")
 | 
	
		
			
				|  |  | +  )
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +complete_sales %>% 
 | 
	
		
			
				|  |  | +  group_by(date_status) %>% 
 | 
	
		
			
				|  |  | +  summarize(
 | 
	
		
			
				|  |  | +    books_purchased = sum(imputed_purchases)
 | 
	
		
			
				|  |  | +  )
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +It doesn't seem that the program has increased sales. Maybe there were certain books that increased in sales?
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +complete_sales %>% 
 | 
	
		
			
				|  |  | +  group_by(date_status, title) %>% 
 | 
	
		
			
				|  |  | +  summarize(
 | 
	
		
			
				|  |  | +    books_purchased = sum(imputed_purchases)
 | 
	
		
			
				|  |  | +  ) %>% 
 | 
	
		
			
				|  |  | +  arrange(title, date_status)
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +It turns out that certain books actually got more popular after the program started! R For Dummies and Secrets of R For Advanced Students got more popular.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +# Comparing Book Sales Within Customer Type
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +complete_sales %>% 
 | 
	
		
			
				|  |  | +  group_by(date_status, customer_type) %>% 
 | 
	
		
			
				|  |  | +  summarize(
 | 
	
		
			
				|  |  | +    books_purchased = sum(imputed_purchases)
 | 
	
		
			
				|  |  | +  ) %>% 
 | 
	
		
			
				|  |  | +  arrange(customer_type, date_status)
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Baserd on the table, it looks like businesses started purchasing more books after the program! There was actually a drop in individual sales.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +# Comparing Review Sentiment Between Pre- and Post-Program Sales
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```{r}
 | 
	
		
			
				|  |  | +complete_sales %>% 
 | 
	
		
			
				|  |  | +  group_by(date_status) %>% 
 | 
	
		
			
				|  |  | +  summarize(
 | 
	
		
			
				|  |  | +    num_positive_reviews = sum(is_positive)
 | 
	
		
			
				|  |  | +  )
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +There's slightly more reviews before the program, but this difference seems negigible.
 | 
	
		
			
				|  |  | +
 |