6 jaren geleden · ee27b5fee5
--- a/Mission498Solutions.Rmd
+++ b/Mission498Solutions.Rmd
@@ -0,0 +1,98 @@
 
				+---
			
 
				+title: "Guided Project Solutions: Creating An Efficient Data Analysis Workflow"
			
 
				+output: html_document
			
 
				+---
			
 
				+
			
 
				+```{r}
			
 
				+library(tidyverse)
			
 
				+reviews = read_csv("book_reviews.csv")
			
 
				+```
			
 
				+
			
 
				+
			
 
				+# Getting Familiar With The Data
			
 
				+
			
 
				+```{r}
			
 
				+# How big is the dataset?
			
 
				+dim(reviews)
			
 
				+
			
 
				+# What are the column names?
			
 
				+colnames(reviews)
			
 
				+
			
 
				+# What are the column types?
			
 
				+for (c in colnames(reviews)) {
			
 
				+  print(typeof(reviews[[c]]))
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+```{r}
			
 
				+# What are the unique values in each column?
			
 
				+for (c in colnames(reviews)) {
			
 
				+  print("Unique values in the column:")
			
 
				+  print(c)
			
 
				+  print(unique(reviews[[c]]))
			
 
				+  print("")
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+All of the columns seem to contain strings. The `reviews` column represents what the score that the reviewer gave the book. The `book` column indicates which particular textbook was purchased. The `state` column represents the state where the book was purchased. The `price` column represents the price that the book was purchased for.
			
 
				+
			
 
				+# Handling Missing Data
			
 
				+
			
 
				+From the previous exercise, it's apparent that that the `review` column contains some `NA` values. We don't want any missing values in the dataset, so we need to get rid of them.
			
 
				+
			
 
				+```{r}
			
 
				+complete_reviews = reviews %>% 
			
 
				+  filter(!is.na(review))
			
 
				+
			
 
				+dim(complete_reviews)
			
 
				+```
			
 
				+
			
 
				+There were about 200 reviews that were removed from the dataset. This is about 10% of the original dataset. This isn't too big of an amount, so we would feel comfortable continuing with our analysis.
			
 
				+
			
 
				+# Dealing With Inconsistent Labels
			
 
				+
			
 
				+We'll use the shortened postal codes instead since they're shorter.
			
 
				+
			
 
				+```{r}
			
 
				+complete_reviews = complete_reviews %>% 
			
 
				+  mutate(
			
 
				+    state = case_when(
			
 
				+      state == "California" ~ "CA",
			
 
				+      state == "New York" ~ "NY",
			
 
				+      state == "Texas" ~ "TX",
			
 
				+      state == "Florida" ~ "FL",
			
 
				+      TRUE ~ state # ignore cases where it's already postal code
			
 
				+    )
			
 
				+  )
			
 
				+```
			
 
				+
			
 
				+# Transforming the Review Data
			
 
				+
			
 
				+```{r}
			
 
				+complete_reviews = complete_reviews %>% 
			
 
				+  mutate(
			
 
				+    review_num = case_when(
			
 
				+      review == "Poor" ~ 1,
			
 
				+      review == "Fair" ~ 2,
			
 
				+      review == "Good" ~ 3,
			
 
				+      review == "Great" ~ 4,
			
 
				+      review == "Excellent" ~ 5
			
 
				+    ),
			
 
				+    is_high_review = if_else(review_num >= 4, TRUE, FALSE)
			
 
				+  )
			
 
				+```
			
 
				+
			
 
				+# Analyzing The Data
			
 
				+
			
 
				+We'll define most profitable book in terms of how many books there was sold. 
			
 
				+
			
 
				+```{r}
			
 
				+complete_reviews %>% 
			
 
				+  group_by(book) %>% 
			
 
				+  summarize(
			
 
				+    purchased = n()
			
 
				+  ) %>% 
			
 
				+  arrange(-purchased)
			
 
				+```
			
 
				+
			
 
				+The books are relatively well matched in terms of purchasing, but "Fundamentals of R For Beginners" has a slight edge over everyone else.