Queer European MD passionate about IT

Mission498Solutions.Rmd 2.5 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
  1. ---
  2. title: "Guided Project Solutions: Creating An Efficient Data Analysis Workflow"
  3. output: html_document
  4. ---
  5. ```{r}
  6. library(tidyverse)
  7. reviews = read_csv("book_reviews.csv")
  8. ```
  9. # Getting Familiar With The Data
  10. ```{r}
  11. # How big is the dataset?
  12. dim(reviews)
  13. # What are the column names?
  14. colnames(reviews)
  15. # What are the column types?
  16. for (c in colnames(reviews)) {
  17. print(typeof(reviews[[c]]))
  18. }
  19. ```
  20. ```{r}
  21. # What are the unique values in each column?
  22. for (c in colnames(reviews)) {
  23. print("Unique values in the column:")
  24. print(c)
  25. print(unique(reviews[[c]]))
  26. print("")
  27. }
  28. ```
  29. All of the columns seem to contain strings. The `reviews` column represents what the score that the reviewer gave the book. The `book` column indicates which particular textbook was purchased. The `state` column represents the state where the book was purchased. The `price` column represents the price that the book was purchased for.
  30. # Handling Missing Data
  31. From the previous exercise, it's apparent that that the `review` column contains some `NA` values. We don't want any missing values in the dataset, so we need to get rid of them.
  32. ```{r}
  33. complete_reviews = reviews %>%
  34. filter(!is.na(review))
  35. dim(complete_reviews)
  36. ```
  37. There were about 200 reviews that were removed from the dataset. This is about 10% of the original dataset. This isn't too big of an amount, so we would feel comfortable continuing with our analysis.
  38. # Dealing With Inconsistent Labels
  39. We'll use the shortened postal codes instead since they're shorter.
  40. ```{r}
  41. complete_reviews = complete_reviews %>%
  42. mutate(
  43. state = case_when(
  44. state == "California" ~ "CA",
  45. state == "New York" ~ "NY",
  46. state == "Texas" ~ "TX",
  47. state == "Florida" ~ "FL",
  48. TRUE ~ state # ignore cases where it's already postal code
  49. )
  50. )
  51. ```
  52. # Transforming the Review Data
  53. ```{r}
  54. complete_reviews = complete_reviews %>%
  55. mutate(
  56. review_num = case_when(
  57. review == "Poor" ~ 1,
  58. review == "Fair" ~ 2,
  59. review == "Good" ~ 3,
  60. review == "Great" ~ 4,
  61. review == "Excellent" ~ 5
  62. ),
  63. is_high_review = if_else(review_num >= 4, TRUE, FALSE)
  64. )
  65. ```
  66. # Analyzing The Data
  67. We'll define most profitable book in terms of how many books there was sold.
  68. ```{r}
  69. complete_reviews %>%
  70. group_by(book) %>%
  71. summarize(
  72. purchased = n()
  73. ) %>%
  74. arrange(-purchased)
  75. ```
  76. The books are relatively well matched in terms of purchasing, but "Fundamentals of R For Beginners" has a slight edge over everyone else.