Queer European MD passionate about IT

Mission327Solutions.Rmd 4.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
  1. ---
  2. title: "Solutions for Guided Project: Exploring NYC Schools Survey Data"
  3. author: "Rose Martin"
  4. data: "January 22, 2019"
  5. output: html_document
  6. ---
  7. **Here are suggested solutions to the questions in the Data Cleaning With R Guided Project: Exploring NYC Schools Survey Data.**
  8. Load the packages you'll need for your analysis
  9. ```{r}
  10. library(readr)
  11. library(dplyr)
  12. library(stringr)
  13. library(purrr)
  14. library(tidyr)
  15. library(ggplot2)
  16. ```
  17. Import the data into R.
  18. ```{r}
  19. combined <- read_csv("combined.csv")
  20. survey <- read_tsv("survey_all.txt")
  21. survey_d75 <- read_tsv("survey_d75.txt")
  22. ```
  23. Filter `survey` data to include only high schools and select columns needed for analysis based on the data dictionary.
  24. ```{r}
  25. survey_select <- survey %>%
  26. filter(schooltype == "High School") %>%
  27. select(dbn:aca_tot_11)
  28. ```
  29. Select columns needed for analysis from `survey_d75`.
  30. ```{r}
  31. survey_d75_select <- survey_d75 %>%
  32. select(dbn:aca_tot_11)
  33. ```
  34. Combine `survey` and `survey_d75` data frames.
  35. ```{r}
  36. survey_total <- survey_select %>%
  37. bind_rows(survey_d75_select)
  38. ```
  39. Rename `survey_total` variable `dbn` to `DBN` so can use as key to join with the `combined` data frame.
  40. ```{r}
  41. survey_total <- survey_total %>%
  42. rename(DBN = dbn)
  43. ```
  44. Join the `combined` and `survey_total` data frames. Use `left_join()` to keep only survey data that correspond to schools for which we have data in `combined`.
  45. ```{r}
  46. combined_survey <- combined %>%
  47. left_join(survey_total, by = "DBN")
  48. ```
  49. Create a correlation matrix to look for interesting relationships between pairs of variables in `combined_survey` and convert it to a tibble so it's easier to work with using tidyverse tools.
  50. ```{r}
  51. cor_mat <- combined_survey %>% ## interesting relationshipsS
  52. select(avg_sat_score, saf_p_11:aca_tot_11) %>%
  53. cor(use = "pairwise.complete.obs")
  54. cor_tib <- cor_mat %>%
  55. as_tibble(rownames = "variable")
  56. ```
  57. Look for correlations of other variables with `avg_sat_score` that are greater than 0.25 or less than -0.25 (strong correlations).
  58. ```{r}
  59. strong_cors <- cor_tib %>%
  60. select(variable, avg_sat_score) %>%
  61. filter(avg_sat_score > 0.25 | avg_sat_score < -0.25)
  62. ```
  63. Make scatter plots of those variables with `avg_sat_score` to examine relationships more closely.
  64. ```{r}
  65. create_scatter <- function(x, y) {
  66. ggplot(data = combined_survey) +
  67. aes_string(x = x, y = y) +
  68. geom_point(alpha = 0.3) +
  69. theme(panel.background = element_rect(fill = "white"))
  70. }
  71. x_var <- strong_cors$variable[2:5]
  72. y_var <- "avg_sat_score"
  73. map2(x_var, y_var, create_scatter)
  74. ```
  75. Reshape the data so that you can investigate differences in student, parent, and teacher responses to survey questions.
  76. ```{r}
  77. # combined_survey_gather <- combined_survey %>%
  78. # gather(key = "survey_question", value = score, saf_p_11:aca_tot_11)
  79. combined_survey_gather <- combined_survey %>%
  80. pivot_longer(cols = saf_p_11:aca_tot_11,
  81. names_to = "survey_question",
  82. values_to = "score")
  83. ```
  84. Use `str_sub()` to create new variables, `response_type` and `question`, from the `survey_question` variable.
  85. ```{r}
  86. combined_survey_gather <- combined_survey_gather %>%
  87. mutate(response_type = str_sub(survey_question, 4, 6)) %>%
  88. mutate(question = str_sub(survey_question, 1, 3))
  89. ```
  90. Replace `response_type` variable values with names "parent", "teacher", "student", "total" using `if_else()` function.
  91. ```{r}
  92. combined_survey_gather <- combined_survey_gather %>%
  93. mutate(response_type = ifelse(response_type == "_p_", "parent",
  94. ifelse(response_type == "_t_", "teacher",
  95. ifelse(response_type == "_s_", "student",
  96. ifelse(response_type == "_to", "total", "NA")))))
  97. ```
  98. Make a boxplot to see if there appear to be differences in how the three groups of responders (parents, students, and teachers) answered the four questions.
  99. ```{r}
  100. combined_survey_gather %>%
  101. filter(response_type != "total") %>%
  102. ggplot() +
  103. aes(x = question, y = score, fill = response_type) +
  104. geom_boxplot()
  105. ```