Is there a way to use `pivot_wider()` to summarize survey data? - r

I have a bunch of survey data, something like:
I have some survey data, let's say it's about how often respondents tackle various daily routines:
survey <- tribble(
~Q1_toothbrush, ~Q1_bathe, ~Q1_brush_hair, ~Q1_make_bed,
"Always","Sometimes","Often","Never",
"Never","Never","Always","Sometimes",
"Often","Sometimes","Sometimes","Often",
"Sometimes","Always","Often","Never"
)
I want to arrange it into a table that shows how many people selected "Often" or "Always".
I can create a new tibble and update it, taking each question one at a time, eg.
habits <- tribble(
~Habit, ~Description, ~Count,
"Q1_toothbrush", "Brushes teeth for two minutes twice each daty.", 0,
"Q1_bathe", "Bathes with soap and water every morning or evening", 0,
"Q1_hair", "Attends to daily hair health", 0,
"Q1_make_bed", "Tidies bed covers daily", 0
)
top_two <- c("Always", "Often")
tmp <- survey %>%
filter(Q1_toothbrush %in% top_two) %>%
count()
habits <- habits %>%
mutate(Count = ifelse(Habit == "Q1_toothbrush", tmp, Count))
kable(habits)
But I'm struggling to consolidate this into a single function.

If we need to do this for each row, an option is c_across after doing rowwise
library(dplyr) # >= 1.0.0
survey %>%
rowwise %>%
mutate(count = sum(c_across(everything()) %in% top_two)) %>%
ungroup
Or we can reshape to 'long' format and then do the count
library(dplyr)
library(tidyr)
pivot_longer(survey, everything()) %>%
filter(value %in% top_two) %>%
dplyr::count(name)

Related

Pulling data in R into separate data frames, merging and want to make code more condensed

I'm trying to pull 2 data frames from a larger data set based off of multiple conditions, group it by a variable, and then merge the subsetted data together.
In this example, I'm pulling from a dataset with individual voters in Florida, including information on whether they voted in 2018 and voted in 2020. I'm trying to look at the % of people who voted in 2018 not 2020, and the people who voted in 2018 and 2020 and compare it side by side.
My question- do I have to create 2 data frames and then merge them to do this? Or is there a shorter/cleaner way than the code I have now?
This is the code I've been using.
d <- FL %>%
filter(voted_2018 == 1, voted_2020 == 0) %>%
group_by(county) %>%
count () %>%
ungroup() %>%
mutate(prop = n/ sum (n)*100)
e <- FL %>%
filter(voted_2018 == 1, voted_2020 == 1) %>%
group_by(county) %>%
count () %>%
ungroup() %>%
mutate(prop = n/ sum (n)*100)
f <- merge(d, e, by = "county")

After grouping, cannot get dplyr's slice to select top 3 of each grouping

I am trying to only retain the top 3 records from each grouping based on l5_ppg_max. This code sets up the table correctly, but when I add the slice code it doe not select top 3 records of each group.
#library(reticulate)
library(tidyverse)
library(plotly)
library(janitor)
library(readxl)
library(reprex)
player_projection <- read_csv("DFF_NHL_cheatsheet.csv", col_names = TRUE)
team__reg_line <- player_projection %>%
clean_names() %>%
mutate_if(is.numeric, ~replace_na(., 0)) %>%
filter(!position == "G") %>%
filter(!reg_line == 0) %>%
select(team, reg_line, l5_ppg_max, salary) %>%
arrange(team, reg_line, desc(l5_ppg_max)) %>%
group_by(team, reg_line, salary, l5_ppg_max)
When I add this line:
slice_head(n = 3)
It still returns all records.
Also tried top_n(3), but read it was deprecated so stayed with dplyr slice functions. Quite easy to do in excel manually, but need to do in R for ggplot outputs.

Writing a for loop to count similar keys in two data frames

I have panel data where I split the whole data set into multiple data frames by year and match unique keys across years. For example, if you have 6,000 observations in 2000 and 7000 observations in, I'm trying to match the overlap between each year for every year from 2000 to 2017.
I have a brute forced solution that's about 350 lines of copy and pasted code, but I'm looking for a more efficient and elegant solution using loops.
I'm working with for loops and looking into map() functions at the moment, but I haven't found a solution. I'm using R4DS.
#1989
b1989 <- b %>% filter(year == 1989) %>% select(key, V7, z9, z11, z13, z15)
a1990 <- a %>% select(key,year) %>% filter(year == 1990) %>% distinct()
br1989 <- inner_join(b1989, a1990, by = "key")
#1990
b1990 <- b %>% filter(year == 1990) %>% select(key, V7, z9, z11, z13, z15)
a1991 <- a %>% select(key,year) %>% filter(year == 1991) %>% distinct()
br1990 <- inner_join(b1990, a1991, by = "key")
#1991
b1991 <- b %>% filter(year == 1991) %>% select(key, V7, z9, z11, z13, z15)
a1992 <- a %>% select(key,year) %>% filter(year == 1992) %>% distinct()
br1991 <- inner_join(b1991, a1992, by = "key")
busrescount_t1 <- c(nrow(br1989),nrow(br1990),nrow(br1991))
busrescount_t1
[1] 4366 4956 4768
It currently works, but is simply bad code and cumbersome. Also, doing it at scale for 2-year, 3-year, 4-year differences in a nightmare and will be 1000+ lines of copy/pasted code.
The goal is to have a loop that produces a vector of these matches that can be placed into a data frame. I'm trying to do this for 20+ years.
How about something like this? (I'd love to be able to verify this works using a sample of your data.)
In theory, we should be able to join b to a version of a where the year is shifted forward one. If the row in b has a match in a with the same key and the following year, the join should complete and have a TRUE in the a_match column.
b %>%
select(key, V7, z9, z11, z13, z15) %>%
left_join(a %>% select(key, year) %>%
mutate(year = year + 1, a_match = TRUE),
by = c("key", "year")) %>%
filter(!is.na(a_match))

transform() to add rows with dplyr()

I've got a data frame (df) with two variables, site and purchase.
I'd like to use dplyr() to group my data by site and purchase, and get the counts and percentages for the grouped data. I'd however also like the tibble to feature rows called ALLSITES, representing the data of all the sites grouped by purchase, so that I end up with a tibble looking similar to dfgoal.
The problem's that my current code doesn't get me the ALLSITES rows. I've tried adding a base R function into dplyr(), which doesn't work.
Any help would be much appreciated.
Starting point (df):
df <- data.frame(site=c("LON","MAD","PAR","MAD","PAR","MAD","PAR","MAD","PAR","LON","MAD","LON","MAD","MAD","MAD"),purchase=c("a1","a2","a1","a1","a1","a1","a1","a1","a1","a2","a1","a2","a1","a2","a1"))
Desired outcome:
dfgoal <- data.frame(site=c("LON","LON","MAD","MAD","PAR","ALLSITES","ALLSITES"),purchase=c("a1","a2","a1","a2","a1","a1","a2"),bin=c(1,2,6,2,4,11,4),pin_per=c(33.33333,66.66667,75.00000,25.00000,100.00000,73.33333,26.66666))
Current code:
library(dplyr)
df %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
df %>%
rbind(df, transform(df, site = "ALLSITES") %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
We can start from the first output code block, after grouping by 'site' with a created string of 'ALLSITES' and 'purchase' get the sum of 'bin' and later 'bin_per', then with bind_rows row bind the two datasets
df1 %>%
ungroup() %>%
group_by(site = 'ALLSITES', purchase) %>%
summarise(bin = sum(bin)) %>%
ungroup %>%
mutate(bin_per = 100*(bin/sum(bin))) %>%
bind_rows(df1, .)

Creating Iterated Variables in R

I've looked around and seen some questions similar to mine, but none directly on point. I have a series of presidential election results for various states from 1940 to 2012. They are labeled, in sequence, r1940, d1940, r1944, d1944, r1948, d1948, and so forth.
I want to create a series of two-party vote variables, which are calculated by dividing the number of Democratic votes by the number of republican and democratic votes. So in a df called votes:
d2pv1940 <- (votes$d1940/(votes$d1940+votes$r1940))
Obviously I can do this 18 more times by hand, e.g., d2pv1944<-(votes$d1944/(votes$d1944+votes$r1944)) but obviously that is time consuming and invites errors. I've seen some solutions to similar problems using lapply or for loops, but I'm not really sure how I'd iterate the four variable names in the commands above.
Try something like this:
namest=colnames(votes)
rep=which(substr(namest, 1,1)=="r")
dem=which(substr(namest, 1,1)=="d")
res=votes[,dem]/(votes[,dem]+votes[,rep])
colnames(res)=paste("d2pv",substring(colnames(votes[,dem]),2),sep="")
res
Here's a tidy way to do it:
library(dplyr)
library(rex)
data =
c(1, 2, 2, 1) %>%
setNames(
c("r1940", "d1940", "r1944", "d1944") ) %>%
as.list %>%
as.data.frame
regex_1 =
rex(capture(letter),
capture(digits) )
abbreviations = data_frame(
abbreviation = c("d", "r"),
party = c("democrat", "republican") )
data %>%
gather(variable, value) %>%
extract(variable,
c("abbreviation", "year"),
regex_1) %>%
left_join(abbreviations) %>%
group_by(year) %>%
mutate(total = sum(value),
proportion = value / total ) %>%
select(-abbreviation, -value) %>%
spread(party, proportion)

Resources