Using tidyverse to clean up rank-choice survey

Using tidyverse to clean up rank-choice survey - r

I have survey data in R that looks like this, where I've presented people with two groups of actions - High and Low - and asked them to rank each action. Each group contains unique actions, marked by the letter (6 actions in total).
id A_High B_High C_High D_Low E_Low F_Low
001 5 2 1 6 4 3
002 6 4 3 5 2 1
003 3 1 6 2 4 5
004 6 5 2 1 3 4
I need a new df that looks like the one below, where each High action is assigned a new numeric rank (between 0 and 3) corresponding to the number of Low action items that were ranked below that High action.
For example, a person with id 001 ranked A_High at number 5, B_High at 2, and C_High at 1. A_High's new rank would be 1 (since only 1 Low action, D_Low is ranked below A_High), B_High's new rank would be 3 (since all 3 Low actions were ranked below B_High), and C_High's new rank would be 3 (since all 3 Low actions were ranked below C_High).
id A_High_rank B_High_rank C_High_rank
001 1 3 3
002 0 1 1
003 2 3 0
004 0 0 2
I have a sense that this can be done with if/else statements but suspect that there should be a far more efficient way of achieving this with tidyverse. In the real dataset, I have 1000+ rows and 12 actions (6 High and 6 Low). I would appreciate any help on this.
Thanks!
Data:
"id A_High B_High C_High D_Low E_Low F_Low
001 5 2 1 6 4 3
002 6 4 3 5 2 1
003 3 1 6 2 4 5
004 6 5 2 1 3 4"

A base R option would be to loop over the 'High' columns, get the rowSums of the logical matrix created by checking if it less than the 'Low' column, and rename those output by appending _rank as suffix
out <- cbind(df1[1], sapply(df1[2:4],
function(x) rowSums(x < df1[endsWith(names(df1), 'Low')])))
names(out)[-1] <- paste0(names(out)[-1], "_rank")
-output
out
# id A_High_rank B_High_rank C_High_rank
#1 1 1 3 3
#2 2 0 1 1
#3 3 2 3 0
#4 4 0 0 2
Or using dplyr
library(dplyr)
df1 %>%
transmute(id, across(ends_with('High'),
~ rowSums(. < select(df1, ends_with('Low'))), .names = '{.col}_rank'))
# id A_High_rank B_High_rank C_High_rank
#1 1 1 3 3
#2 2 0 1 1
#3 3 2 3 0
#4 4 0 0 2
data
df1 <- structure(list(id = 1:4, A_High = c(5L, 6L, 3L, 6L), B_High = c(2L,
4L, 1L, 5L), C_High = c(1L, 3L, 6L, 2L), D_Low = c(6L, 5L, 2L,
1L), E_Low = c(4L, 2L, 4L, 3L), F_Low = c(3L, 1L, 5L, 4L)),
class = "data.frame", row.names = c(NA,
-4L))

After much suffering, this is the tidyverse solution I came up with. This was fun!
library(tidyverse)
data %>%
pivot_longer(cols = ends_with("_High"), names_to = "High Variables", values_to = "High") %>%
pivot_longer(cols = ends_with("_Low"), names_to = "Low Variables", values_to = "Low") %>%
filter(High-Low < 0) %>%
group_by(`High Variables`, `id`) %>%
summarise(Count = n()) %>%
pivot_wider(names_from = `High Variables`, values_from = Count) %>%
arrange(id)
Translation:
The first two line create two pairs of columns and leave id untouched. Each pair has two columns, one with the original column names, and the other with the values. Each pait of columns represents either High or Low.
Then, I filtered all the rows, keeping only those where Low was greater than High. Then I counted how many where left for each id and reversed back the format.
Now I just have to figure out how to turn those NAs into 0s.
Here's the output:
> data %>%
+ pivot_longer(cols = ends_with("_High"), names_to = "High Variables", values_to = "High") %>%
+ pivot_longer(cols = ends_with("_Low"), names_to = "Low Variables", values_to = "Low") %>%
+ filter(High < Low) %>%
+ group_by(`High Variables`, `id`) %>%
+ summarise(Count = n()) %>%
+ pivot_wider(names_from = `High Variables`, values_from = Count) %>%
+ arrange(id)
`summarise()` regrouping output by 'High Variables' (override with `.groups` argument)
# A tibble: 4 x 4
id A_High B_High C_High
<int> <int> <int> <int>
1 1 1 3 3
2 2 NA 1 1
3 3 2 3 NA
4 4 NA NA 2

Related

Create combinations by group and sum

I have data of names within an ID number along with a number of associated values. It looks something like this:
structure(list(id = c("a", "a", "b", "b"), name = c("bob", "jane",
"mark", "brittney"), number = c(1L, 2L, 1L, 2L), value = c(1L,
2L, 1L, 2L)), class = "data.frame", row.names = c(NA, -4L))
# id name number value
# 1 a bob 1 1
# 2 a jane 2 2
# 3 b mark 1 1
# 4 b brittney 2 2
I would like to create all the combinations of name, regardless of how many there are, and paste them together separated with commas, and sum their number and value within each id. The desired output from the example above is then:
structure(list(id = c("a", "a", "a", "b", "b", "b"), name = c("bob",
"jane", "bob, jane", "mark", "brittney", "mark, brittney"), number = c(1L,
2L, 3L, 1L, 2L, 3L), value = c(1L, 2L, 3L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA, -6L))
# id name number value
# 1 a bob 1 1
# 2 a jane 2 2
# 3 a bob, jane 3 3
# 4 b mark 1 1
# 5 b brittney 2 2
# 6 b mark, brittney 3 3
Thanks all!

You could use group_modify() + add_row():
library(dplyr)
df %>%
group_by(id) %>%
group_modify( ~ .x %>%
summarise(name = toString(name), across(c(number, value), sum)) %>%
add_row(.x, .)
) %>%
ungroup()
# # A tibble: 6 × 4
# id name number value
# <chr> <chr> <int> <int>
# 1 a bob 1 1
# 2 a jane 2 2
# 3 a bob, jane 3 3
# 4 b mark 1 1
# 5 b brittney 2 2
# 6 b mark, brittney 3 3

You can create pairwise indices using combn() and expand the data frame with these using slice(). Then just group by these row pairs and summarise. I'm assuming you want pairwise combinations but this can be adapted for larger sets if needed. Some code to handle groups < 2 is included but can be removed if these don't exist in your data.
library(dplyr)
library(purrr)
df1 %>%
group_by(id) %>%
slice(c(combn(seq(n()), min(n(), 2)))) %>%
mutate(id2 = (row_number()-1) %/% 2) %>%
group_by(id, id2) %>%
summarise(name = toString(name),
across(where(is.numeric), sum), .groups = "drop") %>%
select(-id2) %>%
bind_rows(df1 %>%
group_by(id) %>%
filter(n() > 1), .) %>%
arrange(id) %>%
ungroup()
# A tibble: 6 × 4
id name number value
<chr> <chr> <int> <int>
1 a bob 1 1
2 a jane 2 2
3 a bob, jane 3 3
4 b mark 1 1
5 b brittney 2 2
6 b mark, brittney 3 3
Edit:
To adapt for all possible combinations you can iterate over the values up to the max group size. Using edited data which has a couple of rows added to the first group:
map_df(seq(max(table(df2$id))), ~
df2 %>%
group_by(id) %>%
slice(c(combn(seq(n()), .x * (.x <= n())))) %>%
mutate(id2 = (row_number() - 1) %/% .x) %>%
group_by(id, id2) %>%
summarise(name = toString(name),
across(where(is.numeric), sum), .groups = "drop")
) %>%
select(-id2) %>%
arrange(id)
# A tibble: 18 × 4
id name number value
<chr> <chr> <int> <int>
1 a bob 1 1
2 a jane 2 2
3 a sophie 1 1
4 a jeremy 2 2
5 a bob, jane 3 3
6 a bob, sophie 2 2
7 a bob, jeremy 3 3
8 a jane, sophie 3 3
9 a jane, jeremy 4 4
10 a sophie, jeremy 3 3
11 a bob, jane, sophie 4 4
12 a bob, jane, jeremy 5 5
13 a bob, sophie, jeremy 4 4
14 a jane, sophie, jeremy 5 5
15 a bob, jane, sophie, jeremy 6 6
16 b mark 3 5
17 b brittney 4 6
18 b mark, brittney 7 11
Data for df2:
df2 <- structure(list(id = c("a", "a", "a", "a", "b", "b"), name = c("bob",
"jane", "sophie", "jeremy", "mark", "brittney"), number = c(1L,
2L, 1L, 2L, 3L, 4L), value = c(1L, 2L, 1L, 2L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-6L))

A data.table option
setDT(df)[
,
lapply(
.SD,
function(x) {
unlist(
lapply(
seq_along(x),
combn,
x = x,
function(v) {
ifelse(all(is.character(v)), toString, sum)(v)
}
)
)
}
),
id
]
gives
id name number value
1: a bob 1 1
2: a jane 2 2
3: a bob, jane 3 3
4: b mark 1 1
5: b brittney 2 2
6: b mark, brittney 3 3

Coalescing multiple chunks of columns with the same suffix in names (R)

I have a dataset with various "chunks" of columns with different prefixes, but the same suffix:
ID
A034
B034
C034
D034
A099
B099
A123
B123
...
1
NA
1
NA
NA
NA
3
1
NA
...
2
2
NA
NA
NA
2
NA
NA
2
...
3
NA
NA
2
NA
NA
2
1
NA
...
The number of columns within each "chunk" also varies. Is there any way (other than manually, which is what I have been painstakingly doing with coalesce(!!! select(., contains("XXX")))) to automatically coalesce by chunk based on the shared suffix? That is, the result should resemble
ID
034
099
123
...
1
1
3
1
...
2
2
2
2
...
3
2
2
1
...
I'm not sure how to begin doing something like this, so any suggestions would be very helpful.

We reshape the data into 'long' format with pivot_longer, then we group by 'ID' and loop across the other columns, apply the na.omit to remove the NA elements (we assume that there is only one non-NA per each column by group)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -ID, names_to = ".value",
names_pattern = "[A-Z](\\d+)") %>%
group_by(ID) %>%
summarise(across(everything(), na.omit), .groups = 'drop')
-output
# A tibble: 3 x 4
ID `034` `099` `123`
<int> <int> <int> <int>
1 1 1 3 1
2 2 2 2 2
3 3 2 2 1
Or to be safe, use complete.cases to create a logical vector for non-NA elements, and extract the first element (assuming we need only a single non-NA - if the non-NA lengths are different, we may need to return a list)
df1 %>%
pivot_longer(cols = -ID, names_to = ".value",
names_pattern = "[A-Z](\\d+)") %>%
group_by(ID) %>%
summarise(across(everything(), ~ .[complete.cases(.)][1]))
data
df1 <- structure(list(ID = 1:3, A034 = c(NA, 2L, NA), B034 = c(1L, NA,
NA), C034 = c(NA, NA, 2L), D034 = c(NA, NA, NA), A099 = c(NA,
2L, NA), B099 = c(3L, NA, 2L), A123 = c(1L, NA, 1L), B123 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))

one more approach
library(tidyverse)
split(names(df1)[-1], gsub('^\\D*(\\d+)$', '\\1', names(df1)[-1])) %>% map(~df1[c('ID', .x)]) %>%
imap(~ .x %>% group_by(ID) %>% rowwise %>% transmute(!!.y := first(na.omit(c_across(everything())))) %>% ungroup) %>%
reduce(left_join, by = 'ID')
#> # A tibble: 3 x 4
#> ID `034` `099` `123`
#> <int> <int> <int> <int>
#> 1 1 1 3 1
#> 2 2 2 2 2
#> 3 3 2 2 1
Created on 2021-06-20 by the reprex package (v2.0.0)

Remove rows based on date and time where no update actually occurs but keep the first instance

I've come up against a wall in trying to resolve this and hope somebody can help. I'm trying to implement a way to filter this dataset which reflects bike station occupancy data that is time stamped.
ID Time Bike.Availability
1 2 01/04/2020 04:31:16 11
2 2 01/04/2020 04:40:07 11
3 2 01/04/2020 04:50:15 10
4 2 01/04/2020 04:57:10 10
5 2 01/04/2020 05:07:19 9
6 2 01/04/2020 05:19:38 10
7 2 01/04/2020 05:29:47 10
8 2 01/04/2020 06:43:54 11
I want to remove the rows where there is no change in Bike.Availability and only keep the first instance.
I would like the resulting dataset to look as follows:
ID Time Bike.Availability
1 2 01/04/2020 04:31:16 11
2 2 01/04/2020 04:50:15 10
3 2 01/04/2020 05:07:19 9
4 2 01/04/2020 05:19:38 10
5 2 01/04/2020 06:43:54 11
I've converted the timestamp:
bike_data$Time <- as.POSIXct(bike_data$Time,format="%Y-%m-%d %H:%M:%S")
And I've tried different variations of:
library(dplyr)
bike_data %>%
group_by(Time) %>%
arrange(Bike.Availability) %>%
top_n(1)
Any help or feedback would be greatly appreciated.

We group by the 'ID' and run-length-id of 'Bike.Availability' i.e. it creates a grouping index based on the similarity of adjacent elements of 'Bike.Availability', then slice the first row with slice_head specifying n = 1
library(dplyr)
library(data.table)
bike_data %>%
group_by(ID, grp = rleid(Bike.Availability)) %>%
slice_head(n = 1) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 3
# ID Time Bike.Availability
# <int> <chr> <int>
#1 2 01/04/2020 04:31:16 11
#2 2 01/04/2020 04:50:15 10
#3 2 01/04/2020 05:07:19 9
#4 2 01/04/2020 05:19:38 10
#5 2 01/04/2020 06:43:54 11
Grouping by 'Time' column would create groups with single observation per group (based on the values showed in 'Time'), thererefore top_n(1) returns the original dataset instead of subsetting
data
bike_data <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
Time = c("01/04/2020 04:31:16",
"01/04/2020 04:40:07", "01/04/2020 04:50:15", "01/04/2020 04:57:10",
"01/04/2020 05:07:19", "01/04/2020 05:19:38", "01/04/2020 05:29:47",
"01/04/2020 06:43:54"), Bike.Availability = c(11L, 11L, 10L,
10L, 9L, 10L, 10L, 11L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

A dplyr solution alone. Checking if row above and below are same ifelse. Then NA to 0 and then filter.
library(dplyr)
bike_data %>%
mutate(same = ifelse(Bike.Availability == lag(Bike.Availability), 1, 0)) %>%
mutate(same = ifelse(is.na(same), 0, same)) %>%
filter(same=="NA" | same==0) %>%
select(-same)
Output:
ID Time Bike.Availability
1 2 01/04/2020 04:31:16 11
3 2 01/04/2020 04:50:15 10
5 2 01/04/2020 05:07:19 9
6 2 01/04/2020 05:19:38 10
8 2 01/04/2020 06:43:54 11

Creating duplicated data frames with different ID

I have a question for the community and hoping for some help.
I am trying to duplicate a data frame like the one below:
ID Time Solve
1 0 1
1 2 2
1 4 3
1 6 1
I am trying to duplicate the above data frame 100 times so, it would read as below:
ID Time Solve
1 0 1
1 2 2
1 4 3
1 6 1
2 0 1
2 2 2
2 4 3
2 6 1
3 0 1
3 2 2
3 4 3
3 6 1
4 0 1
4 2 2
4 4 3
4 6 1
.....
100 0 1
100 2 2
100 4 3
100 6 1
Does anyone have a good solution for this or a resource to read up on this?
Thanks!

We can use replicate
out <- do.call(rbind, replicate(100, df1, simplify = FALSE))
out$ID <- as.integer(gl(nrow(out), nrow(df1), nrow(out)))
Or another option is rep
out <- df1[rep(seq_len(nrow(df1)), 100),]
out$ID <- as.integer(gl(nrow(out), nrow(df1), nrow(out)))
Or make use of uncount
library(tidyr)
library(dplyr)
uncount(df1, 100) %>%
mutate(ID = as.integer(gl(n(), nrow(df1), n()))
Or another option is
df1 %>%
nest_by(ID) %>%
uncount(100) %>%
mutate(ID = row_number()) %>%
unnest(c(data))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L), Time = c(0L, 2L, 4L, 6L
), Solve = c(1L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA,
-4L))

Sorting and calculating sum and rank with new columns in R

I have 200 columns and want to calculate mean and rank and then generate columns. Here is an example of data
df<-read.table(text="Q1a Q2a Q3b Q4c Q5a Q6c Q7b
1 2 4 2 2 0 1
3 2 1 2 2 1 1
4 3 2 1 1 1 1",h=T)
I want to sum a, b and c for each row, and then sum them together. Next I want to calculate the rank for each row. I want to generate the following table:
Q1a Q2a Q3b Q4c Q5a Q6c Q7b a b c Total Rank
1 2 4 2 2 0 1 5 5 2 12 2
3 2 1 2 2 1 1 7 2 3 12 2
4 3 2 1 1 1 1 8 3 2 13 1

library(dplyr)
df %>%
cbind(sapply(c('a', 'b', 'c'), function(x) rowSums(.[, grep(x, names(.)), drop=FALSE]))) %>%
mutate(Total = a + b + c,
Rank = match(Total, sort(Total, decreasing = T)))
Output is:
Q1a Q2a Q3b Q4c Q5a Q6c Q7b a b c Total Rank
1 1 2 4 2 2 0 1 5 5 2 12 2
2 3 2 1 2 2 1 1 7 2 3 12 2
3 4 3 2 1 1 1 1 8 3 2 13 1
Sample data:
df <- structure(list(Q1a = c(1L, 3L, 4L), Q2a = c(2L, 2L, 3L), Q3b = c(4L,
1L, 2L), Q4c = c(2L, 2L, 1L), Q5a = c(2L, 2L, 1L), Q6c = c(0L,
1L, 1L), Q7b = c(1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-3L))

You can also go with the tidyverse approach. However, it is longer.
library(tidyverse)
df %>%
rownames_to_column(var = "ID") %>%
gather(question, value, -ID) %>%
mutate(type = substr(question, 3,3)) %>%
group_by(ID, type) %>%
summarise(sumType = sum(value, na.rm = TRUE)) %>%
as.data.frame() %>%
spread(type, sumType) %>%
mutate(Total = a+b+c,
Rank = match(Total, sort(Total, decreasing = T)))
Results:
ID a b c Total Rank
1 1 5 5 2 12 2
2 2 7 2 3 12 2
3 3 8 3 2 13 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using tidyverse to clean up rank-choice survey - r

Related

Create combinations by group and sum

Coalescing multiple chunks of columns with the same suffix in names (R)

Remove rows based on date and time where no update actually occurs but keep the first instance

Creating duplicated data frames with different ID

Sorting and calculating sum and rank with new columns in R

Categories

Resources