I am working with a dataframe with thousands of responses to questions about interest in a set of resources. I want to summarize how many participants are interested in a given resource by counting the number of positive responses (coded as "1").
As a final step, I would like to suppress any answer where <5 participants responded.
I've created code that works, but its clunky when I'm dealing with dozens of fields. So, I'm looking for suggestions for a more streamlined approach, perhaps using piping or dplyr?
Example Input
ID
Resource1
Resource2
Resource3
Resource4
1
1
0
1
1
2
0
0
0
1
3
1
0
0
0
4
0
0
0
0
5
1
1
1
1
Desired output
Interested
Not Interested
Resource1
3
2
Resource2
1
4
Resource3
2
3
Resource4
3
2
My (ugly) code
###Select and summarise relevent columns
resource1 <- df %>% drop_na(resource1) %>% group_by(resource1) %>% summarise(n=n()) %>% rename(resp=resource1, r1 =n)
resource2 <- df %>% drop_na(resource2) %>% group_by(resource2) %>% summarise(n=n()) %>% rename(resp=resource2, r2 =n)
resource3 <- df %>% drop_na(resource3) %>% group_by(resource3) %>% summarise(n=n()) %>% rename(resp=resource3, r3 =n)
resource4 <- df %>% drop_na(resource4) %>% group_by(resource4) %>% summarise(n=n()) %>% rename(resp=resource4, r4 =n)
###Merge summarised data
resource_sum <-join_all(list(resource1,resource2,resource3,resource4), by=c("resp"))
###Replace all values less than 5 with NA per suppression rules.
resource_sum <- apply(resource_sum, function(x) ifelse(x<5, "NA", x))
resource_sum <-as.data.frame(resource_sum)
We may reshape into 'long' format with pivot_longer and then do a group by summarise to get the count of 1s and 0s
library(dplyr)
library(tidyr)
library(tibble)
df %>%
pivot_longer(cols = -ID) %>%
group_by(name) %>%
summarise(Interested = sum(value), NotInterested = n() - Interested) %>%
column_to_rownames('name')
-output
Interested NotInterested
Resource1 3 2
Resource2 1 4
Resource3 2 3
Resource4 3 2
Or using base R
v1 <- colSums(df[-1])
cbind(Interested = v1, NotInterested = nrow(df) - v1)
-output
Interested NotInterested
Resource1 3 2
Resource2 1 4
Resource3 2 3
Resource4 3 2
data
df <- structure(list(ID = 1:5, Resource1 = c(1L, 0L, 1L, 0L, 1L),
Resource2 = c(0L,
0L, 0L, 0L, 1L), Resource3 = c(1L, 0L, 0L, 0L, 1L), Resource4 = c(1L,
1L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L
))
You can use table to get counts of 0 and 1 value. To apply the function (table) to multiple columns you can use sapply -
t(sapply(df[-1], table))
# 0 1
#Resource1 2 3
#Resource2 4 1
#Resource3 3 2
#Resource4 2 3
Related
I have survey data in R that looks like this, where I've presented people with two groups of actions - High and Low - and asked them to rank each action. Each group contains unique actions, marked by the letter (6 actions in total).
id A_High B_High C_High D_Low E_Low F_Low
001 5 2 1 6 4 3
002 6 4 3 5 2 1
003 3 1 6 2 4 5
004 6 5 2 1 3 4
I need a new df that looks like the one below, where each High action is assigned a new numeric rank (between 0 and 3) corresponding to the number of Low action items that were ranked below that High action.
For example, a person with id 001 ranked A_High at number 5, B_High at 2, and C_High at 1. A_High's new rank would be 1 (since only 1 Low action, D_Low is ranked below A_High), B_High's new rank would be 3 (since all 3 Low actions were ranked below B_High), and C_High's new rank would be 3 (since all 3 Low actions were ranked below C_High).
id A_High_rank B_High_rank C_High_rank
001 1 3 3
002 0 1 1
003 2 3 0
004 0 0 2
I have a sense that this can be done with if/else statements but suspect that there should be a far more efficient way of achieving this with tidyverse. In the real dataset, I have 1000+ rows and 12 actions (6 High and 6 Low). I would appreciate any help on this.
Thanks!
Data:
"id A_High B_High C_High D_Low E_Low F_Low
001 5 2 1 6 4 3
002 6 4 3 5 2 1
003 3 1 6 2 4 5
004 6 5 2 1 3 4"
A base R option would be to loop over the 'High' columns, get the rowSums of the logical matrix created by checking if it less than the 'Low' column, and rename those output by appending _rank as suffix
out <- cbind(df1[1], sapply(df1[2:4],
function(x) rowSums(x < df1[endsWith(names(df1), 'Low')])))
names(out)[-1] <- paste0(names(out)[-1], "_rank")
-output
out
# id A_High_rank B_High_rank C_High_rank
#1 1 1 3 3
#2 2 0 1 1
#3 3 2 3 0
#4 4 0 0 2
Or using dplyr
library(dplyr)
df1 %>%
transmute(id, across(ends_with('High'),
~ rowSums(. < select(df1, ends_with('Low'))), .names = '{.col}_rank'))
# id A_High_rank B_High_rank C_High_rank
#1 1 1 3 3
#2 2 0 1 1
#3 3 2 3 0
#4 4 0 0 2
data
df1 <- structure(list(id = 1:4, A_High = c(5L, 6L, 3L, 6L), B_High = c(2L,
4L, 1L, 5L), C_High = c(1L, 3L, 6L, 2L), D_Low = c(6L, 5L, 2L,
1L), E_Low = c(4L, 2L, 4L, 3L), F_Low = c(3L, 1L, 5L, 4L)),
class = "data.frame", row.names = c(NA,
-4L))
After much suffering, this is the tidyverse solution I came up with. This was fun!
library(tidyverse)
data %>%
pivot_longer(cols = ends_with("_High"), names_to = "High Variables", values_to = "High") %>%
pivot_longer(cols = ends_with("_Low"), names_to = "Low Variables", values_to = "Low") %>%
filter(High-Low < 0) %>%
group_by(`High Variables`, `id`) %>%
summarise(Count = n()) %>%
pivot_wider(names_from = `High Variables`, values_from = Count) %>%
arrange(id)
Translation:
The first two line create two pairs of columns and leave id untouched. Each pair has two columns, one with the original column names, and the other with the values. Each pait of columns represents either High or Low.
Then, I filtered all the rows, keeping only those where Low was greater than High. Then I counted how many where left for each id and reversed back the format.
Now I just have to figure out how to turn those NAs into 0s.
Here's the output:
> data %>%
+ pivot_longer(cols = ends_with("_High"), names_to = "High Variables", values_to = "High") %>%
+ pivot_longer(cols = ends_with("_Low"), names_to = "Low Variables", values_to = "Low") %>%
+ filter(High < Low) %>%
+ group_by(`High Variables`, `id`) %>%
+ summarise(Count = n()) %>%
+ pivot_wider(names_from = `High Variables`, values_from = Count) %>%
+ arrange(id)
`summarise()` regrouping output by 'High Variables' (override with `.groups` argument)
# A tibble: 4 x 4
id A_High B_High C_High
<int> <int> <int> <int>
1 1 1 3 3
2 2 NA 1 1
3 3 2 3 NA
4 4 NA NA 2
I'm new to programming in R and I have the following dataframe:
A B C D E
1 3 0 4 5 0
2 0 0 5 1 0
3 2 1 2 0 3
I would like to get a new dataframe containing the indices of the n max values of each row, e.g: If I wanted the column indices of the 3 biggest values in each row (n=3), I want my new dataframe to be like this:
F G H
1 1 3 4
2 1 3 4
3 1 3 5
So in the first row of this dataframe containts the column indices of the 3 biggest values of row 1 in the original dataframe. And so on.
My original idea was to write a loop with which.max, but that seems way too long and ineffective. Does anyone have a better idea?
We can use apply
t(apply(df1, 1, function(x) sort(head(seq_along(x)[order(-x)], 3))))
# [,1] [,2] [,3]
#1 1 3 4
#2 1 3 4
#3 1 3 5
Or using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
group_by(rn) %>%
mutate(ind = row_number()) %>%
arrange(rn, desc(value)) %>%
slice(n = 1:3) %>%
select(-name, -value) %>%
arrange(rn, ind) %>%
mutate(nm1 = c("F", "G", "H")) %>%
ungroup %>%
pivot_wider(names_from = nm1, values_from = ind)
data
df1 <- structure(list(A = c(3L, 0L, 2L), B = c(0L, 0L, 1L), C = c(4L,
5L, 2L), D = c(5L, 1L, 0L), E = c(0L, 0L, 3L)), class = "data.frame",
row.names = c("1",
"2", "3"))
I'd like to find consecutive month by client. I thought this is easy but
still can't find solutions..
My goal is to find months' consecutive purchases for each client. Any
My data
Client Month consecutive
A 1 1
A 1 2
A 2 3
A 5 1
A 6 2
A 8 1
B 8 1
In base R, we can use ave
df$consecutive <- with(df, ave(Month, Client, cumsum(c(TRUE, diff(Month) > 1)),
FUN = seq_along))
df
# Client Month consecutive
#1 A 1 1
#2 A 1 2
#3 A 2 3
#4 A 5 1
#5 A 6 2
#6 A 8 1
#7 B 8 1
In dplyr, we can create a new group with lag to compare the current month with the previous month and assign row_number() in each group.
library(dplyr)
df %>%
group_by(Client,group=cumsum(Month-lag(Month, default = first(Month)) > 1)) %>%
mutate(consecutive = row_number()) %>%
ungroup %>%
select(-group)
We can create a grouping variable based on the difference in adjacent 'Month' for each 'Client' and use that to create the sequence
library(dplyr)
df1 %>%
group_by(Client) %>%
group_by(grp =cumsum(c(TRUE, diff(Month) > 1)), add = TRUE) %>%
mutate(consec = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 7 x 4
# Client Month consecutive consec
# <chr> <int> <int> <int>
#1 A 1 1 1
#2 A 1 2 2
#3 A 2 3 3
#4 A 5 1 1
#5 A 6 2 2
#6 A 8 1 1
#7 B 8 1 1
Or using data.table
library(data.table)
setDT(df1)[, grp := cumsum(c(TRUE, diff(Month) > 1)), Client
][, consec := seq_len(.N), .(Client, grp)
][, grp := NULL][]
data
df1 <- structure(list(Client = c("A", "A", "A", "A", "A", "A", "B"),
Month = c(1L, 1L, 2L, 5L, 6L, 8L, 8L), consecutive = c(1L,
2L, 3L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-7L))
Solution
I went with the solutions provided by #MauritsEvers and #akrun below.
Question
For a data frame, I want to keep only 1 column of each set of duplicate columns. In addition, the column that is kept takes on a name that is a concatenation of all column names in the set of duplicate columns. There are multiple sets of duplicate columns in the data frame. The data frame contains tens of thousands of columns, so using a for loop might take too much time.
I have tried a combination of using the duplicate(), summary(), aggregate(), lapply(), apply(), and using for loops.
Input data frame (df_in):
0 1 2 3 4 5 6 7
0 1 0 0 1 0 1 1
0 1 0 1 1 0 0 0
1 0 1 0 0 1 1 0
Output data frame (df_out):
0-2-5 1-4 3 6 7
0 1 0 1 1
0 1 1 0 0
1 0 0 1 0
Here is an option with tidyverse. We gather the data into 'long' format, conver the 'value' into a string, grouped by 'value', paste the 'key' column together, separate the rows of 'value' and then spread the 'value' column to get the expected output
library(tidyverse)
gather(df_in) %>%
group_by(key) %>%
summarise(value = toString(value)) %>%
group_by(value) %>%
summarise(key = paste(key, collapse="-")) %>%
separate_rows(value) %>%
group_by(key) %>%
mutate(n = row_number()) %>%
spread(key, value) %>%
select(-n)
# A tibble: 3 x 5
# `0-2-5` `1-4` `3` `6` `7`
# <chr> <chr> <chr> <chr> <chr>
#1 0 1 0 1 1
#2 0 1 1 0 0
#3 1 0 0 1 0
Or another option with tidyverse would be
t(df_in) %>%
as.data.frame %>%
mutate(grp = group_indices(., V1, V2, V3)) %>%
mutate(rn = row_number() - 1) %>%
group_split(grp, keep = FALSE) %>%
map_dfc(~ .x %>%
mutate(rn = str_c(rn, collapse="-")) %>%
slice(1) %>%
gather(key, val, -rn) %>%
rename(!! .$rn[1] := val) %>%
select(ncol(.)))
# A tibble: 3 x 5
# `0-2-5` `3` `7` `6` `1-4`
# <int> <int> <int> <int> <int>
#1 0 0 1 1 1
#2 0 1 0 0 1
#3 1 0 0 1 0
Or we can also do this with data.table methods
library(data.table)
dcast(melt(as.data.table(t(df_in))[, grp := .GRP, .(V1, V2, V3)][,
c(.SD[1], cn = paste(.I-1, collapse="-")) , .(grp)],
id.var = c('cn', 'grp')), variable ~ cn, value.var = 'value')[,
variable := NULL][]
# 0-2-5 1-4 3 6 7
#1: 0 1 0 1 1
#2: 0 1 1 0 0
#3: 1 0 0 1 0
data
df_in <- structure(list(`0` = c(0L, 0L, 1L), `1` = c(1L, 1L, 0L), `2` = c(0L,
0L, 1L), `3` = c(0L, 1L, 0L), `4` = c(1L, 1L, 0L), `5` = c(0L,
0L, 1L), `6` = c(1L, 0L, 1L), `7` = c(1L, 0L, 0L)),
class = "data.frame", row.names = c(NA, -3L))
You can do the following in base R
Get indices of identical columns
idx <- split(seq_along(names(df)), apply(df, 2, paste, collapse = "_"))
Sort indices from low to high
idx <- idx[order(sapply(idx, function(x) x[1]))]
Names of idx as concatentation of column names
names(idx) <- sapply(idx, function(x) paste(names(df)[x], collapse = "_"))
Create final matrix
sapply(idx, function(x) df[, x[1]])
# col0_col2_col5 col1_col4 col3_col6 col7
#[1,] 0 1 1 1
#[2,] 0 1 0 0
#[3,] 1 0 1 0
Note that the resulting object is a matrix, so if you need a data.frame simply cast as.data.frame.
Sample data
I've changed your sample data slightly to not have numbers as column names.
df <- read.table(text =
"col0 col1 col2 col3 col4 col5 col6 col7
0 1 0 1 1 0 1 1
0 1 0 0 1 0 0 0
1 0 1 1 0 1 1 0", header = T)
I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3