extracting unique combinations from a long list of binary variables - r

I have a dataframe containing a long list of binary variables. Each row represents a participant, and columns represent whether a participant made a certain choice (1) or not (0). For the sakes of simplicity, let's say there's only four binary variables and 6 participants.
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
>df
# a b c d
# 1 0 1 0 1
# 2 1 1 0 1
# 3 0 1 0 0
# 4 1 1 1 0
# 5 0 0 1 0
# 6 1 1 1 0
In the dataframe, I want to create a list of columns that reflect each unique combination of variables in df (i.e., abc, abd, bcd, cda). Then, for each row, I want to add value "1" if the row contains the particular combination corresponding to the column. So, if the participant scored 1 on "a", "b", and "c", and 0 on "d" he would have a score 1 in the newly created column "abc", but 0 in the other columns. Ideally, it would look something like this.
>df_updated
# a b c d abc abd bcd cda
# 1 0 1 0 1 0 0 0 0
# 2 1 1 0 1 0 1 0 0
# 3 0 1 0 0 0 0 0 0
# 4 1 1 1 0 1 0 0 0
# 5 0 0 1 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
The ultimate goal is to have an idea of the frequency of each of the combinations, so I can order them from the most frequently chosen to the least frequently chosen. I've been thinking about this issue for days now, but couldn't find an appropriate answer. I would very much appreciate the help.

Something like this?
funCombn <- function(data){
f <- function(x, data){
data <- data[x]
list(
name = paste(x, collapse = ""),
vec = apply(data, 1, function(x) +all(as.logical(x)))
)
}
res <- combn(names(df), 3, f, simplify = FALSE, data = df)
out <- do.call(cbind.data.frame, lapply(res, '[[', 'vec'))
names(out) <- sapply(res, '[[', 'name')
cbind(data, out)
}
funCombn(df)
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0

Base R option using combn :
n <- 3
cbind(df, do.call(cbind, combn(names(df), n, function(x) {
setNames(data.frame(as.integer(rowSums(df[x] == 1) == n)),
paste0(x, collapse = ''))
}, simplify = FALSE))) -> result
result
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Using combn create all combinations of column names taking n columns at a time. For each of those combinations assign 1 to those rows where all the 3 combinations are 1 or 0 otherwise.

If you are just looking for a frequency of the combinations (and they don't need to be back in the original data), then you could use something like this:
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
n <- names(df)
out <- sapply(n, function(x)ifelse(df[[x]] == 1, x, ""))
combs <- apply(out, 1, paste, collapse="")
sort(table(combs))
# combs
# abd b bd c abc
# 1 1 1 1 2

Ok, so let's use your data, including one row without any 1's:
df <- data.frame(
a = c(0,1,0,1,0,1,0),
b = c(1,1,1,1,0,1,0),
c = c(0,0,0,1,1,1,0),
d = c(1,1,0,0,0,0,0)
)
Now I want to paste all column names together if they have a 1, and then make that a wide table (so that all have a column for a combination). Of course, I fill all resulting NAs with 0's.
df2 <- df %>%
dplyr::mutate(
combination = paste0(
ifelse(a == 1, "a", ""), # There is possibly a way to automate this as well using across()
ifelse(b == 1, "b", ""),
ifelse(c == 1, "c", ""),
ifelse(d == 1, "d", "")
),
combination = ifelse(
combination == "",
"nothing",
paste0("comb_", combination)
),
value = ifelse(
is.na(combination),
0,
1
),
i = dplyr::row_number()
) %>%
tidyr::pivot_wider(
names_from = combination,
values_from = value,
names_repair = "unique"
) %>%
replace(., is.na(.), 0) %>%
dplyr::select(-i)
Since you want to order the original df by frequency, you can create a summary of all combinations (excluding those without anything filled in). Then you just make it a long table and pull the column for every combination (arranged by frequency) from the table.
comb_in_order <- df2 %>%
dplyr::select(
-tidyselect::any_of(
c(
names(df),
"nothing" # I think you want these last.
)
)
) %>%
dplyr::summarise(
dplyr::across(
.cols = tidyselect::everything(),
.fns = sum
)
) %>%
tidyr::pivot_longer(
cols = tidyselect::everything(),
names_to = "combination",
values_to = "frequency"
) %>%
dplyr::arrange(
dplyr::desc(frequency)
) %>%
dplyr::pull(combination)
The only thing to do then is to reconstruct the original df by these after arranging by the columns.
df2 %>%
dplyr::arrange(
across(
tidyselect::any_of(comb_in_order),
desc
)
) %>%
dplyr::select(
tidyselect::any_of(names(df))
)
This should work for all possible combinations.

Related

Calculate number of time streak of categories change in a row in R

I have the following data frame in R:
Row number A B C D E F G H I J
1 1 1 0 0 1 0 0 1 1
2 1 0 0 0 1 0 0 1
3 1 0 0 0 1 0 0 1 1
I am trying to calculate the number of times the number changes between 1 and 0 excluding the Nulls
The result I am expecting is this
Row Number No of changes
---------- --------------
1 4
2 4
3 4
An explanation for row 1
In row 1, A has a null so we exclude that.
B and C have 1 which is our first set of values.
D and E have 0 which is our second set of values. Now Change = 1
F has our third set of values which is 1. Now Change = 1+1
G and H have 0 which is our third set of values. Now Change = 1+1+1
I and J have 1 which is our fourth set of values. Now Change = 1+1+1+1 =4
Here's a tidyverse approach.
I gather into longer format (from tidyr::pivot_longer), then add a helper column noting when we have a change from 0 to 1 or from 1 to 0, and then sum those by row.
library(tidyverse)
df %>%
# before tidyr 1.0, this would be gather(col, value, -1)
pivot_longer(-1, "col") %>%
group_by(Row.number) %>%
mutate(chg = value == 1 & lag(value) == 0 |
value == 0 & lag(value) == 1) %>%
summarize(no_chgs = sum(chg, na.rm = T))
# A tibble: 3 x 2
Row.number no_chgs
<int> <int>
1 1 4
2 2 4
3 3 4
Sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text = "'Row number' A B C D E F G H I J
1 NA 1 1 0 0 1 0 0 1 1
2 NA NA 1 0 0 0 1 0 0 1
3 NA 1 0 0 0 1 0 0 1 1")
Here's a data.table solution:
library(data.table)
dt <- as.data.table(df)
dt[,
no_change := max(rleid(na.omit(t(.SD)))) - 1,
by = RowNumber
]
dt
Alternatively, here's a base version:
apply(df[, -1],
1,
function(x) {
complete_case = complete.cases(x)
if (sum(complete_case) > 0) {
return(length(rle(x[complete_case])$lengths) - 1)
} else {
return (0)
}
}
)

Concatenate column names based on row-wise values

I have an R dataframe with 3 columns containing values 0 or 1. I need to create a column as the concatenation of column names when the value is 1 separated by '&'. The following code works with empty space '' as the separator but fails when I change it to '&'.
Code:
A = c(1,0,1,0,0,1)
B = c(1,1,1,0,1,0)
C = c(0,0,0,1,1,1)
data = data.frame(A, B, C)
data$New = paste(ifelse(data$A == 1, "A", ""),
ifelse(data$B == 1, "B", ""),
ifelse(data$C == 1, "C", ""), sep = '')
data
Output:
A B C New
1 1 1 0 AB
2 0 1 0 B
3 1 1 0 AB
4 0 0 1 C
5 0 1 1 BC
6 1 0 1 AC
Code & Output with '&' Separator:
A = c(1,0,1,0,0,1)
B = c(1,1,1,0,1,0)
C = c(0,0,0,1,1,1)
data = data.frame(A, B, C)
data$New = paste(ifelse(data$A == 1, "A", ""),
ifelse(data$B == 1, "B", ""),
ifelse(data$C == 1, "C", ""), sep = '&')
data
A B C New
1 1 1 0 A&B&
2 0 1 0 &B&
3 1 1 0 A&B&
4 0 0 1 &&C
5 0 1 1 &B&C
6 1 0 1 A&&C
Expected Output:
A B C New
1 1 1 0 A&B
2 0 1 0 B
3 1 1 0 A&B
4 0 0 1 C
5 0 1 1 B&C
6 1 0 1 A&C
Is there a way to do this in R?
In case of a large number of columns, is there a way to do the same without writing explicit ifelse condition on each column?
We can subset the names by looping through the rows
data$New <- apply(data[1:3], 1, function(x) paste(names(x[x!=0]), collapse="&"))
data$New
#[1] "A&B" "B" "A&B" "C" "B&C" "A&C"
it can also be done column wise
library(tidyverse)
data[1:3] %>%
na_if(0) %>%
`*`(col(.)) %>%
imap(~ rep(.y, length(.x))[.x]) %>%
reduce(paste, sep= "&") %>%
str_remove("(NA&)+|(&NA)+") %>%
str_remove("&NA")
#[1] "A&B" "B" "A&B" "C" "B&C" "A&C"
You can use apply with paste to do it.
nms <- names(data)
data$New <- apply(data, 1, function(x){
paste(nms[as.logical(x)], collapse = "&")
})
data
# A B C New
#1 1 1 0 A&B
#2 0 1 0 B
#3 1 1 0 A&B
#4 0 0 1 C
#5 0 1 1 B&C
#6 1 0 1 A&C
Using which with arr.ind = TRUE, and then aggregate:
cbind(data,
new = aggregate(col ~ row, data = which(data == 1, arr.ind = TRUE),
function(x) paste(names(data)[x], collapse = "&"))[ , "col"])
# A B C new
# 1 1 1 0 A&B
# 2 0 1 0 B
# 3 1 1 0 A&B
# 4 0 0 1 C
# 5 0 1 1 B&C
# 6 1 0 1 A&C
Similar, using tapply:
ix <- which(data == 1, arr.ind = TRUE)
cbind(data,
new = tapply(ix[ , "col"], ix[ , "row"],
function(x) paste(names(data)[x], collapse = "&")))

Mutate over every possible combination of columns

I have a data frame of binary variables:
df <-data.frame(a = c(0,1,0,1,0), b = c(1, 1, 0, 0, 1), c = c(1,0,1,1,0))
And I'd like to create a column for each possible combination of my pre-existing columns:
library(tidyverse)
df %>%
mutate(d = case_when(a==1 & b==1 & c==1 ~ 1),
e = case_when(a==1 & b==1 & c!=1 ~ 1),
f = case_when(a==1 & b!=1 & c==1 ~ 1),
g = case_when(a!=1 & b==1 & c==1 ~ 1))
But my real dataset has too many columns to do this without a function or loop. Is there an easy way to do this in R?
First note that do.call(paste0, df) will combine all of your columns into one string, however many they are:
do.call(paste0, df)
# [1] "011" "110" "001" "101" "010" "011"
Then you can use spread() from the tidyr package to give each its own column. Note that you have to add an extra row column so that it knows to keep each of the rows separate (instead of trying to combine them).
# I added a sixth row that copied the first to make the effect clear
df<-data.frame(a = c(0,1,0,1,0,0), b = c(1, 1, 0, 0, 1, 1), c = c(1,0,1,1,0,1))
# this assumes you want `type_` at the start of each new column,
# but you could use a different convention
df %>%
mutate(type = paste0("type_", do.call(paste0, df)),
value = 1,
row = row_number()) %>%
spread(type, value, fill = 0) %>%
select(-row)
Result:
a b c type_001 type_010 type_011 type_101 type_110
1 0 0 1 1 0 0 0 0
2 0 1 0 0 1 0 0 0
3 0 1 1 0 0 1 0 0
4 0 1 1 0 0 1 0 0
5 1 0 1 0 0 0 1 0
6 1 1 0 0 0 0 0 1
An alternative to David's answer, but I recognize it's a little awkward:
df %>%
unite(comb, a:c, remove = FALSE) %>%
spread(key = comb, value = comb) %>%
mutate_if(is.character, funs(if_else(is.na(.), 0, 1)))
#> a b c 0_0_1 0_1_0 0_1_1 1_0_1 1_1_0
#> 1 0 0 1 1 0 0 0 0
#> 2 0 1 0 0 1 0 0 0
#> 3 0 1 1 0 0 1 0 0
#> 4 1 0 1 0 0 0 1 0
#> 5 1 1 0 0 0 0 0 1
EDIT: funs() is being deprecated as of version 0.8.0 of dplyr, so the last line should be revised to:
mutate_if(is.character, list(~ if_else(is.na(.), 0, 1)))

multicolumn csv text into presence absence matri

I have a csv file like this:
col1 col2 col3
r1 a,b,c e,f g
r2 h,i j,k
r3 l m,n,o
some cells have multiple text comma separated, some have single and some have none.I want to convert this like:
col1 col2 col3
a 1 0 0
b 1 0 0
c 1 0 0
e 0 1 0
f 0 1 0
g 0 0 1
h 1 0 0
i 1 0 0
j 0 0 1
k 0 0 1
l 1 0 0
m 0 1 0
n 0 1 0
o 0 1 0
Any suggestion? I tried pivot table in excel but not getting the desired output.
Thanks in advance.
Best Regards
Zillur
not sure whether this is the shortest solution (probably not) but it produces the desired output. Basically, we go through all three columns and count the occurrences of the strings and get a long format data frame that we then flip to the wide format you want.
library(tidyr)
library(purrr)
df <- data_frame(col1 = c("a,b,c", "h,i", "l"),
col2 = c("e,f", "", "m,n,o"),
col3 = c("g", "j,k", ""))
let_df <- map_df(df, function(col){
# map_df applies the function to each column of df
# split strings at "," and unlist to get vector of letters
letters <- unlist(str_split(col, ","))
# delete ""
letters <- letters[nchar(letters) > 0]
# count occurrences for each letter
tab <- table(letters)
# replace with 1 if occurs more often
tab[tab > 1] <- 1
# create data frame from table
df <- data_frame(letter = names(tab), count = tab)
return(df)
}, .id = "col") # id adds a column col that contains col1 - col3
# bring data frame into wide format
let_df %>%
spread(col, count, fill = 0)
Such a great problem to solve. Here is my take on it in base R:
col1 <- c("a,b,c","h,i","l")
col2 <- c("e,f","","m,n,o")
col3 <- c("g","j,k","")
data <- data.frame(col1, col2, col3, stringsAsFactors = F)
restructure <- function(df){
df[df==""] <- "missing"
result_rows <- as.character()
l <- list()
for (i in seq_along(colnames(df)) ){
df_col <- sort(unique(unlist(strsplit(gsub(" ", "",toString(df[[i]])), ","))))
df_col <- df_col[!df_col %in% "missing"]
result_rows <- sort(unique(c(result_rows, df_col)))
l[i] <- list(df_col)
}
result <- data.frame(result_rows)
for (j in seq_along(l)){
result$temp <- NA
result$temp[match(l[[j]], result_rows)] <- 1
colnames(result)[colnames(result)=="temp"] <- colnames(df)[j]
}
result[is.na(result)] <- 0
return(result)
}
> restructure(data)
# result_rows col1 col2 col3
#1 a 1 0 0
#2 b 1 0 0
#3 c 1 0 0
#4 e 0 1 0
#5 f 0 1 0
#6 g 0 0 1
#7 h 1 0 0
#8 i 1 0 0
#9 j 0 0 1
#10 k 0 0 1
#11 l 1 0 0
#12 m 0 1 0
#13 n 0 1 0
#14 o 0 1 0

Faster alternative to nested for loops in R

Here is the scenario: I have a sample in which subjects are placed into any of three groups. Next, subjects from each group are grouped together, resulting in several "triplets" consisting of a subject from each group. I would like to count the number of times a subject from a given group (1, 2, or 3) is grouped with a subject i of a different original group.
Here is a simple code example:
data <- cbind(c(1:9), c(rep("Group 1", 3), rep("Group 2", 3), rep("Group 3", 3)))
data <- data.frame(data)
names(data) <- c("ID", "Group")
groups.of.3 <- data.frame(rbind(c(1,4,7),c(2,4,7),c(2,5,7),c(3,6,8),c(3,6,9)))
N <- nrow(data)
n1 <- nrow(data[data$Group == "Group 1", ])
n2 <- nrow(data[data$Group == "Group 2", ])
n3 <- nrow(data[data$Group == "Group 3", ])
# Check the number of times a subject from a group is grouped with a subject i
# from another group
M1 <- matrix(0, nrow = N, ncol = n1)
M2 <- matrix(0, nrow = N, ncol = n2)
M3 <- matrix(0, nrow = N, ncol = n3)
for (i in 1:N){
if (data$Group[i] != "Group 1"){
for (j in 1:n1){
M1[i,j] <- nrow(groups.of.3[groups.of.3[,1] == j &
(groups.of.3[,2] == i |
groups.of.3[,3] == i), ])
}
}
if (data$Group[i] != "Group 2"){
for (j in 1:n2){
M2[i,j] <- nrow(groups.of.3[groups.of.3[,2] == (n1 + j) &
(groups.of.3[,1] == i |
groups.of.3[,3] == i), ])
}
}
if (data$Group[i] != "Group 3"){
for (j in 1:n3){
M3[i,j] <- nrow(groups.of.3[groups.of.3[,3] == (n1 + n2 + j) &
(groups.of.3[,1] == i |
groups.of.3[,2] == i), ])
}
}
}
So I have 9 subjects, with three from each group. And then subjects from each group are subsequently grouped together (allowing for repetition of placement). This takes a lot longer with more subjects, and I am wondering if there is a faster alternative that avoids using for loops.
For instance, the matrix M1 consists of how many times subjects in Group 1 were subsequently grouped with other subjects from any other group:
M1
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
[4,] 1 1 0
[5,] 0 1 0
[6,] 0 0 2
[7,] 1 2 0
[8,] 0 0 1
[9,] 0 0 1
So the 3 columns represent the three subjects from Group 1, and the rows represent all subjects - the entries are how many times each subject from Group 1 is grouped with any of the other subjects (e.g., according to groups.of.3, subject 3 appears in a group with subject 6 twice, and subject 1 with subject 7 once).
Thanks for any help!
Something like this?
library(tidyr)
library(dplyr)
data <- data %>%
mutate(ID = as.numeric(levels(ID))[ID])
tmp <- groups.of.3 %>%
add_rownames() %>%
gather("X", "Person", -rowname) %>%
inner_join(data, by = c("Person" = "ID"))
tmp %>%
inner_join(tmp, by = c("rowname")) %>%
filter(Group.x != Group.y) %>%
group_by(Person.x, Group.x, Group.y) %>%
summarise(N = n()) %>%
spread(key = Group.y, value = N, fill = 0)
Person.x Group.x Group 1 Group 2 Group 3
(dbl) (fctr) (dbl) (dbl) (dbl)
1 1 Group 1 0 1 1
2 2 Group 1 0 2 2
3 3 Group 1 0 2 2
4 4 Group 2 2 0 2
5 5 Group 2 1 0 1
6 6 Group 2 2 0 2
7 7 Group 3 3 3 0
8 8 Group 3 1 1 0
9 9 Group 3 1 1 0
For loops aren't inherently slow:
# coerce the fields in groups.of.3 to factor
for(i in 1:3)
groups.of.3[,i] <- as.factor(groups.of.3[,i],levels =data$ID)
M <- matrix(0, N, N)
out <- NULL
for(i in 1:(3-1))
for(j in (i+1):3)
M <- M + table(groups.of.3[,i],groups.of.3[,j])
M1 <- M[,as.integer(data$Group)==1]
M2 <- M[,as.integer(data$Group)==2]
M3 <- M[,as.integer(data$Group)==3]
I'll answer my own question, using a very slight modification of Thierry's answer:
library(tidyr)
library(dplyr)
data <- data %>%
mutate(ID = as.numeric(levels(ID))[ID])
tmp <- groups.of.3 %>%
add_rownames() %>%
gather("X", "Person", -rowname) %>%
inner_join(data, by = c("Person" = "ID"))
tmp %>%
inner_join(tmp, by = c("rowname")) %>%
filter(Group.x != Group.y) %>%
group_by(Person.x, Group.x, Person.y) %>%
summarise(N = n()) %>%
spread(key = Person.y, value = N, fill = 0)
This gives the following output, which includes M1, M2, and M3 from the previous for loop, adjoined together.
Source: local data frame [9 x 11]
Person.x Group.x 1 2 3 4 5 6 7 8 9
(dbl) (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1 Group 1 0 0 0 1 0 0 1 0 0
2 2 Group 1 0 0 0 1 1 0 2 0 0
3 3 Group 1 0 0 0 0 0 2 0 1 1
4 4 Group 2 1 1 0 0 0 0 2 0 0
5 5 Group 2 0 1 0 0 0 0 1 0 0
6 6 Group 2 0 0 2 0 0 0 0 1 1
7 7 Group 3 1 2 0 2 1 0 0 0 0
8 8 Group 3 0 0 1 0 0 1 0 0 0
9 9 Group 3 0 0 1 0 0 1 0 0 0

Resources