Is there a simple approach to converting a data frame with dummies (binary coded) on whether an aspect is present, to a co-occurrence matrix containing the counts of two aspects co-occuring?
E.g. going from this
X <- data.frame(rbind(c(1,0,1,0), c(0,1,1,0), c(0,1,1,1), c(0,0,1,0)))
X
X1 X2 X3 X4
1 1 0 1 0
2 0 1 1 0
3 0 1 1 1
4 0 0 1 0
to this
X1 X2 X3 X4
X1 0 0 1 0
X2 0 0 2 1
X3 1 2 0 1
X4 0 1 1 0
This will do the trick:
X <- as.matrix(X)
out <- crossprod(X) # Same as: t(X) %*% X
diag(out) <- 0 # (b/c you don't count co-occurrences of an aspect with itself)
out
# [,1] [,2] [,3] [,4]
# [1,] 0 0 1 0
# [2,] 0 0 2 1
# [3,] 1 2 0 1
# [4,] 0 1 1 0
To get the results into a data.frame exactly like the one you showed, you can then do something like:
nms <- paste("X", 1:4, sep="")
dimnames(out) <- list(nms, nms)
out <- as.data.frame(out)
Though nothing can match the simplicity of answer above, just posting tidyverse aproach for future reference
Y <- X %>% mutate(id = row_number()) %>%
pivot_longer(-id) %>% filter(value !=0)
merge(Y, Y, by = "id", all = T) %>%
filter(name.x != name.y) %>%
group_by(name.x, name.y) %>%
summarise(val = n()) %>%
pivot_wider(names_from = name.y, values_from = val, values_fill = 0, names_sort = T) %>%
column_to_rownames("name.x")
X1 X2 X3 X4
X1 0 0 1 0
X2 0 0 2 1
X3 1 2 0 1
X4 0 1 1 0
Related
What I do is to create dummies to indicate whether a continuous variable exceeds a certain threshold (1) or is below this threshold (0). I achieved this via several repetitive mutates, which I would like to substitute with a loop.
# load tidyverse
library(tidyverse)
# create data
data <- data.frame(x = runif(1:100, min=0, max=100))
# What I do
data <- data %>%
mutate(x20 = ifelse(x >= 20, 1, 0)) %>%
mutate(x40 = ifelse(x >= 40, 1, 0)) %>%
mutate(x60 = ifelse(x >= 60, 1, 0)) %>%
mutate(x80 = ifelse(x >= 80, 1, 0))
# What I would like to do
for (i in seq(from=0, to=100, by=20)){
data %>% mutate(paste(x,i) = ifelse(x >= i, 1,0))
}
Thank you.
You can use map_dfc here :
library(dplyr)
library(purrr)
breaks <- seq(from=0, to=100, by=20)
bind_cols(data, map_dfc(breaks, ~
data %>% transmute(!!paste0('x', .x) := as.integer(x > .x))))
# x x0 x20 x40 x60 x80 x100
#1 6.2772517 1 0 0 0 0 0
#2 16.3520358 1 0 0 0 0 0
#3 25.8958212 1 1 0 0 0 0
#4 78.9354970 1 1 1 1 0 0
#5 35.7731737 1 1 0 0 0 0
#6 5.7395139 1 0 0 0 0 0
#7 49.7069551 1 1 1 0 0 0
#8 53.5134559 1 1 1 0 0 0
#...
#....
Although, I think it is much simpler in base R :
data[paste0('x', breaks)] <- lapply(breaks, function(x) as.integer(data$x > x))
You can use reduce() in purrr.
library(dplyr)
library(purrr)
reduce(seq(0, 100, by = 20), .init = data,
~ mutate(.x, !!paste0('x', .y) := as.integer(x >= .y)))
# x x0 x20 x40 x60 x80 x100
# 1 61.080545 1 1 1 1 0 0
# 2 63.036673 1 1 1 1 0 0
# 3 71.064322 1 1 1 1 0 0
# 4 1.821416 1 0 0 0 0 0
# 5 24.721454 1 1 0 0 0 0
The corresponding base way with Reduce():
Reduce(function(df, y){ df[paste0('x', y)] <- as.integer(df$x >= y); df },
seq(0, 100, by = 20), data)
Ronak's base R is probably the best, but for completeness here's another way similar to how you were originally doing it, just with dplyr:
for (i in seq(from=0, to=100, by=20)){
var <- paste0('x',i)
data <- mutate(data, !!var := ifelse(x >= i, 1,0))
}
x x0 x20 x40 x60 x80 x100
1 99.735037 1 1 1 1 1 0
2 9.075226 1 0 0 0 0 0
3 73.786282 1 1 1 1 0 0
4 89.744719 1 1 1 1 1 0
5 34.139207 1 1 0 0 0 0
6 88.138611 1 1 1 1 1 0
I have a dataframe in R with 4 variables and would like to create a new variable based on any 2 conditions being true on those variables.
I have attempted to create it via if/else statements however would require a permutation of every variable condition being true. I would also need to scale to where I can create a new variable based on any 3 conditions being true. I am not sure if there is a more efficient method than using if/else statements?
My example:
I have a dataframe X with following column variables
x1 = c(1,0,1,0)
X2 = c(0,0,0,0)
X3 = c(1,1,0,0)
X4 = c(0,0,1,0)
I would like to create a new variable X5 if any 2 of the variables are true (eg ==1)
The new variable based on the above dataframe would produce X5 (1,0,1,0)
This can easily be done by using the apply function:
x1 = c(1,0,1,0)
x2 = c(0,0,0,0)
x3 = c(1,1,0,0)
x4 = c(0,0,1,0)
df <- data.frame(x1,x2,x3,x4)
df$x5 <- apply(df,1,function(row) ifelse(sum(row != 0) == 2, 1, 0))
x1 x2 x3 x4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
apply with option 1 means: Do this function on every row. To scale this up to 3...N true values, just change the number in the ifelse statement.
You can try this:
#Data
df <- data.frame(x1,X2,X3,X4)
#Code
df$X5 <- ifelse(rowSums(df,na.rm=T)==2,1,0)
x1 X2 X3 X4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
You can use:
df$X5 <- 1*(apply(df == 1, 1, sum) == 2)
or
df$X5 <- 1*(mapply(sum, df) == 2)
Output
> df
X1 X2 X3 X4 X5
1 0 1 0 1
0 0 1 0 0
1 0 0 1 1
0 0 0 0 0
Data
df <- data.frame(X1,X2,X3,X4)
How can I create a matrix of 0's and 1's from a data set with three columns labelled as hosp (i.e. hospital), pid (i.e. patient id) and treatment, as shown below
df<-
structure(list(
hosp=c(1L,1L,1L,1L,1L,1L,2L,2L,2L),
pid=c(1L,1L,1L,2L,3L,3L,4L,5L,5L),
treatment=c(0L,0L,0L,1L,1L,1L,0L,1L,1L)
),
.Names=c("hosp","pid","treatment"),
class="data.frame",row.names=c(NA,-9))
The rows and columns of the matrix should be the number of observations (in this case 9) and the unique number of hospitals, respectively. The entries in the matrix should be the treatment values, that is, it is 1 for a given hospital if the corresponding patient received treatment 1 in that hospital and 0 otherwise. The matrix should look like
matrix(c(0,0,
0,0,
0,0,
1,0,
1,0,
1,0,
0,0,
0,1,
0,1),nrow=9,byrow=TRUE)
Any help would be much appreciated, thanks.
1) Create a model matrix from hosp as a factor with no intercept term and multiply that by treatment:
hosp <- factor(df$hosp)
model.matrix(~ hosp + 0) * df$treatment
giving:
hosp1 hosp2
1 0 0
2 0 0
3 0 0
4 1 0
5 1 0
6 1 0
7 0 0
8 0 1
9 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$hosp
[1] "contr.treatment"
2) outer(hosp, unique(hosp), "==") is the model matrix of hosp except using TRUE/FALSE in place of 1/0. Multiply that by treatment.
with(df, outer(hosp, unique(hosp), "==") * treatment)
giving
[,1] [,2]
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 0
[5,] 1 0
[6,] 1 0
[7,] 0 0
[8,] 0 1
[9,] 0 1
Update: Added (1) and simplified (2).
Here's my workaround for this. Not the cleanest, but it works!
require(dplyr)
df2 <- df %>%
mutate(x = row_number()) %>%
select(-pid) %>%
spread(x, treatment)
df3 <- df2 %>%
gather("keys", "value", 2:10) %>%
spread(hosp, value) %>%
select(-keys)
df3[is.na(df3)] <- 0
df3 <- as.matrix(df3)
Step by Step:
Take original df and add a row_number to it so we can spread without duplication. We'll also remove pid since you're changing this to a matrix.
require(dplyr)
df2 <- df %>%
mutate(x = row_number()) %>%
select(-pid) %>%
spread(x, treatment)
Then we want to change it back to long form:
df3 <- df2 %>%
gather("keys", "value", 2:10) %>%
spread(hosp, value) %>%
select(-keys)
Some of the values are still NA, so we convert them into 0s, and then turn it into a matrix using ``
df3[is.na(df3)] <- 0
df3 <- as.matrix(df3)
1 2
1 0 0
2 0 0
3 0 0
4 1 0
5 1 0
6 1 0
7 0 0
8 0 1
9 0 1
how about:
> sapply(unique(df$hosp),function(x) ifelse(df$hosp==x&df$treatment==1,1,0))
[,1] [,2]
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 0
[5,] 1 0
[6,] 1 0
[7,] 0 0
[8,] 0 1
[9,] 0 1
I have a dataframe:
> df <- data.frame(x = c('x1','x1','x2','x2','x2','x3','x3','x3'),
+ y = c(0,0,1,1,1,0,0,0),
+ z = c(1,1,0,0,0,0,0,0))
> df
x y z
1 x1 0 1
2 x1 0 1
3 x2 1 0
4 x2 1 0
5 x2 1 0
6 x3 0 0
7 x3 0 0
8 x3 0 0
I would like to create a subset based on y column where it is equal to 1, keep the value of x column based on the condition and make the 1 be 0.
I have only found how I could find the first step:
> length(which(df$y == 1))
[1] 3
How could a have a final output like this:
x y
x2 0
x2 0
x2 0
require(dplyr)
df %>%
filter(y == 1) %>%
select(x, y) %>%
mutate(y = 0)
transform(subset(df[1:2],y==1),y=0)
x y
3 x2 0
4 x2 0
5 x2 0
If you're open to using other packages, data.table is another option:
library(data.table)
setDT(df)[y == 1, .(x, y = 0)]
# x y
#1: x2 0
#2: x2 0
#3: x2 0
Here is the scenario: I have a sample in which subjects are placed into any of three groups. Next, subjects from each group are grouped together, resulting in several "triplets" consisting of a subject from each group. I would like to count the number of times a subject from a given group (1, 2, or 3) is grouped with a subject i of a different original group.
Here is a simple code example:
data <- cbind(c(1:9), c(rep("Group 1", 3), rep("Group 2", 3), rep("Group 3", 3)))
data <- data.frame(data)
names(data) <- c("ID", "Group")
groups.of.3 <- data.frame(rbind(c(1,4,7),c(2,4,7),c(2,5,7),c(3,6,8),c(3,6,9)))
N <- nrow(data)
n1 <- nrow(data[data$Group == "Group 1", ])
n2 <- nrow(data[data$Group == "Group 2", ])
n3 <- nrow(data[data$Group == "Group 3", ])
# Check the number of times a subject from a group is grouped with a subject i
# from another group
M1 <- matrix(0, nrow = N, ncol = n1)
M2 <- matrix(0, nrow = N, ncol = n2)
M3 <- matrix(0, nrow = N, ncol = n3)
for (i in 1:N){
if (data$Group[i] != "Group 1"){
for (j in 1:n1){
M1[i,j] <- nrow(groups.of.3[groups.of.3[,1] == j &
(groups.of.3[,2] == i |
groups.of.3[,3] == i), ])
}
}
if (data$Group[i] != "Group 2"){
for (j in 1:n2){
M2[i,j] <- nrow(groups.of.3[groups.of.3[,2] == (n1 + j) &
(groups.of.3[,1] == i |
groups.of.3[,3] == i), ])
}
}
if (data$Group[i] != "Group 3"){
for (j in 1:n3){
M3[i,j] <- nrow(groups.of.3[groups.of.3[,3] == (n1 + n2 + j) &
(groups.of.3[,1] == i |
groups.of.3[,2] == i), ])
}
}
}
So I have 9 subjects, with three from each group. And then subjects from each group are subsequently grouped together (allowing for repetition of placement). This takes a lot longer with more subjects, and I am wondering if there is a faster alternative that avoids using for loops.
For instance, the matrix M1 consists of how many times subjects in Group 1 were subsequently grouped with other subjects from any other group:
M1
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
[4,] 1 1 0
[5,] 0 1 0
[6,] 0 0 2
[7,] 1 2 0
[8,] 0 0 1
[9,] 0 0 1
So the 3 columns represent the three subjects from Group 1, and the rows represent all subjects - the entries are how many times each subject from Group 1 is grouped with any of the other subjects (e.g., according to groups.of.3, subject 3 appears in a group with subject 6 twice, and subject 1 with subject 7 once).
Thanks for any help!
Something like this?
library(tidyr)
library(dplyr)
data <- data %>%
mutate(ID = as.numeric(levels(ID))[ID])
tmp <- groups.of.3 %>%
add_rownames() %>%
gather("X", "Person", -rowname) %>%
inner_join(data, by = c("Person" = "ID"))
tmp %>%
inner_join(tmp, by = c("rowname")) %>%
filter(Group.x != Group.y) %>%
group_by(Person.x, Group.x, Group.y) %>%
summarise(N = n()) %>%
spread(key = Group.y, value = N, fill = 0)
Person.x Group.x Group 1 Group 2 Group 3
(dbl) (fctr) (dbl) (dbl) (dbl)
1 1 Group 1 0 1 1
2 2 Group 1 0 2 2
3 3 Group 1 0 2 2
4 4 Group 2 2 0 2
5 5 Group 2 1 0 1
6 6 Group 2 2 0 2
7 7 Group 3 3 3 0
8 8 Group 3 1 1 0
9 9 Group 3 1 1 0
For loops aren't inherently slow:
# coerce the fields in groups.of.3 to factor
for(i in 1:3)
groups.of.3[,i] <- as.factor(groups.of.3[,i],levels =data$ID)
M <- matrix(0, N, N)
out <- NULL
for(i in 1:(3-1))
for(j in (i+1):3)
M <- M + table(groups.of.3[,i],groups.of.3[,j])
M1 <- M[,as.integer(data$Group)==1]
M2 <- M[,as.integer(data$Group)==2]
M3 <- M[,as.integer(data$Group)==3]
I'll answer my own question, using a very slight modification of Thierry's answer:
library(tidyr)
library(dplyr)
data <- data %>%
mutate(ID = as.numeric(levels(ID))[ID])
tmp <- groups.of.3 %>%
add_rownames() %>%
gather("X", "Person", -rowname) %>%
inner_join(data, by = c("Person" = "ID"))
tmp %>%
inner_join(tmp, by = c("rowname")) %>%
filter(Group.x != Group.y) %>%
group_by(Person.x, Group.x, Person.y) %>%
summarise(N = n()) %>%
spread(key = Person.y, value = N, fill = 0)
This gives the following output, which includes M1, M2, and M3 from the previous for loop, adjoined together.
Source: local data frame [9 x 11]
Person.x Group.x 1 2 3 4 5 6 7 8 9
(dbl) (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1 Group 1 0 0 0 1 0 0 1 0 0
2 2 Group 1 0 0 0 1 1 0 2 0 0
3 3 Group 1 0 0 0 0 0 2 0 1 1
4 4 Group 2 1 1 0 0 0 0 2 0 0
5 5 Group 2 0 1 0 0 0 0 1 0 0
6 6 Group 2 0 0 2 0 0 0 0 1 1
7 7 Group 3 1 2 0 2 1 0 0 0 0
8 8 Group 3 0 0 1 0 0 1 0 0 0
9 9 Group 3 0 0 1 0 0 1 0 0 0