Cross comparison of columns of the same data.frame - r

I have a data.frame that looks like this:
> DF1
A B C D E
a x c h p
c d q t w
s e r p a
w l t s i
p i y a f
I would like to compare each column of my data.frame with the remaining columns in order to count the number of common elements. For example, I would like to compare column A with all the remaining columns (B, C, D, E) and count the common entities in this way:
A versus the remaining:
A vs B: 0 (because they have 0 common elements)
A vs C: 1 (c in common)
A vs D: 2 (p and s in common)
A vs E: 3 (p,w,a, in common)
Then the same: B versus C,D,E columns and so on.
How can I implement this?

We can loop through the column names and compare with the other columns, by taking the intersect and get the length
sapply(names(DF1), function(x) {
x1 <- lengths(Map(intersect, DF1[setdiff(names(DF1), x)], DF1[x]))
c(x1, setNames(0, setdiff(names(DF1), names(x1))))[names(DF1)]})
# A B C D E
#A 0 0 1 3 3
#B 0 0 0 0 1
#C 1 0 0 1 0
#D 3 0 1 0 2
#E 3 1 0 2 0
Or this can be done more compactly by taking the cross product after getting the frequency of the long formatted (melt) dataset
library(reshape2)
tcrossprod(table(melt(as.matrix(DF1))[-1])) * !diag(5)
# Var2
#Var2 A B C D E
# A 0 0 1 3 3
# B 0 0 0 0 1
# C 1 0 0 1 0
# D 3 0 1 0 2
# E 3 1 0 2 0
NOTE: The crossprod part is also implemented with RcppEigen here which would make this faster

An alternative is to use combn twice, once to get the column combinations and next to find the lengths of the element intersections.
cbind.data.frame returns a data.frame and setNames is used to add column names.
setNames(cbind.data.frame(t(combn(names(df), 2)),
combn(names(df), 2, function(x) length(intersect(df[, x[1]], df[, x[2]])))),
c("col1", "col2", "count"))
col1 col2 count
1 A B 0
2 A C 1
3 A D 3
4 A E 3
5 B C 0
6 B D 0
7 B E 1
8 C D 1
9 C E 0
10 D E 2

Related

Generate pairwise movement data from sequence

I have a sequence which looks like this
SEQENCE
1 A
2 B
3 B
4 C
5 A
Now from this sequence, I want to get the matrix like this where i the row and jth column element denotes how many times movement occurred from ith row node to jth column node
A B C
A 0 1 0
B 0 1 1
C 1 0 0
How Can I get this in R
1) Use table like this:
s <- DF[, 1]
table(tail(s, -1), head(s, -1))
giving:
A B C
A 0 0 1
B 1 1 0
C 0 1 0
2) or like this. Since embed does not work with factors we convert the factor to character,
s <- as.character(DF[, 1])
do.call(table, data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
3) xtabs also works:
s <- as.character(DF[, 1])
xtabs(data = data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
Note: The input DF in reproducible form is:
Lines <- " SEQENCE
1 A
2 B
3 B
4 C
5 A"
DF <- read.table(text = Lines, header = TRUE)

Convert list of individuals to occurence of pairs in R

I need specific format of data.frame for social structure analysis. How to convert data.frame containing list of individuals occuring together on multiple events:
my.df <- data.frame(individual = c("A","B","C","B","C","D"),
time = rep(c("event_01","event_02"), each = 3))
individual time
1 A event_01
2 B event_01
3 C event_01
4 B event_02
5 C event_02
6 D event_02
into a data.frame containing occurence for each pairs (including [A,A]; [B,B] etc. pairs:
ind_1 ind_2 times
A A 0
A B 1
A C 1
A D 0
B A 1
B B 0
B C 2
B D 1
C A 1
C B 2
C C 0
C D 1
D A 0
D B 1
D C 1
D D 0
In base R, you could do the following:
data.frame(as.table(`diag<-`(tcrossprod(table(my.df)), 0)))
# individual individual.1 Freq
# 1 A A 0
# 2 B A 1
# 3 C A 1
# 4 D A 0
# 5 A B 1
# 6 B B 0
# 7 C B 2
# 8 D B 1
# 9 A C 1
# 10 B C 2
# 11 C C 0
# 12 D C 1
# 13 A D 0
# 14 B D 1
# 15 C D 1
# 16 D D 0
tcrossprod gives you the following:
> tcrossprod(table(my.df))
individual
individual A B C D
A 1 1 1 0
B 1 2 2 1
C 1 2 2 1
D 0 1 1 1
That's essentially all the information you are looking for, but you want it in a slightly different form, without the diagonal values.
We can set the diagonals to zero with:
`diag<-`(theOutputFromAbove, 0)
Then, to get the long form, trick R into thinking that the resulting matrix is a table by using as.table, and make use of the data.frame method for tables.
You can do:
create the first 2 variables of the new data.frame:
df2 <- expand.grid(ind_2=levels(my.df$individual), ind_1=levels(my.df$individual))[, 2:1]
Put the value to 0 for the pairs of same individuals:
df2$times[df2[, 1]==df2[, 2]] <- 0
See the other unique combinations:
comb_diff <- combn(levels(my.df$individual), 2)
compute the times each unique combination is found together:
times_uni <- apply(comb_diff, 2, function(inds){
sum(table(my.df$time[my.df$individual %in% inds])==2)
})
Finally, fill the new data.frame:
df2$times[match(c(paste0(comb_diff[1,], comb_diff[2,]), paste0(comb_diff[2, ], comb_diff[1, ])), paste0(df2[, 1],df2[, 2]))] <- rep(times_uni, 2)
df2
# ind_1 ind_2 times
#1 A A 0
#2 A B 1
#3 A C 1
#4 A D 0
#5 B A 1
#6 B B 0
#7 B C 2
#8 B D 1
#9 C A 1
#10 C B 2
#11 C C 0
#12 C D 1
#13 D A 0
#14 D B 1
#15 D C 1
#16 D D 0
You can do it using data.table
dt_combs <- my.dt[,
list(ind_1 = combn(individual, 2)[1, ],
ind_2 = combn(individual, 2)[2, ]),
by = time]
dt_ncombs <- dt_combs[, .N, by = c("ind_1", "ind_2")]
dt_ncombs_inverted <- copy(dt_ncombs)
dt_ncombs_inverted[, temp := ind_1]
dt_ncombs_inverted[, ind_1 := ind_2]
dt_ncombs_inverted[, ind_2 := temp]
dt_ncombs_inverted[, temp := NULL]
dt_ncombs <- rbind(dt_ncombs, dt_ncombs_inverted)
dt_allcombs <- data.table(expand.grid(
ind_1 = my.dt[, unique(individual)],
ind_2 = my.dt[, unique(individual)]
))
dt_final <- merge(dt_allcombs,
dt_ncombs,
all.x = TRUE,
by = c("ind_1", "ind_2"))
dt_final[is.na(N), N := 0]
dt_final

table function in R

I'm using the table function in R to create a table of two of my variables in R.
I have a data.frame like this
mytable
V1 V2 V3
1 a c
2 c d
3 b b
4 d a
5 d c
when I use the table function table(mytable$V2, mytable$V3) I get the following
a b c d
a 0 0 1 0
b 0 1 0 0
c 0 0 0 1
d 1 0 1 0
Now I actually want to treat situations 'a-b' the same as 'b-a'. Or 'b-c' the same as 'c-b'. Therefore I want to have a table where everything above the diagonal is empty. He needs to add the values from above the diagonal to the values below the diagonal. How can I do this in R?
And furthermore I want this table to be represented as a heatmap, which I normally do with ggplot2. But I don't know if this also works for a table that will have empty values above the diagonal.
Here is the manual way of fixing it:
tab <- table(DF[,2:3])
tab[lower.tri(tab)] <- tab[lower.tri(tab)] + tab[upper.tri(tab)]
tab[upper.tri(tab)] <- NA
# V3
#V2 a b c d
# a 0
# b 0 1
# c 1 0 0
# d 1 0 2 0

Creating subgroups from categorical data by using lapply in R

I was wondering if you kind folks could answer a question I have. In the sample data I've provided below, in column 1 I have a categorical variable, and in column 2 p-values.
x <- c(rep("A",0.1*10000),rep("B",0.2*10000),rep("C",0.65*10000),rep("D",0.05*10000))
categorical_data=as.matrix(sample(x,10000))
p_val=as.matrix(runif(10000,0,1))
combi=as.data.frame(cbind(categorical_data,p_val))
head(combi)
V1 V2
1 A 0.484525170875713
2 C 0.48046557046473
3 C 0.228440979029983
4 B 0.216991128632799
5 C 0.521497668232769
6 D 0.358560319757089
I want to now take one of the categorical variables, let's say "C", and create another variable if it is C (print 1 in column 3, or 0 if it isn't).
combi$NEWVAR[combi$V1=="C"] <-1
combi$NEWVAR[combi$V1!="C" <-0
V1 V2 NEWVAR
1 A 0.484525170875713 0
2 C 0.48046557046473 1
3 C 0.228440979029983 1
4 B 0.216991128632799 0
5 C 0.521497668232769 1
6 D 0.358560319757089 0
I'd like to do this for each of the variables in V1, and then loop over using lapply:
variables=unique(combi$V1)
loopeddata=lapply(variables,function(x){
combi$NEWVAR[combi$V1==x] <-1
combi$NEWVAR[combi$V1!=x]<-0
}
)
My output however looks like this:
[[1]]
[1] 0
[[2]]
[1] 0
[[3]]
[1] 0
[[4]]
[1] 0
My desired output would be like the table in the second block of code, but when looping over the third column would be A=1, while B,C,D=0. Then B=1, A,C,D=0 etc.
If anyone could help me out that would be very much appreciated.
How about something like this:
model.matrix(~ -1 + V1, data=combi)
Then you can cbind it to combi if you desire:
combi <- cbind(combi, model.matrix(~ -1 + V1, data=combi))
model.matrix is definitely the way to do this in R. You can, however, also consider using table.
Here's an example using the result I get when using set.seed(1) (always use a seed when sharing example problems with random data).
LoopedData <- table(sequence(nrow(combi)), combi$V1)
head(LoopedData)
#
# A B C D
# 1 0 1 0 0
# 2 0 0 1 0
# 3 0 0 1 0
# 4 0 0 1 0
# 5 0 1 0 0
# 6 0 0 1 0
## If you want to bind it back with the original data
combi <- cbind(combi, as.data.frame.matrix(LoopedData))
head(combi)
# V1 V2 A B C D
# 1 B 0.0647124934475869 0 1 0 0
# 2 C 0.676612401846796 0 0 1 0
# 3 C 0.735371692571789 0 0 1 0
# 4 C 0.111299667274579 0 0 1 0
# 5 B 0.0466546178795397 0 1 0 0
# 6 C 0.130910312291235 0 0 1 0

How can I reshape my dataframe using reshape package?

I have a dataframe that looks like this:
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
etc.
Because I need to correlate variabeles a-g (their scores are in score1) with score2 in only step 5 I think i need to schange my dataset into this first:
a b c d e f g score2_step5
0 1 1 0 0 0
1 1 1 0 1 1
1 0
etc.
I am pretty sure that the Reshape package should be able to help me to do the job, but I haven't been able to make it work yet.
Can anyone help me? Many thanks in advance!
Here's another version. In case there is no step = 5, the value for score2_step = 0. Assuming your data.frame is df:
require(reshape2)
out <- do.call(rbind, lapply(seq(1, nrow(df), by=5), function(ix) {
iy <- min(ix+4, nrow(df))
df.b <- df[ix:iy, ]
tt <- dcast(df.b, 1 ~ var1, fill = 0, value.var = "score1", drop=F)
tt$score2_step5 <- 0
if (any(df.b$step == 5)) {
tt$score2_step5 <- df.b$score2[df.b$step == 5]
}
tt[,-1]
}))
> out
a b d e f g score2_step5
2 0 1 1 0 0 0 0
21 1 1 1 0 1 0 1
22 0 0 1 0 0 0 0
It looks like you want 7 correlations between the variables a-g and score2_step5--is that correct? First, you're going to need another variable. I'm assuming that step repeats continuously from 1 to 5; if not, this is going to be more complicated. I'm assuming your data is called df. I also prefer the newer reshape2 package, so I'm using that.
df$block <- rep(1:(nrow(df)/5),each=5)
df.molten <- melt(df,id.vars=c("var1", "step", "block"),measure.vars=c("score1"))
df2 <- dcast(df.molten, block ~ var1)
score2_step5 <- df$score2[df$step==5]
and then finally
cor(df2, score2_step5, use='pairwise')
There's an extra column (block) in df2 that you can get rid of or just ignore.
I added another row to your example data because my code doesn't work unless there is a step-5 observation in every block.
dat <- read.table(textConnection("
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
5 a 1 0"),header=TRUE)
Like #JonathanChristensen, I made another variable (I called it rep instead of block), and I made var1 into a factor (since there are no c values in the example data set given and I wanted a placeholder).
dat <- transform(dat,var1=factor(var1,levels=letters[1:7]),
rep=cumsum(step==1))
tapply makes the table of score1 values:
tab <- with(dat,tapply(score1,list(rep,var1),identity))
add the score2, step-5 values:
data.frame(tab,subset(dat,step==5,select=score2))

Resources