Efficiently augment a data frame by values found in a matrix - r

I have the following data frame (called cp):
v1 v2 v3 v4
1 1 2 3 4
2 3 1 2 4
3 4 2 1 3
Where 1, 2, 3 and 4 are nodes on a directed graph. The distance between the nodes is given by the following weighted adjacency matrix (let's call it B):
0 3 1 2
3 0 1 4
1 1 0 2
2 4 2 0
I need to augment the columns in the data frame with the distance between the nodes given by the rows and columns of the adjacency matrix (again data frame cp):
v1 v2 v3 v4 V5 V6 V7
1 1 2 3 4 3 1 2
2 3 1 2 4 1 3 4
3 4 2 1 3 4 3 1
That is, the values on columns V5, V6 and V7 come from looking up the distance between the adjacent pairs of nodes in colunms v1 to v4. For example, the 3 in column V5 is the distance from node 1 and 2 -which is found in the first row and second column of matrix B (that is, 3), and so on.
I have written the following code in R to achieve this:
for (i in 1:3){
for (j in 5:7){
cp[i, j] <- B[cp[i, j - 4], cp[i, j - 3]]
}
}
The code works fine with a data frame of just a few observations. The problem is that it takes many many hours to process a data frame of 9 columns and 11 million observations. Can you please help me find a more efficient way to do this without the for loops?

In R, matrices are stored in a one dimensional vector. You can make use of this and index pairs of indices in a matrix using this property.
Hence you can do this:
B <- as.matrix(B)
cp <- cbind(cp, sapply(1:(ncol(cp) - 1), function(ii){
B[cp[,ii] + nrow(B) * (cp[,ii+1] - 1)]
}))
cp
# v1 v2 v3 v4 1 2 3
# 1 1 2 3 4 3 1 2
# 2 3 1 2 4 1 3 4
# 3 4 2 1 3 4 3 1

You can try this which should be way faster as your data frame cp has 9 columns:
res <- apply(mapply(seq, seq(ncol(cp)-1), 2:ncol(cp)), 2, function(i) B[cp[,i]])
# [,1] [,2] [,3]
#[1,] 3 1 2
#[2,] 1 3 4
#[3,] 4 3 1
cbind(cp, res) will give your desired output.
You can convert your data frame cp to a matrix by as.matrix(cp). The type matrix is used here because of easier vectorization.
Benchmarking (cp of dim 1e+6 x 9)
library(microbenchmark)
set.seed(1)
cp <- t(replicate(1e+6, sample(9)))
B <- t(replicate(9, sample(9)-1))
f989=function() apply(mapply(seq, seq(ncol(cp)-1), 2:ncol(cp)), 2, function(i) B[cp[,i]])
fikop=function() sapply(1:(ncol(cp) - 1), function(ii){
B[cp[,ii] + nrow(B) * (cp[,ii+1] - 1)]
})
all(f989()==fikop())
# [1] TRUE
microbenchmark(f989(), fikop())
# Unit: milliseconds
# expr min lq mean median uq max neval
# f989() 157.4025 165.0029 190.5306 200.8816 204.6907 239.720 100
# fikop() 212.2289 255.1914 259.2568 261.1330 266.3382 310.974 100

Related

Count instances of overlap in two vectors in R

I am hoping to create a matrix that shows a count of instances of overlapping values for a grouping variable based on a second variable. Specifically, I am hoping to determine the degree to which primary studies overlap across meta-analyses in order to create a network diagram.
So, in this example, I have three meta-analyses that include some portion of three primary studies.
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,3,2,1,2,3))
metas studies
1 1 1
2 1 3
3 1 2
4 2 1
5 3 2
6 3 3
I would like it to return:
v1 v2 v3
1 3 1 2
2 1 1 0
3 2 0 2
The value in row 1, column 1 indicates that Meta-analysis 1 had three studies in common with itself (i.e., it included three studies). Row 1, column 2 indicates that Meta-analysis 1 had one study in common with Meta-analysis 2. Row 1, column 3 indicates that Meta-analysis 1 had two studies in common with Meta-analysis 3.
I believe you are looking for a symmetric matrix of intersecting studies.
dfspl <- split(df$studies, df$metas)
out <- outer(seq_along(dfspl), seq_along(dfspl),
function(a, b) lengths(Map(intersect, dfspl[a], dfspl[b])))
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
If you need names on them, you can go with the names as defined by df$metas:
rownames(out) <- colnames(out) <- names(dfspl)
out
# 1 2 3
# 1 3 1 2
# 2 1 1 0
# 3 2 0 2
If you need the names defined as v plus the meta name, go with
rownames(out) <- colnames(out) <- paste0("v", names(dfspl))
out
# v1 v2 v3
# v1 3 1 2
# v2 1 1 0
# v3 2 0 2
If you need to understand what this is doing, outer creates an expansion of the two argument vectors, and passes them all at once to the function. For instance,
outer(seq_along(dfspl), seq_along(dfspl), function(a, b) { browser(); 1; })
# Called from: FUN(X, Y, ...)
debug at #1: [1] 1
# Browse[2]>
a
# [1] 1 2 3 1 2 3 1 2 3
# Browse[2]>
b
# [1] 1 1 1 2 2 2 3 3 3
# Browse[2]>
What we ultimately want to do is find the intersection of each pair of studies.
dfspl[[1]]
# [1] 1 3 2
dfspl[[3]]
# [1] 2 3
intersect(dfspl[[1]], dfspl[[3]])
# [1] 3 2
length(intersect(dfspl[[1]], dfspl[[3]]))
# [1] 2
Granted, we are doing it twice (once for 1 and 3, once for 3 and 1, which is the same result), so this is a little inefficient ... it would be better to filter them to only look at the upper or lower half and transferring it to the other.
Edited for a more efficient process (only calculating each intersection pair once, and never calculating self-intersection.)
eg <- expand.grid(a = seq_along(dfspl), b = seq_along(dfspl))
eg <- eg[ eg$a < eg$b, ]
eg
# a b
# 4 1 2
# 7 1 3
# 8 2 3
lens <- lengths(Map(intersect, dfspl[eg$a], dfspl[eg$b]))
lens
# 1 1 2 ## btw, these are just names, from eg$a
# 1 2 0
out <- matrix(nrow = length(dfspl), ncol = length(dfspl))
out[ cbind(eg$a, eg$b) ] <- lens
out
# [,1] [,2] [,3]
# [1,] NA 1 2
# [2,] NA NA 0
# [3,] NA NA NA
out[ lower.tri(out) ] <- out[ upper.tri(out) ]
diag(out) <- lengths(dfspl)
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
Same idea as #r2evans, also Base R (and a bit less eloquent) (edited as required):
# Create df using sample data:
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,7,2,1,2,3))
# Test for equality between the values in the metas vector and the rest of
# of the values in the dataframe -- Construct symmetric matrix from vector:
m1 <- diag(v1); m1[,1] <- m1[1,] <- v1 <- rowSums(data.frame(sapply(df$metas, `==`,
unique(unlist(df)))))
# Coerce matrix to dataframe setting the names as desired; dropping non matches:
df_2 <- setNames(data.frame(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)]),
paste0("v", 1:ncol(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)])))

Select unique values from a list of 3

I would like to list all unique combinations of vectors of length 3 where each element of the vector can range between 1 to 9.
First I list all such combinations:
df <- expand.grid(1:9, 1:9, 1:9)
Then I would like to remove the rows that contain repetitions.
For example:
1 1 9
9 1 1
1 9 1
should only be included once.
In other words if two lines have the same numbers and the same number of each number then it should only be included once.
Note that
8 8 8 or
9 9 9 is fine as long as it only appears once.
Based on your approach and the idea to remove repetitions:
df <- expand.grid(1:2, 1:2, 1:2)
# Var1 Var2 Var3
# 1 1 1 1
# 2 2 1 1
# 3 1 2 1
# 4 2 2 1
# 5 1 1 2
# 6 2 1 2
# 7 1 2 2
# 8 2 2 2
df2 <- unique(t(apply(df, 1, sort))) #class matrix
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 2
# [3,] 1 2 2
# [4,] 2 2 2
df2 <- as.data.frame(df2) #class data.frame
There are probably more efficient methods, but if I understand you correct, that is the result you want.
Maybe something like this (since your data frame is not large, so it does not pain!):
len <- apply(df,1,function(x) length(unique(x)))
res <- rbind(df[len!=2,], df[unique(apply(df[len==2,],1,prod)),])
Here is what is done:
Get the number of unique elements per row
Comprises two steps:
First argument of rbind: Those with length either 1 (e.g. 1 1 1, 7 7 7, etc) or 3 (e.g. 5 8 7, 2 4 9, etc) are included in the final results res.
Second argument of rbind: For those in which the number of unique elements are 2 (e.g. 1 1 9, 3 5 3, etc), we apply product per row and take whose unique products (cause, for example, the product of 3 3 5 and 3 5 3 and 5 3 3 are the same)

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

finding pairs of duplicate columns in R

thank you for viewing this post. I am a newbie for R language.
I want to find if one column(not specified one) is a duplicate of the other, and return a matrix with dimensions num.duplicates x 2 with each row giving both indices of any pair of duplicated variables. the matrix is organized so that first column is the lower number of the pair, and it is increasing ordered.
Let say I have a dataset
v1 v2 v3 v4 v5 v6
1 1 1 2 4 2 1
2 2 2 3 5 3 2
3 3 3 4 6 4 3
and I want this
[,1] [,2]
[1,] 1 2
[2,] 1 6
[3,] 2 6
[4,] 3 5
Please help, thank you!
Something like this I suppose:
out <- data.frame(t(combn(1:ncol(dd),2)))
out[combn(1:ncol(dd),2,FUN=function(x) all(dd[x[1]]==dd[x[2]])),]
# X1 X2
#1 1 2
#5 1 6
#9 2 6
#11 3 5
I feel like i'm missing something more simple, but this seems to work.
Here's the sample data.
dd <- data.frame(
v1 = 1:3, v2 = 1:3, v3 = 2:4,
v4 = 4:6, v5 = 2:4, v6 = 1:3
)
Now i'll assign each column to a group using ave() to look for duplicates. Then I'll count the number of columns in group
groups <- ave(1:ncol(dd), as.list(as.data.frame(t(dd))), FUN=min, drop=T)
Now that I have the groups, i'll split the column indexes up by those groups, if there is more than one, i'll grab all pairwise combinations. That will create a wide matrix and I flip it to a tall-line as you desire with t()
morethanone <- function(x) length(x)>1
dups <- t(do.call(cbind,
lapply(Filter(morethanone, split(1:ncol(dd), groups)), combn, 2)
))
That returns
[,1] [,2]
[1,] 1 2
[2,] 1 6
[3,] 2 6
[4,] 3 5
as desired
First, generate all possible combinatons with expand.grid. Second, remove duplicates and sort in desired order. Third, use sapply to find indexes of repeated columns:
kk <- expand.grid(1:ncol(df), 1:ncol(df))
nn <- kk[kk[, 1] > kk[, 2], 2:1]
nn[sapply(1:nrow(nn),
function(i) all(df[, nn[i, 1]] == df[, nn[i, 2]])), ]
Var2 Var1
2 1 2
6 1 6
12 2 6
17 3 5
The approach I propose is R-ish, but I suppose writing a simple double loop is justified for this case, especially if you recently started learning the language.

Efficient way to count the change of values between 2 or more matrix or vectors

I am checking the change that occurs between different datasets, for now I am using a simple loop that gives me the counts for each change. The datasets are numeric(a sequence of numbers) and I count how many times each change occurs (1 changed to 5 XX times):
n=100
tmp1<-sample(1:25, n, replace=T)
tmp2<-sample(1:25, n, replace=T)
values_tmp1=sort(unique(tmp1))
values_tmp2=sort(unique(tmp2))
count=c()
i=1
for (m in 1:length(values_tmp1)){
for (j in 1:length(values_tmp2)){
count[i]=length(which(tmp1==values_tmp1[m] & tmp2==values_tmp2[j]))
i=i+1
}
}
However my data is much bigger with n = 2000000 , and the loop gets extremely slow.
Can anyone help me improve this calculation?
Like this?
tmp1 <- c(1:5,3)
tmp2 <- c(1,3,3,1,5,3)
aggregate(tmp1,list(tmp1,tmp2),length)
# Group.1 Group.2 x
# 1 1 1 1
# 2 4 1 1
# 3 2 3 1
# 4 3 3 2
# 5 5 5 1
This might be faster for a big dataset:
library(data.table)
DT <- data.table(cbind(tmp1,tmp2),key=c("tmp1","tmp2"))
DT[,.N,by=key(DT)]
# tmp1 tmp2 N
# 1: 1 1 1
# 2: 2 3 1
# 3: 3 3 2
# 4: 4 1 1
# 5: 5 5 1

Resources