Getting all possible two column subsets - r

I am a relative newbie to R and I am now very close to being finished with a rather long script with many thanks to everyone who helped me thus far at various steps. I have another point I am stuck on. I have simplified the issue to this:
Dataset1
ax ay
1 3
2 4
Dataset2
bx by
5 7
6 8
A <- dataset1
B <- dataset2
a <- 2 #number of columns
b <- 1:2
(my datasets will vary in number of columns and so I need to be able to vary this factor)
I want this answer in any order (i.e. all possible combinations of two columns one from each of the two datasets) like this or equivalent.
[[1]]
1 5
2 6
[[2]]
1 7
2 8
[[3]]
3 5
4 6
[[4]]
3 7
4 8
But I am not getting it.
I tried a bunch of things and the closest to what I want was with this:
i <- 1
for( i in 1:a )
{
e <- lapply(B, function(x) as.data.frame(cbind(A, x)))
print(e)
i <- i+1
}
Close, yes. I can take the answer and do some fiddling and subsetting but its not right and there must be an easy way to do this. I have not seen anything like this in my on line searches. Any help much appreciated.

Does something like this work for you?
Dataset1 <- data.frame(ax=1:2,ay=3:4)
Dataset2 <- data.frame(bx=5:6,by=7:8)
apply(
expand.grid(seq_along(Dataset1),seq_along(Dataset2)),
1,
function(x) cbind(Dataset1[x[1]],Dataset2[x[2]])
)
Result:
[[1]]
ax bx
1 1 5
2 2 6
[[2]]
ay bx
1 3 5
2 4 6
[[3]]
ax by
1 1 7
2 2 8
[[4]]
ay by
1 3 7
2 4 8

I think the easiest way to do is very similar to what you tried, use two explicit loops. However, there are still some things I would do differently:
Pre allocate the list space
Use an explicit counter
Use drop=FALSE
Then you can do the following.
A <- read.table(text = "ax ay
1 3
2 4", header = TRUE)
B <- read.table(text = "bx by
5 7
6 8", header = TRUE)
out <- vector("list", length = ncol(A) * ncol(B))
counter <- 1
for (i in 1:ncol(A)) {
for (j in 1:ncol(B)) {
out[[counter]] <- cbind(A[,i, drop = FALSE], B[,j, drop = FALSE])
counter <- counter + 1
}
}
out
## [[1]]
## ax bx
## 1 1 5
## 2 2 6
##
## [[2]]
## ax by
## 1 1 7
## 2 2 8
##
## [[3]]
## ay bx
## 1 3 5
## 2 4 6
##
## [[4]]
## ay by
## 1 3 7
## 2 4 8

If I understand the question, I think you can use combn to select the columns you want. For instance, if wanted all combinations of 8 columns taken 2 at at time, you could do:
combn(1:8, 2)
Which gives (in part for readability):
combn(1:8,2)[,c(1:5, 15:18)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 1 1 1 1 3 3 3 3
[2,] 2 3 4 5 6 5 6 7 8
So then columns of this matrix can be used as the indices you want.

Related

adding value from previous row to subsequent ones, cumulatively in R

I am trying to add the last value from the previous row to the subsequent ones. For example
tmat = rbind(c(1,2,3), c(1,2,3), c(1,2,5))
tmat = as.data.frame(tmat)
tmat
V1 V2 V3
1 1 2 3
2 1 2 3
3 1 2 5
changed to
V1 V2 V3
1 1 2 3
2 4 5 6
3 7 8 11
I have tried various ways but I have a blind spot to this one.
new=list()
for(i in 2:nrow(tmat)){
new[[i]] = cumsum(tmat[i,]+tmat[i-1,3])
}
do.call(rbind, new)
Thanks for any help.
I'd use a loop since you need to compute the rows step by step...
a <- 1:3
aa <- rbind(a,a,a)
aa[3,3] <- 6
for(i in 1:(nrow(aa)-1)) {
toadd <- aa[i,ncol(aa)]
aa[i+1,] <- aa[i+1,] + aa[i, ncol(aa)]
}
aa
[,1] [,2] [,3]
a 1 2 3
a 4 5 6
a 7 8 12
As a matrix reduction:
do.call(rbind, Reduce(function(a0, a1) (a1 + a0[3]),
split(as.matrix(tmat), seq_along(tmat)),
accumulate = T))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 11

Count instances of overlap in two vectors in R

I am hoping to create a matrix that shows a count of instances of overlapping values for a grouping variable based on a second variable. Specifically, I am hoping to determine the degree to which primary studies overlap across meta-analyses in order to create a network diagram.
So, in this example, I have three meta-analyses that include some portion of three primary studies.
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,3,2,1,2,3))
metas studies
1 1 1
2 1 3
3 1 2
4 2 1
5 3 2
6 3 3
I would like it to return:
v1 v2 v3
1 3 1 2
2 1 1 0
3 2 0 2
The value in row 1, column 1 indicates that Meta-analysis 1 had three studies in common with itself (i.e., it included three studies). Row 1, column 2 indicates that Meta-analysis 1 had one study in common with Meta-analysis 2. Row 1, column 3 indicates that Meta-analysis 1 had two studies in common with Meta-analysis 3.
I believe you are looking for a symmetric matrix of intersecting studies.
dfspl <- split(df$studies, df$metas)
out <- outer(seq_along(dfspl), seq_along(dfspl),
function(a, b) lengths(Map(intersect, dfspl[a], dfspl[b])))
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
If you need names on them, you can go with the names as defined by df$metas:
rownames(out) <- colnames(out) <- names(dfspl)
out
# 1 2 3
# 1 3 1 2
# 2 1 1 0
# 3 2 0 2
If you need the names defined as v plus the meta name, go with
rownames(out) <- colnames(out) <- paste0("v", names(dfspl))
out
# v1 v2 v3
# v1 3 1 2
# v2 1 1 0
# v3 2 0 2
If you need to understand what this is doing, outer creates an expansion of the two argument vectors, and passes them all at once to the function. For instance,
outer(seq_along(dfspl), seq_along(dfspl), function(a, b) { browser(); 1; })
# Called from: FUN(X, Y, ...)
debug at #1: [1] 1
# Browse[2]>
a
# [1] 1 2 3 1 2 3 1 2 3
# Browse[2]>
b
# [1] 1 1 1 2 2 2 3 3 3
# Browse[2]>
What we ultimately want to do is find the intersection of each pair of studies.
dfspl[[1]]
# [1] 1 3 2
dfspl[[3]]
# [1] 2 3
intersect(dfspl[[1]], dfspl[[3]])
# [1] 3 2
length(intersect(dfspl[[1]], dfspl[[3]]))
# [1] 2
Granted, we are doing it twice (once for 1 and 3, once for 3 and 1, which is the same result), so this is a little inefficient ... it would be better to filter them to only look at the upper or lower half and transferring it to the other.
Edited for a more efficient process (only calculating each intersection pair once, and never calculating self-intersection.)
eg <- expand.grid(a = seq_along(dfspl), b = seq_along(dfspl))
eg <- eg[ eg$a < eg$b, ]
eg
# a b
# 4 1 2
# 7 1 3
# 8 2 3
lens <- lengths(Map(intersect, dfspl[eg$a], dfspl[eg$b]))
lens
# 1 1 2 ## btw, these are just names, from eg$a
# 1 2 0
out <- matrix(nrow = length(dfspl), ncol = length(dfspl))
out[ cbind(eg$a, eg$b) ] <- lens
out
# [,1] [,2] [,3]
# [1,] NA 1 2
# [2,] NA NA 0
# [3,] NA NA NA
out[ lower.tri(out) ] <- out[ upper.tri(out) ]
diag(out) <- lengths(dfspl)
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
Same idea as #r2evans, also Base R (and a bit less eloquent) (edited as required):
# Create df using sample data:
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,7,2,1,2,3))
# Test for equality between the values in the metas vector and the rest of
# of the values in the dataframe -- Construct symmetric matrix from vector:
m1 <- diag(v1); m1[,1] <- m1[1,] <- v1 <- rowSums(data.frame(sapply(df$metas, `==`,
unique(unlist(df)))))
# Coerce matrix to dataframe setting the names as desired; dropping non matches:
df_2 <- setNames(data.frame(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)]),
paste0("v", 1:ncol(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)])))

multiple loops on matrix

I have below information:
coordinate <- read.table(text = " 18.915 13.462 31.598
17.898 14.453 32.160
18.220 15.420 32.853
19.208 12.313 32.573
20.393 11.524 32.110
20.344 10.809 31.085
21.595 16.610 29.912")
amnumber <- c(1,1,2,3,3,3,4)
atname <-as.data.frame( c("A","B","A","C","D","C","H"),stringsAsFactors = F)
library(geometry)
tri <- delaunayn(coordinate)
tri
[,1] [,2] [,3] [,4]
[1,] 1 3 7 2
[2,] 4 1 6 2
[3,] 4 1 3 2
[4,] 4 1 3 7
[5,] 5 4 3 7
[6,] 5 1 6 7
[7,] 5 4 1 7
[8,] 5 4 1 6
tridmatrix
I want to perform two loops on tri mamtrix such that value 1 in the first row has relations between each other next values like 3,7 and 2. So, in the output matrix of our loops, we have to put 1 between these indices. Then, value 3 of the first row has relations between two other values like 7 and 2. And so on. The output result would be a matrix that only contains 0,1 values. To this end I wrote the below loops:
for (k in 1:nrow(tri)){
for (i in 1:4){
for (j in i+1){
c <- abs(amnumber[tri[k,i]]-amnumber[tri[k,j]])
if (c>=1){
if (!((atname[tri[k,i],]%in%"N")&&(atname[tri[k,j],]%in%"C")&&(c%in%1)||
(atname[tri[k,i],]%in%"C")&&(atname[tri[k,j],]%in%"N")&&(c%in%1))){
d <- sqrt(sum((coordinate[tri[k,i],]-coordinate[tri[k,j],])^2))
if (d<=tridist){
adj_tri[tri[k,i],tri[k,j]] <- 1
adj_tri[tri[k,j],tri[k,i]] <- 1
adj_tri[is.na(adj_tri)] <- 0
}
}
}
}
}
}
But it did not work. And I faced error. i index is equal to the number of columns in tri matrix and I think the problem is in the third loop. However, I could not fix it. Any help would be appreciated.
Besides, this is too slow. Would you please help me to change it lapply to speed up the progress.

how to create numeral combinations

I have 6 digits (1, 2, 3, 4, 5, 6), and I need to create all possible combinations (i.e. 6*5*4*3*2*1 = 720 combinations) in which no number can be used twice and O is not allowed. I would like to obtain combinations like: 123456, 246135, 314256, etc.
Is there a way to create them with Matlab or R? Thank you.
In Matlab you can use
y = perms(1:6);
This gives a numerical 720×6 array y, where each row is a permutation:
y =
6 5 4 3 2 1
6 5 4 3 1 2
6 5 4 2 3 1
6 5 4 2 1 3
6 5 4 1 2 3
···
If you want the result as a char array:
y = char(perms(1:6)+'0');
which produces
y =
654321
654312
654231
654213
654123
···
In R:
library(combinat)
p <- permn(1:6)
gives you a list; do.call(rbind, p) or matrix(unlist(p), ncol=6, byrow=TRUE) will give a numeric array; sapply(p,paste,collapse="") gives a vector of strings.
Here's a base R 'solution':
p <- unique(t(replicate(100000, sample(6,6), simplify="vector")))
nrow(p)
#> [1] 720
head(p)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 3 5 4 2 1 6
#> [2,] 6 3 5 4 1 2
#> [3,] 5 1 6 2 3 4
#> [4,] 6 5 3 2 4 1
#> [5,] 5 2 3 6 4 1
#> [6,] 1 4 2 5 6 3
It's a hack of course, and this potentially only applies to the example given, but sometimes it's useful to do things in silly ways... this takes an excessive number of samples (without replacement) of the vector 1:6, then removes any duplicates. It does indeed produce the unique 720 results, but they're not sorted.
A base R approach is
x <- do.call(expand.grid, rep(list(1:6), 6))
x <- x[apply(x, MAR = 1, function(x) length(unique(x)) == 6), ]
which creates a matrix with 6^6 rows, then retains only rows that contain all 6 numbers.

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

Resources