Merge all possible combinations of multiple data frames

Merge all possible combinations of multiple data frames - r

I would like to merge by columns all the possible pair combinations of these three data frames (i.e. nine combinations)
frame1 = data.frame(a=c(1,2,3), b=c(1,2,3), c=c(1,2,3))
frame2 = data.frame(a=c(2,1,3), b=c(2,1,3), c=c(2,1,3))
frame3 = data.frame(a=c(3,2,1), b=c(3,2,1), c=c(3,2,1))
which contain the same 3 rows each but not in the same order, so I would also like that the merging be by coincidence of the pair of values of the columns a and b in the two files merged. Example:
a b c
1 1 1
2 2 2
3 3 3
+
a b c
2 2 2
1 1 1
3 3 3
=
a.x b.x c.x a.y b.y c.y
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
I wanted then to obtain the difference between each pair of values of the columns c.x and c.y present in each merged file, in absolute values, and sum all these differences thus obtaining a "score" (of course this would be zero in this example), which I would like to add to an empty matrix 3x3 in the correspondant cell (i.e., the score of frame1 vs. frame 2 should be located in cell [2,1], etc.):
nframes = 3
frames = c(frame1,frame2,frame3)
matrix = matrix(, nrow = nframes, ncol = nframes)
matrix_scores = data.frame(matrix)
for (i in frames){
for (j in frames)
{
x = merge(i, j, by=c("a","b"))
score = sum(abs(x$c.x - x$c.y))
matrix_scores[j,i] <- score
}
}
However, when I run the loop I obtain the following message:
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
Also, I understand that the line
matrix_scores[j,i] <- score
will give an error, too, but I do not know how to express that I want the score to be stored in cell [1,1], for the first iteration of the loop (frame1 vs. frame1).
The resulting matrix should be a 3x3 matrix containing all zeros:
f1 f2 f3
frame1 0 0 0
frame2 0 0 0
frame3 0 0 0

You can do:
# Put all frames in a list
d <- list(frame1, frame2, frame3)
# get all merge-combinations
gr <- expand.grid(1:length(d), 1:length(d))
# function to merge and get the sum diff:
foo <- function(i, x, gr){
tmp <- merge(x[[gr[i, 1]]], x[[gr[i, 2]]], by=c("a", "b"))
sum(abs(tmp$c.x - tmp$c.y))
}
# result matrix
matrix(sapply(1:nrow(gr), foo, d, gr), length(d), length(d), byrow = T)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
# The scores are set as followed:
matrix(apply(gr, 1, paste, collapse="_"), 3, 3, byrow = T)
[,1] [,2] [,3]
[1,] "1_1" "2_1" "3_1"
[2,] "1_2" "2_2" "3_2"
[3,] "1_3" "2_3" "3_3"
# alternative using apply:
# function to merge and get the sum diff:
foo <- function(y, x){
tmp <- merge(x[[ y[1] ]], x[[ y[2] ]], by=c("a", "b"))
sum(abs(tmp$c.x - tmp$c.y))
}
# result matrix
matrix(apply(gr, 1, foo, d), length(d), length(d), byrow = T)

Related

Distribute a sum randomly across columns

Suppose I have a vector of length n. Let's say n=3 so and the vector is v=c(3,5,4).
I have a matrix of zeros with n rows and m columns. Let's say m=5
mymatrix <- matrix(rep(0, 3*5), nrow=3)
what I want to to is randomly distribute the values of v across columns, for each row. So in this example, the first row would sum to 3, the second row would sum to 5, and the third row would sum to 4. E.g. this would be one possible random assignment:
0 1 2 0 0
3 1 0 0 1
0 2 2 0 0
The sums of the rows are 3,5,4, which are the values of v.
How can I accomplish this? My thought was to start with
sapply(v, function(i){sample(1:m, i, replace=TRUE)})
and go from there but this gives me a list, since each result is of a different length, and I'm not sure how to proceed from there.
EDIT: the intention is no negative numbers, so 0 9 1 1 -8 summing to 3 would not be a valid row.

Using purrr:
library(tibble)
library(dplyr)
library(purrr)
n <- 3
m <- 5
v <- c(3,5,4)
y <- sapply(v, function(i){sample(1:m, i, replace=TRUE)})
do_it <- function(x) {
tmp <- tibble(
index = x,
cnt = 1
) %>%
group_by(index) %>%
summarise(cnt = sum(cnt))
out <- rep(0, m)
out[tmp$index] <- tmp$cnt
return(out)
}
y %>%
map(~do_it(.x)) %>%
unlist() %>%
matrix(nrow = 3, byrow = TRUE)

Maybe parts from package partitions is the tool you need for your objective, i.e.,
library(partitions)
# define your custom function `f` to generate random combinations and positions, with row sum subject to the given value
f <- function(k) replace(rep(0,ncol(mymatrix)),
sample(ncol(mymatrix),k),
as.data.frame.matrix(parts(k))[sample(k),sample(k,1)])
such that
set.seed(1)
M <- t(sapply(v, f))
> M
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 3 0 0
[2,] 1 0 0 4 0
[3,] 0 4 0 0 0

conditional which.min function

I have two sets of data, one is coordinates of machines, one is coordinates of the nearest repair shop.
I have a working model that has assigned each machine to the nearest store. However one store only has 1 machine and another has 7 machines assigned to it.
What I want is to add a condition so that each store is assigned at least 2 machines but no more than 4.
library(geosphere)
library(ggplot2)
#machine Locations
machine.x <- c(-122.37, -111.72, -111.87, -112.05, -87.17, -86.57, -86.54, -88.04, -86.61, -88.04, -86.61)
machine.y <- c(37.56, 35.23, 33.38, 33.57, 30.36, 30.75, 30.46, 30.68, 30.42, 30.68, 30.42)
machines <- data.frame(machine.x, machine.y)
#store locations
store.x <- c(-121.98, -112.17, -86.57)
store.y <- c(37.56, 33.59, 30.75)
stores <- data.frame(store.x, store.y)
centers<-data.frame(x=stores$store.x, y=stores$store.y)
pts<-data.frame(x=(machines$machine.x), y=(machines$machine.y))
#allocate space
distance<-matrix(-1, nrow = length(pts$x), ncol= length(centers$x))
#calculate the dist matrix - the define centers to each point
#columns represent centers and the rows are the data points
dm<-apply(data.frame(1:length(centers$x)), 1, function(x){ replace(distance[,x], 1:length(pts$x), distGeo(centers[x,], pts))})
#find the column with the smallest distance
closestcenter<-apply(dm, 1, which.min)
#color code the original data for verification
colors<-c(stores)
#create a scatter plot of assets color coded by which fe they belong to
plot(pts, col=closestcenter, pch=9)
So what I want is for each group to have a minimum count of 2 and a max count of 4, I tried adding a if else statement in the closest center variable but it didn't get even close to working out the way I thought it would. and i've looked around on line but can't find any way to add a counting condition to the which.min statement.
Note:My actual data set has several thousand machines and over 100 stores.

If M is an 11 x 3 zero-one matrix where M[i,j] = 1 if machine i is assigned to store j and 0 otherwise then the rows of M must each sum to 1 and the columns must each sum to 2 to 4 inclusive and we want to choose such an M which minimizes the sum of the distances sum(M * dm), say. This would give us the 0-1 linear program shown below. Below A is such that A %*% c(M) is the same as rowSums(M). Also B is such that B %*% c(M) is the same as colSums(M).
library(lpSolve)
k <- 3
n <- 11
dir <- "min"
objective.in <- c(dm)
A <- t(rep(1, k)) %x% diag(n)
B <- diag(k) %x% t(rep(1, n))
const.mat <- rbind(A, B, B)
const.dir <- c(rep("==", n), rep(">=", 3), rep("<=", 3))
const.rhs <- c(rep(1, n), rep(2, k), rep(4, k))
res <- lp(dir, objective.in, const.mat, const.dir, const.rhs, all.bin = TRUE)
res
## Success: the objective function is 9025807
soln <- matrix(res$solution, n, k)
and this solution:
> soln
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 1 0 0
[3,] 0 1 0
[4,] 0 1 0
[5,] 0 1 0
[6,] 0 0 1
[7,] 0 0 1
[8,] 1 0 0
[9,] 0 0 1
[10,] 0 1 0
[11,] 0 0 1
or in terms of the vector of store numbers assigned to each machine:
c(soln %*% (1:k))
## [1] 1 1 2 2 2 3 3 1 3 2 3

Generate matrices using positive integer solutions of the indefinite equation

I asked a question similar to this one previously. But this one little more tricky. I have POSITIVE INTEGER solutions(previously NON-NEGATIVE solutions) matrix(say A) to the indefinite equation x1+x2+x3 = 8. Also, I have another matrix(say B) with columns
0 1 0 1
0 0 1 1
I want to generate matrices using rows of A and the columns of B.
For an example, let (2,2,4) is the one solution(one row) of the matrix A. In this case, I just cannot use rep. So I tried to generate all the three column matrices from matrix B and then try to apply rep, but couldn't figure that out. I use the following lines to generate lists of all three column matrices.
cols <- combn(ncol(B), 3, simplify=F, FUN=as.numeric)
M3 <- lapply(cols, function(x) cbind(B[,x]))
For an example, cols[[1]]
[1] 1 2 3
Then, the columns of my new matrix would be
0 0 1 1 0 0 0 0
0 0 0 0 1 1 1 1
Columns of this new matrix are the multiples of columns of B. i.e., first column 2-times, second column 2-time and third column 4-times. I want to use this procedure all the rows of matrix A. How do I do this?

?rep(x, times) says;
if times is a vector of the same length as x (after replication by
each), the result consists of x[1] repeated times[1] times, x[2]
repeated times[2] times and so on.
Basic idea is;
B <- matrix(c(0, 1, 0, 1, 0, 0, 1, 1), byrow = T, nrow = 2)
cols <- combn(ncol(B), 3, simplify=F, FUN=as.numeric)
a1 <- c(2, 2, 4)
cols[[1]] # [1] 1 2 3
rep(cols[[1]], a1) # [1] 1 1 2 2 3 3 3 3
B[, rep(cols[[1]], a1)]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 0 1 1 0 0 0 0
# [2,] 0 0 0 0 1 1 1 1
testA <- rbind(c(2,2,4), c(2,1,5), c(2,3,3))
## apply(..., lapply(...)) approach (output is list in list)
apply(testA, 1, function(x) lapply(cols, function(y) B[, rep(y, x)]))
## other approach using combination of indices
ind <- expand.grid(ind_cols = 1:length(cols), ind_A = 1:nrow(testA))
col_ind <- apply(ind, 1, function(x) rep(cols[[x[1]]], testA[x[2],]))
lapply(1:ncol(col_ind), function(x) B[, col_ind[,x]]) # output is list
library(dplyr)
apply(col_ind, 2, function(x) t(B[, x])) %>% matrix(ncol = 8, byrow=T) # output is matrix

Combining vectors of unequal length and non-unique values

I would like to do the following:
combine into a data frame, two vectors that
have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element
The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.
For example:
vector 1 vector 2 vector 1 vector 2
1 1 a a
2 2 g g
3 3 b b
4 1 or h a
1 2 a g
2 3 g b
5 4 c h
5 c
should be combined into data frame
1 1 a a
2 2 g g
3 3 b b
4 NA h NA
1 1 or a a
2 2 g g
NA 3 NA b
NA 4 NA h
5 5 c c
What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.

Note - this was proposed as an answer to the first version of the OP. The question has been modified since then but the problem is still not well-defined in my opinion.
Here is a solution that works with your integer example and would also work with numeric vectors. I am also assuming that:
both vectors contain the same number of sequences
a new sequence starts where value[i+1] <= value[i]
If your vectors are non-numeric or if one of my assumptions does not fit your problem, you'll have to clarify.
v1 <- c(1,2,3,4,1,2,5)
v2 <- c(1,2,3,1,2,3,4,5)
v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0)))
v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0)))
align.fun <- function(s1, s2) { #aligns two sequences
s12 <- sort(unique(c(s1, s2)))
cbind(ifelse(s12 %in% s1, s12, NA),
ifelse(s12 %in% s2, s12, NA))
}
do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 4 NA
# [5,] 1 1
# [6,] 2 2
# [7,] NA 3
# [8,] NA 4
# [9,] 5 5

I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.
If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.
Part I: given two sequences, find the longest common subsequence
LongestCommonSubsequence <- function(X, Y) {
m <- length(X)
n <- length(Y)
C <- matrix(0, 1 + m, 1 + n)
for (i in seq_len(m)) {
for (j in seq_len(n)) {
if (X[i] == Y[j]) {
C[i + 1, j + 1] = C[i, j] + 1
} else {
C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
}
}
}
backtrack <- function(C, X, Y, i, j) {
if (i == 1 | j == 1) {
return(data.frame(I = c(), J = c(), LCS = c()))
} else if (X[i - 1] == Y[j - 1]) {
return(rbind(backtrack(C, X, Y, i - 1, j - 1),
data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
} else if (C[i, j - 1] > C[i - 1, j]) {
return(backtrack(C, X, Y, i, j - 1))
} else {
return(backtrack(C, X, Y, i - 1, j))
}
}
return(backtrack(C, X, Y, m + 1, n + 1))
}
Part II: given two sequences, find the shortest common supersequence
ShortestCommonSupersequence <- function(X, Y) {
LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)
ALL <- merge(LCS, X.df, by = "I", all = TRUE)
ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
ifelse(is.na(ALL$J), 0, ALL$J))), ]
ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
ALL
}
Your Example:
ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
Y = c("a","g","b","a","g","b","h","c"))
# J I X Y SCS
# 1 1 1 a a a
# 2 2 2 g g g
# 3 3 3 b b b
# 9 NA 4 h <NA> h
# 4 4 5 a a a
# 5 5 6 g g g
# 6 6 NA <NA> b b
# 7 7 NA <NA> h h
# 8 8 7 c c c
(where the two updated vectors are in columns X and Y.)

How to fill off-diagonals and ignore diagonals in matrix in R?

I am trying to fill a matrix in R where the final result will ignore the diagonal entries and the values will be filled in around the diagonal. A simple example of what I mean is, if I take a simple 3x3 matrix like the one shown below:
ab <- c(1:9)
mat <- matrix(ab,nrow=3,ncol=3)
colnames(mat)<- paste0("x", 1:3)
rownames(mat)<- paste0("y", 1:3)
mat
x1 x2 x3
y1 1 4 7
y2 2 5 8
y3 3 6 9
What I want to achieve is to fill the diagonals with 0 and shift all the other values around the diagonal. So, for example if I just use diag(mat)<-0 that results in this:
x1 x2 x3
y1 0 4 7
y2 2 0 8
y3 3 6 0
Whereas, the result I'm looking for is something like this (where the values get wrapped around the diagonal):
x1 x2 x3
y1 0 3 5
y2 1 0 6
y3 2 4 0
I'm not worried about the values that are pushed out of the matrix (i.e., 7,8,9).
Any suggestions?
Thanks
EDIT: The upvoted solution below, seems to have solved the problem

One solution that works for your example is to first declare a matrix full of ones except on the diagonal:
M <- 1 - diag(3)
And then to replace all the ones by the desired off-diagonal values
M[M == 1] <- 1:6
M
# [,1] [,2] [,3]
# [1,] 0 3 5
# [2,] 1 0 6
# [3,] 2 4 0
A more complicated scenario (e.g. diagonal coefficients that are not 0, or an unkonwn number of off-diagonal elements) might need a little bit of additionnal work.

You may need a loop:
n <- 9
seqs <- seq(1:n)
mats <- matrix(0, nrow = 3, ncol = 3)
ind <- 0
for(i in 1:nrow(mats)){
for(j in 1:nrow(mats)){
if(i == j) {
mats[i,j] <- 0 }
else {
ind <- ind + 1
mats[j,i] <- seqs[ind]
}
}
}
Resulting in:
>mats
[,1] [,2] [,3]
[1,] 0 3 5
[2,] 1 0 6
[3,] 2 4 0

This will work ok for your example. Not sure I needed n1 & n2, could be altered to one value if always symmetric
# original data
ab <- c(1:9)
n1 <- 3
n2 <- 3
# You could add the 0's to the diagonal, by adding a 0 before every n1 split
# of the data e.g. 0,1,2,3 & 0,4,5,6 & 0,7,8,9
split_ab <- split(ab, ceiling((1:length(ab))/n1))
update_split_ab <- lapply(split_ab, function(x){
c(0, x)
})
new_ab <- unlist(update_split_ab)
mat <- matrix(new_ab, nrow=n1, ncol=n2)
colnames(mat)<- paste0("x", 1:n2)
rownames(mat)<- paste0("y", 1:n1)
mat
# turn this in to a function
makeShiftedMatrix <- function(ab=1:9, n1=3, n2=3){
split_ab <- split(ab, ceiling((1:length(ab))/n1))
update_split_ab <- lapply(split_ab, function(x){
c(0, x)
})
new_ab <- unlist(update_split_ab)
mat <- matrix(new_ab, nrow=n1, ncol=n2)
colnames(mat)<- paste0("x", 1:n2)
rownames(mat)<- paste0("y", 1:n1)
mat
return(mat)
}
# default
makeShiftedMatrix()
# to read in original matrix and shift:
old_mat <- matrix(ab, nrow=n1, ncol=n2)
makeShiftedMatrix(ab=unlist(old_mat))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merge all possible combinations of multiple data frames - r

Related

Distribute a sum randomly across columns

conditional which.min function

Generate matrices using positive integer solutions of the indefinite equation

Combining vectors of unequal length and non-unique values

How to fill off-diagonals and ignore diagonals in matrix in R?

Categories

Resources