(For the following, I could either an R data.frame or R data.table. Both are ok.)
I have the following data.table:
library(data.table)
dt = data.table(V1=c("dog", "dog", "cat", "cat", "cat", "bird","bird","bird","bird"),
V2=rep(42, 9), V3=c(1, 2, 4, 5, 7, 1, 2, 5, 8))
> print(dt)
V1 V2 V3
1: dog 42 1
2: dog 42 2
3: cat 42 4
4: cat 42 5
5: cat 42 7
6: bird 42 1
7: bird 42 2
8: bird 42 5
9: bird 42 8
Column V3 contains integers from 1 to 8. My goal is to populate an 8 by 8 zero matrix with the count of each combination "pair" given the unique category in column V1
So, the combination pairs for dog, cat, and bird are:
dog: (1, 2)
cat: (4, 5), (4, 7), (5, 7)
bird: (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)
For each pair, I add +1 to the corresponding entry in the zero matrix. For this matrix, (n, m) = (m, n). The matrix given dt would be:
1 2 3 4 5 6 7 8
1: 0 2 0 0 1 0 0 1
2: 2 0 0 0 1 0 0 1
3: 0 0 0 0 0 0 0 0
4: 0 0 0 0 1 0 1 0
5: 1 1 0 1 0 0 1 1
6: 0 0 0 0 0 0 0 0
7: 0 0 0 1 1 0 0 0
8: 1 1 0 0 1 0 0 0
Note that (1,2)=(2,1) has a count 2, from the dog combination and the bird combination.
(1) Is there a method to calculate the combinations of values in an R data.table/data.frame column, given the unique value in another column?
Perhaps it would make sense to output an R list, with vector "pairs", e.g.
list(c(1, 2), c(2, 1), c(4, 5), c(4, 7), c(5, 7), c(5, 4), c(7, 4), c(7, 5),
c(1, 2), c(1, 5), c(1, 8), c(2, 5), c(2, 8), c(5, 8), c(2, 1), c(5, 1),
c(8, 1), c(5, 2), c(8, 2), c(8, 5))
However, I'm not sure how I would use this to populate a matrix...
(2) Given the input data.table/data.frame, what would be the most efficient data-structure to use to write out a matrix, as soon above?
Here's a data.table solution that seems to be efficient. We basically doing a self join in order to create combinations and then count. Then, similar to what #coldspeed done with Numpy, we will just update a zero matrix by locations with counts.
# a self join
tmp <- dt[dt,
.(V1, id = x.V3, id2 = V3),
on = .(V1, V3 < V3),
nomatch = 0L,
allow.cartesian = TRUE
][, .N, by = .(id, id2)]
## Create a zero matrix and update by locations
m <- array(0L, rep(max(dt$V3), 2L))
m[cbind(tmp$id, tmp$id2)] <- tmp$N
m + t(m)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 2 0 0 1 0 0 1
# [2,] 2 0 0 0 1 0 0 1
# [3,] 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 1 0 1 0
# [5,] 1 1 0 1 0 0 1 1
# [6,] 0 0 0 0 0 0 0 0
# [7,] 0 0 0 1 1 0 0 0
# [8,] 1 1 0 0 1 0 0 0
Alternatively, we could create tmp using data.table::CJ but that could be (potentially - thanks to #Frank for the tip) less memory efficient as it will create all possible combinations first, e.g.
tmp <- dt[, CJ(V3, V3)[V1 < V2], by = .(g = V1)][, .N, by = .(V1, V2)]
## Then, as previously
m <- array(0L, rep(max(dt$V3), 2L))
m[cbind(tmp$V1, tmp$V2)] <- tmp$N
m + t(m)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 2 0 0 1 0 0 1
# [2,] 2 0 0 0 1 0 0 1
# [3,] 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 1 0 1 0
# [5,] 1 1 0 1 0 0 1 1
# [6,] 0 0 0 0 0 0 0 0
# [7,] 0 0 0 1 1 0 0 0
# [8,] 1 1 0 0 1 0 0 0
Not sure this is the most elegant approach, but it works:
myfun <- function(x, matsize=8) {
# get all (i,j) pairs but in an unfortunate text format
pairs_all <- outer(x, x, paste)
# "drop" all self-pairs like (1,1)
diag(pairs_all) <- "0 0"
# convert these text-pairs into numeric pairs and store in matrix
ij <- do.call(rbind, lapply(strsplit(pairs_all, " "), as.numeric))
# create "empty" matrix of zeros
mat <- matrix(0, nrow=matsize, ncol=matsize)
# replace each spot of empty matrix with a 1 if that pair exists
mat[ij] <- 1
# return 0/1 matrix
return(mat)
}
# split your data by group
# lapply the custom function to each group
# add each group's 0/1 matrix together for final result
Reduce('+', lapply(split(dt$V3, dt$V1), myfun))
If anyone has a more direct way to implement the first 3 (non-comment) lines of myfun, I would happily incorporate them.
Related
I would like to transform a vector of integer such:
vector = c(0,6,1,8,5,4,2)
length(vector) = 7
max(vector) = 8
into a matrix m of nrow = length(vector) and ncol = max(vector) :
m =
0 0 0 0 0 0 0 0
1 1 1 1 1 1 0 0
1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
1 1 1 1 1 0 0 0
1 1 1 1 0 0 0 0
1 1 0 0 0 0 0 0
It's just an example of what I am trying to do. I intend that the function work with every vector of integer.
I tried to used the function mapply(rep, 1, vector) but I obtained a list and I didn't succeed to convert it into a matrix...
It would be very useful for me if someone can help me.
Best Regards,
Maxime
If you use c(rep(1, x), rep(0, max(vector-x)) on each element of your variable vector you get the desired binary results. Looping that with sapply even returns a matrix. You only need to transpose it afterwards and you get your result.
vector = c(0,6,1,8,5,4,2)
result <- t(sapply(vector, function(x) c(rep(1, x), rep(0, max(vector)-x))))
is.matrix(result)
#> [1] TRUE
result
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 0 0 0 0 0 0 0 0
#> [2,] 1 1 1 1 1 1 0 0
#> [3,] 1 0 0 0 0 0 0 0
#> [4,] 1 1 1 1 1 1 1 1
#> [5,] 1 1 1 1 1 0 0 0
#> [6,] 1 1 1 1 0 0 0 0
#> [7,] 1 1 0 0 0 0 0 0
Putting that into a function is easy:
binaryMatrix <- function(v) {
t(sapply(v, function(x) c(rep(1, x), rep(0, max(v)-x))))
}
binaryMatrix(vector)
# same result as before
Created on 2021-02-14 by the reprex package (v1.0.0)
Another straightforward approach would be to exploit matrix sub-assignment using row/column indices in a matrix form (see, also, ?Extract).
Define a matrix of 0s:
x = c(0, 6, 1, 8, 5, 4, 2)
m = matrix(0L, nrow = length(x), ncol = max(x))
And fill with 1s:
i = rep(seq_along(x), x) ## row indices of 1s
j = sequence(x) ## column indices of 1s
ij = cbind(i, j)
m[ij] = 1L
m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#[1,] 0 0 0 0 0 0 0 0
#[2,] 1 1 1 1 1 1 0 0
#[3,] 1 0 0 0 0 0 0 0
#[4,] 1 1 1 1 1 1 1 1
#[5,] 1 1 1 1 1 0 0 0
#[6,] 1 1 1 1 0 0 0 0
#[7,] 1 1 0 0 0 0 0 0
Assuming that all values in the vector are non-negative integers, you can define the following function
transformVectorToMatrix <- function(v) {
nrOfCols <- max(v)
zeroRow <- integer(nrOfCols)
do.call("rbind",lapply(v,function(nrOfOnes) {
if(nrOfOnes==0) return(zeroRow)
if(nrOfOnes==nrOfCols) return(zeroRow+1)
c(integer(nrOfOnes)+1,integer(nrOfCols-nrOfOnes))
}))
}
and finally do
m = transformVectorToMatrix(vector)
to get your desired binary matrix.
I have a contingency table (ct) like this:
read.table( text=
1 2 3 4 5 6
1 0 0 1 0 2 0
2 0 0 2 0 0 0
70 0 0 3 0 0 0
76 15 13 19 2 9 8
85 0 0 2 0 0 0
109 0 0 0 0 1 2
479 0 0 0 0 2 0
491 2 0 0 0 0 0
1127 0 1 0 1 6 0
1131 0 1 1 1 2 0
1206 1 3 1 0 0 1
1208 1 0 1 0 0 1
1210 0 1 0 0 0 1
1225 2 0 1 0 0 0
1232 0 0 0 0 1 1
1242 0 0 0 1 0 1
1243 1 0 0 0 1 1
1251 0 0 2 0 1 2
1267 0 2 1 0 0 0
4415 0 2 0 0 0 0
4431 0 0 0 2 0 0
4808 0 0 0 0 2 0
4823 0 2 0 0 0 0 )
Where rows represent cluster, columns represent hospitals and numbers in the table the count of isolates.
For example: Cluster 1 has 3 isolates, 1 in hospital 3 and 2 in hospital 2.
I now want to check, if clusters and hospitals are dependent on each other or not. For that, I would like to create 1000 randomly distributed tables, where all isolates in one cluster have the chance to fall into every hospital.
For example: The 3 Isolates in cluster 1 might then be distributed over 3 hospitals, so that I get the values : 0 1 1 1 0 0 .
Combinations can occur multiple times.
I tried this:
replicates <- 1000
permutations <- lapply(seq(replicates), function(i, ct){
list <- lapply(apply(ct,1,list),unlist)
list <- lapply(list, function(x)as.numeric(x))
z <- as.data.frame(do.call(rbind, lapply(list, function(x) sample(x))))
}, ct = ct)
But by that only the values in the dataframe are shuffled to another position in the row.
Can someone help me with that?
I concur with Maurits Evers answer, at full rank you got binomial combination per lines : n variables mean 2^n combination... if you add m-1 columns this yields 2^(n+m) possibilities.
Here's an alternative using partitions::composition.
library(partitions)
# smaller toy data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
# calculate row sums
rs <- rowSums(d)
# for each unique row sum, partition the value with order m = number of columns
# this avoids repeating calculation of partitions on duplicate row sums
l <- lapply(unique(rs), compositions, m = ncol(d))
# name list elements with row sums
names(l) <- unique(rs)
# set number of samples
n <- 4
# to reproduce sample in this example
set.seed(1)
# loop over rows in data frame
lapply(1:nrow(d), function(i){
# index list of partitions using row sums
m <- l[[as.character(rs[i])]]
# number of columns to sample from
nc <- ncol(m)
# select columns from matrix using a sample of n column indexes
m[ , sample(nc, n, replace = TRUE)]
})
The result is a list where each element is a matrix for each row of the original data. Each matrix column is one (sampled) partition.
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 1 0
# [2,] 1 2 0 0
# [3,] 0 0 1 2
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 0 2
# [2,] 3 1 0 0
# [3,] 0 3 4 2
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 1
# [2,] 0 0 1 1
# [3,] 1 0 0 0
I tried to partition the largest row sum in your example data (66), and it runs pretty quickly. Thus, if your row sums are not very much larger and the number of columns is small (like here), the code above may be a viable option.
system.time(p <- compositions(66, 6))
# user system elapsed
# 1.53 0.16 1.68
str(p)
# 'partition' int [1:6, 1:13019909] 66 0 0 0 0 0 65 1 0 0 ...
Note that it 'explodes' rapidly if the number of columns increases:
system.time(p <- compositions(66, 7))
# user system elapsed
# 14.11 1.61 15.72
Sorry #Henrik for the late response. Your code worked out quite well for me! However, with the help of a colleague of mine, I figured out this code (I'll just show it using your sample data):
#data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
#Number of replicates I want
replicates <- 1000
#Number of columns in the table
k<- 3
l <- NULL
#unlist the dataframe
list <- lapply(apply(d,1,list),unlist)
#Calculate replicates of the dataframe, where numbers are permuted within rows
permutations <- lapply(seq(replicates), function(j){
l_sampled <- lapply(list, function(x){
pos.random <- sample(k, sum(x), replace = T)
x.random <- rep(0,k)
for (i in 1:k){
x.random[i] <- sum(pos.random==i)
}
l = rbind(l, data.frame(x.random))
})
df <- data.frame(matrix(unlist(l_sampled), nrow=length(l_sampled), byrow=T))
})
#Example for results:
> permutations[[8]]
X1 X2 X3
1 2 0 0
2 1 2 1
3 1 0 1
> permutations[[10]]
X1 X2 X3
1 0 1 1
2 2 0 2
3 0 2 0
How can a I automatically build a matrix, that converts permutations of a vector <- c(1,2,3) into kind of a binary format? Like this:
x <- matrix(c(1,1,0,0,0,1,0,1,1,0,1,1,0,0,0,1,1,1), ncol = 3)
rownames(x) <- c("1", "1,2", "2", "3", "2,3", "1,2,3")
colnames(x) <- c("1", "2", "3")
x
1 2 3
1 1 0 0
1,2 1 1 0
2 0 1 0
3 0 0 1
2,3 0 1 1
1,2,3 1 1 1
Though I would like to have not 3, but 7 values.
Here is a one-liner that is really fast:
vector <- c(1,2,3)
library(RcppAlgos)
toBinary <- function(v) permuteGeneral(0:1, length(v), TRUE)[-1,]
toBinary(vector)
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 0 1 0
[3,] 0 1 1
[4,] 1 0 0
[5,] 1 0 1
[6,] 1 1 0
[7,] 1 1 1
The [-1, ] is to remove the row of all zeros. This row would represent the empty set in a power set. In fact, what you are asking for is technically a mapping from the power set of a vector (minus the empty set of course) to a binary matrix.
If you really want the row.names to be the actual permutations, you can use the powerSet function from the rje package. Observe:
library(rje)
nameTest <- toBinary(vector)
row.names(nameTest) <- lapply(powerSet(rev(vector))[-1], sort)
nameTest
[,1] [,2] [,3]
3 0 0 1
2 0 1 0
c(2, 3) 0 1 1
1 1 0 0
c(1, 3) 1 0 1
c(1, 2) 1 1 0
c(1, 2, 3) 1 1 1
* Disclaimer: I am the author of RcppAlgos
One approach would be
x <- c(2, 4, 5)
combs <- sapply(1:length(x), combn, x = x)
M <- do.call(rbind, sapply(combs, function(u)
t(apply(u, 2, function(v) 1 * x %in% v))))
dimnames(M) <- list(unlist(sapply(combs, apply, 2, paste, collapse = ",")), x)
M
# 2 4 5
# 2 1 0 0
# 4 0 1 0
# 5 0 0 1
# 2,4 1 1 0
# 2,5 1 0 1
# 4,5 0 1 1
# 2,4,5 1 1 1
Here is a function that will turn any vector to the appropriate binary matrix,
get_binary <- function(x){
v1 <- unlist(sapply(seq_along(x), function(i) combn(x, i, toString)))
mat <- t(sapply(v1, function(i)sapply(x, function(j) as.integer(grepl(j, i)))))
colnames(mat) <- x
return(mat)
}
get_binary(c(2, 8, 9))
which gives,
2 8 9
2 1 0 0
8 0 1 0
9 0 0 1
2, 8 1 1 0
2, 9 1 0 1
8, 9 0 1 1
2, 8, 9 1 1 1
Example code here:
> temp2
a b c d e f g h
i 1 1 0 0 0 1 0 1
j 0 1 0 0 0 1 0 1
k 0 1 1 0 0 1 1 1
l 0 0 0 0 1 0 0 1
m 0 0 1 1 0 0 1 1
n 0 0 1 1 0 0 1 1
o 0 0 0 1 0 0 1 1
p 0 0 0 0 1 0 0 1
> dput(temp2)
structure(list(a = c(1, 0, 0, 0, 0, 0, 0, 0), b = c(1, 1, 1,
0, 0, 0, 0, 0), c = c(0, 0, 1, 0, 1, 1, 0, 0), d = c(0, 0, 0,
0, 1, 1, 1, 0), e = c(0, 0, 0, 1, 0, 0, 0, 1), f = c(1, 1, 1,
0, 0, 0, 0, 0), g = c(0, 0, 1, 0, 1, 1, 1, 0), h = c(1, 1, 1,
1, 1, 1, 1, 1)), .Names = c("a", "b", "c", "d", "e", "f", "g",
"h"), class = "data.frame", row.names = c("i", "j", "k", "l",
"m", "n", "o", "p"))
I have this 8x8 grid of 1s and 0s. I need to solve for some grid where each row and each column has exactly one 1 and the rest 0s, but the 1 has to be in a place where the original grid has a 1. It's almost like a sudoku question but not exactly. Any thoughts on how to get started?
I would need some function that can do this for a general grid, not simply this specific one. We can assume that there's always a solution grid, given some starting grid.
Thanks!
Edit: a valid solution
> temp3
a b c d e f g h
i 1 0 0 0 0 0 0 0
j 0 1 0 0 0 0 0 0
k 0 0 0 0 0 1 0 0
l 0 0 0 0 1 0 0 0
m 0 0 0 1 0 0 0 0
n 0 0 1 0 0 0 0 0
o 0 0 0 0 0 0 1 0
p 0 0 0 0 0 0 0 1
EDIT2: given that there's only 8! unique solutions for any grid, i may attempt a brute force / matching approach.
This can be solved as a transportation problem or as an integer programming problem. We also show a one-line solution using only base R which generates random matrices for which each row and each columns column sums to 1 filtering out and returning the ones satisfying the additional constraints that each element of the solution matrix be less than or equal to the corresponding element of temp2.
1) transportation problem Using lp.transport in lpSolve we can solve it in one statement:
library(lpSolve)
res <- lp.transport(as.matrix(temp2), "max",
rep("=", 8), rep(1, 8), rep("=", 8), rep(1, 8), integers = 0:1)
res
## Success: the objective function is 8
soln <- array(res$solution, dim(temp2))
# verify
all(colSums(soln)==1) && all(rowSums(soln)==1) && all(temp2>=soln) && all(soln %in% 0:1)
## [1] TRUE
2) integer programming
If X is the solution we have specified the row and column constraints but have not specified the X <= temp2 constraints since they will be satisfied automatically as no solution putting a 1 where a temp2 0 is can have the maximum objective of 8.
library(lpSolve)
n <- nrow(temp2)
obj <- unlist(temp2)
const_row <- t(sapply(1:n, function(i) c(row(temp2)) == i)) # each row sums to 1
const_col <- t(sapply(1:n, function(i) c(col(temp2)) == i)) # each col sums to 1
const.mat <- rbind(const_row, const_col)
res <- lp("max", obj, const.mat, "=", 1, all.bin = TRUE)
res
## Success: the objective function is 8
soln <- array(res$solution, dim(temp2))
# verify
all(colSums(soln)==1) && all(rowSums(soln)==1) && all(temp2>=soln) && all(soln %in% 0:1)
## [1] TRUE
(Note that by the same argument we could have relaxed the problem to a linear programming problem provided we add 0 <= soln[i, j] <= 1 constraints since by the same argument that allowed us to omit the soln[i, j] <= temp2[i, j] constraints the maximization will force the soln elements to be 0 or 1 anyways.)
2a) This approach is longer but does spell out the X <= temp2 constraints explicitly:
n <- nrow(temp2)
obj <- numeric(n*n)
const1 <- diag(n*n) # soln[i,j] <= temp2[i,j]
const2 <- t(sapply(1:n, function(i) c(row(temp2)) == i)) # each row sums to 1
const3 <- t(sapply(1:n, function(i) c(col(temp2)) == i)) # each col sums to 1
const.mat <- rbind(const1, const2, const3)
const.dir <- rep(c("<=", "="), c(n*n, 2*n))
const.rhs <- c(unlist(temp2), rep(1, 2*n))
res <- lp("max", obj, const.mat, const.dir, const.rhs, all.bin = TRUE)
res
## Success: the objective function is 0
soln <- array(res$solution, dim(temp2))
# verify
all(colSums(soln)==1) && all(rowSums(soln)==1) && all(temp2>=soln) && all(soln %in% 0:1)
## [1] TRUE
2b) Note that if X is the solution matrix then in X <= temp2 only the positions of X corresponding to zeros in temp2 actually constrain so we could eliminate any constraint corresponding to a 1 in temp2 in the (2a) solution. With this change all constraints become equality constraints.
n <- nrow(temp2)
obj <- numeric(n*n)
const1 <- diag(n*n)[unlist(temp2) == 0, ]
const2 <- t(sapply(1:n, function(i) c(row(temp2)) == i)) # each row sums to 1
const3 <- t(sapply(1:n, function(i) c(col(temp2)) == i)) # each col sums to 1
const.mat <- rbind(const1, const2, const3)
const.dir <- "="
const.rhs <- c(numeric(nrow(const1)), rep(1, 2*n))
res <- lp("max", obj, const.mat, const.dir, const.rhs, all.bin = TRUE)
res
## Success: the objective function is 0
soln <- array(res$solution, dim(temp2))
# verify
all(colSums(soln)==1) && all(rowSums(soln)==1) && all(temp2>=soln) && all(soln %in% 0:1)
## [1] TRUE
In fact, we could go further and remove the variables that correspond to zero elements of temp2.
3) r2dtable Here we use rd2table to generate 10,000 8x8 tables whose rows and columns sum to 1 and then filter them to pick out only those satisfying the X < temp2 constrainsts. Withtemp2` from the question and the random seed shown has found 3 solutions. If with different inputs it finds no solutions then try generating a higher number of random proposals. This approach does not use any packages.
set.seed(123) # for reproducibility
Filter(function(x) all(x <= temp2), r2dtable(10000, rep(1, 8), rep(1, 8)))
giving:
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 1 0 0
[3,] 0 1 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 1
[5,] 0 0 0 0 0 0 1 0
[6,] 0 0 1 0 0 0 0 0
[7,] 0 0 0 1 0 0 0 0
[8,] 0 0 0 0 1 0 0 0
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 1 0 0
[3,] 0 1 0 0 0 0 0 0
[4,] 0 0 0 0 1 0 0 0
[5,] 0 0 0 1 0 0 0 0
[6,] 0 0 1 0 0 0 0 0
[7,] 0 0 0 0 0 0 1 0
[8,] 0 0 0 0 0 0 0 1
[[3]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 0 0 0 0 0 0 0
[2,] 0 1 0 0 0 0 0 0
[3,] 0 0 0 0 0 1 0 0
[4,] 0 0 0 0 1 0 0 0
[5,] 0 0 1 0 0 0 0 0
[6,] 0 0 0 0 0 0 1 0
[7,] 0 0 0 1 0 0 0 0
[8,] 0 0 0 0 0 0 0 1
A brute-force way:
m = as.matrix(temp2)
w = data.frame(which(m == 1, arr.ind = TRUE))
combos = as.matrix(do.call(expand.grid, with(w, split(col, row))))
combos[ apply(combos, 1, function(x) !anyDuplicated(x)), ]
1 2 3 4 5 6 7 8
[1,] 1 6 2 8 7 3 4 5
[2,] 1 2 6 8 7 3 4 5
[3,] 1 6 2 8 3 7 4 5
[4,] 1 2 6 8 3 7 4 5
[5,] 1 6 2 8 4 3 7 5
[6,] 1 2 6 8 4 3 7 5
[7,] 1 6 2 8 3 4 7 5
[8,] 1 2 6 8 3 4 7 5
[9,] 1 6 2 5 7 3 4 8
[10,] 1 2 6 5 7 3 4 8
[11,] 1 6 2 5 3 7 4 8
[12,] 1 2 6 5 3 7 4 8
[13,] 1 6 2 5 4 3 7 8
[14,] 1 2 6 5 4 3 7 8
[15,] 1 6 2 5 3 4 7 8
[16,] 1 2 6 5 3 4 7 8
OP claims to only ever need to handle an 8x8 grid, so I guess this performs well enough. Each row of the result is a solution. The first row says that (1,1), (2,6), (3,2) ... is a solution.
A variation using data.table:
library(data.table)
m = as.matrix(temp2)
comboDT = setDT(melt(m))[ value == 1, do.call(CJ, split(Var2, Var1)) ][,
rid := .I ][, melt(.SD, id="rid", variable.name="row", value.name="col")]
setkey(comboDT, rid)
comboDT[ .( comboDT[, !anyDuplicated(col), by=rid][(V1), rid]) ]
this works. Let grid be my grid (temp2 from above). then this will return a grid that works
# create random sufficient grid
counter = 0
while(2 > 1) {
counter = counter + 1
if(counter == 10000) {
break
}
rand_grid = matrix(0, nrow = 8, ncol = 8)
indices_avail = seq(1,8,by=1)
for(i in 1:8) {
k = sample(indices_avail, 1)
rand_grid[i, k] = 1
indices_avail = indices_avail[indices_avail != k]
}
if(sum(grid[which(rand_grid == 1)]) == 8) {
break
}
print(counter)
}
This approach will return all valid combinations. First find all matrix row combinations. Then search through exhaustively. This method would have to be improved if your matrix size increased. One simple improvement would be to run the diag test in parallel.
st<-as.matrix(temp2) # make sure we are working with matrices
## This method will return all possible matrices of combinations
## in essence if you have diag(matr) = width matrix than you have
## a valid choice
## Helper function to build all combinations, there may be better way to
## do this but it gets the job done
allCombinationsAux<-function(z,nreg,x){
if(sum(nreg)>1){
innerLoop<-do.call(rbind,lapply(x[nreg&(z!=x)], test1,nreg&(z!=x),x))
ret<-cbind(z,innerLoop )
}
else{
ret<-x[nreg]
}
ret
}
## Build all of the combinations of possible matrices
combs<-do.call(rbind,lapply(x,function(y) allCombinationsAux(y,y!=x,x)))
## iterate through all the possible combinations of matrices, to find out
## which ones have 1s throughout the diag
inds<-which(apply(combs,1,function(x) sum(diag(st[x,]))==8))
lapply(inds,function(x) st[combs[x,],])
While there are great answers already here for the brute-force approach and actually using math, just for kicks, here's a version that guesses and checks lags of the non-matching columns. For the example in question, it actually turns out to be quite quick, and as a bonus, you could find a new answer on any particular run! How fun! To the code:
set.seed(47) # remove this to have more fun
mat.in <- as.matrix(temp2) # we'll work in matrices
mat.out <- diag(8) # a starting guess
dimnames(mat.out) <- dimnames(mat.in) # make our answer pretty
iteration <- 1 # for kicks, a loop counter
while (any((mat.out != mat.in)[as.logical(mat.out)])) {
mat.ref <- mat.out
mat.out <- mat.out[, sample(8)] # make this deterministic if you like
inner <- 1 # don't repeat yourself (too much)
while (any(mat.out != mat.ref) & inner <= 8) {
mat.ref <- mat.out
# find non-matching indices and lag those columns
to.lag <- which((mat.out != mat.in)[as.logical(mat.out)])
i <- 1:8
i[to.lag] <- c(to.lag[length(to.lag)], to.lag[-length(to.lag)])
mat.out <- mat.out[, i]
cat(inner, " ") # let's see what it does
inner <- inner + 1
}
print(iteration) # whoo, scrolling numbers
iteration <- iteration + 1
}
## 1 2 3 [1] 1
## 1 2 3 4 5 6 7 8 [1] 2
## 1 2 [1] 3
## 1 2 3 [1] 4
which, for this particular seed returns
mat.out
## a c e g d b f h
## i 1 0 0 0 0 0 0 0
## j 0 0 0 0 0 1 0 0
## k 0 1 0 0 0 0 0 0
## l 0 0 0 0 1 0 0 0
## m 0 0 1 0 0 0 0 0
## n 0 0 0 0 0 0 1 0
## o 0 0 0 1 0 0 0 0
## p 0 0 0 0 0 0 0 1
It could certainly be optimized further, but it's already pretty quick (without the printing, which slows it down):
Unit: microseconds
expr min lq mean median uq max neval
let's guess 137.796 383.6445 838.2327 693.819 1163.08 2510.436 100
running all 100 times in a fraction of a second. It's quite a bit faster than actual guessing (chopping out the inner loop):
Unit: microseconds
expr min lq mean median uq max neval cld
guess smart 148.997 349.916 848.6314 588.162 1085.841 3117.78 100 a
actually guess 322.458 7341.961 31197.1237 20012.969 47677.501 160250.02 100 b
Note, though, that luck plays a role here, and if there are fewer solutions, it will take longer. If there are no solutions, it will run forever. It could, of course, be optimized to avoid such a fate by making sure it doesn't reuse the same starting permutation provided by sample(8) (a good idea regardless, which I deemed superfluous here as it only runs through a handful of permutations each run anyway). Hack away.
I have a dataframe of 9 columns consisting of an inventory of factors. Each row can have all 9 columns filled (as in that row is holding 9 "things"), but most don't (most have between 3-4). The columns aren't specific either, as in if item 200 shows up in columns 1 and 3, it's the same thing. I'd like to create a matrix that is binary for each row that includes all factors.
Ex (shortened to 4 columns just to get point across)
R1 3 4 5 8
R2 4 6 7 NA
R3 1 5 NA NA
R4 2 6 8 9
Should turn into
1 2 3 4 5 6 7 8 9
r1 0 0 1 1 1 0 0 1 0
r2 0 0 0 1 0 1 1 0 0
r3 1 0 0 0 1 0 0 0 0
r4 0 1 0 0 0 1 0 1 1
I've looked into writeBin/readBin, K-clustering (which is something I'd like to do, but I need to get rid of the NAs first), fuzzy clustering, tag clustering. Just kinda lost about what direction to go.
I've tried writing two for loops that pull the data from the matrix by column/row and then save 0s and 1s respectively in a new matrix, but I think there were scope issues.
You guys are the best. Thanks!
Here's a base R solution:
# Read in the data, and convert to matrix form
df <- read.table(text = "
3 4 5 8
4 6 7 NA
1 5 NA NA
2 6 8 9", header = FALSE)
m <- as.matrix(df)
# Create a two column matrix containing row/column indices of cells to be filled
# with 'one's
id <- cbind(rowid = as.vector(t(row(m))),
colid = as.vector(t(m)))
id <- id[complete.cases(id), ]
# Create output matrix
out <- matrix(0, nrow = nrow(m), ncol = max(m, na.rm = TRUE))
out[id] <- 1
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] 0 0 1 1 1 0 0 1 0
# [2,] 0 0 0 1 0 1 1 0 0
# [3,] 1 0 0 0 1 0 0 0 0
# [4,] 0 1 0 0 0 1 0 1 1
This should do the trick:
# The Incantation
options(stringsAsFactors = FALSE)
library(reshape2)
# Your example data
dat <- data.frame(id = c("R1", "R2", "R3", "R4"),
col1 = c(3, 4, 1, 2),
col2 = c(4, 6, 5, 6),
col3 = c(5, 7, NA, 7),
col4 = c(8, NA, NA, 9)
)
# Melt it down
dat.melt <- melt(dat, id.var = "id")
# Cast it back out, with the row IDs remaining the row IDs
# and the values of the columns becoming the columns themselves.
# dcast() will default to length to aggregate records - which means
# that the values in this data.frame are a count of how many times
# each value occurs in each row's columns (which, based on this data,
# seems to be capped at just once).
dat.cast <- dcast(dat.melt, id ~ value)
The result:
dat.cast
id 1 2 3 4 5 6 7 8 9 NA
1 R1 0 0 1 1 1 0 0 1 0 0
2 R2 0 0 0 1 0 1 1 0 0 1
3 R3 1 0 0 0 1 0 0 0 0 2
4 R4 0 1 0 0 0 1 1 0 1 0
These are all great answers. Thought I'd contribute the original solution I wrote that a friend of mine modified to actually work.
for(i in seq(nrow(x)))
for(j in seq(ncol(x)))
if(!is.na(x[i,j])) { y[i, x[i,j]] = 1 }
Two for loops works after setting some earlier parameters, but it's incredibly slow. Looks like these other solutions work much faster!