How to take pairwise combinations of the variables of each row? - r

I am working in R and am trying to take a dataframe with 7 columns and create a 2 column dataframe with all the combinations of each row's responses stacked on top of each other.
For example if I had:
1,0,1,0
I would want to transform that row to the rows
1,0
1,1
1,0
0,1
0,0
1,0
And do that to every row in the dataframe and stack them.
I know how to do this for 1 row at a time
df2<-combn(df[1,],2)
That code will get me the combinations of one row like my example above; however I can't figure out how to do apply it to all rows. My best guess would be something along the lines of
df3<-apply(1:nrow(df), 1, function(x) combn(df[x,],2))
However I am getting the "dim(x) must have a positive length" error. Does anyone know what my problem is and can explain what I am doing wrong and why I need to do it a certain way. New to coding R beyond base functions. As far as data goes it's just binary data.

Since you didn't provide sample data i'll use some made up data. You can use lapply to store each row as a list element and then use do.call with rbind to bind them all together:
set.seed(1)
df <- data.frame(col1 = sample(1:4), col2 = sample(1:4), col3 = sample(1:4), col4 = sample(1:4))
df3 <- do.call(rbind, lapply(1:nrow(df), function(i) t(combn(df[i,],2))))
head(df3)
# [,1] [,2]
#[1,] 2 1
#[2,] 2 3
#[3,] 2 3
#[4,] 1 3
#[5,] 1 3
#[6,] 3 3
Note - df3 will return as a matrix.

Something like this:
df <- data.frame(a = c(1,0), b = c(0,1), c = c(1,0), d = c(0,1))
matrix(c(apply(df[1:2,],1, function(x) combn(x,2))), ncol = 2, byrow = T)
[,1] [,2]
[1,] 1 0
[2,] 1 1
[3,] 1 0
[4,] 0 1
[5,] 0 0
[6,] 1 0
[7,] 0 1
[8,] 0 0
[9,] 0 1
[10,] 1 0
[11,] 1 1
[12,] 0 1

Related

Subset assignment of multidimensional array in R

I am trying to assign rows of a 3D array, but I don't know how excatly.
I have a 2D index array where each row corresponds to the first and second index of the 3D array, and a 2D value array which i want to insert into the 3D array. The simplest way I found to do this was
indexes <- cbind(1:30, rep(c(1, 2), 15))
rows <- cbind(1:20, 31:50, 71:90)
for (i in 1:nrow(indexes)) for (j in 1:3)
data[indexes[i,1], indexes[i,2], j] <- rows[i, j]
But this is hard to read, because it uses nested indexing, so I was hoping there was a simpler way, like
data[indexes,] <- rows
(this does not work)
What I've tried:
this question shows how to index the array (without assignment)
apply(data, 3, `[`, indexes)
but this doesn't allow assignment
apply(data, 3, `[`, indexes) <- rows #: could not find function "apply<-"
nor does using [<- work:
apply(data, 3, `[<-`, indexes, rows)
because it treats rows as a vector.
Neither of the following works either
data[indexes[1], indexes[2],] <- rows #: subscript out of bounds
data[indexes,] <- rows #: incorrect number of subscripts on matrix
So is there a simpler way of assigning to a multidimensional array?
Your indexes variable implies that data has first dim of 30, but rows[30,j] doesn't exist. So your problem isn't well posed, and I'll change it.
The basic idea is that you can index a 3 way array by an n x 3 matrix. Each row of the matrix corresponds to a location in the 3 way array, so if you want to set entry data[1,2,3] to 4, and entry data[5,6,7] to 8, you'd use
index <- rbind(c(1,2,3), c(5,6,7))
data[index] <- c(4,8)
You will need to expand your indexes variable to replicate each row 3 times, then read the rows matrix as a vector, and then this works:
data <- array(NA, dim=c(30, 2, 3))
indexes <- cbind(1:30, rep(c(1, 2), 15))
rows <- cbind(1:30, 31:60, 71:100)
indexes1 <- indexes[rep(1:nrow(indexes), each = 3),]
indexes2 <- cbind(indexes1, 1:3)
data[indexes2] <- t(rows) # Transpose because R reads down columns first
I don't think this is any simpler than what you had with the for loops, but maybe you'll find it preferable.
After reading #user2554330's answer, I found a slightly simpler solution
# initialize as in user2554330's answer
data <- ...
indexes <- ...
rows <- ...
indexes3 <- as.matrix(merge(indexes, 1:3))
data[indexes3] <- rows
comparison of indexes2 and indexes3 (using fewer elements):
# print(indexes2)
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 2
[3,] 1 1 3
[4,] 2 2 1
[5,] 2 2 2
[6,] 2 2 3
[7,] 3 1 1
[8,] 3 1 2
[9,] 3 1 3
[10,] 4 2 1
[11,] 4 2 2
[12,] 4 2 3
# print(indexes3)
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 1
[3,] 3 1 1
[4,] 4 2 1
[5,] 1 1 2
[6,] 2 2 2
[7,] 3 1 2
[8,] 4 2 2
[9,] 1 1 3
[10,] 2 2 3
[11,] 3 1 3
[12,] 4 2 3

How to use which() on a matrix to get unique indices

Suppose I have a symmetric matrix:
> mat <- matrix(c(1,0,1,0,0,0,1,0,1,1,0,0,0,0,0,0), ncol=4, nrow=4)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 0 1 0
[2,] 0 0 1 0
[3,] 1 1 0 0
[4,] 0 0 0 0
which I would like to analyse:
> which(mat==1, arr.ind=T)
row col
[1,] 1 1
[2,] 3 1
[3,] 3 2
[4,] 1 3
[5,] 2 3
now the question is: how am I not considering duplicated cells? As the resulting index matrix shows, I have the rows 2 and 4 pointing respectively to (3,1) and (1,3), which is the same cell.
How do I avoid such a situation? I only need a reference for each cell, even though the matrix is symmetric. Is there an easy way to deal with such situations?
EDIT:
I was thinking about using upper.tri or lower.tri but in this case what I get is an vector version of the matrix and I am not able to get back to the (row, col) notation.
> which(mat[upper.tri(mat)]==1, arr.ind=T)
[1] 2 3
EDIT II
expected output would be something like an unique over the couple of (row, col) and (col, row):
row col
[1,] 1 1
[2,] 3 1
[3,] 3 2
Since you have symmetrical matrix you could do
which(mat == 1 & upper.tri(mat, diag = TRUE), arr.ind = TRUE)
# row col
#[1,] 1 1
#[2,] 1 3
#[3,] 2 3
OR
which(mat == 1 & lower.tri(mat, diag = TRUE), arr.ind = TRUE)

Filling a matrix in R

I am trying to fill some rows of a (500,2) matrix with the row vector (1,0) using this code, last line is to verify the result:
data<-matrix(ncol=2,nrow=500)
data[41:150,]<-matrix(c(1,0),nrow=1,ncol=2,byrow=TRUE)
data[41:45,]
But the result is
> data[41:45,]
[,1] [,2]
[1,] 1 1
[2,] 0 0
[3,] 1 1
[4,] 0 0
[5,] 1 1
instead of
> data[41:45,]
[,1] [,2]
[1,] 1 0
[2,] 1 0
[3,] 1 0
[4,] 1 0
[5,] 1 0
(1) What am I doing wrong?
(2) Why aren't the row indices in the result 41, 42, 43, 44 and 45?
You're trying to fill a part of the matrix, so the block you're trying to drop in there should be of the right size:
data[41:150,]<-matrix(c(1,0),nrow=110,ncol=2,byrow=TRUE)
# nrow = 110, instead of 1 !!!!
Otherwise your piece-to-be-added will be reverted to vector and added columnwise. Try, for example, this:
data[41:150,] <- matrix(c(1,2,3,4,5), nrow=5, ncol=2, byrow=TRUE)
data[41:45,]
[,1] [,2]
[1,] 1 1
[2,] 3 3
[3,] 5 5
[4,] 2 2
[5,] 4 4
Can one complain? Yes, and now. No, because R behaves as documented (matrices are vectors with dimension attributes, and recycling works on vectors). Yes, because although recycling can be convenient, it may create false expectations.
Why aren't row indices 41,42,43,... ? I don't know, that's just the way matrices and vectors behave.
> (1:10)[5:6]
[1] 5 6
(Notice there's [1] in the output, not [5].)
Data frames behave differently, so you would see the original line numbers for slices:
as.data.frame(data)[45:50,]
It will be cleaner to just do this column-wise:
data[41:150, 1L] = 1
data[41:150, 2L] = 0
You could also accomplish this in one line with matrix indexing like so:
data[cbind(rep(41:150, each = 2L), 1:2)] = 1:0
You could use rep.
data[41:150,] <- rep(1:0, each=150-41+1)
#> data[41:45,]
# [,1] [,2]
#[1,] 1 0
#[2,] 1 0
#[3,] 1 0
#[4,] 1 0
#[5,] 1 0
I think MichaelChirico approach is the cleanest/savest to use.

what does rbind.fill.matrix really do?

I have this code and can't understand how rbind.fill.matrix is used.
dtmat is a matrix with the documents on rows and words on columns.
word <- do.call(rbind.fill.matrix,lapply(1:ncol(dtmat), function(i) {
t(rep(1:length(dtmat[,i]), dtmat[,i]))
}))
I read the description of the function and says that binds matrices but cannot understand which ones and fills with NA missing columns.
From what I understand, the function replaces columns that dont bind with NA.
Lets say I have 2 matrices A with two columns col1 and col2, B with three columns col1, col2 and colA. Since I want to bind all both these matrices, but rbind only binds matrices with equal number of columns and same column names, rbind.fill.matrix binds the columns but adds NA to all values that should be in both the matrices that are not. The code below will explain it more clearly.
a <- matrix(c(1,1,2,2), nrow = 2, byrow = T)
> a
[,1] [,2]
[1,] 1 1
[2,] 2 2
>
> b <- matrix(c(1,1,1,2,2,2,3,3,3), nrow = 3, byrow = T)
> b
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
>
> library(plyr)
> r <- rbind.fill.matrix(a,b)
> r
1 2 3
[1,] 1 1 NA
[2,] 2 2 NA
[3,] 1 1 1
[4,] 2 2 2
[5,] 3 3 3
>
>
The documentation also mentions about column names, which I think you can also understand from the example.

Obtaining connected components of neighboring values

I have a matrix with values 0 or 1 and I would like to obtain a list of groups of adjacent 1's. Vertical and horisontal neighbors of each 1 are considered when defining the connected groups.
For example, the matrix
mat = rbind(c(1,0,0,0,0),
c(1,0,0,1,0),
c(0,0,1,0,0),
c(0,0,0,0,0),
c(1,1,1,1,1))
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 1 0 0 1 0
[3,] 0 0 1 0 0
[4,] 0 0 0 0 0
[5,] 1 1 1 1 1
should return the following 4 connected components:
C1 = {(1,1);(2,1)}
C2 = {(2,4)}
C3 = {(3,3)}
C4 = {(5,1);(5,2);(5,3);(5,4);(5,5)}
Does anybody has an idea of how to do it fast in R? My real matrix is indeed rather large, like 2000x2000 (but I expect that the number of connected components to be reasonably small, i.e. 200).
You can turn your binary matrix into a raster object, and use the raster::clumps function to "Detect clumps (patches) of connected cells. Each clump gets a unique ID". Then it is just data management to return the exact format you want. Example below:
library(igraph)
library(raster)
mat = rbind(c(1,0,0,0,0),
c(1,0,0,1,0),
c(0,0,1,0,0),
c(0,0,0,0,0),
c(1,1,1,1,1))
Rmat <- raster(mat)
Clumps <- as.matrix(clump(Rmat, directions=4))
#turn the clumps into a list
tot <- max(Clumps, na.rm=TRUE)
res <- vector("list",tot)
for (i in 1:tot){
res[i] <- list(which(Clumps == i, arr.ind = TRUE))
}
Which then res prints out at the console:
> res
[[1]]
row col
[1,] 1 1
[2,] 2 1
[[2]]
row col
[1,] 2 4
[[3]]
row col
[1,] 3 3
[[4]]
row col
[1,] 5 1
[2,] 5 2
[3,] 5 3
[4,] 5 4
[5,] 5 5
I wouldn't be surprised if there is a better way to go from the raster object to your end goal though. Again a 2000 by 2000 matrix should not be a big deal for this.
Old (wrong answer) but should be useful for people who want connected components of a graph.
You can use the igraph package to turn your adjacency matrix into a network and return the components. Your example graph is one component, so I removed one edge for illustration.
library(igraph)
mat = rbind(c(1,0,0,0,0),
c(1,0,0,1,0),
c(0,0,1,0,0),
c(0,0,0,0,0),
c(1,1,1,1,1))
g <- graph.adjacency(mat) %>% delete_edges("5|3")
plot(g)
clu <- components(g)
groups(clu)
The final line then returns at the prompt:
> groups(clu)
$`1`
[1] 1 2 4 5
$`2`
[1] 3
My experience with this algorithm it is pretty fast - so I don't think 2,000 by 2,000 will be a problem.

Resources