Quickly generate the cartesian product of a matrix - r

Let's say I have a matrix x which contains 10 rows and 2 columns. I want to generate a new matrix M that contains each unique pair of rows from x - that is, a new matrix with 55 rows and 4 columns.
E.g.,
x <- matrix (nrow=10, ncol=2, 1:20)
M <- data.frame(matrix(ncol=4, nrow=55))
k <- 1
for (i in 1:nrow(x))
for (j in i:nrow(x))
{
M[k,] <- unlist(cbind (x[i,], x[j,]))
k <- k + 1
}
So, x is:
[,1] [,2]
[1,] 1 11
[2,] 2 12
[3,] 3 13
[4,] 4 14
[5,] 5 15
[6,] 6 16
[7,] 7 17
[8,] 8 18
[9,] 9 19
[10,] 10 20
And then M has 4 columns, the first two are one row from x and the next 2 are another row from x:
> head(M,10)
X1 X2 X3 X4
1 1 11 1 11
2 1 11 2 12
3 1 11 3 13
4 1 11 4 14
5 1 11 5 15
6 1 11 6 16
7 1 11 7 17
8 1 11 8 18
9 1 11 9 19
10 1 11 10 20
Is there either a faster or simpler (or both) way of doing this in R?

The expand.grid() function useful for this:
R> GG <- expand.grid(1:10,1:10)
R> GG <- GG[GG[,1]>=GG[,2],] # trim it to your 55 pairs
R> dim(GG)
[1] 55 2
R> head(GG)
Var1 Var2
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
R>
Now you have the 'n*(n+1)/2' subsets and you can simple index your original matrix.

I'm not quite grokking what you are doing so I'll just throw out something that may, or may not help.
Here's what I think of as the Cartesian product of the two columns:
expand.grid(x[,1],x[,2])

You can also try the "relations" package. Here is the vignette. It should work like this:
relation_table(x %><% x)

Using Dirk's answer:
idx <- expand.grid(1:nrow(x), 1:nrow(x))
idx<-idx[idx[,1] >= idx[,2],]
N <- cbind(x[idx[,2],], x[idx[,1],])
> all(M == N)
[1] TRUE
Thanks everyone!

Inspired from the other answers, here is a function implementing cartesian product of two matrices, in the case of two matrices, the full cartesian product, for only one argument, omitting one of each pair:
cartesian_prod <- function(M1, M2) {
if(missing(M2)) { M2 <- M1
ind <- expand.grid(1:NROW(M1), 1:NROW(M2))
ind <- ind[ind[,1] >= ind[,2],] } else {
ind <- expand.grid(1:NROW(M1), 1:NROW(M2))}
rbind(cbind(M1[ind[,1],], M2[ind[,2],]))
}

Related

Crosschecking numbers of a matrix in R

I'm currently working with a large matrix of two columns, and what I want to check is If every line/combination (two columns) is also present in a dataframe loaded (two columns as well).
Example,
(obj_design <- matrix(c(2,5,4,7,6,6,20,12,4,0), nrow = 5, ncol = 2))
[,1] [,2]
[1,] 2 6
[2,] 5 20
[3,] 4 12
[4,] 7 4
[5,] 6 0
(refined_grid <- data.frame(i=1:4, j=1:12))
i j
1 1 1
2 2 2
3 3 3
4 4 4
5 1 5
6 2 6
7 3 7
8 4 8
9 1 9
10 2 10
11 3 11
12 4 12
Following the reproducible example, it would be selected (2,6) and (4,12).
I'm wondering if there's a function that I can use to check the whole matrix, and see if a specific line is in the dataframe, and (if possible) write separately (new dataset) which elements of the matrix it is in.
Any assistance would be wonderful.
Here is an option with match
i1 <- match(do.call(paste, as.data.frame(obj_design)),
do.call(paste, refined_grid), nomatch = 0)
refined_grid[i1,]
This code will give you which rows of the matrix exist in the dataframe.
which(paste(obj_design[,1], obj_design[,2]) %in%
paste(refined_grid$i, refined_grid$j)
)
Then you can just assign it to a vector!

Get specific column value for each row

I want to get a "m" length vector that, considering a m x n matrix, for each row, gives the value on the column identified by another column (say column "Z").
I made it using a for loop:
for (i in 1:dim(data.frame)[1]){vector[i] <- data.frame[i,data.frame$Z[i]]}
Do you see a simpler way to code it avoiding the loop?
"apply" is a possibility:
> M <- cbind( matrix(1:15,3,5), "Z"=c(3,1,2) )
> M
Z
[1,] 1 4 7 10 13 3
[2,] 2 5 8 11 14 1
[3,] 3 6 9 12 15 2
> v <- apply(M,1,function(x){x[x["Z"]]})
> v
[1] 7 2 6
>

Cbind/Rbind With Ifelse Condition

Here is the code that I am working with:
x <- c("Yes","No","No","Yes","Maybe")
y <- t(1:10)
z <- t(11:20)
rbind.data.frame(ifelse(x == "Yes",y,z))
This produces
X1L X12L X13L X4L X15L
1 1 12 13 4 15
The desired outcome is:
x
1 Yes 1 2 3 4 5 6 7 8 9 10
2 No 11 12 13 14 15 16 17 18 19 20
3 No 11 12 13 14 15 16 17 18 19 20
4 Yes 1 2 3 4 5 6 7 8 9 10
5 Maybe 11 12 13 14 15 16 17 18 19 20
I was thinking that I could use an ifelse statement with the rbind.data.frame() or cbind.data.frame() function. So if x == "Yes" then that would be combined with a vector "y". As shown in the first row of the desired output. Inversely if x!="Yes" then it would be combinded with the vector "z". How would I go about doing this. I also thought maybe indexing with the which() function could be possible but I could not think of how I would use it.
UPDATE ANOTHER QUESTION
Here is the code I am working with :
a <- c(1,0,1,0,0,0,0)
b <- 1:7
t(sapply(a, function(test) if(test==0) b else 0))
Which produces
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0 Integer,7 0 Integer,7 Integer,7 Integer,7 Integer,7
Can any of you explain this? The code below works but I was wondering why the sapply function does not work. Also how would I save the matrix that is created below?
`row.names<-`(t(t(rbind(b,0))[,(a!='1')+1L]),x)
Answer To My Most Recent Question
For the sapply function to work it needs the input in the else statement to be a vector so,
a <- c(1,0,1,0,0,0,0)
b <- 1:7
c <- rep(0, times = length(a))
t(sapply(a, function(test) if(test==0) b else c))
Now this produces the proper output.
Try
t(sapply(x, function(x) if(x=='Yes') y else z))
Or
`row.names<-`(t(t(rbind(y,z))[,(x!='Yes')+1L]),x)

How to use logical values to access elements of data frame

Say I have this data frame:
x <- data.frame(matrix(rep(1:5, each=5), nrow=5))
Say I want to square all values that are greater than 3 and put these values back into the x.
I identify the values that are greater than 3 by:
x > 3
Then how can I reference these values in x? Doing x[x>3] returns a vector of integers, not a data frame.
Note that I am more so interested in this particular problem of x[x>3] and not as much the actual application that I included simply as motivation.
Just use matrix indexing:
ind <- which(x > 3, arr.ind = TRUE)
x[ind] <- x[ind] * x[ind] ## or x[ind] <- x[ind]^2
x
# X1 X2 X3 X4 X5
# 1 1 2 3 16 25
# 2 1 2 3 16 25
# 3 1 2 3 16 25
# 4 1 2 3 16 25
# 5 1 2 3 16 25
Alternatively, you can do replace(x, x > 3, x[x > 3]^2), but remember that this doesn't actually modify your "x" object so it needs to be reassigned.
Or,
> x[x>3] <- (x[x>3])^2
> x
X1 X2 X3 X4 X5
1 1 2 3 16 25
2 1 2 3 16 25
3 1 2 3 16 25
4 1 2 3 16 25
5 1 2 3 16 25

Quickly remove zero variance variables from a data.frame

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.
Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?
set.seed(1)
dat <- data.frame(
A=factor(rep("X",10),levels=c('X','Y')),
B=round(runif(10)*10),
C=rep(10,10),
D=c(rep(10,9),1),
E=factor(rep("A",10)),
F=factor(rep(c("I","J"),5)),
G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
which(out==1)
}
And here's the result of the process:
> dat
A B C D E F G
1 X 3 10 10 A I 10
2 X 4 10 10 A J 10
3 X 6 10 10 A I 10
4 X 9 10 10 A J 10
5 X 2 10 10 A I 10
6 X 9 10 10 A J 10
7 X 9 10 10 A I 10
8 X 7 10 10 A J 10
9 X 6 10 10 A I 10
10 X 1 10 1 A J NA
> dat[,-zeroVar(dat)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
> dat[,-zeroVar(dat, useNA = 'no')]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
You may also want to look into the nearZeroVar() function in the caret package.
If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.
Don't use table() - very slow for such things. One option is length(unique(x)):
foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))
Which is an order magnitude faster than yours on the example data set whilst giving similar output:
> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324
Simon's solution here is similarly quick on this example:
> system.time(replicate(1000, which(!unlist(lapply(dat,
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395
but you'll have to see if they scale similarly to real problem sizes.
Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like
var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))
It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance
Use the Caret Package and the function nearZeroVar
require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ]
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]
Well, save yourself some coding time:
Rgames: foo
[,1] [,2] [,3]
[1,] 1 1e+00 1
[2,] 1 2e+00 1
[3,] 1 3e+00 1
[4,] 1 4e+00 1
[5,] 1 5e+00 1
[6,] 1 6e+00 2
[7,] 1 7e+00 3
[8,] 1 8e+00 1
[9,] 1 9e+00 1
[10,] 1 1e+01 1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
Use apply(*, 2, sd) instead.
To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0 and then finally your data frame dat[,as.logical(bar)] should do the trick.
How about using factor to count the number of unique elements and looping with sapply:
dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
NAs are excluded by default, but this can be changed with the exclude parameter of factor:
dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
Because I'm an idiot who keeps googling the same question, let me leave a tidyverse approach that I've settled on:
library(tidyverse)
df <- df %>%
select(
- {
df %>%
map_dbl(~ length(table(.x, useNA = "ifany"))) %>%
{which(. == 1)} %>%
names()
}
)
I think this could be made shorter but I'm too tired!
I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:
removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
notConstant <- function(x) {
if (is.factor(x)) x <- as.integer(x)
return (0 != diff(range(x, na.rm=TRUE)))
}
bkeep <- sapply(a_dataframe, notConstant)
if (verbose) {
cat('removeConstantColumns: '
, ifelse(all(bkeep)
, 'nothing'
, paste(names(a_dataframe)[!bkeep], collapse=',')
, ' removed', '\n')
}
return (a_dataframe[, bkeep])
}
Check this custom function. I did not try it on data frames with 100+ variables.
remove_low_variance_cols <- function(df, threshold = 0) {
n <- Sys.time() #See how long this takes to run
remove_cols <- df %>%
select_if(is.numeric) %>%
map_dfr(var) %>%
gather() %>%
filter(value <= threshold) %>%
spread(key, value) %>%
names()
if(length(remove_cols)) {
print("Removing the following columns: ")
print(remove_cols)
}else {
print("There are no low variance columns with this threshold")
}
#How long did this script take?
print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
return(df[, setdiff(names(df), remove_cols)])
}

Resources