Subset data.table altering subset limit between factor levels

Subset data.table altering subset limit between factor levels - r

I am having a hard time trying to subset a data.table (package) in R. Giving a following example
library(data.table)
x = c(rep("a", 6), rep("b", 5))
y = c(0,2,1,0,1,2, 0,1,0,2,1)
z = c(1:6,1:5) + rnorm(11, 0.02, 0.1)
DT = data.table(ind = x, cond = y, dist = z)
ind cond dist
[1,] a 0 1.078966
[2,] a 2 1.987159
[3,] a 1 3.143391
[4,] a 0 3.937058
[5,] a 1 5.037681
[6,] a 2 6.036432
[7,] b 0 1.057809
[8,] b 1 2.144755
[9,] b 0 3.010903
[10,] b 2 3.937765
[11,] b 1 4.976273
I want to subset everything after the first 1 in cond column. In other words, everything that is larger than 3.143391 for a and 2.144755 for b (in this example).
DT.sub <- DT[cond == "1",] # Please, combine this row
DT.sub[,.SD[dist==min(dist)],by=ind] # With this to make the code shorter, if you can.
ind cond dist
[1,] a 1 3.143391
[2,] b 1 2.144755
The result should look like this:
ind cond dist
[1,] a 0 3.937058
[2,] a 1 5.037681
[3,] a 2 6.036432
[4,] b 0 3.010903
[5,] b 2 3.937765
[6,] b 1 4.976273

How about :
DT[,.SD[seq(match(1,cond)+1,.N)],by=ind]
ind cond dist
[1,] a 0 3.937058
[2,] a 1 5.037681
[3,] a 2 6.036432
[4,] b 0 3.010903
[5,] b 2 3.937765
[6,] b 1 4.976273
Btw, it's good to set.seed(1) first so we can work with the same random data.

Related

How to use which() on a matrix to get unique indices

Suppose I have a symmetric matrix:
> mat <- matrix(c(1,0,1,0,0,0,1,0,1,1,0,0,0,0,0,0), ncol=4, nrow=4)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 0 1 0
[2,] 0 0 1 0
[3,] 1 1 0 0
[4,] 0 0 0 0
which I would like to analyse:
> which(mat==1, arr.ind=T)
row col
[1,] 1 1
[2,] 3 1
[3,] 3 2
[4,] 1 3
[5,] 2 3
now the question is: how am I not considering duplicated cells? As the resulting index matrix shows, I have the rows 2 and 4 pointing respectively to (3,1) and (1,3), which is the same cell.
How do I avoid such a situation? I only need a reference for each cell, even though the matrix is symmetric. Is there an easy way to deal with such situations?
EDIT:
I was thinking about using upper.tri or lower.tri but in this case what I get is an vector version of the matrix and I am not able to get back to the (row, col) notation.
> which(mat[upper.tri(mat)]==1, arr.ind=T)
[1] 2 3
EDIT II
expected output would be something like an unique over the couple of (row, col) and (col, row):
row col
[1,] 1 1
[2,] 3 1
[3,] 3 2

Since you have symmetrical matrix you could do
which(mat == 1 & upper.tri(mat, diag = TRUE), arr.ind = TRUE)
# row col
#[1,] 1 1
#[2,] 1 3
#[3,] 2 3
OR
which(mat == 1 & lower.tri(mat, diag = TRUE), arr.ind = TRUE)

Changes of sign of the rows of a matrix

I have a matrix with three columns (say), for example:
M0 <- rbind(
c(1, 2, 3),
c(4, 5, 6)
)
I want to generate all changes of sign of every row of the matrix. Here the desired output is:
[,1] [,2] [,3]
[1,] -1 -2 -3
[2,] 1 -2 -3
[3,] -1 2 -3
[4,] 1 2 -3
[5,] -1 -2 3
[6,] 1 -2 3
[7,] -1 2 3
[8,] 1 2 3
[9,] -4 -5 -6
[10,] 4 -5 -6
[11,] -4 5 -6
[12,] 4 5 -6
[13,] -4 -5 6
[14,] 4 -5 6
[15,] -4 5 6
[16,] 4 5 6
Here is my solution:
signs <- as.matrix(expand.grid(c(-1,1),c(-1,1),c(-1,1)))
M1 <- vapply(1:nrow(M0),
function(i) t(signs %*% diag(M0[i,])),
array(0, dim = c(3,8)))
t(array(M1, dim = c(3, 8*dim(M1)[3])))
# [,1] [,2] [,3]
# [1,] -1 -2 -3
# [2,] 1 -2 -3
# [3,] -1 2 -3
# [4,] 1 2 -3
# [5,] -1 -2 3
# [6,] 1 -2 3
# [7,] -1 2 3
# [8,] 1 2 3
# [9,] -4 -5 -6
# [10,] 4 -5 -6
# [11,] -4 5 -6
# [12,] 4 5 -6
# [13,] -4 -5 6
# [14,] 4 -5 6
# [15,] -4 5 6
# [16,] 4 5 6
Do you have a more elegant solution ?
Moreover, there's one caveat with this solution: if there are some zeros in a row of the source matrix, then this solution generates some duplicates (because -0 = 0). I remove them with mgcv::uniqueCombs. Do you have a solution which doesn't generate some duplicates in the case when there are some zeros, without resorting to a "unique" function ?
EDIT: Benchmark of solutions
Let's compare the performance of three given solutions.
# #Aurèle
changesOfSign1 <- function(M){
signs <- as.matrix(expand.grid(rep(list(c(1, -1)), ncol(M))))
out <- matrix(c(apply(M, 1, `*`, c(t(signs)))), ncol = ncol(M), byrow = TRUE)
out[!duplicated(out),]
}
# #989
changesOfSign2 <- function(M){
signs <- as.matrix(expand.grid(rep(list(c(1, -1)), ncol(M))))
# signs for each row in the resultant matrix
m1 <- signs[rep(1:nrow(signs), times = nrow(M)), ]
# values for each row in the resultant matrix
m2 <- M[rep(1:nrow(M), each = nrow(signs)), ]
#
res <- m1*m2
res[!duplicated(res), ]
}
# #DS_UNI
changesOfSign3 <- function(M){
as.matrix(do.call(rbind, apply(M, 1, function(row){
expand.grid(lapply(row, function(x) if(x==0) 0 else c(-x,x)))
})))
}
# benchmark ####
library(microbenchmark)
benchmark <- function(nrows, ncols){
M0 <- matrix(rpois(nrows*ncols, 3), nrow = nrows, ncol = ncols)
microbenchmark(
changesOfSign1 = changesOfSign1(M0),
changesOfSign2 = changesOfSign2(M0),
changesOfSign3 = changesOfSign3(M0),
times = 1000
)
}
benchmark(nrows = 20, ncols = 3)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# changesOfSign1 493.990 542.4075 639.2895 577.884 642.589 7912.316 1000 a
# changesOfSign2 475.248 522.7730 618.2550 554.232 608.005 7346.927 1000 a
# changesOfSign3 3506.123 3757.8030 4380.9164 3928.491 4464.204 22603.045 1000 b
benchmark(nrows = 20, ncols = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# changesOfSign1 30.09545 35.95840 46.39465 41.37086 49.56855 344.2176 1000 b
# changesOfSign2 41.20642 47.99532 58.59760 52.83705 60.85200 349.4958 1000 c
# changesOfSign3 13.56397 15.21439 21.34205 18.21113 22.34445 319.3990 1000 a
#Aurèle and #989 win when there are 3 columns. #DS_UNI wins when there are 10 columns.
We can improve #DS_UNI's solution with data.table:
# #DS_UNI with data.table
library(data.table)
changesOfSign4 <- function(M){
as.matrix(rbindlist(apply(M, 1, function(row){
do.call(function(...) CJ(..., sorted = FALSE),
lapply(row, function(x) if(x==0) 0 else c(-x,x)))
})))
}

matrix(c(apply(M0, 1, `*`, c(t(signs)))), ncol = ncol(M0), byrow = TRUE)
No claims about elegance :)

I would suspect this to be fast (no loop):
signs <- as.matrix(expand.grid(c(-1,1),c(-1,1),c(-1,1)))
# signs for each row in the resultant matrix
m1 <- signs[ rep( 1:nrow(signs), times = nrow(M0) ), ]
# values for each row in the resultant matrix
m2 <- M0[ rep( 1:nrow(M0), each = nrow(signs) ), ]
res <- m1*m2
# Var1 Var2 Var3
# [1,] -1 -2 -3
# [2,] 1 -2 -3
# [3,] -1 2 -3
# [4,] 1 2 -3
# [5,] -1 -2 3
# [6,] 1 -2 3
# [7,] -1 2 3
# [8,] 1 2 3
# [9,] -4 -5 -6
# [10,] 4 -5 -6
# [11,] -4 5 -6
# [12,] 4 5 -6
# [13,] -4 -5 6
# [14,] 4 -5 6
# [15,] -4 5 6
# [16,] 4 5 6
To deal with duplicated rows caused by zeros:
res[ !duplicated(res), ]

How about this:
M0 <- rbind(
c(1, 2, 3),
c(4, 5, 6)
)
signs <- expand.grid(rep(list(c(1, -1)), ncol(M0)))
do.call(rbind, apply(M0, FUN = `*`, signs, MARGIN = 1))
EDIT:
Ok I gave up on elegance, and I do prefer the one-liner solution by #Aurèle, however I'm editing the answer to get at least the desired output, and on the plus side it works with zero :P
my_fun <- function(row){
expand.grid(
lapply(row,
function(x) {
if(x != -x)
return(c(x, -x))
else
return(x)}))}
do.call(rbind, apply(M0, FUN = my_fun, MARGIN = 1))

High, is this elegant enough :)
x <- 1:3
signs <- as.matrix(expand.grid(x = c(-1,1), y = c(-1, 1), z = c(-1, 1)))
t(x * t(signs))
x y z
[1,] -1 -2 -3
[2,] 1 -2 -3
[3,] -1 2 -3
[4,] 1 2 -3
[5,] -1 -2 3
[6,] 1 -2 3
[7,] -1 2 3
[8,] 1 2 3

extract every two elements in matrix row in r in sequence to calculate euclidean distance

How to extract every two elements in sequence in a matrix and return the result as a matrix so that I could feed the answer in a formula for calculation:
For example, I have a one row matrix with 6 columns:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
I want to extract column 1 and two in first iteration, 3 and 4 in second iteration and so on. The result has to be in the form of matrix.
[1,] 2 1
[2,] 5 5
[3,] 10 1
My original codes:
data <- matrix(c(1,1,1,2,2,1,2,2,5,5,5,6,10,1,10,2,11,1,11,2), ncol = 2)
Center Matrix:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
[2,] 1 1 2 1 10 1
[3,] 5 5 5 6 11 2
[4,] 2 2 5 5 10 1
[5,] 2 1 5 6 5 5
[6,] 2 2 5 5 11 1
[7,] 2 1 5 5 10 1
[8,] 1 1 5 6 11 1
[9,] 2 1 5 5 10 1
[10,] 5 6 11 1 10 2
objCentroidDist <- function(data, centers) {
resultMatrix <- matrix(NA, nrow=dim(data)[1], ncol=dim(centers)[1])
for(i in 1:nrow(centers)) {
resultMatrix [,i] <- sqrt(rowSums(t(t(data)-centers[i, ])^2))
}
resultMatrix
}
objCentroidDist(data,centers)
I want the Result matrix to be as per below:
[1,][,2][,3]
[1,]
[2,]
[3,]
[4,]
[5,]
[7,]
[8,]
[9,]
[10]
My concern is, how to calculate the data-centers distance if the dimensions of the data matrix are two, and centers matrix are six. (to calculate the distance from the data matrix and every two columns in centers matrix). Each row of the centers matrix has three centers.

Something like this maybe?
m <- matrix(c(2,1,5,5,10,1), ncol = 6)
list.seq.pairs <- lapply(seq(1, ncol(m), 2), function(x) {
m[,c(x, x+1)]
})
> list.seq.pairs
[[1]]
[1] 2 1
[[2]]
[1] 5 5
[[3]]
[1] 10 1
And, in case you're wanting to iterate over multiple rows in a matrix,
you can expand on the above like this:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
apply(mm, 1, function(x) {
lapply(seq(1, length(x), 2), function(y) {
x[c(y, y+1)]
})
})
EDIT:
I'm really not sure what you're after exactly. I think, if you want each row transformed into a 2 x 3 matrix:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
list.mats <- lapply(1:nrow(mm), function(x){
a = matrix(mm[x,], ncol = 2, byrow = TRUE)
})
> list.mats
[[1]]
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[[2]]
[,1] [,2]
[1,] 7 8
[2,] 9 10
[3,] 11 12
[[3]]
[,1] [,2]
[1,] 13 14
[2,] 15 16
[3,] 17 18
If, however, you want to get to your results matrix- I think it's probably easiest to do whatever calculations you need to do while you're dealing with each row:
results <- t(apply(mm, 1, function(x) {
sapply(seq(1, length(x), 2), function(y) {
val1 = x[y] # Get item one
val2 = x[y+1] # Get item two
val1 / val2 # Do your calculation here
})
}))
> results
[,1] [,2] [,3]
[1,] 0.5000000 0.7500 0.8333333
[2,] 0.8750000 0.9000 0.9166667
[3,] 0.9285714 0.9375 0.9444444
That said, I don't understand what you're trying to do so this may miss the mark. You may have more luck if you ask a new question where you show example input and the actual expected output that you're after, with the actual values you expect.

Nested for-loop skips loops

this is my problem:
I have a grid (see plot below), and I need to get and store in a list the coordinates of each vertex of each block (cell). The order of blocks that I need is '1-1', ... '4-1', '1-2', ... '4-2'. To keep it simple I'm just working with the indexes for now.
Based on two vectors with the common East and North coordinates I've written a little function, which is partially producing the output that I need. It is skipping the cell '1-2' and '2-2' (see output below). I can't see where exactly is the error, but I suspect that the issue is in my nested for loop. (There are many questions on for loop, but none helped me with my problem).
Any help will be appreciated and apologise if this is too basic to be asked here.
vectors:
x.breaks <- c(191789.1, 291789.1, 391789.1)
y.breaks <- c(5172287, 5272287, 5372287, 5472287, 5572287)
Function:
getting_vertices <- function(x.breaks, y.breaks){
xs <- list()
ys <- list()
polys <- list()
for(i in 1 : (length(x.breaks)-1)){
xs[[i]] <- c(i, i+1 , i+1, i, i)
}
for(j in 1 : (length(y.breaks)-1)){
ys[[j]] <- c(j, j, j+1, j+1, j)
}
for(v in 1 : length(sapply(ys, length)) ){
for(k in 1: length(sapply(xs, length))){
polys[[v*k]] <- cbind(xs[[k]], ys[[v]])
}
}
return(polys)
}
getting_vertices(x.breaks, y.breaks)
Output (this is partially correct):
[[1]]
[,1] [,2]
[1,] 1 1
[2,] 2 1
[3,] 2 2
[4,] 1 2
[5,] 1 1
[[2]]
[,1] [,2]
[1,] 1 2
[2,] 2 2
[3,] 2 3
[4,] 1 3
[5,] 1 2
[[3]]
[,1] [,2]
[1,] 1 3
[2,] 2 3
[3,] 2 4
[4,] 1 4
[5,] 1 3
[[4]]
[,1] [,2]
[1,] 1 4
[2,] 2 4
[3,] 2 5
[4,] 1 5
[5,] 1 4
[[5]]
NULL
[[6]]
[,1] [,2]
[1,] 2 3
[2,] 3 3
[3,] 3 4
[4,] 2 4
[5,] 2 3
[[7]]
NULL
[[8]]
[,1] [,2]
[1,] 2 4
[2,] 3 4
[3,] 3 5
[4,] 2 5
[5,] 2 4

The logic behind the line polys[[v*k]] <- ... is incorrect, for example, v=2, k=1 will overwrite v=1, k=2. There are no combinations of v and k that make 5 or 7, hence these entries are empty.
I expect that you meant to write something like:
polys[[v+(k-1)*(length(ys))]] <- ...
or
polys[[k+(v-1)*(length(xs))]] <- ...
depending on the order that you want your results in

Optimization : conditional test on several matrices & extraction in R

I have a matrix
mat_a <- matrix(data = c( c(rep(1,3), rep(2,3), rep(3,3))
, rep(seq(1,300,100), 3)
, runif(15, 0, 1))
, ncol=3)
[,1] [,2] [,3]
[1,] 1 1 0.8393401
[2,] 1 101 0.5486805
[3,] 1 201 0.4449259
[4,] 2 1 0.3949137
[5,] 2 101 0.4002575
[6,] 2 201 0.3288861
[7,] 3 1 0.7865035
[8,] 3 101 0.2581155
[9,] 3 201 0.8987769
that I compare to another matrix with higher dimensions
mat_b <- matrix(data = c(
c(rep(1,3), rep(2,3), rep(3,3), rep(4,3))
, rep(seq(1,300,100), 4)
, rep(3:5, 4))
, ncol = 3)
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 101 4
[3,] 1 201 5
[4,] 2 1 3
[5,] 2 101 4
[6,] 2 201 5
[7,] 3 1 3
[8,] 3 101 4
[9,] 3 201 5
[10,] 4 1 3
[11,] 4 101 4
[12,] 4 201 5
I need to extract the lines of mat_a where columns #2 of both matrices match. For those matches, both columns 1 also have to match. Also, column 3 of mat_b must be higher or equal to 4.
I cannot find any solution based on vectorization. I only came out with a loop-based solution.
output <- NULL
for (i in 1:nrow(mat_a)) {
if (mat_a[i,2] %in% mat_b[,2][mat_b[,3] >= 4]) {
rows <- which( mat_b[,2] %in% mat_a[i,2])
row <- which(mat_b[,1][rows] == mat_a[i,1])
if (mat_b[,3][rows[row]] >= 4) {
output <- rbind(output, mat_a[i,])
}
}
}
This works but is extremely slow. It took less than one hour to run. mat_a has 9 col with 40,000 rows (could go higher), mat_b has 5 col and around 1.2 millions rows.
Any idea?

It is better to work with data frames when comparing tables as you are. That will use R's structures to their strengths instead of working against them. We use a simple merge to match the correct values. Then subset b with the necessary condition, b$V3 >= 4. On the end, [-4] lets the output more closely match your desired output:
a <- as.data.frame(mat_a)
b <- as.data.frame(mat_b)
merge(a,b[b$V3 >= 4,], by=c("V1","V2"))[-4]
# V1 V2 V3.x
# 1 1 101 0.1118960
# 2 1 201 0.1543351
# 3 2 101 0.3950491
# 4 2 201 0.5688684
# 5 3 201 0.4749941

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset data.table altering subset limit between factor levels - r

How about : DT[,.SD[seq(match(1,cond)+1,.N)],by=ind] ind cond dist [1,] a 0 3.937058 [2,] a 1 5.037681 [3,] a 2 6.036432 [4,] b 0 3.010903 [5,] b 2 3.937765 [6,] b 1 4.976273 Btw, it's good to set.seed(1) first so we can work with the same random data.

Related

How to use which() on a matrix to get unique indices

Changes of sign of the rows of a matrix

extract every two elements in matrix row in r in sequence to calculate euclidean distance

Nested for-loop skips loops

Optimization : conditional test on several matrices & extraction in R

Categories

Resources