Related
x <- list(c(1,2), c(1,4), c(1,1))
I want to arange the vectors of the list according to their sum of square of the elements of each vector.
Sum of squares of three vectors:
1^2 + 2^2 = 5,
1^2 + 4^2 = 17,
1^2 + 1^2 = 2.
Since, 2 < 5 < 17, the desired output will be:
vectors squaresum
c(1,1) 2
c(1,2) 5
c(1,4) 17
I was thinking to build a function for square sum. Then using that function to sort the vectors. But could not do properly. Any help will be appriciated.
You can go iterate over your list to calculate the sum of squares of each vector and use order() to get the indices of values in ascending order. You can then use those to sort your initial list x:
x[order(sapply(x, function(v) sum(v ** 2)))]
the result is:
[[1]]
[1] 1 1
[[2]]
[1] 1 2
[[3]]
[1] 1 4
Here is another approach which can be used if the list vectors are all ofthe same length:
x[order(rowSums(do.call(rbind, x)^2))]
[[1]]
[1] 1 1
[[2]]
[1] 1 2
[[3]]
[1] 1 4
however it looks it does not provide any speed benefits on bigger lists compared to #clemens (I really thought it would):
x <- replicate(10000, sample(1:1000, 100, replace = TRUE), simplify = FALSE)
library(microbenchmark)
microbenchmark(clemens = x[order(sapply(x, function(v) sum(v ** 2)))],
missuse = x[order(rowSums(do.call(rbind, x) ^ 2))])
#output
Unit: milliseconds
expr min lq mean median uq max neval cld
clemens 32.03712 34.65821 59.16911 43.51531 57.19269 822.7295 100 a
missuse 32.84621 35.33422 47.53151 42.69733 56.09183 107.2334 100 a
I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.
I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.
Any ideas how to do this more efficiently? In the end I need to append to df2:
a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.
Thanks.
Fast computation of hamming distance between two integers vectors of equal length
As I said in my comment, we can do:
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!
For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)
Fast hamming distance between a matrix and a vector
If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
Note that:
hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
there is no obvious loop here; instead, we use tapply().
Fast hamming distance computation between two matrices/data frames
This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.
foo <- function(df1, df2, p) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
k <- 1:p
for (i in 1:n) {
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
k <- k + p
}
## recode "id" and "d" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
list(id = id, d = d)
}
Note that:
transposition is done at the beginning, according to reasons before;
a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).
Experiment
This part uses small dataset to test/demonstrate our functions.
Some toy data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
Test hmd first (needs transposition):
hmd(t(as.matrix(df1)), df2[1, ]) ## df1 & first row of df2
# [1] 2 4 6 2 4
Test foo:
foo(df1, df2, p = 2)
# $id
# min1 min2
# 1 1 4
# 2 2 3
# 3 5 2
# $d
# mindist.1 mindist.2
# 1 2 2
# 2 1 3
# 3 1 3
If you want to append some columns to df2, you know what to do, right?
Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.
General hamming distance computation
In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:
set.seed(0)
x <- sample(1:100, 6)
y <- sample(1:100, 6)
we will end up with a scalar:
hmd0(x,y)
# 13
What if we want to compute pairwise hamming distance of two vectors?
In fact, a simple modification to our function hmd will do:
hamming.distance <- function(x, y, pairwise = TRUE) {
nx <- length(x)
ny <- length(y)
rawx <- intToBits(x)
rawy <- intToBits(y)
if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
bits <- length(intToBits(0)) ## 32-bit or 64 bit?
result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
}
if (pairwise) result else sum(result)
}
Now
hamming.distance(x, y, pairwise = TRUE)
# [1] 0 3 3 2 5 0
hamming.distance(x, y, pairwise = FALSE)
# [1] 13
Hamming distance matrix
If we want to compute the hamming distance matrix, for example,
set.seed(1)
x <- sample(1:100, 5)
y <- sample(1:100, 7)
The distance matrix between x and y is:
outer(x, y, hamming.distance) ## pairwise argument has no effect here
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 2 3 4 3 4 4 2
# [2,] 7 6 3 4 3 3 3
# [3,] 4 5 4 3 6 4 2
# [4,] 2 3 2 5 6 4 2
# [5,] 4 3 4 3 2 0 2
We can also do:
outer(x, x, hamming.distance)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 5 2 2 4
# [2,] 5 0 3 5 3
# [3,] 2 3 0 2 4
# [4,] 2 5 2 0 4
# [5,] 4 3 4 4 0
In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.
Here's an alternative solution that uses only base R, and should be very fast, especially when your df1 and df2 have many rows. The main reason for this is that it does not use any R-level looping for calculating the Hamming distances, such as for-loops, while-loops, or *apply functions. Instead, it uses matrix multiplication for computing the Hamming distance. In R, this is much faster than any approach using R-level looping. Also note that using an *apply function will not necessarily make your code any faster than using a for loop. Two other efficiency-related features of this approach are: (1) It uses partial sorting for finding the best two matches for each row in df2, and (2) It stores the entire bitwise representation of df1 in one matrix (same for df2), and does so in one single step, without using any R-level loops.
The function that does all the work:
# INPUT:
# X corresponds to your entire df1, but is a matrix
# Y corresponds to your entire df2, but is a matrix
# OUTPUT:
# Matrix with four columns corresponding to the values
# that you specified in your question
fun <- function(X, Y) {
# Convert integers to bits
X <- intToBits(t(X))
# Reshape into matrix
dim(X) <- c(ncols * 32, nrows)
# Convert integers to bits
Y <- intToBits(t(Y))
# Reshape into matrix
dim(Y) <- c(ncols * 32, nrows)
# Calculate pairwise hamming distances using matrix
# multiplication.
# Columns of H index into Y; rows index into X.
# The code for the hamming() function was retrieved
# from this page:
# https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/
H <- hamming(X, Y)
# Now, for each row in Y, find the two best matches
# in X. In other words: for each column in H, find
# the two smallest values and their row indices.
t(apply(H, 2, function(h) {
mindists <- sort(h, partial = 1:2)
c(
ind1 = which(h == mindists[1])[1],
val1 = mindists[1],
hmd2 = which(h == mindists[2])[1],
val2 = mindists[2]
)
}))
}
To call the function on some random data:
# Generate some random test data with no. of columns
# corresponding to your data
nrows <- 1000
ncols <- 26
# X corresponds to your df1
X <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
# Y corresponds to your df2
Y <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
res <- fun(X, Y)
The above example with 1000 rows in both X (df1) and Y (df2) took about 1.1 - 1.2 seconds to run on my laptop.
A matrix I have has exactly 2 rows and n columns example
c(0,0,0,0,1,0,2,0,1,0,1,1,1,0,2)->a1
c(0,2,0,0,0,0,2,1,1,0,0,0,0,2,0)->a2
rbind(a1,a2)->matr
for a specific column ( in this example 9 with 1 in both rows) I do need to find to the left and to the right the first instance of 2/0 or 0/2 - in this example to the left is 2 and the other is 14)
The elements of every row can either be 0,1,2 - nothing else . Is there a way to do that operation on large matrixes (with 2 rows) fast? I need to to it 600k times so speed might be a consideration
library(compiler)
myfun <- cmpfun(function(m, cl) {
li <- ri <- cl
nc <- ncol(m)
repeat {
li <- li - 1
if(li == 0 || ((m[1, li] != 1) && (m[1, li] + m[2, li] == 2))) {
l <- li
break
}
}
repeat {
ri <- ri + 1
if(ri == nc || ((m[1, ri] != 1) && (m[1, ri] + m[2, ri] == 2))) {
r <- ri
break
}
}
c(l, r)
})
and, after taking into account #Martin Morgan's observations,
set.seed(1)
N <- 1000000
test <- rbind(sample(0:2, N, replace = TRUE),
sample(0:2, N, replace = TRUE))
library(microbenchmark)
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
expr min lq mean median uq max neval cld
# myfun(test, N/2) 4.658 20.033 2.237153e+01 22.536 26.022 85.567 100 a
# fun(test, N/2) 36685.750 47842.185 9.762663e+04 65571.546 120321.921 365958.316 100 b
# foo(test, N/2) 2622845.039 3009735.216 3.244457e+06 3185893.218 3369894.754 5170015.109 100 d
# AWebb(test, N/2) 121504.084 142926.590 1.990204e+05 193864.670 209918.770 489765.471 100 c
# RHertel(test, N/2) 65998.733 76805.465 1.187384e+05 86089.980 144793.416 385880.056 100 b
set.seed(123)
test <- rbind(sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)),
sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)))
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# myfun(test, N/2) 81.805 103.732 121.9619 106.459 122.36 307.736 100 a
# fun(test, N/2) 26362.845 34553.968 83582.9801 42325.755 106303.84 403212.369 100 b
# foo(test, N/2) 2598806.742 2952221.561 3244907.3385 3188498.072 3505774.31 4382981.304 100 d
# AWebb(test, N/2) 109446.866 125243.095 199204.1013 176207.024 242577.02 653299.857 100 c
# RHertel(test, N/2) 56045.309 67566.762 125066.9207 79042.886 143996.71 632227.710 100 b
I was slower than #Laterow, but anyhow, this is a similar approach
foo <- function(mtr, targetcol) {
matr1 <- colSums(mtr)
matr2 <- apply(mtr, 2, function(x) x[1]*x[2])
cols <- which(matr1 == 2 & matr2 == 0) - targetcol
left <- cols[cols < 0]
right <- cols[cols > 0]
c(ifelse(length(left) == 0, NA, targetcol + max(left)),
ifelse(length(right) == 0, NA, targetcol + min(right)))
}
foo(matr,9) #2 14
Combine the information by squaring the rows and adding them. The right result should be 4. Then, simply find the first column that is smaller than 9 (rev(which())[1]) and the first column that is larger than 9 (which()[1]).
fun <- function(matr, col){
valid <- which((matr[1,]^2 + matr[2,]^2) == 4)
if (length(valid) == 0) return(c(NA,NA))
left <- valid[rev(which(valid < col))[1]]
right <- valid[which(valid > col)[1]]
c(left,right)
}
fun(matr,9)
# [1] 2 14
fun(matr,1)
# [1] NA 2
fun(matrix(0,nrow=2,ncol=100),9)
# [1] NA NA
Benchmark
set.seed(1)
test <- rbind(sample(0:2,1000000,replace=T),
sample(0:2,1000000,replace=T))
microbenchmark::microbenchmark(fun(test,9))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun(test, 9) 22.7297 27.21038 30.91314 27.55106 28.08437 51.92393 100
Edit: Thanks to #MatthewLundberg for pointing out a lot of mistakes.
If you are doing this many times, precompute all the locations
loc <- which((a1==2 & a2==0) | (a1==0 & a2==2))
You can then find the first to the left and right with findInterval
i<-findInterval(9,loc);loc[c(i,i+1)]
# [1] 2 14
Note that findInterval is vectorized should you care to specify multiple target columns.
That is an interesting question. Here's how I would address it.
First a vector is defined which contains the product of each column:
a3 <- matr[1,]*matr[2,]
Then we can find the columns with pairs of (0/2) or (2/0) rather easily, since we know that the matrix can only contain the values 0, 1, and 2:
the02s <- which(colSums(matr)==2 & a3==0)
Next we want to find the pairs of (0/2) or (2/0) that are closest to a given column number, on the left and on the right of that column. The column number could be 9, for instance:
thecol <- 9
Now we have basically all we need to find the index (the column number in the matrix) of a combination of (0/2) or (2/0) that is closest to the column thecol. We just need to use the output of findInterval():
pos <- findInterval(thecol,the02s)
pos <- c(pos, pos+1)
pos[pos==0] <- NA # output NA if no column was found on the left
And the result is:
the02s[pos]
# 2 14
So the indices of the closest columns on either side of thecol fulfilling the required condition would be 2 and 14 in this case, and we can confirm that these column numbers both contain one of the relevant combinations:
matr[,14]
#a1 a2
# 0 2
matr[,2]
#a1 a2
# 0 2
Edit: I changed the answer such that NA is returned in the case where no column exists on the left and/or on the right of thecol in the matrix that fulfills the required condition.
I have 2 vectors, such as these:
A <- c(1,2,NA,NA,NA,NA,7)
B <- c(NA,NA,3,4,NA,NA,7)
I would like to combine them so that the resulting vector is
1,2,3,4,NA,NA,-1
That is
when only 1 value (say X) in either vector at position i exists (the other being NA) the new vector should take the value X at position i.
when both values are NA at position i, the new vector should take the value NA at position i
when both vectors have a value at position i, the new vector should take the value -1 at position i.
I can easily do this with a loop, but it is very slow on a large dataset so can anyone provide a fast way to do this ?
These commands create the vector:
X <- A
X[is.na(A)] <- B[is.na(A)]
X[is.na(B)] <- A[is.na(B)]
X[!is.na(A & B)] <- -1
#[1] 1 2 3 4 NA NA -1
A <- c(1,2,NA,NA,NA,NA,7)
B <- c(NA,NA,3,4,NA,NA,7)
C <- rowMeans(cbind(A,B),na.rm=TRUE)
C[which(!is.na(A*B))]<- -1
#[1] 1 2 3 4 NaN NaN -1
Benchmarks:
Unit: microseconds
expr min lq median uq max
1 Roland(A, B) 17.863 19.095 19.710 20.019 68.985
2 Sven(A, B) 11.703 13.243 14.167 14.783 100.398
A bit late to the party, but here is another option defining a function that works by applying rules to the two vectors cbind-ed together.
# get the data
A <- c(1,2,NA,NA,NA,NA,7)
B <- c(NA,NA,3,4,NA,NA,7)
# define the function
process <- function(A,B) {
x <- cbind(A,B)
apply(x,1,function(x) {
if(sum(is.na(x))==1) {na.omit(x)} else
if(all(is.na(x))) {NA} else
if(!any(is.na(x))) {-1}
})
}
# call the function
process(A,B)
#[1] 1 2 3 4 NA NA -1
The main benefit of using a function is that it is easier to update the rules or the inputs to apply the code to new data.
I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.
Here's a solution using by and colSums, but requires some fiddling due to the default output of by.
M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
V1 V2 V3
1 3 9 15
2 3 6 9
There is now an aggregate function in Matrix.utils. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow solution and 100x faster than the by solution:
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642 10
b <- combineByRow(m) 634.96542 689.54724 759.87833 732.37424 866.22673 923.15491 10
c <- aggregate.Matrix(m, row.names(m)) 42.26674 44.60195 53.62292 48.59943 67.40071 70.40842 10
> identical(as.vector(a),as.vector(c))
[1] TRUE
EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix, especially a sparse one, or if you were performing an aggregation besides sum.
The answer by James work as expected, but is quite slow for large matrices. Here is a version that avoids creating of new objects:
combineByRow <- function(m) {
m <- m[ order(rownames(m)), ]
## keep track of previous row name
prev <- rownames(m)[1]
i.start <- 1
i.end <- 1
## cache the rownames -- profiling shows that it takes
## forever to look at them
m.rownames <- rownames(m)
stopifnot(all(!is.na(m.rownames)))
## go through matrix in a loop, as we need to combine some unknown
## set of rows
for (i in 2:(1+nrow(m))) {
curr <- m.rownames[i]
## if we found a new row name (or are at the end of the matrix),
## combine all rows and mark invalid rows
if (prev != curr || is.na(curr)) {
if (i.start < i.end) {
m[i.start,] <- apply(m[i.start:i.end,], 2, max)
m.rownames[(1+i.start):i.end] <- NA
}
prev <- curr
i.start <- i
} else {
i.end <- i
}
}
m[ which(!is.na(m.rownames)),]
}
Testing it shows that is about 10x faster than the answer using by (2 vs. 20 seconds in this example):
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
start <- proc.time()
m1 <- combineByRow(m)
print(proc.time()-start)
start <- proc.time()
m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity))
print(proc.time()-start)
all(m1 == m2)