Using apply in R with an additional vector argument - r

I have a matrix of size 10000 x 100 and a vector of length 100. I'd like to apply a custom function, percentile, which takes in a vector argument and a scalar argument, to each column of the matrix such that on iteration j, the arguments used with percentile are column j of the matrix and entry j of the vector. Is there a way to use one of the apply functions to do this?
Here's my code. It runs, but doesn't return the correct result.
percentile <- function(x, v){
length(x[x <= v]) / length(x)
}
X <- matrix(runif(10000 * 100), nrow = 10000, ncol = 100)
y <- runif(100)
result <- apply(X, 2, percentile, v = y)
The workaround that I've been using has been to just append y to X, and re-write the percentile function, as shown below.
X <- rbind(X, y)
percentile2 <- function(x){
v <- x[length(x)]
x <- x[-length(x)]
length(x[x <= v]) / length(x)
}
result <- apply(X, 2, percentile2)
This code does return the correct result, but I would prefer something a bit more elegant.

If you understand that R is vectorised and know the right functions you can avoid loops entirely, and do the whole thing in one relatively simple line...
colSums( t( t( X ) <= y ) ) / nrow( X )
Through vectorisation R will recycle each element in y across each column of X (by default it will do this across the rows, so we use the transpose function t to turn the columns to rows, apply the logical comparison <= and then transpose back again.
Since TRUE and FALSE evaluate to 1 and 0 respectively we can use colSums to effectively get the number of rows in each column which met the condition and then divde each column by the total number of rows (remember the recycling rule!). It is the exact same result....
res1 <- apply(X2, 2, percentile2)
res2 <- colSums( t( t( X ) <= y ) ) / nrow( X )
identical( res1 , res2 )
[1] TRUE
Obviously as this doesn't use any R loops it's a lot quicker (~10 times on this small matrix).
Even better would be to use rowMeans like this (thanks to #flodel):
rowMeans( t(X) <= y )

I think the easiest and clearest way is to use a for loop:
result2 <- numeric(ncol(X))
for (i in seq_len(ncol(X))) {
result2[i] <- sum(X[,i] <= y[i])
}
result2 <- result2 / nrow(X)
the fastest and shortest solution I can think of is:
result1 <- rowSums(t(X) <= y) / nrow(X)
SimonO101 has an explanation in his answer how this works. As I said, it is fast. However, the disadvantage is that it is less clear what exactly is calculated here, although you could solve this by placing this piece of code in a well-named function.
flodel also suggester a solution using mapply which is an apply that can work on multiple vectors. However, for that to work you first need to put each of your columns or your matrix in a list or data.frame:
result3 <- mapply(percentile, as.data.frame(X), y)
Speed wise (see below for some benchmarking) the for-loop doesn't do that bad and it's faster than using apply (in this case at least). The trick with rowSums and vector recycling is faster, over 10 times as fast as the solution using apply.
> X <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)
> y <- runif(100)
>
> system.time({result1 <- rowSums(t(X) <= y) / nrow(X)})
user system elapsed
0.020 0.000 0.018
>
> system.time({
+ X2 <- rbind(X, y)
+ percentile2 <- function(x){
+ v <- x[length(x)]
+ x <- x[-length(x)]
+ length(x[x <= v]) / length(x)
+ }
+ result <- apply(X2, 2, percentile2)
+ })
user system elapsed
0.252 0.000 0.249
>
>
> system.time({
+ result2 <- numeric(ncol(X))
+ for (i in seq_len(ncol(X))) {
+ result2[i] <- sum(X[,i] <= y[i])
+ }
+ result2 <- result2 / nrow(X)
+ })
user system elapsed
0.024 0.000 0.024
>
> system.time({
+ result3 <- mapply(percentile, as.data.frame(X), y)
+ })
user system elapsed
0.076 0.000 0.073
>
> all(result2 == result1)
[1] TRUE
> all(result2 == result)
[1] TRUE
> all(result3 == result)
[1] TRUE

Related

Large number digit sum

I am trying to create a function that computes the sum of digits of large numbers, of the order of 100^100. The approach described in this question does not work, as shown below. I tried to come up with a function that does the job, but have not been able to get very far.
The inputs would be of the form a^b, where 1 < a, b < 100 and a and b are integers. So, in that sense, I am open to making digitSumLarge a function that accepts two arguments.
digitSumLarge <- function(x) {
pow <- floor(log10(x)) + 1L
rem <- x
i <- 1L
num <- integer(length = pow)
# Individually isolate each digit starting from the largest and store it in num
while(rem > 0) {
num[i] <- rem%/%(10^(pow - i))
rem <- rem%%(10^(pow - i))
i <- i + 1L
}
return(num)
}
# Function in the highest voted answer of the linked question.
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(x) - 1))) %% 10)
Consider the following tests:
x <- c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9)
as.numeric(paste(x, collapse = ''))
# [1] 1.234568e+17
sum(x)
# 90
digitSumLarge(as.numeric(paste(x, collapse = '')))
# 85
digitsum(as.numeric(paste(x, collapse = '')))
# 81, with warning message about loss of accuracy
Is there any way I can write such a function in R?
You need arbitrary precision numbers. a^b with R's numerics (double precision floats) can be only represented with limited precision and not exactly for sufficiently large input.
library(gmp)
a <- as.bigz(13)
b <- as.bigz(67)
sum(as.numeric(strsplit(as.character(a^b), split = "")[[1]]))
#[1] 328

Optimize performance of a formula spanning three consecutive indices, with wraparound

I want to optimize the implementation of this formula.
Here is the formula:
x is an array of values. i goes from 1 to N where N > 2400000.
For i=0, i-1 is the last element and for i=lastElement, i+1 is the first element. Here is the code which I have written:
x <- 1:2400000
re <- array(data=NA, dim = NROW(x))
lastIndex = NROW(x)
for(i in 1:lastIndex){
if (i==1) {
re[i] = x[i]*x[i] - x[lastIndex]*x[i+1]
} else if(i==lastIndex) {
re[i] = x[i]*x[i] - x[i-1]*x[1]
} else {
re[i] = x[i]*x[i] - x[i-1]*x[i+1]
}
}
Can it be done by apply in R?
We can use direct vectorization for this
# Make fake data
x <- 1:10
n <- length(x)
# create vectors for the plus/minus indices
xminus1 <- c(x[n], x[-n])
xplus1 <- c(x[-1], x[1])
# Use direct vectorization to get re
re <- x^2 - xminus1*xplus1
If really each x[i] is equal to i then you can do a little math:
xi^2 - (xi-1)*(xi+1) = 1
so all elements of the result are 1 (only the first and the last are not 1).
The result is:
c(1-2*N, rep(1, N-2), N*N-(N-1))
In the general case (arbitrary values in x) you can do (as in the answer from Dason):
x*x - c(x[N], x[-N])*c(x[-1], x[1])
Here is a solution with rollapply() from zoo:
library("zoo")
rollapply(c(x[length(x)],x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3]) # or:
rollapply(c(tail(x,1), x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3])
Here is the benchmark:
library("microbenchmark")
library("zoo")
N <- 10000
x <- 1:N
microbenchmark(
math=c(1-2*N, rep(1, N-2), N*N-(N-1)), # for the data from the question
vect.i=x*x - c(x[N], x[-N])*c(x[-1], x[1]), # general data
roll.i=rollapply(c(x[length(x)],x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3]), # or:
roll.tail=rollapply(c(tail(x,1), x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3])
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# math 33.613 34.4950 76.18809 36.9130 38.0355 2002.152 100 a
# vect.i 188.928 192.5315 732.50725 197.1955 198.5245 51649.652 100 a
# roll.i 56748.920 62217.2550 67666.66315 68195.5085 71214.9785 109195.049 100 b
# roll.tail 57661.835 63855.7060 68815.91001 67315.5425 71339.6045 119428.718 100 b
An lapply implementation of your formula would look like this:
x <- c(1:2400000)
last <- length(x)
re <- lapply(x, function(i) {
if(i == 1) {
x[i]*x[i] - x[last]*x[i+1]
} else if (i == last) {
x[i]*x[i] - x[i-1]*x[1]
} else {
x[i]*x[i] - x[i-1]*x[i+1]
}
})
re <- unlist(re)
lapply will return a list, so conversion to a vector is done using unlist()
1) You can avoid all the special-casing in the computation by padding the start and end of array x with copies of the last and first rows; something like this:
N <- NROW(x)
x <- rbind(x[N], x, x[1]) # pad start and end to give wraparound
re <- lapply(2:N, function(i) { x[i]*x[i] - x[i-1]*x[i+1] } )
#re <- unlist(re) as andbov wrote
# and remember not to use all of x, just x[2:N], elsewhere
2) Directly vectorize, as #Dason's answer:
# Do the padding trick on x , then
x[2:N]^2 - x[1:N-1]*x[3:N+1]
3) If performance matters, I suspect using data.table or else for-loop on i will be faster, since it references three consecutive rows.
4) For more performance, use byte-compiling
5) If you need even more speed, use Rcpp extension (C++ under the hood) How to use Rcpp to speed up a for loop?
See those questions I cited for good examples of using lineprof and microbenchmarking to figure out where your bottleneck is.

Simple function counting values from a list within certain range

I want to create a function that takes 3 arguments: a list of values and two cutoff values (a high and a low). Then I want it to how many of the values in the list are within the range of the two cutoff values.
So far I have tried:
count <- function(y, x1, x2){
tmp1 <- length(y)
tmp2 <- length(y>x1)
tmp3 <- length(tmp2<=x2)
return(tmp3)
}
and
count <- function(y, x1, x2){
results <- list()
for (i in y) {
if(y > x1 & y <= x2) {
results <- results+1
}
}
return(results)
}
none of them work. Can some help me correct my code?
Simplify it down. Take the sum of a vectorized logical operation
f <- function(x, y, z) sum(x > y & x < z)
f(1:10, 3, 7)
# [1] 3
But the data.table authors are one step ahead of you. They've written a function between(). I believe there is also one in the dplyr package as well.
library(data.table)
between
# function (x, lower, upper, incbounds = TRUE)
# {
# if (incbounds)
# x >= lower & x <= upper
# else x > lower & x < upper
# }
# <bytecode: 0x44fc790>
# <environment: namespace:data.table>
So for the same result as above you can simply do
sum(between(1:10, 3, 7, FALSE))
# [1] 3

Efficient apply or mapply for multiple matrix arguments by row

I have two matrices that I want to apply a function to, by rows:
matrixA
GSM83009 GSM83037 GSM83002 GSM83029 GSM83041
100001_at 5.873321 5.416164 3.512227 6.064150 3.713696
100005_at 5.807870 6.810829 6.105804 6.644000 6.142413
100006_at 2.757023 4.144046 1.622930 1.831877 3.694880
matrixB
GSM82939 GSM82940 GSM82974 GSM82975
100001_at 3.673556 2.372952 3.228049 3.555816
100005_at 6.916954 6.909533 6.928252 7.003377
100006_at 4.277985 4.856986 3.670161 4.075533
I've found several similar questions, but not a whole lot of answers: mapply for matrices, Multi matrix row-wise mapply?. The code I have now splits the matrices by row into lists, but having to split it makes it rather slow and not much faster than a for loop, considering I have almost 9000 rows in each matrix:
scores <- mapply(t.test.stat, split(matrixA, row(matrixA)), split(matrixB, row(matrixB)))
The function itself is very simple, just finding the t-value:
t.test.stat <- function(x, y)
{
return( (mean(x) - mean(y)) / sqrt(var(x)/length(x) + var(y)/length(y)) )
}
Splitting the matrices isn't the biggest contributor to evaluation time.
set.seed(21)
matrixA <- matrix(rnorm(5 * 9000), nrow = 9000)
matrixB <- matrix(rnorm(4 * 9000), nrow = 9000)
system.time( scores <- mapply(t.test.stat,
split(matrixA, row(matrixA)), split(matrixB, row(matrixB))) )
# user system elapsed
# 1.57 0.00 1.58
smA <- split(matrixA, row(matrixA))
smB <- split(matrixB, row(matrixB))
system.time( scores <- mapply(t.test.stat, smA, smB) )
# user system elapsed
# 1.14 0.00 1.14
Look at the output from Rprof to see that most of the time is--not surprisingly--spent evaluating t.test.stat (mean, var, etc.). Basically, there's quite a bit of overhead from function calls.
Rprof()
scores <- mapply(t.test.stat, smA, smB)
Rprof(NULL)
summaryRprof()
You may be able to find faster generalized solutions, but none will approach the speed of the vectorized solution below.
Since your function is simple, you can take advantage of the vectorized rowMeans function to do this almost instantaneously (though it's a bit messy):
system.time({
ncA <- NCOL(matrixA)
ncB <- NCOL(matrixB)
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowMeans((matrixA-rowMeans(matrixA))^2)*(ncA/(ncA-1))/ncA +
rowMeans((matrixB-rowMeans(matrixB))^2)*(ncB/(ncB-1))/ncB )
})
# user system elapsed
# 0 0 0
head(ans)
# [1] 0.8272511 -1.0965269 0.9862844 -0.6026452 -0.2477661 1.1896181
UPDATE
Here's a "cleaner" version using a rowVars function:
rowVars <- function(x, na.rm=FALSE, dims=1L) {
rowMeans((x-rowMeans(x, na.rm, dims))^2, na.rm, dims)*(NCOL(x)/(NCOL(x)-1))
}
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowVars(matrixA)/NCOL(matrixA) + rowVars(matrixB)/NCOL(matrixB) )
This solution avoids splitting, and lists, so maybe it will be faster than your version:
## original data:
tmp1 <- matrix(sample(1:100, 20), nrow = 5)
tmp2 <- matrix(sample(1:100, 20), nrow = 5)
## combine them together
tmp3 <- cbind(tmp1, tmp2)
## calculate t.stats:
t.stats <- apply(tmp3, 1, function(x) t.test(x[1:ncol(tmp1)],
x[(1 + ncol(tmp1)):ncol(tmp3)])$statistic)
Edit: Just tested it on two matrices of 9000 rows and 5 columns each, and it completed in less than 6 seconds:
tmp1 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp2 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp3 <- cbind(tmp1, tmp2)
system.time(t.st <- apply(tmp3, 1, function(x) t.test(x[1:5], x[6:10])$statistic))
-> user system elapsed
-> 5.640 0.012 5.705

Avoid two for loops in R

I have a R code that can do convolution of two functions...
convolveSlow <- function(x, y) {
nx <- length(x); ny <- length(y)
xy <- numeric(nx + ny - 1)
for(i in seq(length = nx)) {
xi <- x[[i]]
for(j in seq(length = ny)) {
ij <- i+j-1
xy[[ij]] <- xy[[ij]] + xi * y[[j]]
}
}
xy
}
Is there a way to remove the two for loops and make the code run faster?
Thank you
San
Since R is very fast at computing vector operations, the most important thing to keep in mind when programming for performance is to vectorise as many of your operations as possible.
This means thinking hard about replacing loops with vector operations. Here is my solution for fast convolution (50 times faster with input vectors of length 1000 each):
convolveFast <- function(x, y) {
nx <- length(x)
ny <- length(y)
xy <- nx + ny - 1
xy <- rep(0, xy)
for(i in (1:nx)){
j <- 1:ny
ij <- i + j - 1
xy[i+(1:ny)-1] <- xy[ij] + x[i] * y
}
xy
}
You will notice that the inner loop (for j in ...) has disappeared. Instead, I replaced it with a vector operation. j is now defined as a vector (j <- 1:ny). Notice also that I refer to the entire vector y, rather than subsetting it (i.e. y instead of y[j]).
j <- 1:ny
ij <- i + j - 1
xy[i+(1:ny)-1] <- xy[ij] + x[i] * y
I wrote a small function to measure peformance:
measure.time <- function(fun1, fun2, ...){
ptm <- proc.time()
x1 <- fun1(...)
time1 <- proc.time() - ptm
ptm <- proc.time()
x2 <- fun2(...)
time2 <- proc.time() - ptm
ident <- all(x1==x2)
cat("Function 1\n")
cat(time1)
cat("\n\nFunction 2\n")
cat(time2)
if(ident) cat("\n\nFunctions return identical results")
}
For two vectors of length 1000 each, I get a 98% performance improvement:
x <- runif(1000)
y <- runif(1000)
measure.time(convolveSlow, convolveFast, x, y)
Function 1
7.07 0 7.59 NA NA
Function 2
0.14 0 0.16 NA NA
Functions return identical results
For vectors, you index with [], not [[]], so use xy[ij] etc
Convolution doesn't vectorise easily but one common trick is to switch to compiled code. The Writing R Extensions manual uses convolution as a running example and shows several alternative; we also use it a lot in the Rcpp documentation.
As Dirk says, compiled code can be a lot faster. I had to do this for one of my projects and was surprised at the speedup: ~40x faster than Andrie's solution.
> a <- runif(10000)
> b <- runif(10000)
> system.time(convolveFast(a, b))
user system elapsed
7.814 0.001 7.818
> system.time(convolveC(a, b))
user system elapsed
0.188 0.000 0.188
I made several attempts to speed this up in R before I decided that using C code couldn't be that bad (note: it really wasn't). All of mine were slower than Andrie's, and were variants on adding up the cross-product appropriately. A rudimentary version can be done in just three lines.
convolveNotAsSlow <- function(x, y) {
xyt <- x %*% t(y)
ds <- row(xyt)+col(xyt)-1
tapply(xyt, ds, sum)
}
This version only helps a little.
> a <- runif(1000)
> b <- runif(1000)
> system.time(convolveSlow(a, b))
user system elapsed
6.167 0.000 6.170
> system.time(convolveNotAsSlow(a, b))
user system elapsed
5.800 0.018 5.820
My best version was this:
convolveFaster <- function(x,y) {
foo <- if (length(x)<length(y)) {y %*% t(x)} else { x %*% t(y) }
foo.d <- dim(foo)
bar <- matrix(0, sum(foo.d)-1, foo.d[2])
bar.rc <- row(bar)-col(bar)
bar[bar.rc>=0 & bar.rc<foo.d[1]]<-foo
rowSums(bar)
}
This was quite a bit better, but still not nearly as fast as Andrie's
> system.time(convolveFaster(a, b))
user system elapsed
0.280 0.038 0.319
The convolveFast function can be optimized a little by carefully using integer math only and replacing (1:ny)-1L with seq.int(0L, ny-1L):
convolveFaster <- function(x, y) {
nx <- length(x)
ny <- length(y)
xy <- nx + ny - 1L
xy <- rep(0L, xy)
for(i in seq_len(nx)){
j <- seq_len(ny)
ij <- i + j - 1L
xy[i+seq.int(0L, ny-1L)] <- xy[ij] + x[i] * y
}
xy
}
How about convolve(x, rev(y), type = "open") in stats?
> x <- runif(1000)
> y <- runif(1000)
> system.time(a <- convolve(x, rev(y), type = "o"))
user system elapsed
0.032 0.000 0.032
> system.time(b <- convolveSlow(x, y))
user system elapsed
11.417 0.060 11.443
> identical(a,b)
[1] FALSE
> all.equal(a,b)
[1] TRUE
Some say the apply() and sapply() functions are faster than for() loops in R. You could convert the convolution to a function and call it from within apply().
However, there is evidence to the contrary
http://yusung.blogspot.com/2008/04/speed-issue-in-r-computing-apply-vs.html

Resources