Manipulation with large dimensional matrix in R - r

I would like to calculate the following function for each row of a matrix M of dimension 3e+07x4.
func <- function(x){
(dmultinom(c(x[c(1,2)],50-sum(x[c(1,2)])), size = NULL, rep(1/3,3), log = FALSE))/(x[3]^2+x[4]^3)
}
I am using the following code
as.numeric(unlist(apply(M, 1, function(v) func(v))))
Unfortunately, it is taking a long time. I'd like to do this in a short time.

Luckily, lgamma is a Primitive function and hence an option is to vectorize dmultinom by yourself. Here is a option combined with data.table for faster sped
set.seed(0L)
nr <- 3e7 #3e7
size <- 50L
DT <- data.table(X1=sample(1:20, nr, TRUE), X2=sample(1:20, nr, TRUE), X3=3, X4=4)
system.time({
DT[, paste0("lgX", 1L:3L) := c(lapply(1+.SD, lgamma), .(lgamma(1+size-X1-X2))), .SDcols=X1:X2][,
dmn := exp(lgamma(size + 1L) + log(1/3) * size - (lgX1 + lgX2 + lgX3)) / (X3^2 + X4^2)]
DT$dmn
})
# user system elapsed
# 7.44 0.17 7.64

Something like this, but I haven't verified the correctness. Your:
func <- function(x) {
(dmultinom(c(x[c(1,2)],50-sum(x[c(1,2)])), size = NULL, rep(1/3,3), log = FALSE))/(x[3]^2+x[4]^3)
}
can be written as:
func <- function(x) {
a <- x[c(1,2)]
b <- 50 - (a[1] + a[2])
d <- c(a, b)
e <- dmultinom(d, size = NULL, rep(1/3,3), log = FALSE)
f <- x[3]^2 + x[4]^3
e / f
}
The d part you can vectorize via matrix calculations as:
A <- M[, 1:2]
B <- 50 - (A[,1] + A[,2])
D <- cbind(A, B)
Without diving into dmultinom(), the e part can be calculated via apply() as:
prob <- rep(1/3, times = 3L)
E <- apply(D, MARGIN = 1L, function(d) {
dmultinom(d, size = NULL, prob = prob, log = FALSE)
})
The f part you can vectorize via matrix calculations as:
F <- M[,3]^2 + M[,4]^3
which gives that:
Y <- apply(M, 1, function(v) func(v))
can be written as:
Y <- E / F
Disclaimer: Haven't verified but you should get the idea of how to vectorize and avoid duplicate things.
PS. If you look at dmultinom(), I think you can vectorize that as well in a similar fashion. It's not unlikely that you can get rid of the remaining apply() call.

Related

Can this plot_normal_distribution function be optimized?

I've tried to optimize a function which I wrote a few weeks ago.
It got better but it is still slow. So I used Rprof() and found out split() takes the most time which for some reason makes me think this function can be a lot better.
Can it be done?!
normDist_V2 <- function(size=1e5, precision=1, ...)
{
data <- rnorm(size)
roundedData <- round(data, precision)
framedData <- data.frame(cbind(data, roundedData))
factoredData <- split(framedData$data, framedData$roundedData)
actualsize <- (size)/10^precision
X <- names(factoredData)
Probability <- sapply(factoredData, length) / actualsize
plot(X, Probability, ...)
}
Current speed:
system.time(normDist_V2(size=1e7, precision = 2)) #11.14 sec
normDist_V2 <- function(size = 1e5, precision = 1, ...) {
require(data.table)
data <- rnorm(size)
roundedData <- round(data, precision)
framedData <- data.table(data, roundedData)
actualsize <- (size)/10^precision
dt <- framedData[, .N, keyby = roundedData]
X <- dt$roundedData
Probability <- dt$N/actualsize
plot(X, Probability, ...)
}
system.time(normDist_V2(size=1e7, precision = 2)) # 1.26 sec

R reverse function to solve for parameter when output is a fixed constant

If I have a function
estimator <- function(A,B) {
A*(B+23)
}
How can I reverse this function to find the value of A for B as a sequence between 0 and 120 (B=1,2,3,4,...,120) that would give a fixed result, say C = 20?
I would use it to map the values for which satisfy the equation A*(B+23)= C = 20 with B being a list b.list between 0 and 120, for c.list, of different C?
b.list <- seq(0,120,by=1)
c.list <- tibble(seq(10,32,by=2))
In the end, I would like to plot the lines of curves of the function for different C using purrr or similar.
I.e.: Given that the height of a tree in metres at age 100 will follow the function, C = A*(B+23), solve for A that will give the result C=10 when B, Age is a list of years between 0 and 120?
Here's a link showing what I'm trying to make!
Here's another one
Many thanks!
For the inverse it is a quick inversion :
A = C/(B+23)
One answer could be :
B <- seq(0, 120)
C <- seq(10, 32, 2)
A <- matrix(0,
nrow = length(B),
ncol = length(C))
for(i in 1:ncol(M)){
A[,i] <- C[i] / (B + 23)
}
matplot(B, A, type ="l", col = "black")
In case of a more complex function indeed you need an automatic solving problem. One way is to see it like an optimisation problem where you want to minimise the distance from C :
B <- seq(1, 120)
C <- seq(10, 32, 2)
A <- matrix(0,
nrow = length(B),
ncol = length(C))
fct <- function(A, B, C){
paramasi <- 25
parambeta<- 7395.6
paramb2 <- -1.7829
refB <- 100
d <- parambeta*(paramasi^paramb2)
r <- (((A-d)^2)+(4*parambeta*A*(B^paramb2)))^0.5
si_est <- (A+d+r)/ (2+(4*parambeta*(refB^paramb2)) / (A-d+r))
return(sum(si_est - C)^2)}
for(c in 1:length(C)){
for(b in 1:length(B)){
# fixe parameters + optimisation
res <- optim(par = 1, fn = fct, B = B[b], C = C[c])
A[b, c] <- res$par
}
}
matplot(B, A, type = "l", col = "black")
You need to be careful because in your case I think that you could find an analytical formula for the inverse which would be better.
Good luck !

Vectorizing this function in R

Hi so I have the following function:
kde.cv = function(X,s) {
l = length(X)
log.fhat.vector = c()
for (i in 1:l) {
current.log.fhat = log ( kde(X[i],X[-i],s) )
log.fhat.vector[i] = current.log.fhat
}
CV.score = sum(log.fhat.vector)
return(CV.score)
}
I'd like to vectorize this without using any for loops or apply statements, can't seem to get around doing so. Help would be appreciated. Thanks.
EDIT: Given the responses, here are my answers to the questions posed.
Given requests for clarification, I will elaborate on the function inputs and on the user defined function inside the function given. So X here is a dataset in the form of a vector, specifically, a vector of length 7 in the dataset I used as an input to this function. The X I used this function for is c(-1.1653, -0.7538, -1.3218, -2.3394, -1.9766, -1.8718, -1.5041). s is a single scalar point set at 0.2 for the use of this function. kde is a user - defined function that I wrote. Here is the implementation:
kde = function(x,X,s){
l = length(x)
b = matrix(X,l,length(X),byrow = TRUE)
c = x - b
phi.matrix = dnorm(c,0,s)
d = rowMeans(phi.matrix)
return(d)
}
in this function, X is the same vector of data points used in kde.cv. s is also the same scalar value of 0.2 used in kde.cv. x is a vector of evaluation points for the function, I used seq(-2.5, -0.5, by = 0.1).
Here is an option using sapply
kde.cv = function(X,s)
sum(sapply(1:length(X), function(i) log(kde(X[i], X[-i], s))))
For convenience, please provide a more complete example. For example, the kde() function. Is that a customized function?
Alternative to sapply, you can try Vectorize(). There are some examples you can find on stack overflow.
Vectorize() vs apply()
Here is an example
f1 <- function(x,y) return(x+y)
f2 <- Vectorize(f1)
f1(1:3, 2:4)
[1] 3 5 7
f2(1:3, 2:4)
[1] 3 5 7
and the second example
f1 <- function(x)
{
new.vector<-c()
for (i in 1:length(x))
{
new.vector[i]<-sum(x[i] + x[-i])
}
return(sum(new.vector))
}
f2<-function(x)
{
f3<-function(y, i)
{
u<-sum(y[i]+y[-i])
return(u)
}
f3.v<-Vectorize(function(i) f3(y = x, i=i))
new.value<-f3.v(1:length(x))
return(sum(new.value))
}
f1(1:3)
[1] 24
f2(1:3)
[1] 24
Note: Vectorize is a wrapper for mapply
EDIT 1
According to the response, I edited your kde.cv function.
kde.cv = function(X,s) {
l = length(X)
log.fhat.vector = c()
for (i in 1:l) {
current.log.fhat = log ( kde(X[i],X[-i],s) )
log.fhat.vector[i] = current.log.fhat
}
CV.score = sum(log.fhat.vector)
return(CV.score)
}
kde = function(x,X,s){
l = length(x)
b = matrix(X,l,length(X),byrow = TRUE)
c = x - b
phi.matrix = dnorm(c,0,s)
d = rowMeans(phi.matrix)
return(d)
}
##### Vectorize kde.cv ######
kde.cv.v = function(X,s)
{
log.fhat.vector = c()
kde.v<-Vectorize(function(i) kde(X[i], X[-i], s))
CV.score <- sum(log(kde.v(1:length(X))))
return(CV.score)
}
X<-c(-1.1653, -0.7538, -1.3218, -2.3394, -1.9766, -1.8718, -1.5041)
s<-0.2
x<-seq(-2.5, -0.5, by = 0.1)
kde.cv(X, s)
[1] -10.18278
kde.cv.v(X, s)
[1] -10.18278
EDIT 2
Well, I think the following function may match your requirement. BTW, since the little x is not used in your kde.cv, I just edited both two functions
kde.cv.2 <- function(X,s)
{
log.fhat.vector<-log(kde.2(X, s))
CV.score = sum(log.fhat.vector)
return(CV.score)
}
kde.2<-function(X, s)
{
l <- length(X)
b <- matrix(rep(X, l), l, l, byrow = T)
c <- X - b
diag(c) <- NA
phi.matrix <- dnorm(c, 0, s)
d <- rowMeans(phi.matrix, na.rm = T)
return(d)
}
X<-c(-1.1653, -0.7538, -1.3218, -2.3394, -1.9766, -1.8718, -1.5041)
s<-0.2
kde.cv(X,s)
[1] -10.18278
kde.cv.2(X, s)
[1] -10.18278

Wrapping very long functions in RExcel VBA?

When you want to use R functions in VBA via RExcel, you have to use
RInterface.RRun "..."
Then, if you'd like to define your own R function, you can simply
RInterface.RRun "y <- function(x) { ... }"
If y is made up by more than one command line, you can separate each line with ;, as you're used to do in R environment.
But... what if your y function is very very long?
A 20 ~ 30 rows R function is damn difficult to be written in such a way in VBA; and there's a limit to the length of VBA sentences.
So: how may I wrap?
Here's an example of a quite long R function: can you show me how to put in VBA using RExcel?
bestIV <- function(dT, IVTS.t, Spot, r) {
b <- r
xout <- seq(0, max(T), dT)
sfm <- matrix(NA, nrow = length(K), ncol = length(xout))
for(i in 1:length(K)) {
sfm[i,] <- approx(x = T, y = IVTS.t[i,], xout = xout, rule = 2)$y
}
sfm[,1] <- sfm[,1] + sfm[,2] - sfm[,3]
rownames(sfm) <- K
colnames(sfm) <- xout
Option <- matrix(NA, nrow = length(K), ncol = length(xout))
for(i in 1:length(K)) {
for(j in 1:length(xout)) {
TypeFlag <- ifelse(K[i] < Spot, 'p', 'c')
Option[i,j] <- GBSOption(TypeFlag = TypeFlag, S = Spot, X = K[i],
Time = xout[j] / 365, r = r, b = b,
sigma = sfm[i,j] / 100)#price
}
}
rownames(Option) <- K
colnames(Option) <- xout
dP <- (cbind(0, -t(apply(X = Option, MARGIN = 1, FUN = diff))) / Option)[,-(1:2)]
dV <- dP / dT
min.V <- which(dV == min(dV), arr.ind = TRUE, useNames = TRUE)
Strike <- as.numeric(dimnames(min.V)[1])
Maturity <- as.numeric(unlist(dimnames(dV)[2]))[min.V[2]]
Days <- dT
Mat <- c(dV[which(dV == min(dV))], Strike, Maturity, Days)
names(Mat) <- c('Value', 'Strike', 'Maturity', 'Days')
return(Mat)
}
Thanks,
Put your R code in your spreadhseet (in a range of cells) and use this function instead:
RInterface.RunRCodeFromRange range
Executes the commands in range on a worksheet
(allows to use commands prepared for interactive execution with R to be run in macro code)
You are passing a string as an argument to a VBA function. Thus your question reduces to "how can I concatenate strings in VBA".
The answer is to use the concatenation operator &, like this:
"a" & "b"
Say you have an R function:
y <- function(x, a, b){
return(x)
}
Then you can do this in VBA:
RInterface.RRun "y <- function(x, a, b) {" &
"return(x)" &
"}"

How to make a loop run faster in R?

I want to use arms() to get one sample each time and make a loop like the following one in my function. It runs very slowly. How could I make it run faster? Thanks.
library(HI)
dmat <- matrix(0, nrow=100,ncol=30)
system.time(
for (d in 1:100){
for (j in 1:30){
y <- rep(0, 101)
for (i in 2:100){
y[i] <- arms(0.3, function(x) (3.5+0.000001*d*j*y[i-1])*log(x)-x,
function(x) (x>1e-4)*(x<20), 1)
}
dmat[d, j] <- sum(y)
}
}
)
This is a version based on Tommy's answer but avoiding all loops:
library(multicore) # or library(parallel) in 2.14.x
set.seed(42)
m = 100
n = 30
system.time({
arms.C <- getNativeSymbolInfo("arms")$address
bounds <- 0.3 + convex.bounds(0.3, dir = 1, function(x) (x>1e-4)*(x<20))
if (diff(bounds) < 1e-07) stop("pointless!")
# create the vector of z values
zval <- 0.00001 * rep(seq.int(n), m) * rep(seq.int(m), each = n)
# apply the inner function to each grid point and return the matrix
dmat <- matrix(unlist(mclapply(zval, function(z)
sum(unlist(lapply(seq.int(100), function(i)
.Call(arms.C, bounds, function(x) (3.5 + z * i) * log(x) - x,
0.3, 1L, parent.frame())
)))
)), m, byrow=TRUE)
})
On a multicore machine this will be really fast since it spreads the loads across cores. On a single-core machine (or for poor Windows users) you can replace mclapply above with lapply and get only a slight speedup compared to Tommy's answer. But note that the result will be different for parallel versions since it will use different RNG sequences.
Note that any C code that needs to evaluate R functions will be inherently slow (because interpreted code is slow). I have added the arms.C just to remove all R->C overhead to make moli happy ;), but it doesn't make any difference.
You could squeeze out a few more milliseconds by using column-major processing (the question code was row-major which requires re-copying as R matrices are always column-major).
Edit: I noticed that moli changed the question slightly since Tommy answered - so instead of the sum(...) part you have to use a loop since y[i] are dependent, so the function(z) would look like
function(z) { y <- 0
for (i in seq.int(99))
y <- y + .Call(arms.C, bounds, function(x) (3.5 + z * y) * log(x) - x,
0.3, 1L, parent.frame())
y }
Well, one effective way is to get rid of the overhead inside arms. It does some checks and calls the indFunc every time even though the result is always the same in your case.
Some other evaluations can be also be done outside the loop. These optimizations bring down the time from 54 secs to around 6.3 secs on my machine. ...and the answer is identical.
set.seed(42)
#dmat2 <- ##RUN ORIGINAL CODE HERE##
# Now try this:
set.seed(42)
dmat <- matrix(0, nrow=100,ncol=30)
system.time({
e <- new.env()
bounds <- 0.3 + convex.bounds(0.3, dir = 1, function(x) (x>1e-4)*(x<20))
f <- function(x) (3.5+z*i)*log(x)-x
if (diff(bounds) < 1e-07) stop("pointless!")
for (d in seq_len(nrow(dmat))) {
for (j in seq_len(ncol(dmat))) {
y <- 0
z <- 0.00001*d*j
for (i in 1:100) {
y <- y + .Call("arms", bounds, f, 0.3, 1L, e)
}
dmat[d, j] <- y
}
}
})
all.equal(dmat, dmat2) # TRUE
why not like this?
dat <- expand.grid(d=1:10, j=1:3, i=1:10)
arms.func <- function(vec) {
require(HI)
dji <- vec[1]*vec[2]*vec[3]
arms.out <- arms(0.3,
function(x,params) (3.5 + 0.00001*params)*log(x) - x,
function(x,params) (x>1e-4)*(x<20),
n.sample=1,
params=dji)
return(arms.out)
}
dat$arms <- apply(dat,1,arms.func)
library(plyr)
out <- ddply(dat,.(d,j),summarise, arms=sum(arms))
matrix(out$arms,nrow=length(unique(out$d)),ncol=length(unique(out$j)))
However, its still single core and time consuming. But that isn't R being slow, its the arms function.

Resources