How to speed up this function (for n parameters) in R? - r

I have this function:
col <- 0
rres <- data.frame(matrix(nrow=nrow(ind),ncol=length(lt)))
gig <- NULL
> lt
[1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> delta.1
[1] 5 7 9 10 12 15 17 20 22 26 29 34 39 46 54 68 96 138 138
> f.bio
function(x,y,a,b,l,k,m)
{
for (t in 1:nrow(y)){
for (i in 1:length(lt)){
for(j in 1:delta.1[i]){
ifelse (t+j-1>nrow(x),gig[j]<- NA,
gig[j] <- x[t+j-1,i]*
(a*(l-(((l-(lt[i]+1))/(exp(-k*((j-1)/12))))))
^b)*exp(m[(1+j),i]*(j-1)))
}
rres[t,i] <- sum(gig, na.rm = TRUE)
}
result <- apply(rres,1,function(x) sum(x)/1000000)
}
return(result)
}
which it is apply to some biological data, the code is:
f.bio(ind,eff,a_all,b_all,Linf,K_coef,mort)
where the arguments are:
> dim(ind)
[1] 1356 19
> dim(eff)
[1] 1356 1
a_all = 0.004
b_all= 3
Linf= 19.4
K_coef = 0.57
> dim(mort)
[1] 110 19
ind, eff, and mort are data.frame.
Now, my question is, is possible to apply this function to n parameters, without excessive time machine?
I mean for n parameters a distribution of a certain parameters, for example:
set.seed(1)
a_all_v <- round(sort(rnorm(40,a_all,0.00034)),5) #40 values!!
and so on for the 4 par: a_all, b_all, K_coef, Linf
I wrote this code, with loop ( in this loop i can combine a_all with b_all, and Linf with K_coef):
col <- 0
for (m1 in 1:length(a_all_v)){
a_all <- a_all_v[m1]
b_all <- b_all_v[m1]
for(m2 in 1:length(Linf_v)){
Linf <- Linf_v[m2]
K_coef <- k_coef_v[m2]
col <- col+1
res.temp <-f.bio(ind,eff,a_all,b_all,Linf,K_coef,mort)
res.2[,col] <-res.temp
}
}
where res.2 is:
res.2 <- data.frame(matrix(nrow=1356,ncol=1600)) #1600=40*40 (number of values for each parameters distribution)
This loop employ many time machine (many day on my PC). For these reason, there is some package or function (like Monte Carlo or bootstrap) that can change my code structure, and run the function with a good number of parameters combination, in little time (if is possible)?

If you keep your current setup with for loops, you need to start preallocating your output objects. For example, you start with an empty gig (NULL) and iteratively fill it. However, the way you do it right now gig needs to be rebuild every iteration as the analysis progresses, and reallocation of memory is a very expensive operation. Simply making gig as large as it needs to be and then doing the assignment will speed up your code tremendously.
Even better is to solve your problem not via for loops (which are notoriously slow, even with preallocation) but use either:
Vectorisation, matrix calculations. These will be order of magnitude faster.
dplyr or data.table. If smartly used, these will also be much faster, but vectorisation is probably even faster.

Related

R expand.grid with row restrictions

I have a numeric vector x of length N and would like to create a vector of the within-set sums of all of the following sets: any possible combination of the x elements with at most M elements in each combination. I put together a slow iterative approach; what I am looking for here is a way without using any loops.
Consider the approach I have been taking, in the following example with N=5 and M=4
M <- 4
x <- 11:15
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
However, as N gets large (above 22 for me), the expand.grid output becomes too big and gives an error (replace x above with x <- 11:55 to observe this). Ideally there would be an expand.grid function that permits restrictions on the rows before constructing the full matrix, which (at least for what I want) would keep the matrix size within memory limits.
Is there a way to achieve this without causing problems for large N?
Your problem has to do with the sheer amount of combinations.
What you appear to be doing is listing all different combinations of 0's and 1's in a sequence of length of x.
In your example x has length 5 and you have 2^5=32 combinations
When x has length 22 you have 2^22=4194304 combinations.
Couldn't you use a binary encoding instead?
In your case that would mean
0 stands for 00000
1 stands for 00001
2 stands for 00010
3 stands for 00011
...
It will not solve your problem completely, but you should be able to get a bit further than now.
Try this:
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
It generates the same result as with your expand.grid approach, shown below for the test data.
M <- 4
x <- 11:15
# expand.grid approach
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
# combn approach
result1 <- c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
all(sort(result[,1]) == sort(result1))
# [1] TRUE
This should be fast (it takes 0.227577 secs on my machine, with N=22, M=4):
x <- 1:22 # N = 22
M <- 4
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 3 4 5 6 7
you may want to choose the unique values of the sums with
unique(c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k))))))

Create master list adding iterations of values in another list with known interval in R

I have a solution that works, but would appreciate ideas to improve the code to avoid using loops if possible.
I have a list of values, this is read in from a csv file, but takes the form
startingvalues = c(1, 7, 20, 32, 47)
I want to create a new list, that reads in each of these starting values and adds the next 2 (or 7 or 15 etc.) numbers, then goes to the next. For the above example this would be
newlist = c(1,2,3,7,8,9,20,21,22,32,33,34,47,48,49)
I have code that works, but I suspect there is a more elegant way to do this. I am not particularly worried about speed but would like to avoid the loop if there is a better way to do this.
newlist = c() # initialise an empty list
for (i in 1:length(startingvalues){
list1 = seq(startingvalues[i,1],startingvalues[i,1]+2, by = 1)
newlist = c(newlist,list1)
}
Any suggestions to improve my coding would be appreciated. This may be the best way to do this, however I suspect it isn't.
How about something like this
extend <- function(x,y) unlist(lapply(x, seq.int, length.out=y+1))
extend(startingvalues, 2)
# [1] 1 2 3 7 8 9 20 21 22 32 33 34 47 48 49
The first parameter is the vector of numbers and the second is how far you want to extend each number. We just us an lapply for the iteration and unlist the thing in the end. This is better than appending at each iteration which is not very efficient.
Here's another alternative
extend <- function(x,y) c(outer(0:y, x, `+`))
The outer() will build a matrix but we coerce back to a vector with c().
We can use rep with + to get the expected output
unique(sort(startingvalues + rep(0:2, each = length(startingvalues))))
#[1] 1 2 3 7 8 9 20 21 22 32 33 34 47 48 49
Or as #thelatemail mentioned replicating the 'startingvalues' and make use of the recycling would be better as sort can be avoided
s1 <- 0:2
rep(startingvalues, each=length(s1)) + s1
#[1] 1 2 3 7 8 9 20 21 22 32 33 34 47 48 49

Using R as a game simulator

I am trying to simulate a simple game where you spin a spinner, labeled 1-5, and then progress on until you pass the finish line (spot 50). I am a bit new to R and have been working on this for a while searching for answers. When I run the code below, it doesn't add the numbers in sequence, it returns a list of my 50 random spins and their value. How do I get this to add the spins on top of each other, then stop once => 50?
SpacesOnSpinner<-(seq(1,5,by=1))
N<-50
L1<-integer(N)
for (i in 1:N){
takeaspin<-sample(SpacesOnSpinner,1,replace=TRUE)
L1[i]<-L1[i]+takeaspin
}
This is a good use-case for replicate. I'm not sure if you have to use a for loop, but you could do this instead (replicate is a loop too):
SpacesOnSpinner<-(seq(1,5,by=1))
N<-10
cumsum( replicate( N , sample(SpacesOnSpinner,1,replace=TRUE) ) )
#[1] 5 10 14 19 22 25 27 29 30 33
However, since you have a condition which you want to break on, perhaps the other answer with a while condition is exactly what you need in this case (people will tell you they are bad in R, but they have their uses). Using this method, you can see how many spins it took you to get past 50 by a simple subset afterwards (but you will not know in advance how many spins it will take, but at most it will be 50!):
N<-50
x <- cumsum( replicate( N , sample(5,1) ) )
# Value of accumulator at each round until <= 50
x[ x < 50 ]
#[1] 5 6 7 8 12 16 21 24 25 29 33 34 36 38 39 41 42 44 45 49
# Number of spins before total <= 50
length(x[x < 50])
[1] 20
Here is another interesting way to simulate your game, using a recursive function.
spin <- function(outcomes = 1:5, start = 0L, end = 50L)
if (start <= end)
c(got <- sample(outcomes, 1), Recall(outcomes, start + got, end))
spin()
# [1] 5 4 4 5 1 5 3 2 3 4 4 1 5 4 3
Although elegant, it won't be as fast as an improved version of #Simon's solution that makes a single call to sample, as suggested by #Viktor:
spin <- function(outcomes = 1:5, end = 50L) {
max.spins <- ceiling(end / min(outcomes))
x <- sample(outcomes, max.spins, replace = TRUE)
head(x, match(TRUE, cumsum(x) >= end))
}
spin()
# [1] 3 5 2 3 5 2 2 5 1 2 1 5 5 5 2 4
For your ultimate goal (find the probability of one person being in the lead for the entire game), it is debatable whether while will be more efficient or not: a while loop is certainly slower, but you may benefit from the possibility of exiting early as the lead switches from one player to the other. Both approaches are worth testing.
You can use a while statement and a variable total for keeping track of the sum:
total <- 0
while(total <= 50){
takeaspin<-sample(SpacesOnSpinner,1,replace=TRUE)
total <- takeaspin + total
}
print (total)

Using merge.zoo to dynamically create variables in R

I'm trying to create a function that automatically creates polynomials of a zoo object. Coming from Python, the typical way to it is to create a list outside a for loop, and then append the list inside the loop. Following this, I wrote the below code in R:
library("zoo")
example<-zoo(2:8)
polynomial<-function(data, name, poly) {
##creating the catcher object that the polynomials will be attached to
returner<-data
##running the loop
for (i in 2:poly) {
#creating the polynomial
poly<-data^i
##print(paste(name, i), poly) ##done to confirm that paste worked correctly##
##appending the returner object
merge.zoo(returner, assign(paste(name, i), poly))
}
return(returner)
}
#run the function
output<-polynomial(example, "example", 4)
However, when I run the function, R throws no exceptions, but the output object does not have any additional data beyond what I originally created in the example zoo object. I suspect I'm misunderstanding merge.zoo or perhaps now allowed to dynamically reassign the names of the polynomials inside the loop.
Thoughts?
As for error in your code you are missing assignment of result from merge.zoo to returner.
However, I think there is better way to achieve what you want.
example <- zoo(2:8)
polynomial <- function(data, name, poly) {
res <- zoo(sapply(1:poly, function(i) data^i))
names(res) <- paste(name, 1:4)
return(res)
}
polynomial(example, "example", 4)
## example 1 example 2 example 3 example 4
## 1 2 4 8 16
## 2 3 9 27 81
## 3 4 16 64 256
## 4 5 25 125 625
## 5 6 36 216 1296
## 6 7 49 343 2401
## 7 8 64 512 4096

How to efficiently sum over levels defined in another variable?

I am new to R. Now I have a function as follow:
funItemAverRating = function()
{
itemRatingNum = array(0, itemNum);
print("begin");
apply(input, 1, function(x)
{
itemId = x[2]+1;
itemAverRating[itemId] <<- itemAverRating[itemId] + x[3];
itemRatingNum[itemId] <<- itemRatingNum[itemId] + 1;
}
);
}
In this function input is a n*3 data frame, n is ~6*(10e+7), itemRatingNum is a vector of size ~3*(10e+5).
My question is why the apply function is so slow (it would take nearly an hour to finish)? Also, as the function runs, it uses more and more memory. But as you can see, the variables are all defined outside the apply function. Can anybody help me?
cheng
It's slow because you call high-level R functions many times.
You have to vectorize your function, meaning that most operations (like <- or +1) should be computed over all data vectors.
For example it looks to me that itemRatingNum holds frequencies of input[[2]] (second column of input data.frame) which could be replaced by:
tb <- table(input[[2]]+1)
itemRatingNum[as.integer(names(tb))] <- tb
Don't do that. You're following a logic that is completely not R-like. If I understand it right, you want to add to a certain itemAverRating vector a value from a third column in some input dataframe.
What itemRatingNum is doing, is rather obscure. It does not end up in the global environment, and it just becomes a vector filled with frequencies at the end of the loop. As you define itemRatingNum within the function, the <<- assignment will also assign it within the local environment of the function, and it will get destroyed when the function ends.
Next, you should give your function input, and get some output. Never assign to the global environment if it's not necessary. Your function is equivalent to the - rather a whole lot faster - following function, which takes input and gives output :
funItemAverRating = function(x,input){
sums <- rowsum(input[,3],input[,2])
sumid <- as.numeric(rownames(sums))+1
x[sumid]+c(sums)
}
FUNCTION EDITED PER MAREKS COMMENT
Which works like :
# make data
itemNum <- 10
set.seed(12)
input <- data.frame(
a1 = rep(1:10,itemNum),
a2 = sample(9:0,itemNum*10,TRUE),
a3 = rep(10:1,itemNum)
)
itemAverRating <- array(0, itemNum)
itemAverRating <- funItemAverRating(itemAverRating,input)
itemAverRating
0 1 2 3 4 5 6 7 8 9
39 65 57 36 62 33 98 62 60 38
If I try your code, I get :
> funItemAverRating()
[1] "begin"
...
> itemAverRating
[1] 39 65 57 36 62 33 98 62 60 38
Which is the same. If you want itemRatingNum, then just do :
> itemRatingNum <- table(input[,2])
0 1 2 3 4 5 6 7 8 9
6 11 11 8 10 6 18 9 13 8

Resources