How to efficiently sum over levels defined in another variable? - r

I am new to R. Now I have a function as follow:
funItemAverRating = function()
{
itemRatingNum = array(0, itemNum);
print("begin");
apply(input, 1, function(x)
{
itemId = x[2]+1;
itemAverRating[itemId] <<- itemAverRating[itemId] + x[3];
itemRatingNum[itemId] <<- itemRatingNum[itemId] + 1;
}
);
}
In this function input is a n*3 data frame, n is ~6*(10e+7), itemRatingNum is a vector of size ~3*(10e+5).
My question is why the apply function is so slow (it would take nearly an hour to finish)? Also, as the function runs, it uses more and more memory. But as you can see, the variables are all defined outside the apply function. Can anybody help me?
cheng

It's slow because you call high-level R functions many times.
You have to vectorize your function, meaning that most operations (like <- or +1) should be computed over all data vectors.
For example it looks to me that itemRatingNum holds frequencies of input[[2]] (second column of input data.frame) which could be replaced by:
tb <- table(input[[2]]+1)
itemRatingNum[as.integer(names(tb))] <- tb

Don't do that. You're following a logic that is completely not R-like. If I understand it right, you want to add to a certain itemAverRating vector a value from a third column in some input dataframe.
What itemRatingNum is doing, is rather obscure. It does not end up in the global environment, and it just becomes a vector filled with frequencies at the end of the loop. As you define itemRatingNum within the function, the <<- assignment will also assign it within the local environment of the function, and it will get destroyed when the function ends.
Next, you should give your function input, and get some output. Never assign to the global environment if it's not necessary. Your function is equivalent to the - rather a whole lot faster - following function, which takes input and gives output :
funItemAverRating = function(x,input){
sums <- rowsum(input[,3],input[,2])
sumid <- as.numeric(rownames(sums))+1
x[sumid]+c(sums)
}
FUNCTION EDITED PER MAREKS COMMENT
Which works like :
# make data
itemNum <- 10
set.seed(12)
input <- data.frame(
a1 = rep(1:10,itemNum),
a2 = sample(9:0,itemNum*10,TRUE),
a3 = rep(10:1,itemNum)
)
itemAverRating <- array(0, itemNum)
itemAverRating <- funItemAverRating(itemAverRating,input)
itemAverRating
0 1 2 3 4 5 6 7 8 9
39 65 57 36 62 33 98 62 60 38
If I try your code, I get :
> funItemAverRating()
[1] "begin"
...
> itemAverRating
[1] 39 65 57 36 62 33 98 62 60 38
Which is the same. If you want itemRatingNum, then just do :
> itemRatingNum <- table(input[,2])
0 1 2 3 4 5 6 7 8 9
6 11 11 8 10 6 18 9 13 8

Related

How to Iterate over a column in a data frame to change its values

I have a data frame that looks like:
ID A B
0 8 25
1 16 123
2 4 120
... ...
What I want to do now is to iterate over column 'A' for example and call a function with the value of the cell and return it at the same location.
For example a function like (x^2)-1.
int calculation(int val){
return val*val-1;
}
...code...
while(i<A.length){
A[i] = calculation(A[i]);
}
So the result should look like this.
ID A B
0 63 25
1 265 123
2 15 120
... ...
I am new to R, if you know some good basic guidelines or books for scientific plotting, let me know. :-)
Thanks for your help.
This is very straightforward task in R.
a<-c(8,16,4)
b<-c(25,123,120)
df<-data.frame(a,b)
calculation<-function(a){
a^2-1
}
# Method 1
df$a<-(df$a^2)-1
# Method 2
df$a<-calculation(df$a)
Here is very simple example using your data.frame (say df)
#Define your function as
calcualtion <- function(x){
x*x - 1
}
#Suppose you want to call it on column A. Use sapply function as
df$A <- sapply(df$A, calcualtion)
result:
ID A
1 0 63
2 1 255
3 2 15
The beauty of R is its simplicity.
If you would like to apply some operation for column A:
df[["A"]] = ((df[["A"]]*df[["A"]]) - 1)
Also, I would strongly recommend googling "vectorization in R".

How to speed up this function (for n parameters) in R?

I have this function:
col <- 0
rres <- data.frame(matrix(nrow=nrow(ind),ncol=length(lt)))
gig <- NULL
> lt
[1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> delta.1
[1] 5 7 9 10 12 15 17 20 22 26 29 34 39 46 54 68 96 138 138
> f.bio
function(x,y,a,b,l,k,m)
{
for (t in 1:nrow(y)){
for (i in 1:length(lt)){
for(j in 1:delta.1[i]){
ifelse (t+j-1>nrow(x),gig[j]<- NA,
gig[j] <- x[t+j-1,i]*
(a*(l-(((l-(lt[i]+1))/(exp(-k*((j-1)/12))))))
^b)*exp(m[(1+j),i]*(j-1)))
}
rres[t,i] <- sum(gig, na.rm = TRUE)
}
result <- apply(rres,1,function(x) sum(x)/1000000)
}
return(result)
}
which it is apply to some biological data, the code is:
f.bio(ind,eff,a_all,b_all,Linf,K_coef,mort)
where the arguments are:
> dim(ind)
[1] 1356 19
> dim(eff)
[1] 1356 1
a_all = 0.004
b_all= 3
Linf= 19.4
K_coef = 0.57
> dim(mort)
[1] 110 19
ind, eff, and mort are data.frame.
Now, my question is, is possible to apply this function to n parameters, without excessive time machine?
I mean for n parameters a distribution of a certain parameters, for example:
set.seed(1)
a_all_v <- round(sort(rnorm(40,a_all,0.00034)),5) #40 values!!
and so on for the 4 par: a_all, b_all, K_coef, Linf
I wrote this code, with loop ( in this loop i can combine a_all with b_all, and Linf with K_coef):
col <- 0
for (m1 in 1:length(a_all_v)){
a_all <- a_all_v[m1]
b_all <- b_all_v[m1]
for(m2 in 1:length(Linf_v)){
Linf <- Linf_v[m2]
K_coef <- k_coef_v[m2]
col <- col+1
res.temp <-f.bio(ind,eff,a_all,b_all,Linf,K_coef,mort)
res.2[,col] <-res.temp
}
}
where res.2 is:
res.2 <- data.frame(matrix(nrow=1356,ncol=1600)) #1600=40*40 (number of values for each parameters distribution)
This loop employ many time machine (many day on my PC). For these reason, there is some package or function (like Monte Carlo or bootstrap) that can change my code structure, and run the function with a good number of parameters combination, in little time (if is possible)?
If you keep your current setup with for loops, you need to start preallocating your output objects. For example, you start with an empty gig (NULL) and iteratively fill it. However, the way you do it right now gig needs to be rebuild every iteration as the analysis progresses, and reallocation of memory is a very expensive operation. Simply making gig as large as it needs to be and then doing the assignment will speed up your code tremendously.
Even better is to solve your problem not via for loops (which are notoriously slow, even with preallocation) but use either:
Vectorisation, matrix calculations. These will be order of magnitude faster.
dplyr or data.table. If smartly used, these will also be much faster, but vectorisation is probably even faster.

Create master list adding iterations of values in another list with known interval in R

I have a solution that works, but would appreciate ideas to improve the code to avoid using loops if possible.
I have a list of values, this is read in from a csv file, but takes the form
startingvalues = c(1, 7, 20, 32, 47)
I want to create a new list, that reads in each of these starting values and adds the next 2 (or 7 or 15 etc.) numbers, then goes to the next. For the above example this would be
newlist = c(1,2,3,7,8,9,20,21,22,32,33,34,47,48,49)
I have code that works, but I suspect there is a more elegant way to do this. I am not particularly worried about speed but would like to avoid the loop if there is a better way to do this.
newlist = c() # initialise an empty list
for (i in 1:length(startingvalues){
list1 = seq(startingvalues[i,1],startingvalues[i,1]+2, by = 1)
newlist = c(newlist,list1)
}
Any suggestions to improve my coding would be appreciated. This may be the best way to do this, however I suspect it isn't.
How about something like this
extend <- function(x,y) unlist(lapply(x, seq.int, length.out=y+1))
extend(startingvalues, 2)
# [1] 1 2 3 7 8 9 20 21 22 32 33 34 47 48 49
The first parameter is the vector of numbers and the second is how far you want to extend each number. We just us an lapply for the iteration and unlist the thing in the end. This is better than appending at each iteration which is not very efficient.
Here's another alternative
extend <- function(x,y) c(outer(0:y, x, `+`))
The outer() will build a matrix but we coerce back to a vector with c().
We can use rep with + to get the expected output
unique(sort(startingvalues + rep(0:2, each = length(startingvalues))))
#[1] 1 2 3 7 8 9 20 21 22 32 33 34 47 48 49
Or as #thelatemail mentioned replicating the 'startingvalues' and make use of the recycling would be better as sort can be avoided
s1 <- 0:2
rep(startingvalues, each=length(s1)) + s1
#[1] 1 2 3 7 8 9 20 21 22 32 33 34 47 48 49

Using merge.zoo to dynamically create variables in R

I'm trying to create a function that automatically creates polynomials of a zoo object. Coming from Python, the typical way to it is to create a list outside a for loop, and then append the list inside the loop. Following this, I wrote the below code in R:
library("zoo")
example<-zoo(2:8)
polynomial<-function(data, name, poly) {
##creating the catcher object that the polynomials will be attached to
returner<-data
##running the loop
for (i in 2:poly) {
#creating the polynomial
poly<-data^i
##print(paste(name, i), poly) ##done to confirm that paste worked correctly##
##appending the returner object
merge.zoo(returner, assign(paste(name, i), poly))
}
return(returner)
}
#run the function
output<-polynomial(example, "example", 4)
However, when I run the function, R throws no exceptions, but the output object does not have any additional data beyond what I originally created in the example zoo object. I suspect I'm misunderstanding merge.zoo or perhaps now allowed to dynamically reassign the names of the polynomials inside the loop.
Thoughts?
As for error in your code you are missing assignment of result from merge.zoo to returner.
However, I think there is better way to achieve what you want.
example <- zoo(2:8)
polynomial <- function(data, name, poly) {
res <- zoo(sapply(1:poly, function(i) data^i))
names(res) <- paste(name, 1:4)
return(res)
}
polynomial(example, "example", 4)
## example 1 example 2 example 3 example 4
## 1 2 4 8 16
## 2 3 9 27 81
## 3 4 16 64 256
## 4 5 25 125 625
## 5 6 36 216 1296
## 6 7 49 343 2401
## 7 8 64 512 4096

bestfit nonlinear to a list of dataframes

I have a list of data frames and I would like to calculate the nonlinear bestfit to each dataframe in the list and to get a list with the best fit objects. I am trying to use lapply but I am having problems with the parameters.
# define a function for D
fncTtr <- function(n,d) (d/n)*((sqrt(1+2*(n/d))-1))
# define a function for best fit
bestFitD <- function(dat,fun) {
res <- nls(dat$ttr~fun(n,d),data=dat,start=list(d=25),trace=T)
return(res)
}
resL <- lapply(dData2,bestFitD,dat=dData2,fun=fncTtr)
When I execute this code I am getting the following error:
Error in FUN(X[[1L]], ...) : unused argument(s) (X[[1]])
I want the lapply to go thru each dataframe in dData2 and to execute the bestFitD function. How should I define the parameters for the function in lapply. The dData2 object is a list. I am using it as a parameter for bestFitD but this function expects one dataframe as a parameter. How can I define this parameter? When I execute the function bestFitD by itself with one dataframe, it is running correctly. example of a dData2 list with components that are dataframes:
$`1`
n ttr d id
1 35 0.6951 27.739 1
2 36 0.6925 28.072 1
3 37 0.6905 28.507 1
4 38 0.6887 28.946 1
5 39 0.6790 28.003 1
6 40 0.6703 27.247 1
7 41 0.6566 25.735 1
8 42 0.6605 26.981 1
9 43 0.6567 27.016 1
10 44 0.6466 26.026 1
11 45 0.6531 27.667 1
12 46 0.6461 27.128 1
13 47 0.6336 25.751 1
14 48 0.6225 24.636 1
15 49 0.6214 24.992 1
16 50 0.6248 26.011 1
$`2`
n ttr d id
17 35 0.6951 27.739 2
18 36 0.6925 28.072 2
19 37 0.6905 28.507 2
20 42 0.6605 26.981 2
The following code seems to be ok:
res <- bestFitD(dData2[[1]],fncTtr)
but when I execute the following:
res <- bestFitD(dData2[[2]],fncTtr)
I am getting the followin error:
Error in model.frame.default(formula = ~dat + ttr + n, data = dat) :
invalid type (list) for variable 'dat'
Why? Both are dataframes!
But it seems that There is something strange with the second component!
just get rid of the dat$ in your nls function call. i believe it's looking for dat$dat$ttr which obviously will break. That is, your bestFitD function should be:
bestFitD <- function(dat,fun) {
res <- nls(ttr~fun(n,d),data=dat,start=list(d=25),trace=T)
return(res)
}
Now, call using lapply as:
resL <- lapply(dData2, bestFitD, fun = fncTtr)
This should work:
resL <- lapply(dData2, function(x,fun){
bestFitD(x,fun)
},fun='fncTtr')
Where I rewrite,bestFitD using do.call
bestFitD <- function(dat,fun){
nls(ttr~do.call(fun,list(n,d)), data=dat,
start=list(d=25),trace=T)
res
}
0.003237457 : 25
0.0009393089 : 26.77943
0.0009362902 : 26.84895
0.0009362902 : 26.84898
0.001282807 : 25
4.771935e-05 : 27.64267
4.389588e-05 : 27.80729
4.389584e-05 : 27.80781
EDIT
my solution can be simplified to (similar but not exactly to Anthony solution)
lapply(dData2, bestFitD, fun = 'fncTtr')

Resources