I set myself a little challange on my way to learning R. The question was, given a sample of 500 numbers in normal distribution with mean 20, how many numbers under 20 would I get for standard deviations from 6 to 10. Just to have to learn more I decided to get 4 samples for each sd. So by the end I should have:
sd6samp1:...
sd6samp2:...
....
sd10samp4:...
My first approach, which worked was:
ddss<-c(6:10) # sd's
sam<-c(1:4) # 4 samples for each
k=0 # counter in 0
for (i in ddss) { # for each sd
for (j in sam) { # for each sample
nam <- paste("sam",i,".",j, sep="") # building a name
n <- assign(nam,rnorm(500, 20, i)) # the great assign function
k <- k+sum(n<=0)
}
print(assign(paste("ds",i,sep=""), k)) # ohh assign you're great
k=0 # reset counter
}
While looking for how to create variable names with the looping 'i', founded that 'assign' does the work but it also said:
Note though that if you are planning some simulations,
many guRus would say that you should use a list.
So I thoght it would be good to learn lists...
In the meanwhile I also discover a great other option...
ddss <- c(6:10)
for (i in ddss) {
print(paste('prob. x<=0), with sd=',i))
print(pnorm(0,mean=20,sd=i)*500)
}
This worked to answer the question, but the lists were still to be done... and a lot of R has yet to be learned. The main idea wasn't to know the very prob or number of negatives... but to learn R and specifically some looping.
So, I've been trying to go with the mentioned lists
My closest approach has been:
ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list() # initializing the list
for (i in ddss) { # for each sd
liss[[i]] <- list()
for (j in sam) { # for each sample
liss[[i]][[j]] <- rnorm(500, 20, i)
print(paste('ds',i,'samp',j,'=',sum(liss[[i]][[j]]<0)))
}
}
With this one I get the information but I'm wondering about two issues (1 & 2) and some other questions (3 & 4):
I get a list of 10 elements, 6 empty ones and then 4 with sublists. I can't seem to find out how to work with elements 1:4 of the list (sd's) with the 6:9 names (the very sd's).
Even though I tried, I couldn't get to name the lists elements through the 'for' loops. Any insight on these issues would be great.
Since in this context of simulations. What do you think is better: nested lists (lists with sublists) or simple (longer) lists?
I wondered whether the 'apply' functions would be of any help here, I tried to do something, like:
vbv<-matrix(c(6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9))
lsl<-apply(vbv, 2, function(x) rnorm(500,20,x))
But it looks I'm not getting even close....
Thanks for your time if you've read this far!
You may as well take some more to reply ;-).
The problem is in your indexes: you are running over indexer i from ddss, which runs from 6 to 10. So in the first tour of duty in your outer loop, your first statement really says: liss[[6]]<-list(), implying that the first 5 ones are NULL.
So if you insist on working with loops, this is what you should do (check ?seq_along):
ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list() # initializing the list
for (i in seq_along(ddss)) { # now, i runs from 1 to 5
liss[[i]] <- list()
for (j in sam) { # for each sample
liss[[i]][[j]] <- rnorm(500, 20, i)
print(paste('ds',ddss[i],'samp',j,'=',sum(liss[[i]][[j]]<0)))
}
names(liss[[i]])<-as.character(sam)#this should solve your naming issue (1/2)
}
names(liss)<-as.character(ddss)#this should solve your naming issue (2/2)
Note that, as always, it is a good idea to name your variables something more useful than i or j: if you'd named it curds, maybe you wouldn't have used it immediately as an indexer in a list?
Now, if you are really aiming for improvement (but want to stick to lists), you indeed want to go with the apply style functions:
liss<-lapply(ddss, function(curds){ #apply the inline function to each ds and store results in a list
return(lapply(sam, function(cursam){ #apply inline function to each sam and store results in a list
rv<-rnorm(500, 20, curds)
cat('ds',curds,'samp',cursam,'=',sum(rv<0), "\n") #maybe better for your purposes.
return(rv)
}))
})
Finally, for your case, there is not a lot of reason to actually use lists (nor do you even need to keep the sampled data for each ds/sam): you can store everything as a threedimensional array, but since you specify it as a learning exercise (hey, maybe the array thing can be your next exercise :-)), I'll leave it at that.
lapply() is helpful here, where we can just apply over the set of values for the SD. It helps to write a custom wrapper around the rnorm() function so we can pass in different values for the various arguments of rnorm(), and handle the k replicates (k = 4 in your example) in a nice fashion also. That wrapper is foo() below:
foo <- function(sd, n, mean, reps = 1) {
rands <- rnorm(n * reps, mean = mean, sd = sd)
if(reps > 1)
rands <- matrix(rands, ncol = reps)
rands
}
We use it in an lapply() call like so:
sims <- lapply(6:10, FUN = foo, mean = 20, n = 500, reps = 4)
Which gives:
R> str(sims)
List of 5
$ : num [1:500, 1:4] 30.3 22 15.6 20 19.4 ...
$ : num [1:500, 1:4] 20.9 21.7 17.7 35 30 ...
$ : num [1:500, 1:4] 17.88 26.48 5.19 19.25 15.59 ...
$ : num [1:500, 1:4] 27.41 12.72 9.38 35.09 11.08 ...
$ : num [1:500, 1:4] 16.2 11.6 20.5 35.4 27.3 ...
We can then compute the number of observations < 20 per SD
names(sims) <- paste("SD", 6:10, sep = "")
out <- lapply(sims, function(x) colSums(x < 20))
Which gives:
R> out
$SD6
[1] 218 251 253 227
$SD7
[1] 250 242 233 232
$SD8
[1] 258 241 246 274
$SD9
[1] 252 245 249 258
$SD10
[1] 253 259 241 242
#Joris suggests I show how to access elements of the list. For example, if you want the results of the simulations for a SD = 20, we could do out[[4]] because 20 was the 4th value in the vector of SDs we applied over, or, because I named the elements of the output list out, we can as for the results of the simulation using out[["SD10"]].
To Answer some of the specific points about your loops etc.,
to add names to a list use names(), e.g. names(mylist) <- c("foo","bar"). You'd be better off in your loop callingnames()` once per iteration of the loop to set up the names in a single shot - you probably wouldn't want to fill the names in as you go along as that would be inefficient.
I don't think it makes too much difference whether you use a nested list or a list containing a matrix as per my example. To alter foo() to return a list so the output of lapply() is a list of lists, we could do:
Code:
bar <- function(sd, n, mean, reps = 1) {
rands <- rnorm(n * reps, mean = mean, sd = sd)
if(reps > 1)
rands <- split(rands, rep(seq_len(reps), each = n))
rands
}
sims2 <- lapply(6:10, FUN = bar, mean = 20, n = 500, reps = 4)
names(sims2) <- paste("SD", 6:10, sep = "")
out2 <- lapply(sims2, function(x) sapply(x, function(y) sum(y < 20)))
which gives the same output as before.
I am going to throw in another solution using the plyr package, which I think is tailor made for such exercises.
library(plyr)
# generate a data frame of parameters, repeating some as required
parameters = data.frame(mean = 20, sd = rep(6:10, each = 4))
# generate sample data for each combination of parameters
sample_data = mdply(df, rnorm, n = 500)
# generate answer by counting number of observations less than 20
answer = data.frame(
parameters,
obs_less_20 = rowSums(sample_data[,-c(1, 2),] < 20)
)
head(answer)
mean sd obs_less_20
1 20 6 247
2 20 6 250
3 20 6 242
4 20 6 259
5 20 7 240
6 20 7 237
Related
I am trying to write a program in R to sum n random number. However, when I try it for some numbers it won't work.
For example,
## rm(list=ls())
random.sum <- function(n) {
x[1:n] <- ceiling(10*runif(n))
cat("x:", x[1:n], "\n")
return(sum(x))
}
x <- rep(100, 10)
show(random.sum(10))
show(random.sum(5))
when I try to sum 10 random numbers it will give me the correct sum which is
show(random.sum(10))
x: 1 3 10 1 3 2 8 6 7 9
[1] 50
However, when I try it for the next one which is 5, it won't work,
show(random.sum(5))
x: 7 5 6 2 9
[1] 529
I am not sure what I am doing wrong
The easiest way would be something like this (updated as per #Axeman's comment):
sum(sample(1:10, 10, replace = TRUE))
where the first "10" is your n and min and max define the value range for runif.
Also keep x local to the function:
random.sum <- function(n) {
x <- sample(1:10, 10, replace = TRUE)
cat("x:", x, "\n")
return(sum(x))
}
The reason for your error is the variable scoping rules of R. Your variable x in global scope is copied upon modification, but maintains the dimension of the global declaration. If you sum over only the first n elements with sum(x[1:n]) you will get the correct answer.
Now, that begs the question, are you trying to modify the global object x inside the function? If that is your intent, the superassignment operator <<- can be used. See the R intro section 10.5 "Assignments within functions" for details.
As part of my program in R, I have to compare a huge number of pair of sentences with some functions (the one im showing here is comparing sentences with the same number of words, and whether there is just exactly one different word between those two sentences)
To make things faster, I have already converted all words into integers so I am dealing with integer vectors so the example function is a very simple one
is_sub_num <- function(a,b){sum(!(a==b))==1}
where a,b are character vectors such as
a = c(1,2,3); b=c(1,4,3)
is_sub_num(a,b)
# [1] TRUE
my data will be stored in a data.table
Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
$ ID: int 1 2 3 4 5 6 7 8 9 10 ...
$ V2:List of 100
..$ : int 4 4 3 4
..$ : int 1 2 3 1
the length of each entry may be different (in the example below, the entries are all of size 4)
I have a table with candidate pair IDs to test the corresponding entries in DT with the function above as follow
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
here is a simplification of what I'm trying to do:
set.seed=234
z = lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
DT <- as.data.table(1:100)
DT$V2 <- z
colnames(DT) <- c("ID","V2")
print(system.time(tmp <-apply(pair_list,1,is_pair_ok)))
this takes around 22 seconds on my laptop although its only 10,000 entries and the functions are very very basic.
Do you have any advice on how to speed up the code ???
i have delved further myself into this issue, and here is my answer.
I think its an important one, and everyone should know it so please vote for this post, it doesn't deserve its bad score !!
The code to the answer is below. I have put some new parameters to make the problem a bit more general.
The key point is to use the unlist function.
Whenever we use apply to a list object, we get very very bad performance in R.
its a bit of a pain in the ass to explode objects and to do manual indexing in a vector, but the speedup is phenomenal.
set.seed=234
N=100
nobs=10000
z = lapply(1:N, function(x) sample(1:4,size=sample(3:5),replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
is_pair_ok1 <- function(pair){
is_sub_num(zzz[pos_table[pair[1]]:(pos_table[pair[1]]+length_table[pair[1]] -1) ],
zzz[pos_table[pair[2]]:(pos_table[pair[2]]+length_table[pair[2]] -1) ]) }
pair_list <- as.data.table(cbind(sample(1:N,nobs,replace=TRUE),sample(1:N,nobs,replace=TRUE)))
DT <- as.data.table(1:N)
DT$V2 <- z
setnames(DT, c("ID","V2"))
setkey(DT, ID)
length_table <- sapply(z,length)
myfun <- function(i){sum(length_table[1:i])}
pos_table <- c(0,sapply(1:N,myfun))+1
zzz=unlist(z)
print(system.time(tmp_ref <- apply(pair_list,1,is_pair_ok)))
print(system.time(tmp <- apply(pair_list,1,is_pair_ok1)))
identical(tmp,tmp_ref)
here is the output
utilisateur système écoulé
20.96 0.00 20.96
utilisateur système écoulé
0.70 0.00 0.71
There were 50 or more warnings (use warnings() to see the first 50)
[1] TRUE
EDIT
it would a bit too long to post here. I tried to draw conclusions from the above and modify the source code of my program by trying to speed it up and using unlist, and manual indexing.
the new implementation actually is slower which came as a surprise to me, and i fail to understand why...
now I have 60% spare of time:
library(data.table)
set.seed(234)
is_sub_num <- function(a,b) sum(!(a==b))==1
is_pair_ok2 <- function(p1, p2) is_sub_num(DT[p1,V2][[1]],DT[p2,V2][[1]])
DT <- as.data.table(1:100)
DT$V2 <- lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
setnames(DT, c("ID","V2"))
setkey(DT, ID)
pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
print(system.time(tmp <- mapply(FUN=is_pair_ok2, pair_list$V1, pair_list$V2)))
most effect had setting the key for DT and using fast indexing in is_pair_ok2()
a little bit more (without the function is_sub_num()):
is_pair_ok3 <- function(p1, p2) sum(DT[p1,V2][[1]]!=DT[p2,V2][[1]])==1
print(system.time(tmp <- mapply(FUN=is_pair_ok3, pair_list$V1, pair_list$V2)))
I can't imagine it should be that difficult, but probably, coming from Python, my mindset is biased.
I know I'm going to carry out 50 calculations and the result of each calculation, together with two parameters characterizing the calculation, should build up a data frame.
So my approach is to instantiate the data frame and then I want to add the results whenever they become available. Please see the indicated row below:
# Number of simulations
nsim = 50
# The data frame which should carry the calculation (parameters and solutions).
sol <- data.frame(col.names=c("ni", "Xbar", "n"))
# Fifty values for n.
n <- seq.int(5, 5000, length.out=nsim)
for(ni in n)
{
# A random sample containing possible duplicates.
X <- sample(seq(-ni, ni, length=ni+1), replace=T)
Xbar <- round(mean(X), 3)
sol <- rbind(sol, c(ni, Xbar, n)) # <<-- How to do this correctly??
}
This doesn't work.
There are two ways to do this correctly. One is to pre-define your data.frame (its size) and then populate it iteratively in a for-loop:
nsim <- 10 # reduce to 10 to simplify output
n <- seq.int(5, 5000, length.out=nsim)
sol <- setNames(data.frame(matrix(nrow=nsim, ncol=3)), c("ni", "Xbar", "n"))
set.seed(1) # for reproducibility
for(ni in seq_along(n)) {
Xbar <- round(mean(sample(seq(-n[ni], n[ni], length=n[ni]+1), replace=T)), 3)
sol[ni,] <- c(ni, Xbar, n[ni])
}
Alternatively, you can use sapply on your n vector to create a vector of results and then cbind everything back together:
set.seed(1) # for reproducibility
sol <- data.frame(
ni = seq_along(n),
Xbar = sapply(n, function(ni) {
round(mean(sample(seq(-ni, ni, length=ni+1), replace=T)), 3)
}),
n = n
)
Either way, you'll end up with a nice dataframe:
> str(sol)
'data.frame': 10 obs. of 3 variables:
$ ni : num 1 2 3 4 5 6 7 8 9 10
$ Xbar: num 0.667 -0.232 -14.599 -26.026 36.51 ...
$ n : num 5 560 1115 1670 2225 ...
1) Check what your initial sol contains.
> sol <- data.frame(col.names=c("ni", "Xbar", "n"))
> sol
col.names
1 ni
2 Xbar
3 n
Not what you want. See this question.
2) Make sure seq.int does what you expect - check the documentation of (or just the output of) seq.int. e.g. look at what n contains:
> n
[1] 5.0000 106.9388 208.8776 310.8163 412.7551 514.6939 616.6327
[8] 718.5714 820.5102 922.4490 1024.3878 1126.3265 1228.2653 1330.2041
[15] 1432.1429 1534.0816 1636.0204 1737.9592 1839.8980 1941.8367 2043.7755
[22] 2145.7143 2247.6531 2349.5918 2451.5306 2553.4694 2655.4082 2757.3469
[29] 2859.2857 2961.2245 3063.1633 3165.1020 3267.0408 3368.9796 3470.9184
[36] 3572.8571 3674.7959 3776.7347 3878.6735 3980.6122 4082.5510 4184.4898
[43] 4286.4286 4388.3673 4490.3061 4592.2449 4694.1837 4796.1224 4898.0612
[50] 5000.0000
Is that what you meant?
3) Given (1) the problems are not surprising, but in any case, just carry out the first time through the loop a line at a time. See what happens:
sim = 50
sol <- data.frame(col.names=c("ni", "Xbar", "n"))
ni=5
X <- sample(seq(-ni, ni, length=ni+1), replace=T)
Xbar <- round(mean(X), 3)
sol <- rbind(sol, c(ni, Xbar, n))
print(sol)
Gives:
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 5) :
invalid factor level, NA generated
> print(sol)
col.names
1 ni
2 Xbar
3 n
4 <NA>
Now the behavior is unsurprising; we can't add three columns to something with one column.
4) You don't want to do it this way anyway. It's better to initialize sol to be its final size and then fill it in.
See, for example, this answer
However, the more common R idiom would be to avoid loops where possible; there are a number of functions that will let you create the whole thing at once.
First of all, can you clarify the expected output format that you expect?
As of now, on modifying the code to generate a data frame, the following output will be generated (let me know if this is what you expect & then its not difficult to generate the following) :
ni Xbar n
10.000 2.182 12.000
If this is what you expect, then one way to do this would be as follows:
Step 1: Create Vectors
Step 2: Create Data frame from the above vectors
Step 3: Run your operations in a loop & fill in row by row.
nsim=50
n=seq.int(5, 5000, length.out=nsim)
ni<-vector(mode='numeric',length=nsim)
Xbar<-vector(mode='numeric',length=nsim)
out<-data.frame(ni=ni,Xbar=Xbar,n=n)
for ( i in 1:length(n)){
X<- sample(seq(-n[i], n[i], length=n[i]+1), replace=T)
out[i,'Xbar'] <- round(mean(X), 3)
out[i,'ni']<-n[i]
}
The output is as follows:
R textbooks continue to promote the use of lapply instead of loops. This is easy even for functions with arguments like
lapply(somelist, f, a=1, b=2)
but what if the arguments change depending on the list element?
Assume my somelist consists of:
somelist$USA
somelist$Europe
somelist$Switzerland
plus there is anotherlist with the same regions and I want use lapply with these changing arguments? This could be useful when f was a ratio calculation for example.
lapply(somelist, f, a= somelist$USA, b=anotherlist$USA)
Is there are way except for a loop to run through these regions efficiently?
EDIT:
my problem seems to be that I tried to use a previously written function without indexes...
ratio <-function(a,b){
z<-(b-a)/a
return(z)
}
which led to
lapply(data,ratio,names(data))
which does not work. Maybe others can also learn from this mistake.
Apply over list names rather than list elements. E.g.:
somelist <- list('USA'=rnorm(10), 'Europe'=rnorm(10), 'Switzerland'=rnorm(10))
anotherlist <- list('USA'=5, 'Europe'=10, 'Switzerland'=4)
lapply(names(somelist), function(i) somelist[[i]] / anotherlist[[i]])
EDIT:
You also ask if there is a way "except for a loop" to do this "efficiently". You should note that the apply will not necessarily be more efficient. Efficiency will probably be determined by how quick your inner function is. If you want to operate on each elements of a list, you will need a loop, whether it is hidden in an apply() call or not. Check this question: Is R's apply family more than syntactic sugar?
The example I gave above can be re-written as a for loop, and you can make some naive benchmarks:
fun1 <- function(){
lapply(names(somelist), function(i) somelist[[i]] / anotherlist[[i]])
}
fun2 <- function(){
for (i in names(somelist)){
somelist[[i]] <- somelist[[i]] / anotherlist[[i]]
}
return(somelist)
}
library(rbenchmark)
benchmark(fun1(), fun2(),
columns=c("test", "replications",
"elapsed", "relative"),
order="relative", replications=10000)
The output of the benchmark on my machine was this:
test replications elapsed relative
1 fun1() 10000 0.145 1.000000
2 fun2() 10000 0.148 1.020690
Although this is not a real work application and the functions are not realistic tasks, you can see that the difference in computation time is quite negligible.
You just need to work out what to lapply() over. Here the names() of the lists suffices, after we rewrite f() to take different arguments:
somelist <- list(USA = 1:10, Europe = 21:30,
Switzerland = seq(1, 5, length = 10))
anotherlist <- list(USA = list(a = 1, b = 2), Europe = list(a = 2, b = 4),
Switzerland = list(a = 0.5, b = 1))
f <- function(x, some, other) {
(some[[x]] + other[[x]][["a"]]) * other[[x]][["b"]]
}
lapply(names(somelist), f, some = somelist, other = anotherlist)
Giving:
R> lapply(names(somelist), f, some = somelist, other = anotherlist)
[[1]]
[1] 4 6 8 10 12 14 16 18 20 22
[[2]]
[1] 92 96 100 104 108 112 116 120 124 128
[[3]]
[1] 1.500000 1.944444 2.388889 2.833333 3.277778 3.722222 4.166667 4.611111
[9] 5.055556 5.500000
I am new in using R.
So I am not sure about how to use apply.
I would like to speed up my function with using apply:
for(i in 1: ncol(exp)){
for (j in 1: length(fe)){
tmp =TRUE
id = strsplit(colnames(exp)[i],"\\.")
if(id == fe[j]){
tmp = FALSE
}
if(tmp ==TRUE){
only = cbind(only,c(names(exp)[i],exp[,i]) )
}
}
}
How can I use the apply function to do this above?
EDIT :
Thank you so much for the very good explanation and sorry for my bad description. You guess everything right, but When wanted to delete matches in fe.
Exp <- data.frame(A.x=1:10,B.y=10:1,C.z=11:20,A.z=20:11)
fe<-LETTERS[1:2]
then the result should be only colnames with 'C'. Everything else should be deleted.
1 C.z
2 11
3 12
4 13
5 14
6 15
7 16
8 17
9 18
10 19
11 20
EDIT : If you only want to delete the columns whose name appear in fe, you can simply do :
Exp <- data.frame(A.x=1:10,B.y=10:1,C.z=11:20,A.z=20:11)
fe<-LETTERS[1:2]
id <- sapply(strsplit(names(Exp),"\\."),
function(i)!i[1] %in% fe)
Exp[id]
This code does exactly what your (updated) for-loop does as well, only a lot more efficient. You don't have to loop through fe, the %in% function is vectorized.
In case the name can appear anywhere between the dots, then
id <- sapply(strsplit(names(Exp),"\\."),
function(i)sum(i %in% fe)==0)
Your code does some very funny things, and I have no clue what exactly you're trying to do. For one, strsplit gives a list, so id == fe[j] will always return false, unless fe[j] is a list itself. And I doubt it is... So I'd correct your code as
id = strsplit(colnames(Exp)[i],"\\.")[[1]][1]
in case you want to compare with everything that is before the dot, or to
id = unlist(strsplit(colnames(Exp)[i],"\\."))
if you want to compare with everything in the string. In that case, you should use %in%instead of == as well.
Second, what you get is a character matrix, which essentially multiplies rows. if all elements in fe[j] are unique, you could as well do :
only <- rbind(names(exp),exp)
only <- do.call(cbind,lapply(mat,function(x)
matrix(rep(x,ncol(exp)-1),nrow=nrow(exp)+1)
))
Assuming that the logic in your code does make sense (as you didn't apply some sample data this is impossible to know), the optimalization runs :
mat <- rbind(names(Exp),Exp)
do.call(cbind,
lapply(mat, function(x){
n <- sum(!fe %in% strsplit(x[1],"\\.")[[1]][1])
matrix(rep(x,n),nrow=nrow(mat))
}))
Note that - in case you are interested if fe[j] appears anywhere in the name - you can change the code to :
do.call(cbind,
lapply(mat, function(x){
n <- sum(!fe %in% unlist(strsplit(x[1],"\\.")))
matrix(rep(x,n),nrow=nrow(mat))
}))
If this doesn't return what you want, then your code doesn't do that either. I checked with following sample data, and all gives the same result :
Exp <- data.frame(A.x=1:10,B.y=10:1,C.z=11:20,A.z=20:11)
fe <- LETTERS[1:4]
The apply() family of functions are convenience functions. They will not necessarily be faster than a well-written for loop or vectorized functions. For example:
set.seed(21)
x <- matrix(rnorm(1e6),5e5,2)
system.time({
yLoop <- x[,1]*0 # preallocate result
for(i in 1:NROW(yLoop)) yLoop[i] <- mean(x[i,])
})
# user system elapsed
# 13.39 0.00 13.39
system.time(yApply <- apply(x, 1, mean))
# user system elapsed
# 16.19 0.28 16.51
system.time(yRowMean <- rowMeans(x))
# user system elapsed
# 0.02 0.00 0.02
identical(yLoop,yApply,yRowMean)
# TRUE
The reason your code is so slow is that--as Gavin pointed out--you're growing your array for every loop iteration. Preallocate the entire array before the loop and you will see a significant speedup.