Im running the function parLapply inside a loop and im verifying a strange behaviour. The time per iteration was increasing significantly and it didn't make much sense such an increase.
So i started clocking the functions within the cycle to see which one was taking the most time and i found out that parLapply was taking >95% of the time. So i went inside the parLapply function and clocked it as well to see if the times between inside and outside of the function match. And they did not by quite a large margin. This margin increases over time and the difference can reach seconds which makes quite an impact on the time it takes for the algorithm to complete.
while (condition) {
start.time_1 <- Sys.time()
predictions <- parLapply(cl, array, function(i){
start.time_par <- Sys.time()
#code
end.time <- Sys.time()
time.taken_par<- end.time - start.time_par
print(time.taken_par)
return(value)
})
end.time <- Sys.time()
time.taken <- end.time - start.time_1
print(time.taken)
}
I would be expecting that time.taken would be similar to the sum of all time.taken_par. But it is not. The sum of all time.taken_par is usually 0.026 seconds while time.taken starts by being 4 times that value, which is fine, but then increases to a lot more (>5 seconds).
Can anyone explain what is going on and/or if what im thinking should happen is wrong? Is it a memory issue?
Thanks for the help!
Edit:
The output of parLapply is the following. However in my tests there are 10 lists instead of just 3 as in this example. The size of the each individual list that is returned by parLapply is always the same and in this case is 25.
[1] 11
[[1]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14
-0.01878590 -0.03462315 -0.03412670 -0.06016549 -0.02527741 -0.06271799 -0.05429947 -0.02521108 -0.04291305 -0.03145491 -0.08571382 -0.07025075 -0.07704650 0.25301839
15 16 17 18 19 20 21 22 23 24 25
-0.02332236 -0.02521089 -0.01170326 0.41469539 -0.15855689 -0.02548952 -0.02545446 -0.10971302 -0.02521836 -0.09762386 0.02044592
[[2]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14
-0.01878590 -0.03462315 -0.03412670 -0.06016549 -0.02527741 -0.06271799 -0.05429947 -0.02521108 -0.04291305 -0.03145491 -0.08571382 -0.07025075 -0.07704650 0.25301839
15 16 17 18 19 20 21 22 23 24 25
-0.02332236 -0.02521089 -0.01170326 0.41469539 -0.15855689 -0.02548952 -0.02545446 -0.10971302 -0.02521836 -0.09762386 0.02044592
[[3]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14
-0.01878590 -0.03462315 -0.03412670 -0.06016549 -0.02527741 -0.06271799 -0.05429947 -0.02521108 -0.04291305 -0.03145491 -0.08571382 -0.07025075 -0.07704650 0.25301839
15 16 17 18 19 20 21 22 23 24 25
-0.02332236 -0.02521089 -0.01170326 0.41469539 -0.15855689 -0.02548952 -0.02545446 -0.10971302 -0.02521836 -0.09762386 0.02044592
Edit2:
Ok i have found out what the problem was. I have an array that i initialize using vector("list",10000). And in each iteration of the cycle i add a list of lists to this array. This list of lists has size 6656 bytes. So over the 10000 iteration it doesn't even add up to 0.1Gb. However as this array start filling up the performance of the parallelization starts to degrade. I have no idea as to why this is happening as im running the script on a machine with 64Gb of RAM. Is this a known problem?
Related
I'm running R on a Mac (OS X). I have a rather large data frame (imported from a csv-file) that I'm working with:
dim(mydf)
[1] 75848 9
I'm trying to analyse it and find ways of breaking it up into smaller parts, so I need to print at least parts of it out to get an overview from time to time.
However, when I have printed it, R (version 3.1.2) starts working extremely slowly to the point where I just have to give up and restart it. Then R works normally until I have printed something large to the console again.
I have tried ´gc()´ and ´rm(list = ls())´, but it doesn't improve the speed - and I guess it wouldn't as it seems to be the printing to the console and not the size of the data frame that causes the slowness (clogging up memory?).
Is there anything I can do to prevent R from becoming so slow, or do I just have to choose between restarting frequently or giving up printing my data to the console?
Thanks!
Same as you, I wanted to get an overview of my data. But just a little more then the ‘head’ function would give me. So I wrote a small function that would give me the head, middle, and tail of a dataset.
hmt <- function(x){ # head, middle, tail of data set
if(class(x) == "data.frame"){
middle <- round(nrow(x)*0.5)
middle <- x[(middle-3):(middle+3),]
data <- rbind(head(x),middle,tail(x))
}
return(data)
}
hmt(cars)
And the result:
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
22 14 60
23 14 80
24 15 20
25 15 26
26 15 54
27 16 32
28 16 40
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
Hope this is of any help to you.
I don't want to save the huge intermediate results for some of my calculations, and hence want to run some tests without saving these memory expensive vectors.
Say, during the computation I have a vector of arbitrary length l.
But I don't know what l is, and I can't save the vector in the memory.
Is there a way I can refer the length of the vector, something like
vec[100:END] or vec[100:-1] or vec[100:last]
Please note that vec here is not a variable, and it only refers to an intermediate expression which will output a vector.
I know length, head and tail functions, and that vec[-(1:99)] is an equivalent expression.
But, I actually want to know if there is some reference that will run an iteration from a specified number to the 'end' of the vector.
Thanks!!
I'm probably not understanding your question. If this isn't useful let me know and I'll delete it.
I gather you want to extract the elements from a vector of arbitrary length, from element N to the end, without explicitly storing the vector (which is required if you want to use, e.g. length(vec)). Here are two ways:
N <- 5 # grab element 5 to the end.
set.seed(12)
(1:sample(N:100,1))[-(1:(N-1))]
# [1] 5 6 7 8 9 10 11
set.seed(12)
tail(1:sample(N:100,1),-(N-1))
# [1] 5 6 7 8 9 10 11
Both of these create (temporarily) a sequence of integers of random length (>=5), and extract the elements from 5 to the end without self-referencing.
You mentioned memory a could of times. If you're concerned about memory and assigning large objects, you should take a look at the Memory-limits documentation, and the related links. First, there are ways to operate on the language in R. Here I only assign one object, the function f, and use it without making any other assignments.
> f <- function(x, y) x:y ## actually, g <- ":" is only 96 bytes
> object.size(f)
# 1560 bytes
> f(5, 20)[3:7]
# [1] 7 8 9 10 11
> object.size(f)
# 1560 bytes
> f(5, 20)[3:length(f(5, 20))]
# [1] 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> object.size(f)
# 1560 bytes
You can also use an expression to hold an unevaluated function call.
> e <- expression(f(5, 20)) ## but again, g <- ":" is better
> eval(e)
# [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> eval(e)[6:9]
# [1] 10 11 12 13
> eval(e)[6:length(eval(e))]
# [1] 10 11 12 13 14 15 16 17 18 19 20
> rev(eval(e))
# [1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
Depending of the type of data you're working with, there are ways to
avoid using large amounts of memory during a session. Here are a few related to
your question.
memory.profile()
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 274711 14.7 531268 28.4 531268 28.4
# Vcells 502886 3.9 1031040 7.9 881084 6.8
?gc() is good knowledge to have, and I can't really explain it. Best to read
about it. Also, I just learned about memCompress() and memDecompress() for
in-memory compression/storage. Here's a look Also, if you're working with
integer values, notifying R about it can help save memory.
That's what the L is for on the end of the rep.int() call.
x <- rep.int(5L, 1e4L)
y <- as.raw(x)
z1 <- memCompress(y)
z2 <- memCompress(y, "b")
z3 <- memCompress(y, "x")
mapply(function(a) object.size(get(a)), c('x','y','z1','z2','z3'))
# x y z1 z2 z3
# 40040 10040 88 88 168
And there is also
delayedAssign("p", rep.int(5L, 1e5L))
which is a promise object that takes up 0 bytes of memory until it is first evaluated.
I'm trying to perform a resample of a list using the for loops in R for generating a data frame that records the output of each trial.
I get the for loops to work without error, but I am sure I am making a mistake somewhere as I should not be getting the result for the jth entry that I get as possible outcomes.
Here's how I am generating my list:
set1=rep(0,237) # repeat 0's 237 times
set2=rep(1,33) # repeats 1s 33 times
aa=c(set1,set2) # put the two lists together
table(aa) # just a test count to make sure I have it set up right
Now I want to take a random sample set of size j out of aa and record how many 0's and 1's I get each time I perform this task (let's say n number of trials).
Here's how I have set it up:
n=1000
j=27
output=matrix(0,nrow=2,ncol=n)
for (i in 1:n){
trial<-sample(aa,j,replace=F)
counts=table(trial)
output[,i]=counts
}
Checking the output,
table(output[1,])
# 17 18 19 20 21 22 23 24 25 26 27
1 1 9 17 46 135 214 237 205 111 24
table(output[2,])
# 1 2 3 4 5 6 7 8 9 10 27
111 205 237 214 135 46 17 9 1 1 24
I do not think I am getting the right answer from the distribution for the jth value (in this case 27) for either of the expected number of 0's or 1's (should be close to 0 as oppose to the high number it returns).
Any suggestions as to where I am going wrong would be greatly appreciated.
If you have only 0s in trial length(counts)==1 and the value gets recycled when you assign to output. Try this:
for (i in 1:n){
trial<-sample(aa,j,replace=F)
trial <- factor(trial, levels=0:1)
counts=table(trial)
output[,i]=counts
}
Of course, you could more efficiently use rhyper:
table(rhyper(1000, table(aa)[1], table(aa)[2], 27))
I have just begun using R and have gone through multiple books and sources and they get more and more complex yet I still am unable to find a solution to what I think should be quite a basic process.
I have data with 3 columns as shown: (I am really simplifying everything to try and get a really clear answer which can applied to multiple situations)
min max value
1 5 23
8 15 9
33 35 30
I would like to plot this data on a graph.
by this data I intend that every value between 1 and 5 for example on the x axis is equal to 23 on the y axis.
I have tried several things including assigning each column to vectors a , b , and c respectively.
generating the correct number of values with:
y <- rep( c, (a-b+1))
which works as expected
then the problem occurs with getting the appropriate x values, I tried:
x <- (a:b)
but because of the way R functions it only applies to the first variables.
Now I can make this work by manually typing everything in like:
x <- c(1:5, 8:15, 33:35)
but I really need an automated way to do this because I am working with huge datasets of this structure.
I have seen some other people seem to have similar issues, however the underlying principle always seem to be convoluted with vast datasets and entire codes in questions so I have been unable to get to a good solution to this problem.
If anyone with a little more experience could clear up this issue I would be hugely grateful!
dat <- read.table(text=
"min max value
1 5 23
8 15 9
33 35 30",
header=TRUE)
I'm still not quite sure what you mean, but maybe:
newdat <- with(dat,data.frame(x=c(min,max),y=rep(value,2)))
newdat <- plyr::arrange(newdat,x)
plot(y~x,type="s",data=newdat)
It's not clear what you want to do between 5 and 8, 15 and 33 ... another possibility is to plot each bit as a separate segment:
plot(max~value,data=dat,xlim=range(c(dat$min,dat$max)),
type="n")
apply(dat,1,function(x) segments(x[1],x[3],x[2],x[3]))
How about this:
# your data.frame
df<-data.frame(min=c(1,8,33),max=c(5,15,35),value=c(23,9,30))
x<-unlist(apply(df,1,function(x)x[1]:x[2]))
y<-unlist(apply(df,1,function(x)rep(x[3],x[2]-x[1]+1)))
plotdata<-data.frame(x=x,y=y)
plotdata
x y
1 1 23
2 2 23
3 3 23
4 4 23
5 5 23
6 8 9
7 9 9
8 10 9
9 11 9
10 12 9
11 13 9
12 14 9
13 15 9
14 33 30
15 34 30
16 35 30
Something like this?
a <- c(c(1:5), c(8:15), c(33:35))
b <- c(rep(23,5), rep(9,8), rep(30,3))
plot(a,b, type="l")
I have a data frame with around 25000 records and 10 columns. I am using code to determine the change to the previous value in the same column (NewVal) based on another column (y) with a percent change already in it.
x=c(1:25000)
y=rpois(25000,2)
z=data.frame(x,y)
z[1,'NewVal']=z[1,'x']
So I ran this:
for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+(z$NewVal[i-1]*(z$y[i]/100))}
This takes considerably longer than I expected it to. Granted I may be an impatient person - as a scathing letter drafted to me once said - but I am trying to escape the world of Excel (after I read http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html, which is causing me more problems as I have begun to mistrust data - that letter also mentioned my trust issues).
I would like to do this without using any of the functions from packages as I would like to know what the formula for creating the values is - or if you will, I am a demanding control freak according to that friendly missive.
I would also like to know how to get a moving average just like rollmean in caTools. Either that or how do I figure out what their formula is? I tried entering rollmean and I think it refers to another function (I am new to R). This should probably be another question - but as that letter said, I don't ever make the right decisions in my life.
The secret in R is to vectorise. In your example you can use cumprod to do the heavy lifting:
z$NewVal2 <- x[1] * cumprod(with(z, 1 +(c(0, y[-1]/100))))
all.equal(z$NewVal, z$NewVal2)
[1] TRUE
head(z, 10)
x y NewVal NewVal2
1 25 4 25.00000 25.00000
2 24 3 25.75000 25.75000
3 23 0 25.75000 25.75000
4 22 1 26.00750 26.00750
5 21 3 26.78773 26.78773
6 20 2 27.32348 27.32348
7 19 2 27.86995 27.86995
8 18 3 28.70605 28.70605
9 17 4 29.85429 29.85429
10 16 2 30.45138 30.45138
On my machine, the loop takes just less than 3 minutes to run, while the cumprod statement is virtually instantaneous.
I got about a 800-fold improvement with Reduce:
system.time(z[, "NewVal"] <-Reduce("*", c(1, 1+z$y[-1]/100), accumulate=T) )
user system elapsed
0.139 0.008 0.148
> head(z)
x y NewVal
1 1 1 1.000
2 2 1 1.010
3 3 1 1.020
4 4 5 1.071
5 5 1 1.082
6 6 2 1.103
7 7 2 1.126
8 8 3 1.159
9 9 0 1.159
10 10 1 1.171
> system.time(for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+
(z$NewVal[i-1]*(z$y[i]/100))})
user system elapsed
37.29 106.38 143.16