How do I time my script in R? [duplicate] - r

This question already has answers here:
Measuring function execution time in R
(15 answers)
Closed 8 years ago.
I guess I have a simple and straightforward question.
I am running a script and for each function I want to time the runtime of the function. I suppose there is a function to time my function. Can anybody help me here?
I have been searching but keep finding functions for time series and time intervals. I am not searching that.

As the others in the comments mentioned before, the simplest way is with system.time. Here's an example code from the system.time manual page
require(stats)
system.time(for(i in 1:100) mad(runif(1000)))
## Not run:
exT <- function(n = 10000) {
# Purpose: Test if system.time works ok; n: loop size
system.time(for(i in 1:n) x <- mean(rt(1000, df = 4)))
}
#-- Try to interrupt one of the following (using Ctrl-C / Escape):
exT() #- about 4 secs on a
2.5GHz Xeon
system.time(exT()) #~ +/- same
On my machine, once the function exT() is called, this is my output:
user system elapsed
2.916 0.004 2.925
And for the function system.time(exT()) I get the following output:
user system elapsed
3.004 0.016 3.026
This means that for the first case the elapsed time is 2.925 seconds and 3.026 for the second.
However, if you want to perform benchmark tests, you should use the package rbenchmark (go here). This is a library which consists of one function:
The library consists of just one function, benchmark, which is a
simple wrapper around system.time.
On the link I've provided, you can see more examples of how to use this package. There are 4 examples there, which are pretty good.

Related

Does R store the values of recursive functions that it has obtained?

I have used Mathematica for many years and have just started using R for programming.
In both programs we can define recursive functions. In Mathematica there is a way to save values of functions. I am not sure if this is the default setting for R. Take the Fibonacci numbers for example. In Mathematica
fibonacci[0]=1;
fibonacci[1]=1;
fibonacci[n_]:=fibonacci[n-1]+fibonacci[n-2];
Let's say to find fibonacci[10]. It needs to find all fibonacci[i] from i=2, ..., 9 first.
After that, if we want to find fibonacci[11], then the program needs to go through the whole process to find all fibonacci[i] from i=2, ..., 10 again. It does not store the values it has obtained. A modification in Mathematica is
fibonacci[0]=1;
fibonacci[1]=1;
fibonacci[n_]:=fibonacci[n]=fibonacci[n-1]+fibonacci[n-2];
In this way, once we have computed the value fibonacci[10], it is stored, and we do not need to compute it again to find fibonacci[11]. This can save a lot of time to find, say fibonacci[10^9].
The Fibonacci function in R can be defined similarly:
fibonacci = function(n) {
if (n==0 | n==1) { n }
else {fibonacci[n-1]+fibonacci[n-2]}}
Does R store the value fibonacci[10] after we compute it? Does R compute fibonacci[10] again when we want to find fibonacci[11] next? Similar questions have been asked for other programmings.
Edit: as John has suggested, I have computed fibonacci[30] (which is 832040) and then fibonacci[31] (which is 1346269). It took longer to get fibonacci[31]. So it appears that the R function defined above does not store the values. How to change the program so that it can store the intermediate values of a recursive function?
R does not do this by default, and similarly as you see Mathematica doesn't either. You can implement memoise yourself or use the memoise package.
fib <- function(i){
if(i == 0 || i == 1)
i
else
fib(i - 1) + fib(i - 2)
}
system.time(fib(30))
# user system elapsed
# 0.92 0.00 1.84
library(memoise)
fib <- memoise(fib) # <== Memoise magic
system.time(fib(100))
# user system elapsed
# 0.02 0.00 0.03

Speeding up sqldf in R

I have a program in R that i have run for about a day now and its only reached about 10 percent completion. The main source of slowness comes from having to make thousands of sqldf(SELECT ...) calls from a data set of length ~ 1 million using the R package sqldf. My select statements currently take the following form:
sqldf(SELECT V1, V2, FROM mytable WHERE cast(start as real) <= sometime and cast(realized as real) > sometime)
sometime is just some integer representing a unix timestamp, and start and realized are columns of mytable that are also filled with unix timestamps entries. What i additionally know however is that |realized - start| < 172800 always, which is quite a small period as the dataset spans over a year. My thought is that I should be able to exploit this fact to tell R to only check the dataframe from time +- 172800 in each of these calls.
Is the package sqldf inappropriate to use here? Should i be using a traditional [,] traversal of the data.frame? Is there an easy way to incorporate this fact to speed up the program? My gut feeling is to break up the data frame, sort the vectors, and then build custom functions that traverse and select the appropriate entries themselves, but I'm looking for some affirmation if this is the best way.
First, the slow part is probably cast(...), so rather than doing that twice for each record, in each query, why don't you leave start and realized as timestamps, and change the query to accommodate that.
Second, the data.table option is still about 100 times faster (but see the bit at the end about indexing with sqldf).
library(sqldf)
library(data.table)
N <- 1e6
# sqldf option
set.seed(1)
df <- data.frame(start=as.character(as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T)),
realized=as.character(as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T)),
V1=rnorm(N), V2=rpois(N,4))
sometime <- "2000-01-05 00:00:00"
query <- "SELECT V1, V2 FROM df WHERE start <= datetime('%s') and realized > datetime('%s')"
query <- sprintf(query,sometime,sometime)
system.time(result.sqldf <- sqldf(query))
# user system elapsed
# 12.17 0.03 12.23
# data.table option
set.seed(1)
DT <- data.table(start=as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T),
realized=as.POSIXct("2000-01-01")+sample(0:1e6,N,replace=T),
V1=rnorm(N), V2=rpois(N,4))
setkey(DT,start,realized)
system.time(result.dt <- DT[start<=as.POSIXct(sometime) & realized > as.POSIXct(sometime),list(V1,V2)])
# user system elapsed
# 0.15 0.00 0.15
Note that the two result-sets will be sorted differently.
EDIT Based on comments below from #G.Grothendieck (author of the sqldf package).
This is turning into a really good comparison of the packages...
# code from G. Grothendieck comment
sqldf() # opens connection
sqldf("create index ix on df(start, realized)")
query <- fn$identity("SELECT V1, V2 FROM main.df WHERE start <= '$sometime' and realized > '$sometime'")
system.time(result.sqldf <- sqldf(query))
sqldf() # closes connection
# user system elapsed
# 1.28 0.00 1.28
So creating an index speeds sqldf by about a factor of 10 in this case. Index creation is slow but you only have to do it once. "key" creation in data.table (which physically sorts the table) is extremely fast, but does not improve performace all that much in this case (only about a factor of 2).
Benchmarking using system.time() is a bit risky (1 data point), so it's better to use microbenchmark(...). Note that for this to work, we have to run the code above and leave the connection open (e.g., remove the last call the sqldf().)
f.dt <- function() result.dt <- DT[start<=as.POSIXct(sometime) & realized > as.POSIXct(sometime),list(V1,V2)]
f.sqldf <- function() result.sqldf <- sqldf(query)
library(microbenchmark)
microbenchmark(f.dt(),f.sqldf())
# Unit: milliseconds
# expr min lq median uq max neval
# f.dt() 110.9715 184.0889 200.0634 265.648 833.4041 100
# f.sqldf() 916.8246 1232.6155 1271.6862 1318.049 1951.5074 100
So we can see that, in this case, data.table using keys is about 6 times faster than sqldf using indexes. The actual times will depend on the size of the result-set, so you might want to compare the two options.

Efficiently split a large audio file in R

Previously I asked this question on SO about splitting an audio file. The answer I got from #Jean V. Adams worked relatively (downside: input was stereo and output was mono, not stereo) well for small sound objects:
library(seewave)
# your audio file (using example file from seewave package)
data(tico)
audio <- tico # this is an S4 class object
# the frequency of your audio file
freq <- 22050
# the length and duration of your audio file
totlen <- length(audio)
totsec <- totlen/freq
# the duration that you want to chop the file into
seglen <- 0.5
# defining the break points
breaks <- unique(c(seq(0, totsec, seglen), totsec))
index <- 1:(length(breaks)-1)
# a list of all the segments
subsamps <- lapply(index, function(i) cutw(audio, f=freq, from=breaks[i], to=breaks[i+1]))
I applied this solution to one (out of around 300) of the files I'm preparing for analysis (~150 MB), and my computer worked on it for ( > 5 hours now), but I ended up closing the session before it finished.
Does anyone have any thoughts or solutions to efficiently perform this task of splitting up a large audio file (specifically, an S4 class Wave object) into smaller pieces using R? I'm hoping to cut down drastically on the time it takes to make smaller files out of these larger files, and I'm hoping to use R. However, if I cannot get R to do the task efficiently, I would appreciate suggestions of other tools for the job. The example data above is mono, but my data is in stereo. The example data can be made to be stereo using:
tico#stereo <- TRUE
tico#right <- tico#left
UPDATE
I identified another solution that builds on work from the first solution:
lapply(index, function(i) audio[(breaks[i]*freq):(breaks[i+1]*freq)])
Comparing the performance of three solutions:
# Solution suggested by #Jean V. Adams
system.time(replicate(100,lapply(index, function(i) cutw(audio, f=freq, from=breaks[i], to=breaks[i+1], output="Wave"))))
user system elapsed
1.19 0.00 1.19
# my modification of the previous solution
system.time(replicate(100,lapply(index, function(i) audio[(breaks[i]*freq):(breaks[i+1]*freq)])))
user system elapsed
0.86 0.00 0.85
# solution suggested by #CarlWitthoft
audiomod <- audio[(freq*breaks[1]):(freq*breaks[length(breaks)-1])] # remove unequal part at end
system.time(replicate(100,matrix(audiomod#left,ncol=length(breaks))))+
system.time(replicate(100,matrix(audiomod#right,ncol=length(breaks))))
user system elapsed
0.25 0.00 0.26
The method using indexing (i.e. [) seems to faster (3-4x). #CarlWitthoft's solution is even faster, the downside is that it puts the data into a matrix rather than multiple Wave objects, which I will be saving using writeWave. Presumably, convert from the matrix format to a separate Wave objects will be relatively trivial if I properly understand how to create this type of S4 object. Any further room for improvement?
The approach I ended up using builds off of the solutions offered by #CarlWitthoft and #JeanV.Adams. It is quite fast compared to the other techniques I was using, and it has allowed me to split a large number of my files in a matter of hours, rather than days.
Here is the whole process using a small Wave object for example (my current audio files range up to 150 MB in size, but in the future, I may receive much larger files (i.e. sound files covering 12-24 hours of recording) where memory management will become more important):
library(seewave)
library(tuneR)
data(tico)
# force to stereo
tico#stereo <- TRUE
tico#right <- tico#left
audio <- tico # this is an S4 class object
# the frequency of your audio file
freq <- 22050
# the length and duration of your audio file
totlen <- length(audio)
totsec <- totlen/freq
# the duration that you want to chop the file into (in seconds)
seglen <- 0.5
# defining the break points
breaks <- unique(c(seq(0, totsec, seglen), totsec))
index <- 1:(length(breaks)-1)
# the split
leftmat<-matrix(audio#left, ncol=(length(breaks)-2), nrow=seglen*freq)
rightmat<-matrix(audio#right, ncol=(length(breaks)-2), nrow=seglen*freq)
# the warnings are nothing to worry about here...
# convert to list of Wave objects.
subsamps0409_180629 <- lapply(1:ncol(leftmat), function(x)Wave(left=leftmat[,x],
right=rightmat[,x], samp.rate=d#samp.rate,bit=d#bit))
# get the last part of the audio file. the part that is < seglen
lastbitleft <- d#left[(breaks[length(breaks)-1]*freq):length(d)]
lastbitright <- d#right[(breaks[length(breaks)-1]*freq):length(d)]
# convert and add the last bit to the list of Wave objects
subsamps0409_180629[[length(subsamps0409_180629)+1]] <-
Wave(left=lastbitleft, right=lastbitright, samp.rate=d#samp.rate, bit=d#bit)
This wasn't part of my original question, but my ultimate goal was to save these new, smaller Wave objects.
# finally, save the Wave objects
setwd("C:/Users/Whatever/Wave_object_folder")
# I had some memory management issues on my computer when doing this
# process with large (~ 130-150 MB) audio files so I used rm() and gc(),
# which seemed to resolve the problems I had with allocating memory.
rm("breaks","audio","freq","index","lastbitleft","lastbitright","leftmat",
"rightmat","seglen","totlen","totsec")
gc()
filenames <- paste("audio","_split",1:(length(breaks)-1),".wav",sep="")
# Save the files
sapply(1:length(subsamps0409_180629),
function(x)writeWave(subsamps0409_180629[[x]],
filename=filenames[x]))
The only real downside here is that the output files are pretty big. For example, I put in a 130 MB file and split it into 18 files each approximately 50 MB. I think this is because my input file is .mp3 and the output is .wav. I posted this answer to my own question in order to wrap up the problem I was having with the full solution I used to solve it, but other answers are appreciated and I will take the time to look at each solution and evaluate what they offer. I am sure there are better ways to accomplish this task, and methods that will work better with very large audio files. In solving this problem, I barely scratched the surface on dealing with memory management.
Per Frank's request, here's one possible approach.
Extract the audio#left and audio#right slots' vectors of sound data, then break each up into equal-length sections in one step something like:
leftsong<-audio#left
leftmat<-matrix(leftsong, ncol=(seglen*freq)
Where I've assumed seglen is the distance between breaks[i] and breaks[i+1] .
New wave objects can then be created and processed from the matching rows in leftmat and rightmat.

R probability simulation that won't terminate?

I'm teaching a statistics class where I'm having students explore questions in probability and statistics through simulation using R. Recently there was some confusion about the probability of getting exactly two 6's when rolling 5 dice. The answer is choose(5,2)*5^3/6^5, but some students were convinced that "order shouldn't matter"; i.e. that the answer should be choose(5,2)*choose(25,3)/choose(30,5). I thought it would be fun to have them simulate rolling 5 dice thousands of times, keeping track of the empirical probability for each experiment, and then repeat the experiment many times. The problem is the two numbers above are sufficiently close that it's quite hard to get a simulation to tease out the difference in a statistically significant fashion (of course I could just be doing it wrong). I tried rolling 5 dice 100000 times, then repeating the experiment 10000 times. This took an hour or so to run on my i7 linux machine and still allowed for a 25% chance that the correct answer is choose(5,2)*choose(25,3)/choose(30,5). So I increased the number of dice rolls per experiment to 10^6. Now the code has been running for over 2 days and shows no sign of finishing. I'm confused by this, as I only increased the number of operations by an order of magnitude, implying that the run time should be closer to 10 hours.
Second question: Is there a better way to do this? See code posted below:
probdist = rep(0,10000)
for (j in 1:length(probdist))
{
outcome = rep(0,1000000)
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
}
probdist[j] = sum(outcome)/length(outcome)
}
A good rule of thumb is to never, ever write a for loop in R. Here's an alternative solution:
doSample <- function()
{
sum(sample(1:6,size=5,replace=TRUE)==6)==2
}
> system.time(samples <- replicate(n=10000,expr=doSample()))
user system elapsed
0.06 0.00 0.06
> mean(samples)
[1] 0.1588
> choose(5,2)*5^3/6^5
[1] 0.160751
Doesn't seem to be too accurate with $10,000$ samples. Better with $100,000$:
> system.time(samples <- replicate(n=100000,expr=doSample()))
user system elapsed
0.61 0.02 0.61
> mean(samples)
[1] 0.16135
I had originally awarded a correct answer check to M. Berk for his/her suggestion to use the R replicate() function. Further investigation has forced to to rescind my previous endorsement. It turns out that replicate() is just a wrapper for sapply(), which doesn't actually afford any performance benefits over a for loop (this seems to be a common misconception). In any case, I prepared 3 versions of the simulation, 2 using a for loop, and one using replicate, as suggested, and ran them one after the other, starting from a fresh R session each time, in order to compare the execution times:
# dice26dist1.r: For () loop version with unnecessary array allocation
probdist = rep(0,100)
for (j in 1:length(probdist))
{
outcome = rep(0,1000000)
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
}
probdist[j] = sum(outcome)/length(outcome)
}
system.time(source('dice26dist1.r'))
user system elapsed
596.365 0.240 598.614
# dice26dist2.r: For () loop version
probdist = rep(0,100)
for (j in 1:length(probdist))
{
outcomes = 0
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcomes = outcomes + 1
}
probdist[j] = outcomes/1000000
}
system.time(source('dice26dist2.r'))
user system elapsed
506.331 0.076 508.104
# dice26dist3.r: replicate() version
doSample <- function()
{
sum(sample(1:6,size=5,replace=TRUE)==6)==2
}
probdist = rep(0,100)
for (j in 1:length(probdist))
{
samples = replicate(n=1000000,expr=doSample())
probdist[j] = mean(samples)
}
system.time(source('dice26dist3.r'))
user system elapsed
804.042 0.472 807.250
From this you can see that the replicate() version is considerably slower than either of the for loop versions by any system.time metric. I had originally thought that my problem was mostly due to cache misses by allocating the million character outcome[] array, but comparing the times of dice26dist1.r and dice26dist2.r indicates that this only has nominal impact on performance (although the impact on system time is considerable: >300% difference.
One might argue that I'm still using for loops in all three simulations, but as far as I can tell this is completely unavoidable when simulating a random process; I have to simulate actually going through the random process (in this case, rolling 5 die) every time. I would love to know about any technique that would allow me to avoid using a for loop (in a way that improves performance, of course). I understand that this problem would lend itself very effectively to parallelization, but I'm talking about using a single R session -- is there a way to make this faster?
Vectorization is almost always preferred to any for loop. In this case, you should see substantial speedup by generating all your dice throws first, then checking how many in each group of five equal 6.
set.seed(5)
N <- 1e6
foo <- matrix(sample(1:6, 5*N, replace=TRUE), ncol=5)
p <- mean(rowSums(foo==6)==2)
se <- sqrt(p*(1-p)/N)
p
## [1] 0.160382
Here's a 95% confidence interval:
p + se*qnorm(0.975)*c(-1,1)
## [1] 0.1596628 0.1611012
We can see that the true answer (ans1) is in the interval, but the false answer (ans2) is not, or we could perform significance tests; the p-value when testing the true answer is 0.31 but for the false answer is 0.0057.
(ans1 <- choose(5,2)*5^3/6^5)
## [1] 0.160751
pnorm(abs((ans1-p)/se), lower=FALSE)*2
## [1] 0.3145898
ans2 <- choose(5,2)*choose(25,3)/choose(30,5)
## [1] 0.1613967
pnorm(abs((ans2-p)/se), lower=FALSE)*2
## [1] 0.005689008
Note that I'm generating all the dice throws at once; if memory is an issue, you could split this up into pieces and combine, as you did in your original post. This is possibly what caused your unexpected speedup in time; if it was necessary to use swap memory, this would slow it substantially. If so, better to increase the number of time you run the loop, not the number of rolls within the loop.

Snowfall's sfApply and sfClusterApplyLB is slower than normal loop or sapply [duplicate]

This question already has answers here:
Why is the parallel package slower than just using apply?
(3 answers)
Closed 9 years ago.
When i apply this code in R, the loop and sapply are faster than snowfall's functions. What am i doing wrong? (using windows 8)
library(snowfall)
a<- 2
sfInit(parallel = TRUE, cpus = 4)
wrapper <- function(x){((x*a)^2)/3}
sfExport('a')
values <- seq(0, 100,1)
benchmark(for(i in 1:length(values)){wrapper(i)},sapply(values,wrapper),sfLapply(values, wrapper),sfClusterApplyLB(values, wrapper))
sfStop()
elapsed time for after 100 replications:
loop 0.05
sapply 0.07
sfClusterApplySB 2.94
sfApply 0.26
If the function that is sent to each of the worker nodes takes a small amount of time, the overhead of paralellization causes the overall duration of the task to take longer than running the job serially. When the jobs that are sent to the worker nodes take a significant amount of time (at least several seconds), than paralellization will really show improved performance.
See also:
Why is the parallel package slower than just using apply?
Searching for [r] parallel will yield at least 20 questions like yours, including more details as to what you can do to solve the problem.

Resources