How let a Countdown run in R [duplicate] - r

How do you pause an R script for a specified number of seconds or miliseconds? In many languages, there is a sleep function, but ?sleep references a data set. And ?pause and ?wait don't exist.
The intended purpose is for self-timed animations. The desired solution works without asking for user input.

See help(Sys.sleep).
For example, from ?Sys.sleep
testit <- function(x)
{
p1 <- proc.time()
Sys.sleep(x)
proc.time() - p1 # The cpu usage should be negligible
}
testit(3.7)
Yielding
> testit(3.7)
user system elapsed
0.000 0.000 3.704

Sys.sleep() will not work if the CPU usage is very high; as in other critical high priority processes are running (in parallel).
This code worked for me. Here I am printing 1 to 1000 at a 2.5 second interval.
for (i in 1:1000)
{
print(i)
date_time<-Sys.time()
while((as.numeric(Sys.time()) - as.numeric(date_time))<2.5){} #dummy while loop
}

TL;DR sys_sleep a new stable and precise sleep function
We already know that Sys.sleep could work not as expected, e.g. when CPU usage is very high.
That is why I decided to prepare a high quality function powered by microbenchmark::get_nanotime() and while/repeat mechanics.
#' Alternative to Sys.sleep function
#' Expected to be more stable
#' #param val `numeric(1)` value to sleep.
#' #param unit `character(1)` the available units are nanoseconds ("ns"), microseconds ("us"), milliseconds ("ms"), seconds ("s").
#' #note dependency on `microbenchmark` package to reuse `microbenchmark::get_nanotime()`.
#' #examples
#' # sleep 1 second in different units
#' sys_sleep(1, "s")
#' sys_sleep(100, "ms")
#' sys_sleep(10**6, "us")
#' sys_sleep(10**9, "ns")
#'
#' sys_sleep(4.5)
#'
sys_sleep <- function(val, unit = c("s", "ms", "us", "ns")) {
start_time <- microbenchmark::get_nanotime()
stopifnot(is.numeric(val))
unit <- match.arg(unit, c("s", "ms", "us", "ns"))
val_ns <- switch (unit,
"s" = val * 10**9,
"ms" = val * 10**7,
"us" = val * 10**3,
"ns" = val
)
repeat {
current_time <- microbenchmark::get_nanotime()
diff_time <- current_time - start_time
if (diff_time > val_ns) break
}
}
system.time(sys_sleep(1, "s"))
#> user system elapsed
#> 1.015 0.014 1.030
system.time(sys_sleep(100, "ms"))
#> user system elapsed
#> 0.995 0.002 1.000
system.time(sys_sleep(10**6, "us"))
#> user system elapsed
#> 0.994 0.004 1.000
system.time(sys_sleep(10**9, "ns"))
#> user system elapsed
#> 0.992 0.006 1.000
system.time(sys_sleep(4.5))
#> user system elapsed
#> 4.490 0.008 4.500
Created on 2022-11-21 with reprex v2.0.2

Related

Need to inspect memoised package function for memoisation to work

Let's say I have a package with the following function:
foo <- function() {
Sys.sleep(1) # really expensive operation
return(1)
}
The value of the function is always the same per run, so I would like to use memoisation.
I thought I could simply do
foo <- memoise::memoise(function() {
Sys.sleep(1) # really expensive operation
return(1)
})
However, this doesn't work.
I mean, running it as a GlobalEnv function, it works:
foo <- memoise::memoise(function() {
Sys.sleep(1)
return(1)
})
system.time(foo())
#> user system elapsed
#> 0 0 1
system.time(foo())
#> user system elapsed
#> 0.01 0.00 0.01
Created on 2019-12-23 by the reprex package (v0.3.0)
However, if it's in a package, I get really weird behavior. Basically, memoisation doesn't kick in and I keep getting the same cost. However, if I print the function definition, it starts working!
system.time(bar::foo())
#> user system elapsed
#> 0.47 0.08 2.55
system.time(bar::foo())
#> user system elapsed
#> 0 0 2
system.time(bar::foo())
#> user system elapsed
#> 0.02 0.00 2.02
system.time(bar::foo())
#> user system elapsed
#> 0.01 0.00 2.02
bar::foo
#> Memoised Function:
#> function() {
#> Sys.sleep(2)
#> return (1)
#> }
#> <environment: namespace:bar>
system.time(bar::foo())
#> user system elapsed
#> 0 0 2
system.time(bar::foo())
#> user system elapsed
#> 0 0 0
system.time(bar::foo())
#> user system elapsed
#> 0 0 0
system.time(bar::foo())
#> user system elapsed
#> 0 0 0
For the record, here are the relevant parts of the NAMESPACE and DESCRIPTION files:
# NAMESPACE
export(foo)
importFrom(memoise,memoise)
# DESCRIPTION [...]
Imports:
memoise
What's going on here, and what should I do to make memoisation work from the start in my package?
This looks like a bug in the memoise package. When you are working on your own package, R may add debug information (called srcrefs) to functions. Something about those cause the hash to come out differently every time you call the function, so it never recognizes that you are calling with the same arguments.
A simple workaround is to remove the install option "--with-keep.source" when you install your own package. (If you're using RStudio, this is added automatically in Project Options | Build Tools | Install and Restart... .) This will stop R from adding the srcref, and the bug in memoise won't be triggered. Unfortunately, this cripples the debugger in RStudio and other front-ends, so it's not ideal.
Another workaround that doesn't mess with the debugger (except for that one function) is to use removeSource on the target that is being memoised. For example,
foo <- memoise::memoise(removeSource(function() {
Sys.sleep(1) # really expensive operation
return(1)
}))

Timing R code with Sys.time()

I can run a piece of code for 5 or 10 seconds using the following code:
period <- 10 ## minimum time (in seconds) that the loop should run for
tm <- Sys.time() ## starting data & time
while(Sys.time() - tm < period) print(Sys.time())
The code runs just fine for 5 or 10 seconds. But when I replace the period value by 60 for it to run for a minute, the code never stops. What is wrong?
As soon as elapsed time exceeds 1 minute, the default unit changes from seconds to minutes. So you want to control the unit:
while (difftime(Sys.time(), tm, units = "secs")[[1]] < period)
From ?difftime
If ‘units = "auto"’, a suitable set of units is chosen, the
largest possible (excluding ‘"weeks"’) in which all the absolute
differences are greater than one.
Subtraction of date-time objects gives an object of this class, by
calling ‘difftime’ with ‘units = "auto"’.
Alternatively use proc.time, which measures various times ("user", "system", "elapsed") since you started your R session in seconds. We want "elapsed" time, i.e., the wall clock time, so we retrieve the 3rd value of proc.time().
period <- 10
tm <- proc.time()[[3]]
while (proc.time()[[3]] - tm < period) print(proc.time())
If you are confused by the use of [[1]] and [[3]], please consult:
How do I extract just the number from a named number (without the name)?
How to get a matrix element without the column name in R?
Let me add some user-friendly reproducible examples. Your original code with print inside a loop is quite annoying as it prints thousands of lines onto the screen. I would use Sys.sleep.
test.Sys.time <- function(sleep_time_in_secs) {
t1 <- Sys.time()
Sys.sleep(sleep_time_in_secs)
t2 <- Sys.time()
## units = "auto"
print(t2 - t1)
## units = "secs"
print(difftime(t2, t1, units = "secs"))
## use '[[1]]' for clean output
print(difftime(t2, t1, units = "secs")[[1]])
}
test.Sys.time(5)
#Time difference of 5.005247 secs
#Time difference of 5.005247 secs
#[1] 5.005247
test.Sys.time(65)
#Time difference of 1.084357 mins
#Time difference of 65.06141 secs
#[1] 65.06141
The "auto" units is very clever. If sleep_time_in_secs = 3605 (more than an hour), the default unit will change to "hours".
Be careful with time units when using Sys.time, or you may be fooled in a benchmarking. Here is a perfect example: Unexpected results in benchmark of read.csv / fread. I had answered it with a now removed comment:
You got a problem with time units. I see that fread is more than 20 times faster. If fread takes 4 seconds to read a file, read.csv takes 80 seconds = 1.33 minutes. Ignoring the units, read.csv is "faster".
Now let's test proc.time.
test.proc.time <- function(sleep_time_in_secs) {
t1 <- proc.time()
Sys.sleep(sleep_time_in_secs)
t2 <- proc.time()
## print user, system, elapsed time
print(t2 - t1)
## use '[[3]]' for clean output of elapsed time
print((t2 - t1)[[3]])
}
test.proc.time(5)
# user system elapsed
# 0.000 0.000 5.005
#[1] 5.005
test.proc.time(65)
# user system elapsed
# 0.000 0.000 65.057
#[1] 65.057
"user" time and "system" time are 0, because both CPU and the system kernel are idle.

gtrendsR recently became slow?

We have a production system calling gtrendsR::gtrends() which typically returns in less than an second. Now we are getting ~ one minute response time. Example:
library(gtrendsR) ## Version 1.4.0
system.time(ret_gt <- gtrendsR::gtrends("skinny jeans", geo = "US",
time = "today+5-y", category = 997))
# user system elapsed
# 0.404 0.016 56.897
Am I missing something?

unserialize error in sparkR

I am very new to SparkR (and parallelization in general). I am running SparkR locally (I know that is not the right usage of spark but I am just getting started) and I have tried to re-write some part of my code with sparkR though
collect gives me the following errors by increasing the number of samples as (no error for small number of samples):
Error in unserialize(obj) :
ReadItem: unknown type 0, perhaps written by later version of R
Calls: assetForecast ... convertJListToRList -> lapply -> lapply -> FUN -> unserialize
Execution halted
and the other error which is is probably because of my low memory is:
heap memory error (trying increasing JVM memory & driver memory did not help)
I would appreciate any help regarding FIRST error (I posted the second error since I thought they may be somehow related even though I get them by setting different values for numSlices in parallelize). I think the first one may be a version incompatibility between spark, sparkR and R that causes this serialization issue. I tried installing different version though pretty soon stuck with resolving dependency.
Here is a sample script which simulates what I am doing in SparkR (the error are generated for input.len > 950):
library(SparkR) # load sparkR library
sc <- sparkR.init() ## initialize the sparkR
input.len <- 8000 # size of the input
num.slice <- 2 # number of slices for parallelize function
## Define a few functions to simulate actual calculations
latemail <- function(N, st="2012/01/01", et="2015/12/31") {
## create random date of length N
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
encode <- function(ele1, ele2) {
## concatenate ele1 and ele2, seperated by %
return (paste(toString(ele1), toString(ele2), sep = "%"))
}
decode <- function(coded) {
## separate input string by %
idx <- regexpr("%", coded)[1]
ele1 <- as.numeric(substr(coded, 1, idx-1))
ele2 <- substr(coded, idx + 1, nchar(coded))
return (list(ele1, ele2))
}
fakeFun <- function(asset.age, asset.year) {
## fake function to simulate my actual function
return (as.list(rep(asset.age, 10)))
}
wrapperFun <- function(x) {
asset.age <- decode(x)[[1]]
asset.y <- decode(x)[[1]]
df <- fakeFun(asset.age, asset.y)
return (df)
}
## Start of calculations with SparkR
calc.ts <- latemail(input.len) ## create fake years
asset.ages <- runif(input.len) * 10 ## create fake ages
paired <- list()
for (i in 1:length(asset.ages)) {
## keep information of both years and ages in one vector
## using encode function
paired[[length(paired) + 1]] <- encode(asset.ages[[i]], calc.ts[[i]])
}
rdd.paired <- parallelize(sc, paired, numSlices = num.slice)
rdd.df <- lapply(rdd.paired, wrapperFun)
rdd.list <- collect(rdd.df)
print(rdd.list)
sparkR.stop()
Here is the full report of error:
for numSlice = 5 in parallelize function:
> rdd.list <- collect(rdd.df)
15/07/22 17:20:40 INFO RRDD: Times: boot = 0.434 s, init = 0.015 s, broadcast = 0.000 s, read-input = 0.003 s, compute = 0.200 s, write-output = 0.004 s, total = 0.656 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.010 s, init = 0.017 s, broadcast = 0.000 s, read-input = 0.003 s, compute = 0.193 s, write-output = 0.004 s, total = 0.227 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.010 s, init = 0.013 s, broadcast = 0.001 s, read-input = 0.002 s, compute = 0.191 s, write-output = 0.003 s, total = 0.220 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.010 s, init = 0.011 s, broadcast = 0.000 s, read-input = 0.002 s, compute = 0.191 s, write-output = 0.004 s, total = 0.218 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.014 s, init = 0.015 s, broadcast = 0.000 s, read-input = 0.003 s, compute = 0.213 s, write-output = 0.004 s, total = 0.249 s
Error in unserialize(obj) :
ReadItem: unknown type 0, perhaps written by later version of R
Calls: collect ... convertJListToRList -> lapply -> lapply -> FUN -> unserialize
Execution halted
for numSlice = 6 in parallelize function
15/07/22 17:18:52 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): java.lang.OutOfMemoryError: Java heap space
edu.berkeley.cs.amplab.sparkr.RRDD.readData(RRDD.scala:258)
edu.berkeley.cs.amplab.sparkr.RRDD.readData(RRDD.scala:243)
edu.berkeley.cs.amplab.sparkr.BaseRRDD.read(RRDD.scala:200)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.next(RRDD.scala:70)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.foreach(RRDD.scala:66)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.to(RRDD.scala:66)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.toBuffer(RRDD.scala:66)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.toArray(RRDD.scala:66)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
15/07/22 17:18:52 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
Error in readTypedObject(con, type) :
Unsupported type for deserialization
Calls: collect ... callJMethod -> invokeJava -> readObject -> readTypedObject
Execution halted
Is there really a problem in my SparkR installation? If yes, how it runs for small number of samples?
Thanks a lot
The following answer is how it works (or should work in Spark-1.4.0). First initialize a sqlContext as well:
sqlContext <- sparkRSQL.init(sc)
And than change your code starting from
paired <- list()
in
# Create a vector instead of a list
paired <- c()
for (i in 1:length(asset.ages)) {
## keep information of both years and ages in one vector
## using encode function
paired[length(paired) + 1] <- encode(asset.ages[[i]], calc.ts[[i]])
}
# What you actually need is a data.frame or SparkR DataFrame
paired.data.frame <- data.frame(paired=paired)
paired.DataFrame <- createDataFrame(sqlContext, paired.data.frame)
# Map function returns an RDD which you can not collect yet
# Therefor convert it to a DataFrame again
paired.df <- createDataFrame(sqlContext, map(paired.DataFrame,wrapperFun))
# This DataFrame you can collect
paired.result <- collect(paired.df)
Why did I say should work in my first sentence? It works when I run it on my laptop, but I altered the SparkR source code to make map available.
I do not know however to fix this in SparkR 1.2, but would suggest anyway to change to Spark-1.4.0 since SparkR is integrated in Spark since then.

R code slowing with increased iterations

I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.
### The beginning iterations
user system elapsed
0.03 0.00 0.03
user system elapsed
0.03 0.00 0.04
user system elapsed
0.03 0.00 0.03
user system elapsed
0.04 0.00 0.05
### The ending iterations
user system elapsed
3.06 0.08 3.14
user system elapsed
3.10 0.05 3.15
user system elapsed
3.08 0.06 3.15
user system elapsed
3.30 0.06 3.37
I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.
StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
WLX = c(8,19,29,20,13,20,21)
WLY = c(25,28,25,21,17,14,12)
WLY = 34 - WLY
WLX = WLX / 44
WLY = WLY / 33
timedata = NULL
mf <- function(i){
b = (NoR+8) * (i-1) + 8
###I read in data one section at a time to avoid headers
mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
rows = 34-mydata[,2]
cols = 45-mydata[,3]
flows = mydata[,7]
rows = as.numeric(rows)
cols = as.numeric(cols)
rm(mydata)
###Create Flux matrix
flow_mat <- matrix(0,44,33)
###Populate matrix###
flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
flow_mat[flow_mat == 0] <- NA
rm(flows)
rm(rows)
rm(cols)
timestep = i
###Specifying jpeg info###
jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
width = 640, height=441,quality=75,bg="grey")
image.plot(flow_mat, zlim=c(-1,1),
col=brewer.pal(11, "RdBu"),yaxt="n",
xaxt="n", main=paste("Stress Period ",
timestep, sep = ""))
points(WLX,WLY)
dev.off()
rm(flow_mat)
}
ST<- function(x){functiontime=system.time(mf(x))
print(functiontime)}
lapply(1:NTS, ST)
}
This is how to run the function
###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)
To test remove print(functiontime) to stop it printing at every timestep then
> system.time(StreamFlux("stream_out.txt",687,100))
user system elapsed
28.22 1.06 32.67
> system.time(StreamFlux("stream_out.txt",687,200))
user system elapsed
102.61 2.98 106.20
What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?
I will follow #PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)
From a quick glance at your code, I agree with #joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:
read.table(data, skip=b, nrows=NoR, header=FALSE).
In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.

Resources