Strange addTaskCallback work in RStudio - r

This is my next question from cycle of "strange" questions.
I found same difference in code execution in R console and RStudio and couldn't understand reason of it. It's also connected with incorrect work of "track" package in RStudio and R.NET as I'd written before in Incorrect work of track package in R.NET
So, let's look at example from https://search.r-project.org/library/base/html/taskCallback.html
(I corrected it a little for correct data output for sum in RStudio)
times <- function(total = 3, str = "Task a") {
ctr <- 0
function(expr, value, ok, visible) {
ctr <<- ctr + 1
cat(str, ctr, "\n")
if(ctr == total) {
cat("handler removing itself\n")
}
return(ctr < total)
}
}
# add the callback that will work for
# 4 top-level tasks and then remove itself.
n <- addTaskCallback(times(4))
# now remove it, assuming it is still first in the list.
removeTaskCallback(n)
## Not run:
# There is no point in running this
# as
addTaskCallback(times(4))
print(sum(1:10))
print(sum(1:10))
print(sum(1:10))
print(sum(1:10))
print(sum(1:10))
## End(Not run)
An output in R console:
>
> # add the callback that will work for
> # 4 top-level tasks and then remove itself.
> n <- addTaskCallback(times(4))
Task a 1
>
> # now remove it, assuming it is still first in the list.
> removeTaskCallback(n)
[1] TRUE
>
> ## Not run:
> # There is no point in running this
> # as
> addTaskCallback(times(4))
1
1
Task a 1
>
> print(sum(1:10))
[1] 55
Task a 2
> print(sum(1:10))
[1] 55
Task a 3
> print(sum(1:10))
[1] 55
Task a 4
handler removing itself
> print(sum(1:10))
[1] 55
> print(sum(1:10))
[1] 55
>
> ## End(Not run)
>
Okay, let's run this in RStudio.
Output:
> source('~/callbackTst.R')
[1] 55
[1] 55
[1] 55
[1] 55
[1] 55
Task a 1
>
Second run give us this:
> source('~/callbackTst.R')
[1] 55
[1] 55
[1] 55
[1] 55
[1] 55
Task a 2
Task a 1
>
Third:
> source('~/callbackTst.R')
[1] 55
[1] 55
[1] 55
[1] 55
[1] 55
Task a 3
Task a 2
Task a 1
>
and so on.
There is a strange difference between RStudio and R console and I don't know why. Could anyone help me? Is is bug or it's normal and I have curved hands?
Thank you.
P.S. This post connected with correct working of "track" package, because "track.start" method consist this part of code:
assign(".trackingSummaryChanged", FALSE, envir = trackingEnv)
assign(".trackingPid", Sys.getpid(), envir = trackingEnv)
if (!is.element("track.auto.monitor", getTaskCallbackNames()))
addTaskCallback(track.auto.monitor, name = "track.auto.monitor")
return(invisible(NULL))
which, I think, doesn't work correct in RStudio and R.NET
P.P.S. I use R 3.2.2 x64, RStudio 0.99.489 and Windows 10 Pro x64. On RRO this problem also exists under R.NET and RStudio

addTaskCallback() will add a callback that's executed when R execution returns to the top level. When you're executing code line-by-line, each statement executed will return control to the top level, and callbacks will execute.
When executed within source(), control isn't returned until the call to source() returns, and so the callback is only run once.

Related

Running different R functions from different R Script files

I would like to know how to run different functions from different R script files.
For example, in Main.R:
source("Database.R")
msci_data <- getIndex() #function from Database.R
source("Positions.R")
current_positions <- getPositions() #function from Positions.R
I realized after running getPositions() method , my msci_data data frame gets deleted. Is there anyway I can call multiple functions from two different source files?
Thanks very much
Here is a short demonstration that, in general, sourcing multiple R scripts will not remove anything from your global environment.
I have in a file foo.R:
foo <- function(x) x^2
I then have in a file bar.R:
bar <- function(x) x^3
Then from main.R, I do the following:
x <- 1:10
ls()
# [1] "x"
source("foo.R")
foo(x)
# [1] 1 4 9 16 25 36 49 64 81 100
ls()
# [1] "foo" "x"
source("bar.R")
bar(x)
# [1] 1 8 27 64 125 216 343 512 729 1000
ls()
# [1] "bar" "foo" "x"
You can see the functions all work as expected, and nothing is ever removed from the global environment. It must be that something in your Positions.R file is what causes this behavior, so no one can help you solve your problem without seeing your code.

Extraction of POSIXlt component runs fine in R 3.4.4, but errors in R 3.5.0. Why?

1) R version 3.4.4 (2018-03-15)
my.timedate <- as.POSIXlt('2016-01-01 16:00:00')
# print(attributes(my.timedate))
print(my.timedate[['hour']])
[1] 16
2) R version 3.5.0 (2018-04-23)
my.timedate <- as.POSIXlt('2016-01-01 16:00:00')
# print(attributes(my.timedate))
print(my.timedate[['hour']])
Error in FUN(X[[i]], ...) : subscript out of bounds
I think that is a known change in R 3.5.0 where the list elements of a POSIXlt need to be unpackaged explicitly. Using R 3.5.0:
edd#rob:~$ docker run --rm -ti r-base:3.5.0 \
R -q -e 'print(unclass(as.POSIXlt("2016-01-01 16:00:00")[["hour"]])'
> print(unclass(as.POSIXlt("2016-01-01 16:00:00"))[["hour"]])
[1] 16
>
>
edd#rob:~$
whereas with R 3.4.* one does not need the unclass() as you showed:
edd#rob:~$ docker run --rm -ti r-base:3.4.3 \
R -q -e 'print(as.POSIXlt("2016-01-01 16:00:00")[["hour"]])'
> print(as.POSIXlt("2016-01-01 16:00:00")[["hour"]])
[1] 16
>
>
edd#rob:~$
I don't find a corresponding NEWS file entry though so not entirely sure if it is on purpose...
Edit: As others have noted, the corresponding NEWS entry is the somewhat opaque
* Single components of "POSIXlt" objects can now be extracted and
replaced via [ indexing with 2 indices.
From ?POSIXlt:
As from R 3.5.0, one can extract and replace single components via [ indexing with two indices (see the examples).
The example is a little opaque, but shows the idea:
leapS[1 : 5, "year"]
If you look at the source, though, you can see what's happening:
`[.POSIXlt`
#> function (x, i, j, drop = TRUE)
#> {
#> if (missing(j)) {
#> .POSIXlt(lapply(X = unclass(x), FUN = "[", i, drop = drop),
#> attr(x, "tzone"), oldClass(x))
#> }
#> else {
#> unclass(x)[[j]][i]
#> }
#> }
#> <bytecode: 0x7fbdb4d24f60>
#> <environment: namespace:base>
It is using i to subset unclass(x), where x is the POSIXlt object. So with R 3.5.0, you use [ and preface the part of the datetime you want with the index of the datetime in the vector:
my.timedate <- as.POSIXlt('2016-01-01 16:00:00')
my.timedate[1, 'hour']
#> [1] 16
as.POSIXlt(seq(my.timedate, by = 'hour', length.out = 10))[2:5, 'hour']
#> [1] 17 18 19 20
Note that $ subsetting still works as usual:
my.timedate$hour
#> [1] 16
See ?DateTimeClasses (same as ?as.POSIXlt):
As from R 3.5.0, one can extract and replace single components via [ indexing with two indices
See also similar description in R NEWS CHANGES IN R 3.5.0.
Thus:
my.timedate[1, "hour"]
# [1] 16
# or leave the i index empty to select a component
# from all date-times in a vector
as.POSIXlt(c('2016-01-01 16:00:00', '2016-01-01 17:00:00'))[ , "hour"]
# [1] 16 17
See also Examples in the help text.

R rhyper() fails to give correct hypergeometric random number

I am trying to generate some random numbers from hypergeometric distribution using R. However, the rhyper() behaves very strange when I have a very small number of white balls and a large number for black balls. Here is what I got in my computer:
> sum(rhyper(100,1000,1e9-1000,1e6))
[1] 91
> sum(rhyper(100,2000,1e9-2000,1e6))
[1] 204
> sum(rhyper(100,10000,1e9-10000,1e6))
[1] 1016
> sum(rhyper(100,20000,1e9-20000,1e6))
[1] 1909
> sum(rhyper(100,50000,1e9-50000,1e6))
[1] 4968
> sum(rhyper(100,5000,1e9-5000,1e6))
[1] 60
> sum(rhyper(100,6000,1e9-6000,1e6))
[1] 164
> sum(rhyper(100,8000,1e9-8000,1e6))
[1] 0
> sum(rhyper(100,9000,1e9-9000,1e6))
[1] 45
The first 5 works fine, but for the 6th, I expected to get a number around 500, but not something like 60, also for the 7th,8th,9th.
Something wrong with the rhyper() function or my computer?

Turn on all CPUs for all nodes on a cluster: snow/snowfall package

I am working on a cluster and am using the snowfall package to establish a socket cluster on 5 nodes with 40 CPUs each with the following command:
> sfInit(parallel=TRUE, cpus = 200, type="SOCK", socketHosts=c("host1", "host2", "host3", "host4", "host5"));
R Version: R version 3.1.0 (2014-04-10)
snowfall 1.84-6 initialized (using snow 0.3-13): parallel execution on 5 CPUs.
I am seeing a much lower load on the slaves than expected when I check the cluster report and was disconcerted by the fact that it says "parallel execution on 5 CPUs" instead of "parallel execution on 200 CPUs". Is this merely an ambiguous reference to CPUs or are the hosts only running one CPU each?
EDIT: Here is an example of why this concerns me, if I only use the local machine and specify the max number of cores, I have:
> sfInit(parallel=TRUE, type="SOCK", cpus = 40);
snowfall 1.84-6 initialized (using snow 0.3-13): parallel execution on 40 CPUs.
I ran an identical job on the single node, 40 CPU cluster and it took 1.4 minutes while the 5 node, apparently 5 CPU cluster took 5.22 minutes. To me this confirms my suspicions that I am running with parallelism on 5 nodes but am only turning on 1 of the CPUs on each node.
My question is then: how do you turn on all CPUs for use across all available nodes?
EDIT: #SimonG I used the underlying snow package's intialization and we can clearly see that only 5 nodes are being turned on:
> cl <- makeSOCKcluster(names = c("host1", "host2", "host3", "host4", "host5"), count = 200)
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.9854311 0.5737885 0.8495582
[[2]]
[1] 0.7272693 0.3157248 0.6341732
[[3]]
[1] 0.26411931 0.36189866 0.05373248
[[4]]
[1] 0.3400387 0.7014877 0.6894910
[[5]]
[1] 0.2922941 0.6772769 0.7429913
> stopCluster(cl)
> cl <- makeSOCKcluster(names = rep("localhost", 40), count = 40)
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.6914666 0.7273244 0.8925275
[[2]]
[1] 0.3844729 0.7743824 0.5392220
[[3]]
[1] 0.2989990 0.7256851 0.6390770
[[4]]
[1] 0.07114831 0.74290601 0.57995908
[[5]]
[1] 0.4813375 0.2626619 0.5164171
.
.
.
[[39]]
[1] 0.7912749 0.8831164 0.1374560
[[40]]
[1] 0.2738782 0.4100779 0.0310864
I think this shows it pretty clearly. I tried this in desperation:
> cl <- makeSOCKcluster(names = rep(c("host1", "host2", "host3", "host4", "host5"), each = 40), count = 200)
and predictably got:
Error in socketConnection(port = port, server = TRUE, blocking = TRUE, :
all connections are in use
After thoroughly reading the snow documentation, I have come up with a (partial) solution.
I read that only 128 connections may be opened at once with the distributed R version, and have found it to be true. I can open 25 CPUs on each node, but the cluster will not start if I try to start 26 on each. Here is the proper structure of the host list that needs to be passed to makeCluster:
> library(snow);
> unixHost13 <- list(host = "host1");
> unixHost14 <- list(host = "host2");
> unixHost19 <- list(host = "host3");
> unixHost29 <- list(host = "host4");
> unixHost30 <- list(host = "host5");
> kCPUs <- 25;
> hostList <- c(rep(list(unixHost13), kCPUs), rep(list(unixHost14), kCPUs), rep(list(unixHost19), kCPUs), rep(list(unixHost29), kCPUs), rep(list(unixHost30), kCPUs));
> cl <- makeCluster(hostList, type = "SOCK")
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.08430941 0.64479036 0.90402362
[[2]]
[1] 0.1821656 0.7689981 0.2001639
[[3]]
[1] 0.5917363 0.4461787 0.8000013
.
.
.
[[123]]
[1] 0.6495153 0.6533647 0.2636664
[[124]]
[1] 0.75175580 0.09854553 0.66568129
[[125]]
[1] 0.79336203 0.61924813 0.09473841
I found a reference saying in order to up the connections, R needed to be rebuilt with NCONNECTIONS set higher (see here).

Why do I get "Error in rbind.zoo(...) : indexes overlap" when merging two zoo objects?

I have two seemingly identical zoo objects created by the same commands from csv files for different time periods. I try to combine them into one long zoo but I'm failing with "indexes overlap" error. ('merge' 'c' or 'rbind' all produce variants of the same error text.) As far as I can see there are no duplicates and the time periods do not overlap. What am I doing wrong? Am using R version 3.0.1 on Windows 7 64bit if that makes a difference.
> colnames(z2)
[1] "Amb" "HWS" "Diff"
> colnames(t.tmp)
[1] "Amb" "HWS" "Diff"
> max(index(z2))
[1] "2012-12-06 02:17:45 GMT"
> min(index(t.tmp))
[1] "2012-12-06 03:43:45 GMT"
> anyDuplicated(c(index(z2),index(t.tmp)))
[1] 0
> c(z2,t.tmp)
Error in rbind.zoo(...) : indexes overlap
>
UPDATE: In trying to make a reproducible case I've concluded this is an implementation error due to the large number of rows I'm dealing with: it fails if the final result is more than 311434 rows long.
> nrow(c(z2,head(t.tmp,n=101958)))
Error in rbind.zoo(...) : indexes overlap
> nrow(c(z2,head(t.tmp,n=101957)))
[1] 311434
# but row 101958 inserts fine on its own so its not a data problem.
> nrow(c(z2,tail(head(t.tmp,n=101958),n=2)))
[1] 209479
I'm sorry but I dont have the R scripting skills to produce a zoo of the critical length, hopefully someone might be able to help me out..
UPDATE 2- Responding to Jason's suggestion.. : The problem is in the MATCH but my R skills arent sufficient to know how to interpret it- does it mean MATCH finds a duplicate value in x.t whereas anyDuplicated does not?
> x.t <- c(index(z2),index(t.tmp));
> length(x.t)
[1] 520713
> ix <- ORDER (x.t)
> length(ix)
[1] 520713
> x.t <- x.t[ix]
> length(ix)
[1] 520713
> length(x.t)
[1] 520713
> tx <- table(MATCH(x.t,x.t))
> max(tx)
[1] 2
> tx[which(tx==2)]
311371 311373 311378 311383 311384 311386 311389 311392 311400 311401
2 2 2 2 2 2 2 2 2 2
> anyDuplicated(x.t)
[1] 0
After all the testing and head scratching it seems that the problem I'm having is timezone related. Setting the environment to the same time zone as the original data makes it work just fine.
Sys.setenv(TZ="GMT")
> z3<-rbind(z2,t.tmp)
> nrow(z3)
[1] 520713
Thanks to how to guard against accidental time zone conversion for the inspiration to look in that direction.

Resources