I'm using snow parApply() to distribute processing tasks to a number of workes on the local machine. The problem is that if I change the code in one of the functions the workers will not be aware of the changes.
How can I 'resource' the source code files in the workers?
EDIT
I can't call source() on my cluster to re-eval all my functions:
cl = makeSOCKcluster(rep("localhost", 5))
> clusterCall(cl, getwd)
[[1]]
[1] "/home/user"
[[2]]
[1] "/home/user"
[[3]]
[1] "/home/user"
[[4]]
[1] "/home/user"
[[5]]
[1] "/home/user"
> clusterCall(cl, source, 'ets.load.R')
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
5 nodes produced errors; first error: cannot open the connection
Update it in each worker using parallel::clusterCall
Related
I wrote a function to run R parallel, but it doesn't seem to work. The code is
'''
rm(list=ls())
square<-function(x){
library(Iso)
y=ufit(x,lmode<-2,x<-c(1:length(x)),type="b")[[2]]
return(y)
}
num<-c(1,2,1,4)
cl <- makeCluster(getOption("cl.cores",2))
clusterExport(cl,"square")
results<-parLapply(cl,num,square)
stopCluster(cl)
'''
and the error is:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: cannot open the connection
I think a possible reason is that I used the Iso package in the function. BUT I DON'T KNOW how to solve it.
You have to export your functions/whole packages to each cluster if you want to do it in parallel:
library(doSNOW)
## the rest is the same
rm(list=ls())
square<-function(x){
y=ufit(x,lmode<-2,x<-c(1:length(x)),type="b")[[2]]
return(y)
}
num<-c(1,2,1,4)
cl <- makeCluster(getOption("cl.cores",2))
clusterExport(cl,"square")
clusterEvalQ(cl,library(Iso))
## here you should see smth like this, where each cluster prints attached libraries
[[1]]
[1] "Iso" "snow" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"
[[2]]
[1] "Iso" "snow" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"
## then just call the same as with parallel
results<-parLapply(cl,num,square)
stopCluster(cl)
## alternative is to use Iso::ufit
I stored a list of files in a list using this code:
filesList <- list.files(path="/Users/myPath/data/", pattern="*.csv")
I then wanted to output it without the indexes (that usually appear of form [1] at start of each line, so I tried this:
sapply(filesList[order(filesList)], print)
The result is given below copied exactly from RStudio. Why does my list of files output twice? I can work with this, I am just curious.
[1] "IMDB_Bottom250movies.csv"
[1] "IMDB_Bottom250movies2_OMDB_Detailed.csv"
[1] "IMDB_Bottom250movies2.csv"
[1] "IMDB_ErrorLogIDs1_OMDB_Detailed.csv"
[1] "IMDB_ErrorLogIDs1.csv"
[1] "IMDB_ErrorLogIDs2_OMDB_Detailed.csv"
[1] "IMDB_ErrorLogIDs2.csv"
[1] "IMDB_OMDB_Kaggle_TestSet_OMDB_Detailed.csv"
[1] "IMDB_OMDB_Kaggle_TestSet.csv"
[1] "IMDB_Top250Engmovies.csv"
[1] "IMDB_Top250Engmovies2_OMDB_Detailed.csv"
[1] "IMDB_Top250Engmovies2.csv"
[1] "IMDB_Top250Indianmovies.csv"
[1] "IMDB_Top250Indianmovies2_OMDB_Detailed.csv"
[1] "IMDB_Top250Indianmovies2.csv"
[1] "IMDB_Top250movies.csv"
[1] "IMDB_Top250movies2_OMDB_Detailed.csv"
[1] "IMDB_Top250movies2.csv"
[1] "TestDoc2_KaggleData_OMDB_Detailed.csv"
[1] "TestDoc2_KaggleData.csv"
[1] "TestDoc2_KaggleData68_OMDB_Detailed.csv"
[1] "TestDoc2_KaggleData68.csv"
[1] "TestDoc2_KaggleDataHUGE_OMDB_Detailed.csv"
[1] "TestDoc2_KaggleDataHUGE.csv"
IMDB_Bottom250movies.csv IMDB_Bottom250movies2_OMDB_Detailed.csv
"IMDB_Bottom250movies.csv" "IMDB_Bottom250movies2_OMDB_Detailed.csv"
IMDB_Bottom250movies2.csv IMDB_ErrorLogIDs1_OMDB_Detailed.csv
"IMDB_Bottom250movies2.csv" "IMDB_ErrorLogIDs1_OMDB_Detailed.csv"
IMDB_ErrorLogIDs1.csv IMDB_ErrorLogIDs2_OMDB_Detailed.csv
"IMDB_ErrorLogIDs1.csv" "IMDB_ErrorLogIDs2_OMDB_Detailed.csv"
IMDB_ErrorLogIDs2.csv IMDB_OMDB_Kaggle_TestSet_OMDB_Detailed.csv
"IMDB_ErrorLogIDs2.csv" "IMDB_OMDB_Kaggle_TestSet_OMDB_Detailed.csv"
IMDB_OMDB_Kaggle_TestSet.csv IMDB_Top250Engmovies.csv
"IMDB_OMDB_Kaggle_TestSet.csv" "IMDB_Top250Engmovies.csv"
IMDB_Top250Engmovies2_OMDB_Detailed.csv IMDB_Top250Engmovies2.csv
"IMDB_Top250Engmovies2_OMDB_Detailed.csv" "IMDB_Top250Engmovies2.csv"
IMDB_Top250Indianmovies.csv IMDB_Top250Indianmovies2_OMDB_Detailed.csv
"IMDB_Top250Indianmovies.csv" "IMDB_Top250Indianmovies2_OMDB_Detailed.csv"
IMDB_Top250Indianmovies2.csv IMDB_Top250movies.csv
"IMDB_Top250Indianmovies2.csv" "IMDB_Top250movies.csv"
IMDB_Top250movies2_OMDB_Detailed.csv IMDB_Top250movies2.csv
"IMDB_Top250movies2_OMDB_Detailed.csv" "IMDB_Top250movies2.csv"
TestDoc2_KaggleData_OMDB_Detailed.csv TestDoc2_KaggleData.csv
"TestDoc2_KaggleData_OMDB_Detailed.csv" "TestDoc2_KaggleData.csv"
TestDoc2_KaggleData68_OMDB_Detailed.csv TestDoc2_KaggleData68.csv
"TestDoc2_KaggleData68_OMDB_Detailed.csv" "TestDoc2_KaggleData68.csv"
TestDoc2_KaggleDataHUGE_OMDB_Detailed.csv TestDoc2_KaggleDataHUGE.csv
"TestDoc2_KaggleDataHUGE_OMDB_Detailed.csv" "TestDoc2_KaggleDataHUGE.csv"
The second copy (without the indexes) is close enough to copy-paste-use, jsut wondering why this happened ?
What is happening here is that sapply is calling print on each element of fileList[order(fileList)] printing the contents to screen. Then Rstudio prints the result of the sapply function itself, which is a list of the contents printed by print. You can use cat to print values without the [1] or wrap sapply in invisible to suppress its output. https://stackoverflow.com/a/12985020/6490232
Follow up to this
I want to source scripts inside a given environment, like in sys.source, but "exporting" only some functions and keeping the others private.
I created this function:
source2=function(script){
ps=paste0(script, "_")
assign(ps, new.env(parent=baseenv()))
assign(script, new.env(parent=get(ps)))
private=function(f){
fn=deparse(substitute(f))
assign(fn, f, parent.env(parent.frame()))
rm(list=fn, envir=parent.frame())
}
assign("private", private, get(script))
sys.source(paste0(script, ".R"), envir=get(script))
rm(private, envir=get(script))
attach(get(script), name=script)
}
For the most part, this function works as expected.
Consider the script:
## foo.R
f=function() g()
g=function() print('hello')
private(g)
Note the private() function, which will hide g().
If I, so to say, import the module foo:
source2("foo")
I have a new environment in the search path:
search()
## [1] ".GlobalEnv" "foo" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"
The current environment, .GlobalEnv, shows only:
ls()
## [1] "source2"
But if I list items in foo environment:
ls("foo")
## [1] "f"
Therefore I can run:
f()
## [1] "hello"
The problem is that g() is hidden totally.
getAnywhere(g)
## no object named 'g' was found
Too much. In fact, if I want to debug f():
debug(f)
f()
debugging in: f()
## Error in f() : could not find function "g"
The question is:
Where is g()? Can I still retrieve it?
Use:
get("g",env=environment(f))
## function ()
## print("hello")
## <environment: 0x0000000018780c30>
ls(parent.env(environment(f)))
## [1] "g"
Credit goes to Alexander Griffith for the solution.
mybrowser$navigate("http://bitcointicker.co/transactions/")
> a <- mybrowser$findElement(using = 'css selector',"#transactionscontainer")
> a
[1] "remoteDriver fields"
$remoteServerAddr
[1] "localhost"
$port
[1] 4444
$browserName
[1] "firefox"
$version
[1] ""
$platform
[1] "ANY"
$javascript
[1] TRUE
$autoClose
[1] FALSE
$nativeEvents
[1] TRUE
$extraCapabilities
list()
[1] "webElement fields"
$elementId
[1] "0"
I am trying to web scrape live Data using RSelenium and Rvest. I am planning to create a control loop with a timer to run every minute but I am struggling with the dynamic exporting of data into a folder on my computer. The ideal would be to create an output file and R would update rows automatically on the one file although I am not sure if this is possible using R.
I am working on a cluster and am using the snowfall package to establish a socket cluster on 5 nodes with 40 CPUs each with the following command:
> sfInit(parallel=TRUE, cpus = 200, type="SOCK", socketHosts=c("host1", "host2", "host3", "host4", "host5"));
R Version: R version 3.1.0 (2014-04-10)
snowfall 1.84-6 initialized (using snow 0.3-13): parallel execution on 5 CPUs.
I am seeing a much lower load on the slaves than expected when I check the cluster report and was disconcerted by the fact that it says "parallel execution on 5 CPUs" instead of "parallel execution on 200 CPUs". Is this merely an ambiguous reference to CPUs or are the hosts only running one CPU each?
EDIT: Here is an example of why this concerns me, if I only use the local machine and specify the max number of cores, I have:
> sfInit(parallel=TRUE, type="SOCK", cpus = 40);
snowfall 1.84-6 initialized (using snow 0.3-13): parallel execution on 40 CPUs.
I ran an identical job on the single node, 40 CPU cluster and it took 1.4 minutes while the 5 node, apparently 5 CPU cluster took 5.22 minutes. To me this confirms my suspicions that I am running with parallelism on 5 nodes but am only turning on 1 of the CPUs on each node.
My question is then: how do you turn on all CPUs for use across all available nodes?
EDIT: #SimonG I used the underlying snow package's intialization and we can clearly see that only 5 nodes are being turned on:
> cl <- makeSOCKcluster(names = c("host1", "host2", "host3", "host4", "host5"), count = 200)
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.9854311 0.5737885 0.8495582
[[2]]
[1] 0.7272693 0.3157248 0.6341732
[[3]]
[1] 0.26411931 0.36189866 0.05373248
[[4]]
[1] 0.3400387 0.7014877 0.6894910
[[5]]
[1] 0.2922941 0.6772769 0.7429913
> stopCluster(cl)
> cl <- makeSOCKcluster(names = rep("localhost", 40), count = 40)
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.6914666 0.7273244 0.8925275
[[2]]
[1] 0.3844729 0.7743824 0.5392220
[[3]]
[1] 0.2989990 0.7256851 0.6390770
[[4]]
[1] 0.07114831 0.74290601 0.57995908
[[5]]
[1] 0.4813375 0.2626619 0.5164171
.
.
.
[[39]]
[1] 0.7912749 0.8831164 0.1374560
[[40]]
[1] 0.2738782 0.4100779 0.0310864
I think this shows it pretty clearly. I tried this in desperation:
> cl <- makeSOCKcluster(names = rep(c("host1", "host2", "host3", "host4", "host5"), each = 40), count = 200)
and predictably got:
Error in socketConnection(port = port, server = TRUE, blocking = TRUE, :
all connections are in use
After thoroughly reading the snow documentation, I have come up with a (partial) solution.
I read that only 128 connections may be opened at once with the distributed R version, and have found it to be true. I can open 25 CPUs on each node, but the cluster will not start if I try to start 26 on each. Here is the proper structure of the host list that needs to be passed to makeCluster:
> library(snow);
> unixHost13 <- list(host = "host1");
> unixHost14 <- list(host = "host2");
> unixHost19 <- list(host = "host3");
> unixHost29 <- list(host = "host4");
> unixHost30 <- list(host = "host5");
> kCPUs <- 25;
> hostList <- c(rep(list(unixHost13), kCPUs), rep(list(unixHost14), kCPUs), rep(list(unixHost19), kCPUs), rep(list(unixHost29), kCPUs), rep(list(unixHost30), kCPUs));
> cl <- makeCluster(hostList, type = "SOCK")
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.08430941 0.64479036 0.90402362
[[2]]
[1] 0.1821656 0.7689981 0.2001639
[[3]]
[1] 0.5917363 0.4461787 0.8000013
.
.
.
[[123]]
[1] 0.6495153 0.6533647 0.2636664
[[124]]
[1] 0.75175580 0.09854553 0.66568129
[[125]]
[1] 0.79336203 0.61924813 0.09473841
I found a reference saying in order to up the connections, R needed to be rebuilt with NCONNECTIONS set higher (see here).