How to change working directory in asynchronous futures in R - r

I am trying to change working directory in a future processor, carry out some operations, and exit. The problem is I am not able to set a working directory.
The following toy example works fine
library(future)
dirNames <- as.character(c(1:4))
sapply(dirNames, function(x) if(!dir.exists(x)) dir.create(x))
plan(multiprocess, workers=2)
b <- list()
for(i in seq_along(dirNames)){
sleeptime <- 10
if(i > 3) sleeptime <- 50
a <- future({
# setwd(dirNames[i])
Sys.sleep(sleeptime)
return(2)
})
print(i)
b[[dirNames[i]]] <- a
}
lapply(b, resolved)
lapply(b[1:2], value)
lapply(b, value)
but if I uncomment line 11 then I get following error when running the code
Error in setwd(dirNames[i]) : cannot change working directory
How can I change working directory successfully?

I figured out a solution while playing around with the script.
library(future)
dirNames <- as.character(c(1:4))
sapply(dirNames, function(x) if(!dir.exists(x)) dir.create(x))
plan(multiprocess, workers=2)
b <- list()
for(i in seq_along(dirNames)){
sleeptime <- 10
if(i > 3) sleeptime <- 50
a <- future({
currDir <- getwd()
on.exit(setwd(currDir))
setwd(dirNames[i])
Sys.sleep(sleeptime)
return(2)
})
print(i)
b[[dirNames[i]]] <- a
}
lapply(b, resolved)
lapply(b[1:2], value)
lapply(b, value)
I believe that the workers working directory once set in the first few iterations remains permanently set to new directory for remaining iterations and hence future paths (with reference to old directory) do not work.

Related

Read many files in parallel and extract data

I have 1000 json files. And I would like to read them in parallel. I have 4 CPU cores.
I have a character vector which has the names of all the files as following:-
cik_files <- list.files("./data/", pattern = ".json")
And using this vector I load the file and extract the data and add it to the following list:-
data <- list()
Below is the code for extracting the data:-
for(i in 1:1000){
data1 <- fromJSON(paste0("./data/", cik_files[i]), flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
data1 <- data1[grep("CY20[0-9]{2}$", data1$frame), c(3, 9)]
try({if(nrow(data1) > 0){
data1$cik <- strtrim(cik_files[i], 13)
data[[length(data) + 1]] <- data1
}}, silent = TRUE)
}
}
This however, takes quite a lot of time. So I was wondering how I can run the code within the for loop but in parallel.
Thanks in advance.
Here is an attempt to solve the problem in the question. Untested, since there is no data.
Step 1
First of all, rewrite the loop in the question as a function.
f <- function(i, path = "./data", cik_files){
filename <- file.path(path, cik_files[i])
data1 <- fromJSON(filename, flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
found <- grep("CY20[0-9]{2}$", data1$frame)
if(length(found) > 0){
tryCatch({
out <- data1[found, c(3, 9)]
out$cik <- strtrim(cik_files[i], 13)
out
},
error = function(e) e,
warning = function(w) w)
} else NULL
} else NULL
}
Step 2
Now load the package parallel and run one of the following, depending on OS.
library(parallel)
# Not on Windows
library(jsonlite)
json_list <- mclapply(seq_along(cik_files), f, cik_files = cik_files)
# Windows
ncores <- detectCores()
cl <- makeCluster(ncores - 1L)
clusterExport(cl, "cik_files")
clusterEvalQ(cl, "cik_files")
clusterEvalQ(cl, library(jsonlite))
json_list <- parLapply(cl, seq_along(cik_files), f, cik_files = cik_files)
stopCluster(cl)
Step 3
Extract the data from the returned list json_list.
err <- sapply(json_list, inherits, "error")
warn <- sapply(json_list, inherits, "warning")
ok <- !(err | warn)
json_list[ok] # correctly read in

How to use tryCatch() to ignore the error in while loop in R

I have a code that reads each line of my dataframe's first column, visits the website and then downloads the photo of each deputy. But it doesn't work properly because there are some deputies who don't have a photo yet.
That's why my code breaks and stop working. I tried to use "next" and if clauses, but it still didn't work. So a friend recomended me to use the tryCatch(). I couldn't find enough information online, and the code still doesn't work.
The file is here:
https://gist.github.com/gabrielacaesar/940f3ef14eaf29d18c3780a66053bbee
deputados <- fread("dep-legislatura56-14jan2019.csv")
i <- 1
while(i <= 514) {
this.could.go.wrong <- tryCatch(
attemptsomething(),
error=function(e) next
)
url <- deputados$uri[i]
api_content <- rawToChar(GET(url)$content)
pessoa_info <- jsonlite::fromJSON(api_content)
pessoa_foto <- pessoa_info$dados$ultimoStatus$urlFoto
download.file(pessoa_foto, basename(pessoa_foto), mode = "wb")
Sys.sleep(0.5)
i <- i + 1
}
Here is a solution using purrr:
library(purrr)
download_picture <- function(url){
api_content <- rawToChar(httr::GET(url)$content)
pessoa_info <- jsonlite::fromJSON(api_content)
pessoa_foto <- pessoa_info$dados$ultimoStatus$urlFoto
download.file(pessoa_foto, basename(pessoa_foto), mode = "wb")
}
walk(deputados$uri, possibly(download_picture, NULL))
Simply wrap tryCatch on the lines that can potentially raise errors and have it return NULL or NA on the error block:
i <- 1
while(i <= 514) {
tryCatch({
url <- deputados$uri[i]
api_content <- rawToChar(GET(url)$content)
pessoa_info <- jsonlite::fromJSON(api_content)
pessoa_foto <- pessoa_info$dados$ultimoStatus$urlFoto
download.file(pessoa_foto, basename(pessoa_foto), mode = "wb")
Sys.sleep(0.5)
}, error = function(e) return(NULL)
)
i <- i + 1
}

Parallelized `Find` loop in R

There are several packages in R to simplify running code in parallel, like foreach and future. Most of these have constructs which are like lapply or a for loop: they carry on until all the tasks have finished.
Is there a simple parallel version of Find? That is, I would like to run several tasks in parallel. I don't need all of them to finish, I just need to get the first one that finishes (maybe with a particular result). After that the other tasks can be killed, or left to finish on their own.
Conceptual code:
hunt_needle <- function (x, y) x %in% (y-1000):y
x <- sample.int(1000000, 1)
result <- parallel_find(seq(1000, 1000000, 1000), hunt_needle)
# should return the first value for which hunt_needle is true
You can use shared memory so that processes can communicate with one another.
For that, you can use package bigstatsr (disclaimer: I'm the author).
Choose a block size and do:
# devtools::install_github("privefl/bigstatsr")
library(bigstatsr)
# Data example
cond <- logical(1e6)
cond[sample(length(cond), size = 1)] <- TRUE
ind.block <- bigstatsr:::CutBySize(length(cond), block.size = 1000)
cl <- parallel::makeCluster(nb_cores())
doParallel::registerDoParallel(cl)
# This value (in an on-disk matrix) is shared by processes
found_it <- FBM(1, 1, type = "integer", init = 0L)
library(foreach)
res <- foreach(ic = sample(rows_along(ind.block)), .combine = 'c') %dopar% {
if (found_it[1]) return(NULL)
ind <- bigstatsr:::seq2(ind.block[ic, ])
find <- which(cond[ind])
if (length(find)) {
found_it[1] <- 1L
return(ind[find[1]])
} else {
return(NULL)
}
}
parallel::stopCluster(cl)
# Verification
all.equal(res, which(cond))
Basically, when a solution is found, you don't need to do some computations anymore, and others know it because you put a 1 in found_it which is shared between all processes.
As your question is not reproducible and I don't understand everything you need, you may have to adapt this solution a little bit.

How to make 'for loop' skip some iterations in between which shows error?

I am running a for loop from (1:1700) in R, but I am loading different data in each iteration. But I am getting error in some iterations in between (may be because of corresponding data is missing).
I want to know if there is any way I could skip those particular iterations in which I get error and at least for loop should complete all the 1700 iterations skipping aforementioned error showing iterations.
I have to run a for loop, there is no other option.
Yoy can use tryCatch within your loop. here an example where I loop from 1 to 5 , and for some counter value I get an error ( i create it here using stop), I catch it and then I continue for other values of the counters.
for( i in 1:5) ## replace 5 by 1700
tryCatch({
if(i %in% c(2,5)) stop(e)
print(i) ## imagine you read a file here, or any more complicated process
}
,error = function(e) print(paste(i,'is error')))
[1] 1
[1] "2 is error"
[1] 3
[1] 4
[1] "5 is error"
I use try for such issues. It allows your loop to continue through the cycle of values without stopping at the error message.
Example
make data
set.seed(1)
dat <- vector(mode="list", 1800)
dat
tmp <- sample(1800, 900) # only some elements are filled with data
for(i in seq(tmp)){
dat[[tmp[i]]] <- rnorm(10)
}
dat
loop without try
#gives warning
res <- vector(mode="list", length(dat))
for(i in seq(dat)){
res[[i]] <- log(dat[[i]]) # warning given when trying to take the log of the NULL element
}
loop with try
#cycles through
res <- vector(mode="list", length(dat))
for(i in seq(dat)){
res[[i]] <- try(log(dat[[i]]), TRUE) # cycles through
}

Parallelize and speed up R code to read in many files

I've a code that works perfectly for my purpose (it reads some files with a specific pattern, read the matrix within each file and compute something using each filepair...the final output is a matrix that has the same size of the file number) and looks like this:
m<- 100
output<- matrix(0, m, m)
lista<- list.files(pattern = "q")
listan<- as.matrix(lista)
n <- nrow(listan)
for (i in 1:n) {
AA <- read.table((listan[i,]), header = FALSE)
A<- as.matrix(AA)
dVarX <- sqrt(mean(A * A))
for (j in i:n) {
BB <- read.table ((listan[j,]), header = FALSE)
B<- as.matrix(BB)
V <- sqrt (dVarX * (sqrt(mean(B * B))))
output[i,j] <- (sqrt(mean(A * B))) / V
}
}
My problem is that it takes a lot of time (I have about 5000 matrixes, that means 5000x5000 loops).
I would like to parallelize, but I need some help!
Waiting for your kind suggestions!
Thank you in advance!
Gab
The bottleneck is likely reading from disk. Running code in parallel isn't guaranteed to make things faster. In this case, multiple processes attempting to read from the same disk at the same time is likely to be even slower than a single process.
Since your matrices are being written by another R process, you really should save them in R's binary format. You're reading every matrix once and only once, so the only way to make your program faster is to make reading from disk faster.
Here's an example that shows you how much faster it could be:
# make some random data and write it to disk
set.seed(21)
for(i in 0:9) {
m <- matrix(runif(700*700), 700, 700)
f <- paste0("f",i)
write(m, f, 700) # text format
saveRDS(m, paste0(f,".rds")) # binary format
}
# initialize two output objects
m <- 10
o1 <- o2 <- matrix(NA, m, m)
# get list of file names
files <- list.files(pattern="^f[[:digit:]]+$")
n <- length(files)
First, let's run your your code using scan, which is already a lot faster than your current solution with read.table.
system.time({
for (i in 1:n) {
A <- scan(files[i],quiet=TRUE)
for (j in i:n) {
B <- scan(files[j],quiet=TRUE)
o1[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
}
}
})
# user system elapsed
# 31.37 0.78 32.58
Now, let's re-run that code using the files saved in R's binary format:
system.time({
for (i in 1:n) {
fA <- paste0(files[i],".rds")
A <- readRDS(fA)
for (j in i:n) {
fB <- paste0(files[j],".rds")
B <- readRDS(fB)
o2[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
}
}
})
# user system elapsed
# 2.42 0.39 2.92
So the binary format is ~10x faster! And the output is the same:
all.equal(o1,o2)
# [1] TRUE

Resources