Allowing (or circumventing) parallel write access to a file - r

I'm calling a Windows executable from multiple parallel R processes (via a system call within parSapply). This .exe (let's call it my.exe) is passed a filename as an argument, and processes this file (details are probably irrelevant). Unfortunately, my.exe creates a log file (in the same directory as my.exe) that it writes to while it runs, and, since the log file's name is fixed, subsequent R processes calling my.exe results in my.exe` throwing the error:
Cannot create result file "log.res".
Do you have write access in the current directory?
I've managed to work around this by creating multiple copies of the my.exe (as many as the number of cores in my cluster, i.e. 7). I can then ensure that each is only in use by a single R process at any one time, by passing to the cores a vector of 7 paths to .bat files, each of which repeatedly calls a given copy of my.exe.
Is there a more elegant way to deal with this issue, perhaps by having the processes create virtual instances of my.exe automagically? I don't require the log files.
Since this is an error thrown by the program and not by R, I suspect there might be no way to permit concurrent write access to the log file from the R side of things.
Ideally, I want to be doing something like this:
ff <- c('a', 'long', 'vector', 'of', 'file', 'paths') # abbreviated
parSapply(cl, ff, function(f) system(sprintf("my.exe %s", f)))
but instead I've resorted to doing (more or less) this (after copying my.exe to c:/1/, c:/2/, c:/3/, through c:/7/):
cat(paste('CALL C:/1/my.exe', ff[1:10], '/RUN=YES'), file='run1.bat', sep='\n')
cat(paste('CALL C:/2/my.exe', ff[11:20], '/RUN=YES'), file='run2.bat', sep='\n')
cat(paste('CALL C:/3/my.exe', ff[21:30], '/RUN=YES'), file='run3.bat', sep='\n')
cat(paste('CALL C:/4/my.exe', ff[31:40], '/RUN=YES'), file='run4.bat', sep='\n')
cat(paste('CALL C:/5/my.exe', ff[41:50], '/RUN=YES'), file='run5.bat', sep='\n')
cat(paste('CALL C:/6/my.exe', ff[51:60], '/RUN=YES'), file='run6.bat', sep='\n')
cat(paste('CALL C:/7/my.exe', ff[61:70], '/RUN=YES'), file='run7.bat', sep='\n')
parSapply(cl, c('run1.bat', 'run2.bat', 'run3.bat', 'run4.bat',
'run5.bat', 'run6.bat', 'run7.bat'), system)
(Above, instead of letting parSapply assign the 70 elements of ff to the various processes, I manually split them when creating the batch files, and then run the batch files in parallel.)

It sounds like your basic strategy is the only known solution to the problem, but I think it can be done more elegantly. For instance, you could avoid creating .BAT files by having each worker execute a different command line based on a worker ID. The worker ID could be assigned using:
# Assign worker ID's to each of the cluster workers
setid <- function(id) assign(".Worker.id", id, pos=globalenv())
clusterApply(cl, seq_along(cl), setid)
Also, you may want to automate the creation of the directories that contain "my.exe". I also prefer to use a symlink rather than a copy of the executable:
# Create directories containing a symlink to the real executable
exepath <- "C:/bin/my.exe" # Path to the real executable
pdir <- getwd() # Parent of the new executable directories
myexe <- file.path(pdir, sprintf("work_%d", seq_along(cl)), "my.exe")
for (x in myexe) {
dir.create(dirname(x), showWarnings=FALSE)
if (file.exists(x))
unlink(x)
file.symlink(exepath, x)
}
If symlinks don't fool "my.exe" into creating the log file in the desired directory, you could try using "file.copy" instead of "file.symlink".
Now you can run your parallel job using:
# Each worker executes a different symlink to the real executable
worker.fun <- function(f, myexe) {
system(sprintf("%s %s /RUN=YES", myexe[.Worker.id], f))
}
ff <- c('a', 'long', 'vector', 'of', 'file', 'paths')
parSapply(cl, ff, worker.fun, myexe)
You could also delete the directories that were created, but they don't use much space since symlinks are used, so it might be better to keep them, especially during debugging/testing.

Related

How do I download images from a server and then upload it to a website using R?

Okay, so I have approximately 2 GB worth of files (images and what not) stored on a server (I'm using Cygwin right now since I'm on Windows) and I was wondering if I was able to get all of this data into R and then eventually translate it onto a website where people can view/download those images?
I currently have installed the ssh package and have logged into my server using:
ssh::ssh_connect("name_and_server_ip_here")
I've been able to successfully connect, however, I am not particular sure how to locate the files on the server through R. I assume I would use something like scp_download to download the files from the server, but as mentioned before, I am not particularly sure how to locate the files from the server, so I wouldn't be able to download them anyways (yet)!
Any sort of feedback and help would be appreciated! Thanks :)
You can use ssh::ssh_exec_internal and some shell commands to "find" commands.
sess <- ssh::ssh_connect("r2#myth", passwd="...")
out <- ssh::ssh_exec_internal(sess, command = "find /home/r2/* -maxdepth 3 -type f -iname '*.log'")
str(out)
# List of 3
# $ status: int 0
# $ stdout: raw [1:70] 2f 68 6f 6d ...
# $ stderr: raw(0)
The stdout/stderr are raw (it's feasible that the remote command did not produce ascii data), so we can use rawToChar to convert. (This may not be console-safe if you have non-ascii data, but it is here, so I'll go with it.)
rawToChar(out$stdout)
# [1] "/home/r2/logs/dns.log\n/home/r2/logs/ping.log\n/home/r2/logs/status.log\n"
remote_files <- strsplit(rawToChar(out$stdout), "\n")[[1]]
remote_files
# [1] "/home/r2/logs/dns.log" "/home/r2/logs/ping.log" "/home/r2/logs/status.log"
For downloading, scp_download is not vectorized, so we can only upload one file at a time.
for (rf in remote_files) ssh::scp_download(sess, files = rf, to = ".")
# 4339331 C:\Users\r2\.../dns.log
# 36741490 C:\Users\r2\.../ping.log
# 17619010 C:\Users\r2\.../status.log
For uploading, scp_upload is vectorized, so we can send all in one shot. I'll create a new directory (just for this example, and to not completely clutter my remote server :-), and then upload them.
ssh::ssh_exec_wait(sess, "mkdir '/home/r2/newlogs'")
# [1] 0
ssh::scp_upload(sess, files = basename(remote_files), to = "/home/r2/newlogs/")
# [100%] C:\Users\r2\...\dns.log
# [100%] C:\Users\r2\...\ping.log
# [100%] C:\Users\r2\...\status.log
# [1] "/home/r2/newlogs/"
(I find it odd that scp_upload is vectorized while scp_download is not. If this were on a shell/terminal, then each call to scp would need to connect, authenticate, copy, then disconnect, a bit inefficient; since we're using a saved session, I believe (unverified) that there is little efficiency lost due to not vectorizing the R function ... though it is still really easy to vectorize it.)

R system() error when using brace expansion to match folders in linux

I have a series of sequential directories to gather files from on a linux server that I am logging into remotely and processing from an R terminal.
/r18_060, /r18_061, ... /r18_118, /r18_119
Each directory is for the day of year the data was logged on, and it contains a series of files with standard prefix such as "fl.060.gz"
I have to supply a function that contains multiple system() commands with a linux glob for the day. I want to divide the year into 60-day intervals to make the QA/QC more manageable. Since I'm crossing from 099 - 100 in the glob, I have to use brace expansion to match the correct sequence of days.
ls -d /root_driectory/r18_{0[6-9]?,1[0-1]?}
ls -d /root_driectory/r18_{060..119}
All of these work fine when I manually input these globs into my bash shell, but I get an error when the system() function provides a similar command through R.
day_glob <- {060..119}
system(paste("zcat /root_directory/r_18.", day_glob, "/fl.???.gz > tmpfile", sep = "")
>gzip: cannot access '/root_directory/r18_{060..119}': No such file or directory
I know that this could be an error in the shell that the system() function operates in, but when I query that it gives the correct environment and user name
system("env | grep ^SHELL=")
>SHELL=/bin/bash
system("echo $USER")
>tgw
Does anyone know why this fails when it is passed through R's system() command? What can I do to get around this problem without removing the system call altogether? There are many scripts that rely on these functions, and re-writing the entire family of R scripts would be time prohibitive.
Previously I had been using 50-day intervals which avoids this problem, but I thought this should be something easy to change, and make one less iteration of my QA/QC scripts per year. I'm new to the linux OS so I figured I might just be missing something obvious.

How to prevent from overwriting the file?

I'm looking for a way to prevent R from overwriting files during the session. The more general solution then better.
Currently I got bunch of functions called e.g.: safe.save, safe.png, safe.write.table which are implemented more or less as
safe.smth <- function(..., file) {
if (file.exists(file))
stop("File exists!")
else
smth(..., file=file)
}
It works, but only if I got control over execution. If some (not mine) function created file I can't stop it from overwrite.
Another way is to set read only flag on files, which also top R from overwriting existing files. But this has drawbacks as well (e.g.: you don't know which files needs to be protected).
Or write one-liner:
protect <- function(p) if (file.exists(p)) stop("File exsits!") else p
and use it always when providing filename.
Is there a way to force this behaviour session wide? Some kind of global setting for connections? Maybe only for subset of functions (graphics devices, file-created connections, etc)? Maybe some system specific solution?
The following could be used as test case:
test <- function(i) {
try(write.table(i, "test_001.csv"))
try(writeLines(as.character(i), "test_002.txt"))
try({png("test_003.png");plot(i);dev.off()})
try({pdf("test_004.pdf");plot(i);dev.off()})
try(save(i, file="test_005.RData"))
try({f<-file("test_006.txt", "w");cat(as.character(i), file=f);close(f)})
}
test(1)
magic_incantations() # or magic_incantations(test(2)), etc.
test(2) # should fail on all writes (to test set read-only to files from first call)
The conventional way to avoid clobbering data files isn't to look for OS hacks, but to use filenames and directories that are special for your session.
session.dir <- tempdir()
...
write.table(i, file.path(session.dir,"test_001.csv"))
writeLines(as.character(i), file.path(session.dir,"test_002.txt"))
...
or
session.pid <- Sys.getpid()
...
write.table(i, paste0("test_001.",session.pid,".csv"))
writeLines(as.character(i), paste0("test_002.",session.pid,".txt"))
...

How to create a new output file in R if a file with that name already exists?

I am trying to run an R-script file using windows task scheduler that runs it every two hours. What I am trying to do is gather some tweets through Twitter API and run a sentiment analysis that produces two graphs and saves it in a directory. The problem is, when the script is run again it replaces the already existing files with that name in the directory.
As an example, when I used the pdf("file") function, it ran fine for the first time as no file with that name already existED in the directory. Problem is I want the R-script to be running every other hour. So, I need some solution that creates a new file in the directory instead of replacing that file. Just like what happens when a file is downloaded multiple times from Google Chrome.
I'd just time-stamp the file name.
> filename = paste("output-",now(),sep="")
> filename
[1] "output-2014-08-21 16:02:45"
Use any of the standard date formatting functions to customise to taste - maybe you don't want spaces and colons in your file names:
> filename = paste("output-",format(Sys.time(), "%a-%b-%d-%H-%M-%S-%Y"),sep="")
> filename
[1] "output-Thu-Aug-21-16-03-30-2014"
If you want the behaviour of adding a number to the file name, then something like this:
serialNext = function(prefix){
if(!file.exists(prefix)){return(prefix)}
i=1
repeat {
f = paste(prefix,i,sep=".")
if(!file.exists(f)){return(f)}
i=i+1
}
}
Usage. First, "foo" doesn't exist, so it returns "foo":
> serialNext("foo")
[1] "foo"
Write a file called "foo":
> cat("fnord",file="foo")
Now it returns "foo.1":
> serialNext("foo")
[1] "foo.1"
Create that, then it returns "foo.2" and so on...
> cat("fnord",file="foo.1")
> serialNext("foo")
[1] "foo.2"
This kind of thing can break if more than one process might be writing a new file though - if both processes check at the same time there's a window of opportunity where both processes don't see "foo.2" and think they can both create it. The same thing will happen with timestamps if you have two processes trying to write new files at the same time.
Both these issues can be resolved by generating a random UUID and pasting that on the filename, otherwise you need something that's atomic at the operating system level.
But for a twice-hourly job I reckon a timestamp down to minutes is probably enough.
See ?files for file manipulation functions. You can check if file exists with file.exists, and then either rename the existing file, or create a different name for the new one.

R - Connect Scripts via Pipes

I have a number of R scripts that I would like to chain together using a UNIX-style pipeline. Each script would take as input a data frame and provide a data frame as output. For example, I am imagining something like this that would run in R's batch mode.
cat raw-input.Rds | step1.R | step2.R | step3.R | step4.R > result.Rds
Any thoughts on how this could be done?
Writing executable scripts is not the hard part, what is tricky is how to make the scripts read from files and/or pipes. I wrote a somewhat general function here: https://stackoverflow.com/a/15785789/1201032
Here is an example where the I/O takes the form of csv files:
Your step?.R files should look like this:
#!/usr/bin/Rscript
OpenRead <- function(arg) {
if (arg %in% c("-", "/dev/stdin")) {
file("stdin", open = "r")
} else if (grepl("^/dev/fd/", arg)) {
fifo(arg, open = "r")
} else {
file(arg, open = "r")
}
}
args <- commandArgs(TRUE)
file <- args[1]
fh.in <- OpenRead(file)
df.in <- read.csv(fh.in)
close(fh.in)
# do something
df.out <- df.in
# print output
write.csv(df.out, file = stdout(), row.names = FALSE, quote = FALSE)
and your csv input file should look like:
col1,col2
a,1
b,2
Now this should work:
cat in.csv | ./step1.R - | ./step2.R -
The - are annoying but necessary. Also make sure to run something like chmod +x ./step?.R to make your scripts executables. Finally, you could store them (and without extension) inside a directory that you add to your PATH, so you will be able to run it like this:
cat in.csv | step1 - | step2 -
Why on earth you want to cram your workflow into pipes when you have the whole R environment available is beyond me.
Make a main.r containing the following:
source("step1.r")
source("step2.r")
source("step3.r")
source("step4.r")
That's it. You don't have to convert the output of each step into a serialised format; instead you can just leave all your R objects (datasets, fitted models, predicted values, lattice/ggplot graphics, etc) as they are, ready for the next step to process. If memory is a problem, you can rm any unneeded objects at the end of each step; alternatively, each step can work with an environment which it deletes when done, first exporting any required objects to the global environment.
If modular code is desired, you can recast your workflow as follows. Encapsulate the work done by each file into one or more functions. Then call these functions in your main.r with the appropriate arguments.
source("step1.r") # defines step1_read_input, step1_f2
source("step2.r") # defines step2_f2
source("step3.r") # defines step3_f1, step3_f2, step3_f3
source("step4.r") # defines step4_write_output
step1_read_input(...)
step1_f2(...)
....
step4write_output(...)
You'll need to add a line at the top of each script to read in from stdin. Via this answer:
in_data <- readLines(file("stdin"),1)
You'll also need to write the output of each script to stdout().

Resources