R test if a file exists, and is not a directory - r

I have an R script that takes a file as input, and I want a general way to know whether the input is a file that exists, and is not a directory.
In Python you would do it this way: How do I check whether a file exists using Python?, but I was struggling to find anything similar in R.
What I'd like is something like below, assuming that the file.txt actually exists:
input.good = "~/directory/file.txt"
input.bad = "~/directory/"
is.file(input.good) # should return TRUE
is.file(input.bad) #should return FALSE
R has something called file.exists(), but this doesn't distinguish files from directories.

There is a dir.exists function in all recent versions of R.
file.exists(f) && !dir.exists(f)

The solution is to use file_test()
This gives shell-style file tests, and can distinguish files from folders.
E.g.
input.good = "~/directory/file.txt"
input.bad = "~/directory/"
file_test("-f", input.good) # returns TRUE
file_test("-f", input.bad) #returns FALSE
From the manual:
Usage
file_test(op, x, y) Arguments
op a character string specifying the test to be performed. Unary
tests (only x is used) are "-f" (existence and not being a directory),
"-d" (existence and directory) and "-x" (executable as a file or
searchable as a directory). Binary tests are "-nt" (strictly newer
than, using the modification dates) and "-ot" (strictly older than):
in both cases the test is false unless both files exist.
x, y character vectors giving file paths.

You can also use is_file(path) from the fs package.

Related

How to pass bash variable into R script

I have a couple of R scripts that processes data in a particular input folder. I have a few folders I need to run this script on, so I started writing a bash script to loop through these folders and run those R scripts.
I'm not familiar with R at all (the script was written by a previous worker and it's basically a black box for me), and I'm inexperienced with passing variables through scripts, especially involving multiple languages. There's also an issue present when I call source("$SWS_output/Step_1_Setup.R") here - R isn't reading my $SWS_output as a variable, but rather a string.
Here's my bash script:
#!/bin/bash
# Inputs
workspace="`pwd`"
preprocessed="$workspace/6_preprocessed"
# Output
SWS_output="$workspace/7_SKSattempt4_results/"
# create output directory
mkdir -p $SWS_output
# Copy data from preprocessed to SWS_output
cp -a $preprocessed/* $SWS_output
# Loop through folders in the output and run the R code on each folder
for qdir in $SWS_output/*/; do
qdir_name=`basename $qdir`
echo -e 'source("$SWS_output/Step_1_Setup.R") \n source("$SWS_output/(Step_2_data.R") \n q()' | R --no-save
done
I need to pass the variable "qdir" into the second R script (Step_2_data.R) to tell it which folder to process.
Thanks!
My previous answer was incomplete. Here is a better effort to explain command line parsing.
It is pretty easy to use R's commandArgs function to process command line arguments. I wrote a small tutorial https://gitlab.crmda.ku.edu/crmda/hpcexample/tree/master/Ex51-R-ManySerialJobs. In cluster computing this works very well for us. The whole hpcexample repo is open source/free.
The basic idea is that in the command line you can run R with command line arguments, as in:
R --vanilla -f r-clargs-3.R --args runI=13 parmsC="params.csv" xN=33.45
In this case, my R program is a file r-clargs-3.R and the arguments that the file will import are three space separated elements, runI, parmsC, xN. You can add as many of these space separated parameters as you like. It is completely at your discretion what these are called, but it is required they are separated by spaces and there is NO SPACE around the equal signs. Character string variables should be quoted.
My habit is to name the arguments with suffix "I" to hint that it is an integer, "C" is for character, and "N" is for floating point numbers.
In the file r-clargs-3.R, include some code to read the arguments and sort through them. For example, my tutorial's example
cli <- commandArgs(trailingOnly = TRUE)
args <- strsplit(cli, "=", fixed = TRUE)
The rest of the work is sorting through the args, and this is my most evolved stanza to sort through arguments (because it looks for suffixes "I", "N", "C", and "L" (for logical)), and then it coerces the inputs to the correct variable types (all input variables are characters, unless we coerce with as.integer(), etc):
for (e in args) {
argname <- e[1]
if (! is.na(e[2])) {
argval <- e[2]
## regular expression to delete initial \" and trailing \"
argval <- gsub("(^\\\"|\\\"$)", "", argval)
}
else {
# If arg specified without value, assume it is bool type and TRUE
argval <- TRUE
}
# Infer type from last character of argname, cast val
type <- substring(argname, nchar(argname), nchar(argname))
if (type == "I") {
argval <- as.integer(argval)
}
if (type == "N") {
argval <- as.numeric(argval)
}
if (type == "L") {
argval <- as.logical(argval)
}
assign(argname, argval)
cat("Assigned", argname, "=", argval, "\n")
}
That will create variables in the R session named paramsC, runI, and xN.
The convenience of this approach is that the same base R code can be run with 100s or 1000s of command parameter variations. Good for Monte Carlo simulation, etc.
Thanks for all the answers they were very helpful. I was able to get a solution that works. Here's my completed script.
#!/bin/bash
# Inputs
workspace="`pwd`"
preprocessed="$workspace/6_preprocessed"
# Output
SWS_output="$workspace/7_SKSattempt4_results"
# create output directory
mkdir -p $SWS_output
# Copy data from preprocessed to SWS_output
cp -a $preprocessed/* $SWS_output
cd $SWS_output
# Loop through folders in the output and run the R code on each folder
for qdir in $SWS_output/*/; do
qdir_name=`basename $qdir`
echo $qdir_name
export VARIABLENAME=$qdir
echo -e 'source("Step_1_Setup.R") \n source("Step_2_Data.R") \n q()' | R --no-save --slave
done
And then the R script looks like this:
qdir<-Sys.getenv("VARIABLENAME")
pathname<-qdir[1]
As a couple of comments have pointed out, this isn't best practice, but this worked exactly as I wanted it to. Thanks!

How does julia know what path separator and root directory to use?

Functions like joinpath use the appropriate OS-dependent separator when joining two paths (ie / on Linux, \\ on Windows, etc). How do these functions know what separator to use?
Similarly, the root directory on Linux is /, but on Windows is probably C:\\. Is there a way to retrieve the OS-dependent root directory in Julia?
Note, I've had a look at the joinpath source on github, and it appears to use an undocumented function pathsep(a,b) and a global variable path_separator_re, but I can't see how either of these work.
It uses the Sys.isunix and Sys.iswindows functions in order to conditionally define the correct path_separator_re variables, etc.
https://github.com/JuliaLang/julia/blob/5c3f58039525972b24930f356821af8299f70a26/base/path.jl#L19-L41
if Sys.isunix()
# ...
const path_separator_re = r"/+"
# ...
splitdrive(path::String) = ("",path)
elseif Sys.iswindows()
# ...
const path_separator_re = r"[/\\]+"
# ...
function splitdrive(path::String)
m = match(r"^([^\\]+:|\\\\[^\\]+\\[^\\]+|\\\\\?\\UNC\\[^\\]+\\[^\\]+|\\\\\?\\[^\\]+:|)(.*)$", path)
String(m.captures[1]), String(m.captures[2])
end
else
error("path primitives for this OS need to be defined")
end
For the root directory, check out the homedir function, which uses libuv to determine it.
https://github.com/JuliaLang/julia/blob/5c3f58039525972b24930f356821af8299f70a26/base/path.jl#L52-L77
help?> homedir
search: homedir
homedir() -> AbstractString
Return the current user's home directory.
| Note
|
| homedir determines the home directory via libuv's uv_os_homedir. For details (for example on how to specify the home
| directory via environment variables), see the uv_os_homedir documentation.

How to create a new output file in R if a file with that name already exists?

I am trying to run an R-script file using windows task scheduler that runs it every two hours. What I am trying to do is gather some tweets through Twitter API and run a sentiment analysis that produces two graphs and saves it in a directory. The problem is, when the script is run again it replaces the already existing files with that name in the directory.
As an example, when I used the pdf("file") function, it ran fine for the first time as no file with that name already existED in the directory. Problem is I want the R-script to be running every other hour. So, I need some solution that creates a new file in the directory instead of replacing that file. Just like what happens when a file is downloaded multiple times from Google Chrome.
I'd just time-stamp the file name.
> filename = paste("output-",now(),sep="")
> filename
[1] "output-2014-08-21 16:02:45"
Use any of the standard date formatting functions to customise to taste - maybe you don't want spaces and colons in your file names:
> filename = paste("output-",format(Sys.time(), "%a-%b-%d-%H-%M-%S-%Y"),sep="")
> filename
[1] "output-Thu-Aug-21-16-03-30-2014"
If you want the behaviour of adding a number to the file name, then something like this:
serialNext = function(prefix){
if(!file.exists(prefix)){return(prefix)}
i=1
repeat {
f = paste(prefix,i,sep=".")
if(!file.exists(f)){return(f)}
i=i+1
}
}
Usage. First, "foo" doesn't exist, so it returns "foo":
> serialNext("foo")
[1] "foo"
Write a file called "foo":
> cat("fnord",file="foo")
Now it returns "foo.1":
> serialNext("foo")
[1] "foo.1"
Create that, then it returns "foo.2" and so on...
> cat("fnord",file="foo.1")
> serialNext("foo")
[1] "foo.2"
This kind of thing can break if more than one process might be writing a new file though - if both processes check at the same time there's a window of opportunity where both processes don't see "foo.2" and think they can both create it. The same thing will happen with timestamps if you have two processes trying to write new files at the same time.
Both these issues can be resolved by generating a random UUID and pasting that on the filename, otherwise you need something that's atomic at the operating system level.
But for a twice-hourly job I reckon a timestamp down to minutes is probably enough.
See ?files for file manipulation functions. You can check if file exists with file.exists, and then either rename the existing file, or create a different name for the new one.

R, passing variables to a system command

Using R, I am looking to create a QR code and embed it into an Excel spreadsheet (hundreds of codes and spreadsheets). The obvious way seems to be to create a QR code using the command line, and use the "system" command in R. Does anyone know how to pass R variables through the "system" command? Google is not too helpful as "system" is a bit generic, ?system does not contain any examples of this.
Note - I am actually using data matrices rather than QR codes, but using the term "data matrix" in an R question will lead to havoc, so let's talk QR codes instead. :-)
system("dmtxwrite my_r_variable -o image.png")
fails, as do the variants I have tried with "paste". Any suggestions gratefully received.
Let's say we have the variable x that we want to pass on to dmtxwrite, you can pass it on like:
x = 10
system(sprintf("dmtxwrite %s -o image.png", x))
or alternatively using paste:
system(paste("dmtxwrite", x, "-o image.png"))
but I prefer sprintf in this case.
Also making use of base::system2 may be worth considering as system2 provides args argument that can be used for that purpose. In your example:
my_r_variable <- "a"
system2(
'echo',
args = c(my_r_variable, '-o image.png')
)
would return:
a -o image.png
which is equivalent to running echo in the terminal. You may also want to redirect output to text files:
system2(
'echo',
args = c(my_r_variable, '-o image.png'),
stdout = 'stdout.txt',
stderr = 'stderr.txt'
)

Validate a character as a file path?

What's the best way to determine if a character is a valid file path? So CheckFilePath( "my*file.csv") would return FALSE (on windows * is invalid character), whereas CheckFilePath( "c:\\users\\blabla\\desktop\\myfile.csv" ) would return TRUE.
Note that a file path can be valid but not exist on disk.
This is the code that save is using to perform that function:
....
else file(file, "wb")
on.exit(close(con))
}
else if (inherits(file, "connection"))
con <- file
else stop("bad file argument")
......
Perhaps file.exists() is what you're after? From the help page:
file.exists returns a logical vector indicating whether the files named by its argument exist.
(Here ‘exists’ is in the sense of the system's stat call: a file will be reported as existing only
if you have the permissions needed by stat. Existence can also be checked by file.access, which
might use different permissions and so obtain a different result.
Several other functions to tap into the computers file system are available as well, also referenced on the help page.
No, there's no way to do this (reliably). I don't see an operating system interface in neither Windows nor Linux to test this. You would normally try and create the file and get a fail message, or try and read the file and get a 'does not exist' kind of message.
So you should rely on the operating system to let you know if you can do what you want to do to the file (which will usually be read and/or write).
I can't think of a reason other than a quiz ("Enter a valid fully-qualified Windows file path:") to want to know this.
I would suggest trying checkPathForOutput function offered by the checkmate package. As stated in the linked documentation, the function:
Check[s] if a file path can be safely be used to create a file and write to it.
Example
checkmate::checkPathForOutput(x = tempfile(pattern = "sample_test_file", fileext = ".tmp"))
# [1] TRUE
checkmate::checkPathForOutput(x = "c:\\users\\blabla\\desktop\\myfile.csv")
# [1] TRUE
Invalid path
\0 character should not be used in Linux1 file names:
checkmate::check_path_for_output("my\0file.csv")
# Error: nul character not allowed (line 1)
1 Not tested on Windows, but looking at the code of checkmate::check_path_for_output indicates that function should work correctly on MS Windows system as well.

Resources