R: Parse a line like a shell command line - r

I have a key=value file like follows:
a=foo
b="foo bar"
c="foo \"bar\""
d="foo \'bar\'"
Values appear to be quoted/escaped when there are special characters with a logic similar to the shell.
To parse the file into a named vector, the straight way is:
txt <- readLines("dat")
spl <- strsplit(txt, "=")
vec <- setNames(sapply(spl, `[`, 2),
sapply(spl, `[`, 1))
which gives:
a b c d
"foo" "\"foo bar\"" "\"foo \\\"bar\\\"\"" "\"foo \\'bar\\'\""
Now, just like for the shell quotes in mkdir "new folder" are not literal, the same is above.
Therefore, I reparse vec with:
prs <- function(el) if(substr(el, 1, 1) == "\"") eval(parse(text= el)) else el
sapply(vec, prs)
That is:
a b c d
"foo" "foo bar" "foo \"bar\"" "foo 'bar'"
I wonder if, rather than my naive prs function, there is an established one, perhaps from a library, to parse cli-like lines. What I have found so far, assumes R scripts, where commandArgs() already makes the tokenising task.

Related

print vector as with "and" in \Sexpr{}

Consider this simple vector:
x <- c(1,2,3,4,5)
\Sexpr{x} will print in LaTeX 1,2,3,4,5 but I often I need to report some vectors in text as a human, including "and" before the last number.
I tried todo automatically with this function:
x <- c(1,2,3,4,5)
nicevector <- function(x){
a <- head(x,length(x)-1)
b <- tail(x,1)
cat(a,sep=", ");cat(" and ");cat(b)}
nicevector(x)
That seem to work in the console \Sexpr{nicevector(x)} but failed miserably in the .Rnw file (while \Sexpr{x} works). Some ideas?
You can use knitr::combine_words(x).
Using cat() is only for its side-effect: printing in the console. cat() won't return a character string, so you won't see anything in the output. By comparison, knitr::combine_words() returns a character string.
There is also a function for this in glue package
glue::glue_collapse(1:4, ",", last = " and ")
#> 1, 2, 3 and 4
See help of function

Vectorizing A Custom Function

I wrote this function to return the string before a certain character, which goes like so:
strBefore <- function(find, x, last = FALSE, occurence) {
# Checking.
if (class(x)[1] != "character") { stop("The strBefore function only supports objects of character class.") }
# Getting the place of the find, and handling both caes of last.
fullPlace <- gregexpr(find, x)[[1]] # Gets the location of the occurences of find in x.
# Handling the case where last is TRUE.
if (last == TRUE) { place <- max(fullPlace) # Grabbing the latest character index if last is TRUE.
} else { place <- min(fullPlace) } # Otherwise, getting the first space.
# Handles the occurrenceargument if given.
if (!missing(occurrence)) { place <- fullPlace[occurrence] }
# Subsetting the string.
xlen <- nchar(x) # Getting the total number of characters in the string.
x <- substr(x, 1, place - 1) # Minus 1 because we don't want to include the first hit for find.
return(x)
}
Where find is the character you want the string before, x is the character, last asks if you to get before the last occurrence of find, and occurrence designates which occurrence of find to get before (overrides last if given).
If I use it on a single character object, it works fine like so:
> test <- "Hello World"
> test2 <- strBefore(" ", test)
> test2
[1] "Hello"
However, if I use it on a character vector, it cuts each item in the vector at the same place as the first item:
> test <- c("Hello World", "Hi There", "Why Hello")
> test2 <- strBefore(" ", test)
> test2
[1] "Hello" "Hi Th" "Why H"
Now, this link here does provide me with a method for doing what I want:
Using gsub to extract character string before white space in R
However, I do like having the functionality of the "occurrence" argument, which returns the string before the 2nd, 3rd, etc... occurrence of the find argument.
Just as a note, I can vectorize my function with sapply like so:
> test <- c("Hello World", "Hi There", "Why Hello")
> test2 <- sapply(test, function(x) strBefore(" ", x))
> test2
Hello World Hi There Why Hello
"Hello" "Hi" "Why"
Which somewhat solves my problem...but is there a way to do this more cleanly without having to use an apply function? I'm not looking for a solution to what strBefore does, but more a solution to how to vectorize custom functions. Thanks for your time.

Integers/expressions as names for elements in lists

I am trying to understand names, lists and lists of lists in R. It would be convenient to have a way to dynamically label them like this:
> ll <- list("1" = 2)
> ll
$`1`
[1] 2
But this is not working:
> ll <- list(as.character(1) = 2)
Error: unexpected '=' in "ll <- list(as.character(1) ="
Neither is this:
> ll <- list(paste(1) = 2)
Error: unexpected '=' in "ll <- list(paste(1) ="
Why is that? Both paste() and as.character() are returning "1".
The reason is that paste(1) is a function call that evaluates to a string, not a string itself.
The The R Language Definition says this:
Each argument can be tagged (tag=expr), or just be a simple expression.
It can also be empty or it can be one of the special tokens ‘...’, ‘..2’, etc.
A tag can be an identifier or a text string.
Thus, tags can't be expressions.
However, if you want to set names (which are just an attribute), you can do so with structure, eg
> structure(1:5, names=LETTERS[1:5])
A B C D E
1 2 3 4 5
Here, LETTERS[1:5] is most definitely an expression.
If your goal is simply to use integers as names (as in the question title), you can type them in with backticks or single- or double-quotes (as the OP already knows). They are converted to characters, since all names are characters in R.
I can't offer a deep technical explanation for why your later code fails beyond "the left-hand side of = is not evaluated in that context (of enumerating items in a list)". Here's one workaround:
mylist <- list()
mylist[[paste("a")]] <- 2
mylist[[paste("b")]] <- 3
mylist[[paste("c")]] <- matrix(1:4,ncol=2)
mylist[[paste("d")]] <- mean
And here's another:
library(data.table)
tmp <- rbindlist(list(
list(paste("a"), list(2)),
list(paste("b"), list(3)),
list(paste("c"), list(matrix(1:4,ncol=2))),
list(paste("d"), list(mean))
))
res <- setNames(tmp$V2,tmp$V1)
identical(mylist,res) # TRUE
The drawbacks of each approach are pretty serious, I think. On the other hand, I've never found myself in need of richer naming syntax.

Passing multiple arguments via command line in R

I am trying to pass multiple file path arguments via command line to an Rscript which can then be processed using an arguments parser. Ultimately I would want something like this
Rscript test.R --inputfiles fileA.txt fileB.txt fileC.txt --printvar yes --size 10 --anotheroption helloworld -- etc...
passed through the command line and have the result as an array in R when parsed
args$inputfiles = "fileA.txt", "fileB.txt", "fileC.txt"
I have tried several parsers including optparse and getopt but neither of them seem to support this functionality. I know argparse does but it is currently not available for R version 2.15.2
Any ideas?
Thanks
Although it wasn't released on CRAN when this question was asked a beta version of the argparse module is up there now which can do this. It is basically a wrapper around the popular python module of the same name so you need to have a recent version of python installed to use it. See install notes for more info. The basic example included sums an arbitrarily long list of numbers which should not be hard to modify so you can grab an arbitrarily long list of input files.
> install.packages("argparse")
> library("argparse")
> example("ArgumentParser")
The way you describe command line options is different from the way that most people would expect them to be used. Normally, a command line option would take a single parameter, and parameters without a preceding option are passed as arguments. If an argument would take multiple items (like a list of files), I would suggest parsing the string using strsplit().
Here's an example using optparse:
library (optparse)
option_list <- list ( make_option (c("-f","--filelist"),default="blah.txt",
help="comma separated list of files (default %default)")
)
parser <-OptionParser(option_list=option_list)
arguments <- parse_args (parser, positional_arguments=TRUE)
opt <- arguments$options
args <- arguments$args
myfilelist <- strsplit(opt$filelist, ",")
print (myfilelist)
print (args)
Here are several example runs:
$ Rscript blah.r -h
Usage: blah.r [options]
Options:
-f FILELIST, --filelist=FILELIST
comma separated list of files (default blah.txt)
-h, --help
Show this help message and exit
$ Rscript blah.r -f hello.txt
[[1]]
[1] "hello.txt"
character(0)
$ Rscript blah.r -f hello.txt world.txt
[[1]]
[1] "hello.txt"
[1] "world.txt"
$ Rscript blah.r -f hello.txt,world.txt another_argument and_another
[[1]]
[1] "hello.txt" "world.txt"
[1] "another_argument" "and_another"
$ Rscript blah.r an_argument -f hello.txt,world.txt,blah another_argument and_another
[[1]]
[1] "hello.txt" "world.txt" "blah"
[1] "an_argument" "another_argument" "and_another"
Note that for the strsplit, you can use a regular expression to determine the delimiter. I would suggest something like the following, which would let you use commas or colons to separate your list:
myfilelist <- strsplit (opt$filelist,"[,:]")
In the front of your script test.R, you put this :
args <- commandArgs(trailingOnly = TRUE)
hh <- paste(unlist(args),collapse=' ')
listoptions <- unlist(strsplit(hh,'--'))[-1]
options.args <- sapply(listoptions,function(x){
unlist(strsplit(x, ' '))[-1]
})
options.names <- sapply(listoptions,function(x){
option <- unlist(strsplit(x, ' '))[1]
})
names(options.args) <- unlist(options.names)
print(options.args)
to get :
$inputfiles
[1] "fileA.txt" "fileB.txt" "fileC.txt"
$printvar
[1] "yes"
$size
[1] "10"
$anotheroption
[1] "helloworld"
After searching around, and avoiding to write a new package from the bottom up, I figured the best way to input multiple arguments using the package optparse is to separate input files by a character which is most likely illegal to be included in a file name (for example, a colon)
Rscript test.R --inputfiles fileA.txt:fileB.txt:fileC.txt etc...
File names can also have spaces in them as long as the spaces are escaped (optparse will take care of this)
Rscript test.R --inputfiles file\ A.txt:file\ B.txt:fileC.txt etc...
Ultimatley, it would be nice to have a package (possibly a modified version of optparse) that would support multiple arguments like mentioned in the question and below
Rscript test.R --inputfiles fileA.txt fileB.txt fileC.txt
One would think such trivial features would be implemented into a widely used package such as optparse
Cheers
#agstudy's solution does not work properly if input arguments are lists of the same length. By default, sapply will collapse inputs of the same length into a matrix rather than a list. The fix is simple enough, just explicitly set simplify to false in the sapply parsing the arguments.
args <- commandArgs(trailingOnly = TRUE)
hh <- paste(unlist(args),collapse=' ')
listoptions <- unlist(strsplit(hh,'--'))[-1]
options.args <- sapply(listoptions,function(x){
unlist(strsplit(x, ' '))[-1]
}, simplify=FALSE)
options.names <- sapply(listoptions,function(x){
option <- unlist(strsplit(x, ' '))[1]
})
names(options.args) <- unlist(options.names)
print(options.args)
I had this same issue, and the workaround that I developed is to adjust the input command line arguments before they are fed to the optparse parser, by concatenating whitespace-delimited input file names together using an alternative delimiter such as a "pipe" character, which is unlikely to be used as part of a file name.
The adjustment is then reversed at the end again, by removing the delimiter using str_split().
Here is some example code:
#!/usr/bin/env Rscript
library(optparse)
library(stringr)
# ---- Part 1: Helper Functions ----
# Function to collapse multiple input arguments into a single string
# delimited by the "pipe" character
insert_delimiter <- function(rawarg) {
# Identify index locations of arguments with "-" as the very first
# character. These are presumed to be flags. Prepend with a "dummy"
# index of 0, which we'll use in the index step calculation below.
flagloc <- c(0, which(str_detect(rawarg, '^-')))
# Additionally, append a second dummy index at the end of the real ones.
n <- length(flagloc)
flagloc[n+1] <- length(rawarg) + 1
concatarg <- c()
# Counter over the output command line arguments, with multiple input
# command line arguments concatenated together into a single string as
# necessary
ii <- 1
# Counter over the flag index locations
for(ij in seq(1,length(flagloc)-1)) {
# Calculate the index step size between consecutive pairs of flags
step <- flagloc[ij+1]-flagloc[ij]
# Case 1: empty flag with no arguments
if (step == 1) {
# Ignore dummy index at beginning
if (ij != 1) {
concatarg[ii] <- rawarg[flagloc[ij]]
ii <- ii + 1
}
}
# Case 2: standard flag with one argument
else if (step == 2) {
concatarg[ii] <- rawarg[flagloc[ij]]
concatarg[ii+1] <- rawarg[flagloc[ij]+1]
ii <- ii + 2
}
# Case 3: flag with multiple whitespace delimited arguments (not
# currently handled correctly by optparse)
else if (step > 2) {
concatarg[ii] <- rawarg[flagloc[ij]]
# Concatenate multiple arguments using the "pipe" character as a delimiter
concatarg[ii+1] <- paste0(rawarg[(flagloc[ij]+1):(flagloc[ij+1]-1)],
collapse='|')
ii <- ii + 2
}
}
return(concatarg)
}
# Function to remove "pipe" character and re-expand parsed options into an
# output list again
remove_delimiter <- function(rawopt) {
outopt <- list()
for(nm in names(rawopt)) {
if (typeof(rawopt[[nm]]) == "character") {
outopt[[nm]] <- unlist(str_split(rawopt[[nm]], '\\|'))
} else {
outopt[[nm]] <- rawopt[[nm]]
}
}
return(outopt)
}
# ---- Part 2: Example Usage ----
# Prepare list of allowed options for parser, in standard fashion
option_list <- list(
make_option(c('-i', '--inputfiles'), type='character', dest='fnames',
help='Space separated list of file names', metavar='INPUTFILES'),
make_option(c('-p', '--printvar'), type='character', dest='pvar',
help='Valid options are "yes" or "no"',
metavar='PRINTVAR'),
make_option(c('-s', '--size'), type='integer', dest='sz',
help='Integer size value',
metavar='SIZE')
)
# This is the customary pattern that optparse would use to parse command line
# arguments, however it chokes when there are multiple whitespace-delimited
# options included after the "-i" or "--inputfiles" flag.
#opt <- parse_args(OptionParser(option_list=option_list),
# args=commandArgs(trailingOnly = TRUE))
# This works correctly
opt <- remove_delimiter(parse_args(OptionParser(option_list=option_list),
args=insert_delimiter(commandArgs(trailingOnly = TRUE))))
print(opt)
Assuming the above file were named fix_optparse.R, here is the output result:
> chmod +x fix_optparse.R
> ./fix_optparse.R --help
Usage: ./fix_optparse.R [options]
Options:
-i INPUTFILES, --inputfiles=INPUTFILES
Space separated list of file names
-p PRINTVAR, --printvar=PRINTVAR
Valid options are "yes" or "no"
-s SIZE, --size=SIZE
Integer size value
-h, --help
Show this help message and exit
> ./fix_optparse.R --inputfiles fileA.txt fileB.txt fileC.txt --printvar yes --size 10
$fnames
[1] "fileA.txt" "fileB.txt" "fileC.txt"
$pvar
[1] "yes"
$sz
[1] 10
$help
[1] FALSE
>
A minor limitation with this approach is that if any of the other arguments have the potential to accept a "pipe" character as a valid input, then those arguments will not be treated correctly. However I think you could probably develop a slightly more sophisticated version of this solution to handle that case correctly as well. This simple version works most of the time, and illustrates the general idea.

A better way to extract functions from an R script?

Say I have a file "myfuncs.R" with a few functions in it:
A <- function(x) x
B <- function(y) y
C <- function(z) z
I want to place all the functions contained within "myfuncs.R" into their own files, named appropriately. I have a simple Bash-shell script to extract functions and place them in separate files:
split -p "function\(" myfuncs.R tmpfunc
grep "function(" tmpfunc* | awk '{
# strip first-instances of function assignment
sub("<-", " ")
sub("=", " ")
sub(":", " ") # and colon introduced by grep
mv=$1
mvto=sprintf("func_%s.R",$2)
print "mv", mv, mvto
}' | sh
leaving me with:
func_A.R
func_B.R
func_C.R
But, this script has obvious limitations. For example, it will misbehave when function 'A' has a nested function:
A <- function(x){
Aa <- function(x){x}
return(Aa)
}
and outright fails if the whole function is on a single line.
Does anyone know of a more robust, and less error-prone method to do this?
Source your functions and then type package.skeleton()
Separate files will be made for each function.

Resources