Grep in R using conditions on the matching - r

I am using the tm in R and would like to change the stemCompletion function a little.
Currently when I have a string
x <- c('everi','new')
and a pattern
dictionary <- c ('every','everyone','new')
When I run the build in code in the stemCompletion, the function that is running is the following
possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w),
dictionary,
value = TRUE))
structure(sapply(possibleCompletions, "[", 1), names = x)
and the result is
everi new
NA "new"
I want to change the function so that if the grep does not find anything for a particular value of x
then it tries by taking out the last value of the string. In my case 'ever' instead of 'everi'
I tried this code but it does not work.
substrLeft <- function(x, n) { substr(x, 1, nchar(x)-n) }
possibleCompletions <- lapply(x, function(w)
if (grepl(sprintf("^%s", w),dictionary,fixed = FALSE) = FALSE) {
grep(sprintf("^%s", substrLeft(w,1)),dictionary,value = TRUE,fixed = FALSE)
} else {
grep(sprintf("^%s", w),dictionary,value = TRUE,fixed = FALSE, invert = TRUE)
})
structure(sapply(possibleCompletions, "[", 1), names = x)
Thanks all.

Basically you have a few syntax problems.
First, = is assignment, == compares.
Also, if conditions in ℝ are, according to the manual "a single logical value". So you could get around this by wrapping a comparison in parentheses ((grepl(sprintf("^%s", w),dictionary,fixed = FALSE) == FALSE)) but you'll get warnings since grepl produces a list of length 2.
Finally, your one-line function has expanded to several lines, so you'll need to wrap it in {}:
possibleCompletions <- lapply(x, function (w) {
+ if ((grepl(sprintf("^%s", w),dictionary,fixed = FALSE) == FALSE) ) {
+ grep(sprintf("^%s", substrLeft(w,1)),dictionary,value = TRUE,fixed = FALSE)
+ } else {
+ grep(sprintf("^%s", w),dictionary,value = TRUE,fixed = FALSE, invert = TRUE)
+ }
+ })
Warning messages:
1: In if ((grepl(sprintf("^%s", w), dictionary, fixed = FALSE) == FALSE)) { :
the condition has length > 1 and only the first element will be used
2: In if ((grepl(sprintf("^%s", w), dictionary, fixed = FALSE) == FALSE)) { :
the condition has length > 1 and only the first element will be used
> str(possibleCompletions)
List of 2
$ : chr [1:2] "every" "everyone"
$ : chr "new"

Related

How to mask subsequences of string with a pattern string

I have a main string that looks like this:
my_main <- "ABCDEFGHIJ"
What I want to do is to sequentially mask at every position with another pattern string:
my_pattern <- "x*x" # the length could be varied from 1 up to length of my_main
Every character that overlap with * will be kept, other will be replaced with x.
The final result is a vector of strings that contain these:
xBxDEFGHIJ
AxCxEFGHIJ
ABxDxFGHIJ
ABCxExGHIJ
ABCDxFxHIJ
ABCDExGxIJ
ABCDEFxHxJ
ABCDEFGxIx
Next if the pattern is
my_pattern <- "xx**x"
The result would be:
xxCDxFGHIJ
AxxDExGHIJ
ABxxEFxHIJ
ABCxxFGxIJ
ABCDxxGHxJ
ABCDExxHIx
How can I achieve that?
This might be a little over-complicated, but it's a start:
I'm going to reuse Reduce_frame from https://stackoverflow.com/a/70945868/3358272.
Reduce_frame <- function(data, expr, init) {
expr <- substitute(expr)
out <- rep(init[1][NA], nrow(data))
for (rn in seq_len(nrow(data))) {
out[rn] <- init <- eval(expr, envir = data[rn,])
}
out
}
From here, let's split the pattern into a frame (for ease of access, if nothing else):
repl <- subset(
data.frame(p = strsplit(my_pattern, "")[[1]], i = seq_len(nchar(my_pattern))),
p != "*")
repl
# p i
# 1 x 1
# 3 x 3
From here, we can do it once with:
tail(Reduce_frame(repl, `substring<-`(init, i, i, p), init = my_main), 1)
# [1] "xBxDEFGHIJ"
Which means we can iterate fairly easily:
sapply(c(0, seq_len(nchar(my_main) - nchar(my_pattern))), function(offset) {
tail(Reduce_frame(transform(repl, i = i + offset),
`substring<-`(init, i, i, p), init = my_main), 1)
})
# [1] "xBxDEFGHIJ" "AxCxEFGHIJ" "ABxDxFGHIJ" "ABCxExGHIJ" "ABCDxFxHIJ" "ABCDExGxIJ" "ABCDEFxHxJ" "ABCDEFGxIx"
To use your second pattern,
my_pattern <- "xx**x"
repl <- transform(...) # from above
## the rest of this code is unchanged from above
sapply(c(0, seq_len(nchar(my_main) - nchar(my_pattern))), function(offset) {
tail(Reduce_frame(transform(repl, i = i + offset),
`substring<-`(init, i, i, p), init = my_main), 1)
})
# [1] "xxCDxFGHIJ" "AxxDExGHIJ" "ABxxEFxHIJ" "ABCxxFGxIJ" "ABCDxxGHxJ" "ABCDExxHIx"
So this can be easily functionized:
Reduce_frame <- ... # defined above
func <- function(S, pattern) {
stopifnot(nchar(S) >= nchar(pattern))
repl <- subset(
data.frame(p = strsplit(pattern, "")[[1]], i = seq_len(nchar(pattern))),
p != "*")
sapply(c(0, seq_len(nchar(S) - nchar(pattern))), function(offset) {
tail(Reduce_frame(transform(repl, i = i + offset),
`substring<-`(init, i, i, p), init = S), 1)
})
}
func("ABCDEFGHIJ", "x*x")
# [1] "xBxDEFGHIJ" "AxCxEFGHIJ" "ABxDxFGHIJ" "ABCxExGHIJ" "ABCDxFxHIJ" "ABCDExGxIJ" "ABCDEFxHxJ" "ABCDEFGxIx"
func("ABCDEFGHIJ", "xx**x")
# [1] "xxCDxFGHIJ" "AxxDExGHIJ" "ABxxEFxHIJ" "ABCxxFGxIJ" "ABCDxxGHxJ" "ABCDExxHIx"
Here's one way using strsplit, grepl, and paste.
f <- \(mm, mp) {
m <- el(strsplit(mm, ''))
p <- el(strsplit(mp, ''))
i <- which(!grepl(p, pattern='\\*'))
vapply(c(0L, seq_len(length(m) - max(i))), \(j) {
m[i + j] <- p[i]
paste(m, collapse='')
}, vector('character', 1L))
}
f('ABCDEFGHIJ', 'x*x')
# [1] "xBxDEFGHIJ" "AxCxEFGHIJ" "ABxDxFGHIJ" "ABCxExGHIJ" "ABCDxFxHIJ"
# [6] "ABCDExGxIJ" "ABCDEFxHxJ" "ABCDEFGxIx"
f('ABCDEFGHIJ', 'x**x')
# [1] "xBCxEFGHIJ" "AxCDxFGHIJ" "ABxDExGHIJ" "ABCxEFxHIJ" "ABCDxFGxIJ"
# [6] "ABCDExGHxJ" "ABCDEFxHIx"
f('ABCDEFGHIJ', 'xx**x')
# [1] "xxCDxFGHIJ" "AxxDExGHIJ" "ABxxEFxHIJ" "ABCxxFGxIJ" "ABCDxxGHxJ"
# [6] "ABCDExxHIx"
f('ABCDEFGHIJ', 'kk**krr')
# [1] "kkCDkrrHIJ" "AkkDEkrrIJ" "ABkkEFkrrJ" "ABCkkFGkrr"
f('ABCDEFGHIJ', 'kk**kr*r')
# [1] "kkCDkrGrIJ" "AkkDEkrHrJ" "ABkkEFkrIr"
Here is an approach along the same lines as r2evans' answer but relying on some stringr functions which should be more efficient than the base equivalents:
library(stringr)
f <- function(main, r_pattern) {
shift <- nchar(main) - nchar(r_pattern) + 1
idx <- as.data.frame(str_locate_all(r_pattern, "[^*]+")[[1]])
x_pattern <- str_split(r_pattern, "\\*+")[[1]]
Reduce(
function(x, y)
`str_sub<-`(
x,
seq(idx$start[y], length.out = shift),
seq(idx$end[y], length.out = shift),
omit_na = FALSE,
x_pattern[y]
),
seq(nrow(idx)),
init = main
)
}
f("ABCDEFGHIJ", "x*x")
[1] "xBxDEFGHIJ" "AxCxEFGHIJ" "ABxDxFGHIJ" "ABCxExGHIJ" "ABCDxFxHIJ" "ABCDExGxIJ" "ABCDEFxHxJ" "ABCDEFGxIx"
f("ABCDEFGHIJ", "xx**x")
[1] "xxCDxFGHIJ" "AxxDExGHIJ" "ABxxEFxHIJ" "ABCxxFGxIJ" "ABCDxxGHxJ" "ABCDExxHIx"
# Edit after OP comment:
f(my_main, "KK**KRR")
[1] "KKCDKRRHIJ" "AKKDEKRRIJ" "ABKKEFKRRJ" "ABCKKFGKRR"
Here is Ruby code that produces the desired result. I am presenting it in the event that a reader wishes to convert it to R, possibly with modification, of course.
You should be able to read the code even if you don't know Ruby, as long as you understand that:
'abc'.size returns 3;
0..8 is a range of integers between 0 and 8, inclusive;
'abc' << 'd' returns 'abcd';
7.modulo(3) returns 1;
'abcd'[2] returns 'c', 2 being an index; and
s == 'x' ? 'x' : my_main[j] reads, "if the string s (which will be 'x' or '*') equals 'x' return 'x', else return the character of my_main at index j.
The Ruby code (somewhat simplified from what would normally be written) is as follows.
def doit(my_main, my_pattern)
msz = my_main.size
psz = my_pattern.size
(0..msz-psz).map do |i|
s = ''
(0..msz-1).each do |j|
s << (my_pattern[(j-i).modulo(msz)] == 'x' ? 'x' : my_main[j])
end
s
end
end

Overriding a method of data.table - failing to perfectly forward arguments

I'm able to successfully modify the behaviour of [.data.frame, but fail to do so for [.data.table.
For data.frame:
# Exact same signature as "[.data.frame" :
"[.my.data.frame" <- function (x, i, j,
drop = if (missing(i)) TRUE
else length(cols) == 1) {
if(!missing(j) && j==8 ) {
cat("Oy vey\n")
}
NextMethod()
}
df <- data.frame(a=1,b=2)
class(df) <- c("my.data.frame", class(df))
# Works as expected:
df[1,2] # 2
df[1,8] # Oy Vey NULL
df[1,] # 1 2
However, for (the considerably more complicated) data.table:
# Exact same signature as "[.data.table" :
"[.my.data.table" <- function (x, i, j, by, keyby, with = TRUE, nomatch = getOption("datatable.nomatch"),
mult = "all", roll = FALSE,
rollends = if (roll == "nearest") c(TRUE, TRUE)
else if (roll >= 0) c(FALSE, TRUE) else c(TRUE, FALSE),
which = FALSE, .SDcols, verbose = getOption("datatable.verbose"),
allow.cartesian = getOption("datatable.allow.cartesian"),
drop = NULL, on = NULL) {
if(!missing(j) && j==8 ) {
cat("Oy vey\n")
}
NextMethod()
}
dt <- data.table(a=1,b=2)
class(dt) <- c("my.data.table", class(dt))
dt[1,2] # ERROR: i is not found in calling scope and it is not a column of type logical. When the first argument inside DT[...] is a single symbol, data.table looks for it in calling scope.
I know better than to pass arguments to NextMethod. It looks like I must call [.data.table explicitly, capture and pass the arguments as unevaluated promises - but all my attempts with quote, substitute or match.call have so far failed. Any insight would be appreciated.
I've found a partial solution, posting here in hope someone might improve on it.
"[.my.data.table" <- function (x, ...) {
# Modifications and tests galore - which can be tricky with this signature
class(x) <- class(x)[-1]
ret <- x[...]
class(x) <- c("my.data.table", class(x))
ret
}
I still consider this partial, because actually doing something in the function probably involves at least something like arglist <- list(...), and this fails when [ is called like this -
dt[1,]
Other directions are still very welcome.

How to avoid argparser type error when omitting argument?

Using argparser in R, I'm getting an error when specifying the type of an argument in the call to add_argument but not passing an argument to the script at the MacOSX command line. For example, given this R script:
library(argparser)
p <- arg_parser(description = "A test parser")
p <- add_argument(p, "--alpha", type = "double", help = "alpha for p-value")
p <- add_argument(p, "--sig-digits", type = "integer", help="number of significant digits")
args <- parse_args(p)
print(str(args))
and invoking it at the command line:
Rscript argparser-test.R --alpha 0.1
I am returned the error:
Error in (function (object, class, nargs) :
Invalid argument value: expecting integer but got: (NA).
Calls: parse_args -> mapply -> <Anonymous>
Interestingly, there is no error if you let --alpha take it's default:
Rscript argparser-test.R
Returns:
List of 5
$ : logi FALSE
$ help : logi FALSE
$ opts : logi NA
$ alpha : logi NA
$ sig_digits: logi NA
NULL
Notice the NA value here for sig_digits is type logical, not integer, as defined in the add_argument function.
Am I doing something wrong here? In the mean time, I suppose I will get around this by making the default --sig-digits = -1, then handling that as an exception, but I'd prefer not to.
Update: Actually, -1 throws the same error, which is very frustrating because I want to use a number for the exception that non-sensical. 9999 works, and is unlikely to be input by the user, but actually it's valid.
I experienced this error a month back or so. This is a problem with how optional arguments are parsed by the argparser package. Basically it does respect the order of optional arguments as it should in every situation, and sometimes it thus expects the wrong argument type.
I've opened an issue on the package bitbucket page. I highly suggest upvoting this and adding a comment to help adding to the attention of the issue.
In my issue I provided a possible solution to the problem which amounts to changing parse_args to the following definition (one could pull and recreate the package with this function at which point it would [should] work as expected)
parse_args <- function (parser, argv = commandArgs(trailingOnly = TRUE))
{
stopifnot(is(parser, "arg.parser"))
values <- list()
argv <- preprocess_argv(argv, parser)
arg.flags <- parser$args[parser$is.flag]
x <- as.logical(parser$defaults[parser$is.flag])
x[is.na(x)] <- FALSE
names(x) <- sub("^-+", "", arg.flags)
flag.idx <- match(arg.flags, argv)
flag.idx <- flag.idx[!is.na(flag.idx)]
if (length(flag.idx) > 0) {
x[match(argv[flag.idx], arg.flags)] <- TRUE
argv <- argv[-flag.idx]
}
values <- c(values, x)
if (values$help) {
print(parser)
quit()
}
x <- parser$defaults[parser$is.opt.arg]
arg.opt <- parser$args[parser$is.opt.arg]
names(x) <- sub("^-+", "", arg.opt)
i <- match("--opts", argv)
if (!is.na(i)) {
opts <- readRDS(argv[i + 1])
opts <- opts[!names(opts) %in% c("opts", "help")]
idx <- match(sanitize_arg_names(names(opts)), sanitize_arg_names(names(x)))
if (any(is.na(idx))) {
stop("Extra arguments supplied in OPTS file: (",
paste(setdiff(names(opts), names(x)), collapse = ", "),
").")
}
x[idx] <- opts
}
arg.idx <- match(arg.opt, argv)
arg.idx <- arg.idx[!is.na(arg.idx)]
arg.opt.types <- parser$types[parser$is.opt.arg]
arg.opt.nargs <- parser$nargs[parser$is.opt.arg]
### ###
## Altered section ##
### ###
if (length(arg.idx) > 0) {
# extract values following the optional argument label
x[ind <- match(argv[arg.idx], arg.opt)] <- argv[arg.idx+1];
# convert type of extraced values; x is now a list
x[ind] <- mapply(convert_type,
object = x[ind],
class = arg.opt.types[ind],
nargs = arg.opt.nargs[ind],
SIMPLIFY = FALSE);
# remove extracted arguments
to.remove <- c(arg.idx, arg.idx+1);
argv <- argv[-to.remove];
}
### ###
## Altered section ##
### ###
values <- c(values, x)
x <- argv
args.req <- parser$args[parser$is.req.arg]
args.req.types <- parser$types[parser$is.req.arg]
args.req.nargs <- parser$nargs[parser$is.req.arg]
if (length(x) < length(args.req)) {
print(parser)
stop(sprintf("Missing required arguments: expecting %d values but got %d values: (%s).",
length(args.req), length(x), paste(x, collapse = ", ")))
}
else if (length(x) > length(args.req)) {
print(parser)
stop(sprintf("Extra arguments supplied: expecting %d values but got %d values: (%s).",
length(args.req), length(x), paste(x, collapse = ", ")))
}
else if (length(args.req) > 0) {
names(x) <- args.req
x <- mapply(convert_type, object = x, class = args.req.types,
nargs = args.req.nargs, SIMPLIFY = FALSE)
}
values <- c(values, x)
names(values) <- sanitize_arg_names(names(values))
values
}

paste doesn't recognize object

I'm trying to run this function, but when I try to compile it, it says:
Error in paste("http://uk.advfn.com/p.php?pid=financials&symbol=", Symbol, :
object 'Symbol' not found
fund.data <- function
(
Symbol, # ticker
n=10, # number of periods
mode=c('quarterly','annual'), # periodicity
max.attempts=5 # maximum number of attempts to download before exiting
)
dirname(sys.frame(1)$ofile)
{
all.data = c()
option.value = -1
start_date = c('istart_date,start_date')
names(start_date) = c('quarterly,annual')
repeat {
# download Quarterly Financial Report data
if(option.value >= 0) {
url = paste('http://uk.advfn.com/p.php?pid=financials&symbol=', Symbol, '&btn=', mode[1], '_reports&', start_date[mode[1]], '=', option.value, sep = '')
} else {
url = paste('http://uk.advfn.com/p.php?pid=financials&symbol=', Symbol, '&btn=', mode[1], '_reports', sep = '')
}
cat('Downloading', url, '\n')
#txt = join(readLines(url))
for(iattempt in 1:max.attempts) {
flag = T
tryCatch({
txt = join(readLines(url))
}, interrupt = function(ex) {
flag <<- F
Sys.sleep(0.1)
}, error = function(ex) {
flag <<- F
Sys.sleep(0.1)
}, finally = {
if(flag) break
})
}
if( length(grep('INDICATORS', txt, ignore.case = T)) == 0 ) {
cat('No Data Found for', Symbol, '\n')
return(all.data)
}
# get title
pos = regexpr(pattern = '<title>(.*?)</title>', txt, ignore.case = TRUE, perl = TRUE)
if(length(pos) == 1)
title = substr(txt, attr(pos, 'capture.start'), attr(pos, 'capture.start') + attr(pos, 'capture.length') - 1)
# extract table from this page
data = extract.table.from.webpage(txt, 'INDICATORS', has.header = T)
colnames(data) = data[1,]
rownames(data) = data[,1]
data = data[,-1,drop=F]
# only add not already present data
add.index = which( is.na(match( colnames(data), colnames(all.data) )) )
all.data = cbind(data[,add.index,drop=F], all.data)
# check if it is time to stop
if(ncol(all.data) >= n) break
if(option.value == 0) break
# extract option value to go to the next page
temp = gsub(pattern = '<option', replacement = '<tr>', txt, perl = TRUE)
temp = gsub(pattern = '</option>', replacement = '</tr>', temp, perl = TRUE)
temp = extract.table.from.webpage(temp, 'All amounts', has.header = T)
temp = apply(temp,1,join)
index.selected = grep('selected', temp)
option.value = 0
if( length(index.selected) )
option.value = as.double( gsub('.*value=\'([0-9]*).*', '\\1', temp[index.selected]) )
if(option.value > 0) {
# can only get 5 time periods at a time
option.value = option.value - 5
option.value = max(0, option.value)
} else {
break
}
}
# remove empty columns
all.data = all.data[, colSums(nchar(trim(all.data))) > 0, drop=F]
all.data = rbind(all.data, title)
rownames(all.data)[nrow(all.data)] = 'HTMLTITLEtext'
if( ncol(all.data) > n ) {
return(all.data[,(ncol(all.data)-n+1):ncol(all.data), drop=F])
} else {
return(all.data)
}
}
The way you've written your code, your dirname() call comprises the entirety of the body of your function. The braced block that follows is executed immediately and is not part of the function.
After running all your code (and getting the error you quoted), this is fund.data():
fund.data;
## function
## (
## Symbol, # ticker
## n=10, # number of periods
## mode=c('quarterly','annual'), # periodicity
## max.attempts=5 # maximum number of attempts to download before exiting
## )
## dirname(sys.frame(1)$ofile)
As you can see, the braced block was not taken as part of the function definition. It was executed by itself immediately after fund.data() was defined. A function definition takes only the immediately following expression as the body, although that expression may comprise a braced block, which allows any number of statements to be subsumed within it. And as #RichardScriven pointed out in his comment, there is no actual call to your function anywhere in your code.
So, the reason why you're getting the exact error "object 'Symbol' not found" is because the function parameter Symbol does not exist in your braced block, because it is not part of the body of your function and was executed by itself.
To solve your problem, you need to surround the entire function body with a braced block:
fund.data <- function
(
Symbol, # ticker
n=10, # number of periods
mode=c('quarterly','annual'), # periodicity
max.attempts=5 # maximum number of attempts to download before exiting
) {
dirname(sys.frame(1)$ofile)
all.data = c()
option.value = -1
...
}
Although it's not clear what the purpose of the dirname() call is, since its return value is not used.

Conditionally remove data frames from environment

How can I drop data frames with less than 3 variables? I tried this:
`1001.AFG.1.A`<-data.frame(x = 1, y = 1:10)
apply(ls(), function(x) {if (dim(x)[2]<3) rm(x)})
The error message is:
Error in match.fun(FUN) : argument "FUN" is missing, with no default
1) The first line produces a named logical vector, to.rm with a component for each object which is TRUE if that object should be removed and FALSE otherwise. Thus names(to.rm)[to.rm] are the objects to be removed so feed that into rm. By splitting it into two steps, this lets one review to.rm before actually performing the rm.
to.rm <- unlist(eapply(.GlobalEnv, function(x) is.data.frame(x) && ncol(x) < 3))
rm(list = names(to.rm)[to.rm], envir = .GlobalEnv)
If this is entered directly into the global environment (i.e. not placed in a fucntion) then envir = .GlobalEnv in the last line is the default and can be omitted.
2) Another way is to iterate through the object names of env as shown. We have provided a verbose argument to show what it is doing and a dryrun argument to show what it would remove without actually removing anything.
rm2 <- function(env = .GlobalEnv, verbose = FALSE, dryrun = FALSE, all.names = FALSE) {
for(nm in ls(env, all.names = all.names)) {
obj <- get(nm, env)
if (is.data.frame(obj) && ncol(obj) < 3) {
if (verbose || dryrun) cat("removing", nm, "\n")
if (!dryrun) rm(list = nm, envir = env)
}
}
}
rm2(dryrun = TRUE)
rm2(verbose = TRUE)
Update Added envir argument to rm in (1). It was already in (2).
Update 2 Minor imrovements to (2).
You may want to try :
sapply(ls(), function(x) {
if (is.data.frame(get(x)) && dim(get(x))[2]<3) rm(list=x,envir=.GlobalEnv)
})
I you want to suppress the printings, you can do :
invisible(sapply(ls(), function(x) {
if (is.data.frame(get(x)) && dim(get(x))[2]<3) rm(list=x,envir=.GlobalEnv)
}))

Resources