R progressively search string of a hierarchical lookup table for matches - r

I am working with OPS codes which code the type of procedure performed in a hospital. The OPS coding list has a hierarchical structure of the form X-XXX.XX with X being numbers. The coding structure is hierarchical, that means, the first X- is a big set, then the XXX denote a subset type of procedure within the first X-, the last .XX denote a subspecialization of the XXX
so the code might be X-XXX, X-XXX. , X-XXX.X, X-XXX.XX
My problem is that a program we uses collapses the structure of the code to XXXX, XXXXX, or XXXXXXX and i would like to match the collapsed with the uncollapsed llokup table of definitions.
So I would like to have a routine that checks for each digit and then procedes to the next when performing the matching. grepl would not to because 5381 would match 65381 (the uncollapsed would be 5-381 and 6-538.1) which are totally different procedures. I would need something that would match character to character (first number second number etc) and respects the character positions.
When an exact match cannot be found, it should return the first match that matches the same character positions.
More examples in pseudocode
which("5381" %in% c("65381","53811", "5382")) should return 2 since the second item matches all available characters provided
which("5381" %in% c("538110","538111", "538221")) should return 1 (because its the first match, the lookup table within c() is sorted.
which("5381." %in% c("5381","538111", "538121")) should return 1 (because its the first match, the lookup table within c() is sorted. Note that the period is ignored in the match
which("5381.1" %in% c("5381","538111", "538112")) should return 2 (because its the first match that matches all available five characters and we don't have a fifth.
I know this is not the best example of a question in SO but I am open to improve the question.

This is probably too complicated but it works.
First define a generic to transform the input string to the OPS format. Then have a matching function check if x has y as a substring.
Note that the matching function does not check if x is a substring of y, it's the other way around.
as.ops <- function(x, ...) UseMethod("as.ops")
as.ops.default <- function(x, ...){
warning("The default method coerces its argument to character and calls the character method")
as.ops.character(as.character(x))
}
as.ops.character <- function(x, ...){
x <- gsub("[^[:digit:]]", "", x)
ops1 <- substr(x, 1, 1)
ops2 <- substr(x, 2, 4)
ops3 <- substring(x, 5)
y <- character(length(x))
n <- findInterval(nchar(x), c(0, 1, 4, 7))
y[n == 1] <- x[n == 1]
y[n != 1] <- paste(ops1[n != 1], ops2[n != 1], sep = "-")
o3 <- nchar(ops3) > 0
y[n == 3 & o3] <- paste(y[n == 3 & o3], ops3[n == 3 & o3], sep = ".")
y
}
ops_match <- function(x, y){
xo <- as.ops(x)
yo <- as.ops(y)
i <- (xo %in% yo) | grepl(yo, xo)
which(i)
}
x1 <- c("65381","53811", "5382")
x2 <- c("538110","538111", "538221")
x3 <- c("5381","538111", "538121")
x4 <- c("5381","538111", "538112")
y1 <- y2 <- "5381"
y3 <- "5381."
y4 <- "5381.1"
ops_match(x1, y1)
ops_match(x2, y2)
ops_match(x3, y3)
ops_match(x4, y4)

Related

Replace expressions in a source file from another source file in R

Hello I have the following problem
Say I have a file base.R
x <- 1
# comment
y <- Y ~ X1 +
X2
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
and another file override.R
x <- 2
y <- Y ~ X1 + X3
my goal would be to create another file new.R which is essentially base.R overriden by override.R
x <- 2
# comment
y <- Y ~ X1 + X3
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
Obviously if all expressions in base.R were 1 liners I would be able to use sed but unfortunately it's not the case.
Note that I only need it to work for assignations lhs <- rhs either if ideally lhs = rhs would work as well.
EDIT: the above is a minimization of my actual problem
Sometime a difficult problem is best made easier by redefining the problem itself. In the following we suggest a number of approaches some of which have particularly simple implementations. In (7) we provide code that does what the question asks but you may prefer to change the problem slightly and use simpler code in one of the other solutions we provide.
1) omit first few lines in base.R & concatenate files We will assume that override.R should override everything in base.R up to the last statement to be overriden in base.R. Comments in base.R after the overridden statements will be kept as well any comments in override.R . Thus in the example comments 2 will be kept and comments will be overridden but could be replicated in override.R if desired which seems reasonable since you can't assume that a comment that applies to the assignment in base.R also applies in override.R .
Determine the number of statements n in override.R. Then parse base.R and find the last line number prior to the first line not to be overridden, ix. Then in the lines ending in that line number find the last non-comment line number, mx. Now write out override.R followed by all but the first mx lines of base.R . In the code below replace stdout() with the desired name of the output file, e.g. "outfile.R" .
library(utils)
n <- length(parse("override.R"))
g <- getParseData(parse("base.R"))
ix <- g$line1[grep("^0", g$parent)][n + 1] - 1
baseLines <- readLines("base.R")
is_comment <- grepl("^\\s*#", head(baseLines, ix))
mx <- max(which(!is_comment))
overrideLines <- readLines("override.R")
writeLines(c(overrideLines, tail(baseLines, -mx)), stdout())
giving:
x <- 2
y <- Y ~ X1 + X3
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
2) comment out rather thqan omit One alternative would be to comment out the overridden lines rather than omitting them. We can readily do that by replacing the writeLines statement with the statement below. This will allow one to see both the comments in base.R, if any, and the comments in override.R .
writeLines(c(overrideLines, sub("^", "# ", head(baseLines, mx)),
tail(baseLines, -mx)), stdout())
giving:
x <- 2
y <- Y ~ X1 + X3
# x <- 1
# # comment
# y <- Y ~ X1 +
# X2
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
3) separator If you control base.R then a simpler approach is to mark the end of the portion to be overriden. Suppose we put #--- on a line by itself in base.R between the portion to override and the rest. Then we have the following which is simpler:
overrideLines <- readLines("override.R")
baseLines <- readLines("base.R")
ix <- grep("#---", baseLines)[1]
writeLines(c(overrideLines, tail(baseLines, -ix)), stdout())
4) exists or possibly, in base.R, check if x has already been defined and only define it if not. Ditto for y. Then it is just a matter of concatenating the two files or sourcing one after the other.
if (!exists("x")) x <- ...whatever...
if (!exists("y")) y <- ...whatever...
5) function Yet another possibility is to define a function whose defaults are the current values of x and y in base.R. Then we can call it as f() to get the defaults or specify them.
f <- function(x = ..., y = ...) { ...base.R code except x and y ...}
6) Omit definitions from base.R Perhaps the simplest alternative is just to omit the definitions from base.R and for each run have a override.R that is sourced first or concatenated.
7) Keep base.R comments This one does what the question asked but it is a bit complex and you may prefer one of the other solutions.
library(codetools)
library(utils)
baseLines <- readLines("base.R")
overrideLines <- readLines("override.R")
p_o <- parse("override.R")
g_o <- getParseData(p_o)
locals_o <- findLocalsList(p_o)
ipos <- sapply(locals_o, function(x) which(g_o$text == x)[1]-1)
DFo <- cbind(g_o[ipos, ], var = names(ipos))
p_b <- parse("base.R")
g_b <- getParseData(p_b)
ipos <- sapply(locals_o, function(x) which(g_b$text == x)[1]-1)
DFb <- data.frame(g_b[ipos, ], var = names(ipos), row.names = NULL)
o <- order(-DFb$line1)
DFb <- DFb[o, ]
newLines <- baseLines
for(i in 1:nrow(DFb)) {
j <- match(DFb$var[i], DFo$var)
newLines <- append(newLines,
overrideLines[DFo$line1[j]:DFo$line2[j]], DFb$line2[i])
newLines <- newLines[-(DFb$line1[i]:DFb$line2[i])]
}
writeLines(newLines, stdout())
giving:
x <- 2
# comment
y <- Y ~ X1 + X3
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
If you can accept comments being stripped, then this might suffice for you:
Starting with base.R:
x <- 1
# comment
y <- Y ~ X1 +
X2
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
and override.R:
x <- 2
y <- Y ~ X1 + X3
We can run:
base <- parse("base.R")
override <- parse("override.R")
base_assignment <-
sapply(base, function(z) as.character(z[[1]]) %in% c("<-", "="))
base_lhs <- mapply(function(assigned, z) as.character(z[[2]]),
base_assignment, base)
override_assignment <-
sapply(override, function(z) as.character(z[[1]]) %in% c("<-", "="))
override_lhs <- mapply(function(assigned, z) as.character(z[[2]]),
override_assignment, override)
matches <- match(base_lhs, override_lhs)
base[which(!is.na(matches))] <- override[na.omit(matches)]
writeLines(paste(do.call(c, lapply(base, deparse)), collapse = "\n"), "new.R")
and now we have new.R with
x <- 2
y <- Y ~ X1 + X3
z <- function(x) {
x + 1
}
t <- z(x)
For conversation, in order to retain comments we'd likely need to use getParseData:
iterate over $parent and $id so that our $line1 references can be combined, store this reduced line1 into a new variable (since we'll need to remove the originals from getParseData(base);
find all references to $token == "SYMBOL" where there exists $token == "LEFT_ASSIGN" later in each expression. This starts to hobble it a little in the instance we have "EQ_ASSIGN" or, more of a challege, "RIGHT_ASSIGN" (since the presumed order of symbols changes);
step 2 helps us find object names to which assignments occur, which we use to compare between base/override processing;
replace the subset of each versions' parsed frame;
find a way to recombine the resulting parsed frame into a source file.
I ran out of time trying to get this to work elegantly/robustly, so I offer it as an example of effort-required in order to retain comments.
I suggest that if your intent is to allow a single source file of overriding expressions, it makes sense to keep the base.R untouched (as in your question) and create a temporary new.R that is used and sourced and discarded, in which case its comments are tangential.
This would be very challenging with sed, but you could try using awk; this works with the example data:
awk 'BEGIN{FS="<-|="} NR==FNR{a[$1]=$0; next}; {if($1 in a){c=1} else if (/[#<=]/){c++}; if(c == 1){print a[$1]} else {print $0}}' override.R base.R
x <- 2
# comment
y <- Y ~ X1 + X3
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
Basically, if the LHS from override.R is found in base.R the 'counter' is set to 1, then if any of the "#<=" characters are encountered in the following lines the counter is incremented. Then all lines in base.R with counter == 1 are replaced with the corresponding line from override.R. I can't think of cases where this would fail, but I'd be interested to see if it holds up on more complicated examples.
Formatted with further explanation:
awk 'BEGIN{FS="<-|="} # set the field separator to either "<-" or "="
NR==FNR{a[$1]=$0; next} # load override.R into an array, key = 1st field (i.e. the LHS)
{
if($1 in a){c=1} # if the LHS from override.R is found, set "c" (counter) to 1
else if(/[#<=]/){c++} # if the line contains "#", "<" or "=", increment counter
if(c==1){print a[$1]} # if c equals 1 (i.e. the LHS is present) print the override.R line
else {print $0} # else print the base.R line
}' override.R base.R

Count "changes in direction" in a vector in R

I need to count how many times a variable inverts its growth pattern - from increasing values to decreasing values (as well as from decreasing values to increasing values). In the following example, I should be able to find 4 such inversions. How can I create a new dummy variable that shows such inversions?
x <- c(1:20,19:5,6:15,12:9,10:11)
plot(x)
You're effectively asking "when is the second derivative of x not equal to zero?", so you could just do a double diff:
x <- c(1:20,19:5,6:15,12:9,10:11)
plot(seq_along(x), x)
changes <- c(0, diff(diff(x)), 0) != 0
To show it picks the right points, colour them red.
points(seq_along(x)[changes], x[changes], col = "red")
This function will return the indices at which the direction changed:
get_change_indices <- function(x){
# return 0 if x contains one (or none, if NULL) unique elements
if(length(unique(x)) <= 1) return(NULL)
# make x named, so we can recapture its indices later
x <- setNames(x, paste0("a", seq_along(x)))
# calculate diff between successive elements
diff_x <- diff(x)
# remove points that are equal to zero
diff_x <- diff_x[!diff_x==0]
# identify indices of changepoints
diff_x <- c(diff_x[1], diff_x)
change_ind <- NULL
for(i in 2:length(diff_x)){
if(sign(diff_x[i]) != sign(diff_x[i-1])){
change_ind_curr <- as.numeric(gsub("a", "", names(diff_x[i]))) - 1
change_ind <- c(change_ind, change_ind_curr)
}
}
change_ind
}
The length of its output is the number of changes.
Note that it also works when the change in x is non-linear, e.g. if x <- c(1, 4, 9, 1).

Lookup list of formulas in other list

I am comparing two lists of formulas to see if some previously computed models can be reused. Right now I'm doing this like this:
set.seed(123)
# create some random formulas
l1 <- l2 <- list()
for (i in 1:10) {
l1[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
l2[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
}
# at least one appears in the other list
l1[[5]] <- l2[[7]]
# helper function to convert formulas to character strings
as.formulaCharacter <- function(x) paste(deparse(x))
# convert both lists to strings
s1 <- sapply(l1, as.formulaCharacter)
s2 <- sapply(l2, as.formulaCharacter)
# look up elements of one vector in the other
idx <- match(s1, s2, nomatch = 0L) # 7
s1[idx] # found matching elements
However, I noticed that some formulas are not retrieved although they are practically equivalent.
f1 <- z ~ b + c + b:c
f2 <- z ~ c + b + c:b
match(as.formulaCharacter(f1), as.formulaCharacter(f2)) # no match
I get why this result is different, the strings just aren't the same, but I'm struggling with how to extend this approach method to also work for formulas with reordered elements. I could use strsplit to first sort all formula components independently, but that sounds horribly inefficient to me.
Any ideas?
If the formulas are restricted to a sum of terms which contain colon separated variables then we can create a standardized string by extracting the term labels, exploding those with colons, sorting them, pasting the exploded terms back together, sorting this and turning that into a formula string.
stdize <- function(fo) {
s <- strsplit(attr(terms(f2), "term.labels"), ":")
terms <- sort(sapply(lapply(s, sort), paste, collapse = ":"))
format(reformulate(terms, all.vars(fo)[1]))
}
stdize(f1) == stdize(f2)
## [1] TRUE

Match a sequence of elements to a longer vector with some similar sequences

How can I count the number of times the specific sequence of elements x exists in the longer vector y.
x <- c(1,2,3,4,5,6)
y <- c(1,2,3,4,5,6,3,2,0,1,2,3,4,5,6,1,2,3,4,5,6,9,2,1,2,3,4,5,6,1,2,3,4)
Sorry but could not come out with any way to do it as using match will match individual elements but not a string or sequence of elements.
In order to add to the current answer and Davids comment,
length(gregexpr(paste(x,collapse = ""), paste(y, collapse=""))[[1L]])
fails in the general case. This is because if there are no matches, the resultant index is -1, giving a length of 1, rather than correct answer 0.
x = c(1,2,3)
y = c(4,3,2,4,3,4,3,2,2,3)
length(gregexpr(paste(x,collapse = ""), paste(y, collapse=""))[[1L]])
# [1] 1
whereas stringi::stri_count_fixed returns the expected 0.
A different answer which does not rely on pasting to strings should you prefer it, I suspect that it is slower however:
library(zoo)
sum(rollapply(y, width = length(x), function(v) all(x == v)))
cy <- paste(y, collapse = "")
cx <- paste(x, collapse = "")
(nchar(cy) - nchar(gsub(cx, "", cy))) / nchar(cx)
[1] 4
This is also robust to the case mentioned by #jamieRowen when the expected result is 0.

R - Repetitions of an array in other array

From a dataframe I get a new array, sliced from a dataframe.
I want to get the amount of times a certain repetition appears on it.
For example
main <- c(A,B,C,A,B,V,A,B,C,D,E)
p <- c(A,B,C)
q <- c(A,B)
someFunction(main,p)
2
someFunction(main,q)
3
I've been messing around with rle but it counts every subrepetion also, undersirable.
Is there a quick solution I'm missing?
You can use one of the regular expression tools in R since this is really a pattern matching exercise, specifically gregexpr for this question. The p and q vectors represent the search pattern and main is where we want to search for those patterns. From the help page for gregexpr:
gregexpr returns a list of the same length as text each element of which is of
the same form as the return value for regexpr, except that the starting positions
of every (disjoint) match are given.
So we can take the length of the first list returned by gregexpr which gives the starting positions of the matches. We'll first collapse the vectors and then do the searching:
someFunction <- function(haystack, needle) {
haystack <- paste(haystack, collapse = "")
needle <- paste(needle, collapse = "")
out <- gregexpr(needle, haystack)
out.length <- length(out[[1]])
return(out.length)
}
> someFunction(main, p)
[1] 2
> someFunction(main, q)
[1] 3
Note - you also need to throw "" around your vector main, p, and q vectors unless you have variables A, B, C, et al defined.
main <- c("A","B","C","A","B","V","A","B","C","D","E")
p <- c("A","B","C")
q <- c("A","B")
I'm not sure if this is the best way, but you can simply do that work by:
f <- function(a,b)
if (length(a) > length(b)) 0
else all(head(b, length(a)) == a) + Recall(a, tail(b, -1))
Someone may or may not find a built-in function.
Using sapply:
find_x_in_y <- function(x, y){
sum(sapply(
seq_len(length(y)-length(x)),
function(i)as.numeric(all(y[i:(i+length(x)-1)]==x))
))
}
find_x_in_y(c("A", "B", "C"), main)
[1] 2
find_x_in_y(c("A", "B"), main)
[1] 3
Here's a way to do it using embed(v,n), which returns a matrix of all n-length sub-sequences of vector v:
find_x_in_y <- function(x, y)
sum( apply( embed( y, length(x)), 1,
identical, rev(x)))
> find_x_in_y(p, main)
[1] 2
> find_x_in_y(q, main)
[1] 3

Resources