Replace expressions in a source file from another source file in R

Replace expressions in a source file from another source file in R - r

Hello I have the following problem
Say I have a file base.R
x <- 1
# comment
y <- Y ~ X1 +
X2
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
and another file override.R
x <- 2
y <- Y ~ X1 + X3
my goal would be to create another file new.R which is essentially base.R overriden by override.R
x <- 2
# comment
y <- Y ~ X1 + X3
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
Obviously if all expressions in base.R were 1 liners I would be able to use sed but unfortunately it's not the case.
Note that I only need it to work for assignations lhs <- rhs either if ideally lhs = rhs would work as well.
EDIT: the above is a minimization of my actual problem

Sometime a difficult problem is best made easier by redefining the problem itself. In the following we suggest a number of approaches some of which have particularly simple implementations. In (7) we provide code that does what the question asks but you may prefer to change the problem slightly and use simpler code in one of the other solutions we provide.
1) omit first few lines in base.R & concatenate files We will assume that override.R should override everything in base.R up to the last statement to be overriden in base.R. Comments in base.R after the overridden statements will be kept as well any comments in override.R . Thus in the example comments 2 will be kept and comments will be overridden but could be replicated in override.R if desired which seems reasonable since you can't assume that a comment that applies to the assignment in base.R also applies in override.R .
Determine the number of statements n in override.R. Then parse base.R and find the last line number prior to the first line not to be overridden, ix. Then in the lines ending in that line number find the last non-comment line number, mx. Now write out override.R followed by all but the first mx lines of base.R . In the code below replace stdout() with the desired name of the output file, e.g. "outfile.R" .
library(utils)
n <- length(parse("override.R"))
g <- getParseData(parse("base.R"))
ix <- g$line1[grep("^0", g$parent)][n + 1] - 1
baseLines <- readLines("base.R")
is_comment <- grepl("^\\s*#", head(baseLines, ix))
mx <- max(which(!is_comment))
overrideLines <- readLines("override.R")
writeLines(c(overrideLines, tail(baseLines, -mx)), stdout())
giving:
x <- 2
y <- Y ~ X1 + X3
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
2) comment out rather thqan omit One alternative would be to comment out the overridden lines rather than omitting them. We can readily do that by replacing the writeLines statement with the statement below. This will allow one to see both the comments in base.R, if any, and the comments in override.R .
writeLines(c(overrideLines, sub("^", "# ", head(baseLines, mx)),
tail(baseLines, -mx)), stdout())
giving:
x <- 2
y <- Y ~ X1 + X3
# x <- 1
# # comment
# y <- Y ~ X1 +
# X2
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
3) separator If you control base.R then a simpler approach is to mark the end of the portion to be overriden. Suppose we put #--- on a line by itself in base.R between the portion to override and the rest. Then we have the following which is simpler:
overrideLines <- readLines("override.R")
baseLines <- readLines("base.R")
ix <- grep("#---", baseLines)[1]
writeLines(c(overrideLines, tail(baseLines, -ix)), stdout())
4) exists or possibly, in base.R, check if x has already been defined and only define it if not. Ditto for y. Then it is just a matter of concatenating the two files or sourcing one after the other.
if (!exists("x")) x <- ...whatever...
if (!exists("y")) y <- ...whatever...
5) function Yet another possibility is to define a function whose defaults are the current values of x and y in base.R. Then we can call it as f() to get the defaults or specify them.
f <- function(x = ..., y = ...) { ...base.R code except x and y ...}
6) Omit definitions from base.R Perhaps the simplest alternative is just to omit the definitions from base.R and for each run have a override.R that is sourced first or concatenated.
7) Keep base.R comments This one does what the question asked but it is a bit complex and you may prefer one of the other solutions.
library(codetools)
library(utils)
baseLines <- readLines("base.R")
overrideLines <- readLines("override.R")
p_o <- parse("override.R")
g_o <- getParseData(p_o)
locals_o <- findLocalsList(p_o)
ipos <- sapply(locals_o, function(x) which(g_o$text == x)[1]-1)
DFo <- cbind(g_o[ipos, ], var = names(ipos))
p_b <- parse("base.R")
g_b <- getParseData(p_b)
ipos <- sapply(locals_o, function(x) which(g_b$text == x)[1]-1)
DFb <- data.frame(g_b[ipos, ], var = names(ipos), row.names = NULL)
o <- order(-DFb$line1)
DFb <- DFb[o, ]
newLines <- baseLines
for(i in 1:nrow(DFb)) {
j <- match(DFb$var[i], DFo$var)
newLines <- append(newLines,
overrideLines[DFo$line1[j]:DFo$line2[j]], DFb$line2[i])
newLines <- newLines[-(DFb$line1[i]:DFb$line2[i])]
}
writeLines(newLines, stdout())
giving:
x <- 2
# comment
y <- Y ~ X1 + X3
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)

If you can accept comments being stripped, then this might suffice for you:
Starting with base.R:
x <- 1
# comment
y <- Y ~ X1 +
X2
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
and override.R:
x <- 2
y <- Y ~ X1 + X3
We can run:
base <- parse("base.R")
override <- parse("override.R")
base_assignment <-
sapply(base, function(z) as.character(z[[1]]) %in% c("<-", "="))
base_lhs <- mapply(function(assigned, z) as.character(z[[2]]),
base_assignment, base)
override_assignment <-
sapply(override, function(z) as.character(z[[1]]) %in% c("<-", "="))
override_lhs <- mapply(function(assigned, z) as.character(z[[2]]),
override_assignment, override)
matches <- match(base_lhs, override_lhs)
base[which(!is.na(matches))] <- override[na.omit(matches)]
writeLines(paste(do.call(c, lapply(base, deparse)), collapse = "\n"), "new.R")
and now we have new.R with
x <- 2
y <- Y ~ X1 + X3
z <- function(x) {
x + 1
}
t <- z(x)
For conversation, in order to retain comments we'd likely need to use getParseData:
iterate over $parent and $id so that our $line1 references can be combined, store this reduced line1 into a new variable (since we'll need to remove the originals from getParseData(base);
find all references to $token == "SYMBOL" where there exists $token == "LEFT_ASSIGN" later in each expression. This starts to hobble it a little in the instance we have "EQ_ASSIGN" or, more of a challege, "RIGHT_ASSIGN" (since the presumed order of symbols changes);
step 2 helps us find object names to which assignments occur, which we use to compare between base/override processing;
replace the subset of each versions' parsed frame;
find a way to recombine the resulting parsed frame into a source file.
I ran out of time trying to get this to work elegantly/robustly, so I offer it as an example of effort-required in order to retain comments.
I suggest that if your intent is to allow a single source file of overriding expressions, it makes sense to keep the base.R untouched (as in your question) and create a temporary new.R that is used and sourced and discarded, in which case its comments are tangential.

This would be very challenging with sed, but you could try using awk; this works with the example data:
awk 'BEGIN{FS="<-|="} NR==FNR{a[$1]=$0; next}; {if($1 in a){c=1} else if (/[#<=]/){c++}; if(c == 1){print a[$1]} else {print $0}}' override.R base.R
x <- 2
# comment
y <- Y ~ X1 + X3
# comment 2
z <- function(x) {
x + 1
}
t <- z(x)
Basically, if the LHS from override.R is found in base.R the 'counter' is set to 1, then if any of the "#<=" characters are encountered in the following lines the counter is incremented. Then all lines in base.R with counter == 1 are replaced with the corresponding line from override.R. I can't think of cases where this would fail, but I'd be interested to see if it holds up on more complicated examples.
Formatted with further explanation:
awk 'BEGIN{FS="<-|="} # set the field separator to either "<-" or "="
NR==FNR{a[$1]=$0; next} # load override.R into an array, key = 1st field (i.e. the LHS)
{
if($1 in a){c=1} # if the LHS from override.R is found, set "c" (counter) to 1
else if(/[#<=]/){c++} # if the line contains "#", "<" or "=", increment counter
if(c==1){print a[$1]} # if c equals 1 (i.e. the LHS is present) print the override.R line
else {print $0} # else print the base.R line
}' override.R base.R

Related

R progressively search string of a hierarchical lookup table for matches

I am working with OPS codes which code the type of procedure performed in a hospital. The OPS coding list has a hierarchical structure of the form X-XXX.XX with X being numbers. The coding structure is hierarchical, that means, the first X- is a big set, then the XXX denote a subset type of procedure within the first X-, the last .XX denote a subspecialization of the XXX
so the code might be X-XXX, X-XXX. , X-XXX.X, X-XXX.XX
My problem is that a program we uses collapses the structure of the code to XXXX, XXXXX, or XXXXXXX and i would like to match the collapsed with the uncollapsed llokup table of definitions.
So I would like to have a routine that checks for each digit and then procedes to the next when performing the matching. grepl would not to because 5381 would match 65381 (the uncollapsed would be 5-381 and 6-538.1) which are totally different procedures. I would need something that would match character to character (first number second number etc) and respects the character positions.
When an exact match cannot be found, it should return the first match that matches the same character positions.
More examples in pseudocode
which("5381" %in% c("65381","53811", "5382")) should return 2 since the second item matches all available characters provided
which("5381" %in% c("538110","538111", "538221")) should return 1 (because its the first match, the lookup table within c() is sorted.
which("5381." %in% c("5381","538111", "538121")) should return 1 (because its the first match, the lookup table within c() is sorted. Note that the period is ignored in the match
which("5381.1" %in% c("5381","538111", "538112")) should return 2 (because its the first match that matches all available five characters and we don't have a fifth.
I know this is not the best example of a question in SO but I am open to improve the question.

This is probably too complicated but it works.
First define a generic to transform the input string to the OPS format. Then have a matching function check if x has y as a substring.
Note that the matching function does not check if x is a substring of y, it's the other way around.
as.ops <- function(x, ...) UseMethod("as.ops")
as.ops.default <- function(x, ...){
warning("The default method coerces its argument to character and calls the character method")
as.ops.character(as.character(x))
}
as.ops.character <- function(x, ...){
x <- gsub("[^[:digit:]]", "", x)
ops1 <- substr(x, 1, 1)
ops2 <- substr(x, 2, 4)
ops3 <- substring(x, 5)
y <- character(length(x))
n <- findInterval(nchar(x), c(0, 1, 4, 7))
y[n == 1] <- x[n == 1]
y[n != 1] <- paste(ops1[n != 1], ops2[n != 1], sep = "-")
o3 <- nchar(ops3) > 0
y[n == 3 & o3] <- paste(y[n == 3 & o3], ops3[n == 3 & o3], sep = ".")
y
}
ops_match <- function(x, y){
xo <- as.ops(x)
yo <- as.ops(y)
i <- (xo %in% yo) | grepl(yo, xo)
which(i)
}
x1 <- c("65381","53811", "5382")
x2 <- c("538110","538111", "538221")
x3 <- c("5381","538111", "538121")
x4 <- c("5381","538111", "538112")
y1 <- y2 <- "5381"
y3 <- "5381."
y4 <- "5381.1"
ops_match(x1, y1)
ops_match(x2, y2)
ops_match(x3, y3)
ops_match(x4, y4)

How to use grep function in for loop

I have troubles using the grep function within a for loop.
In my data set, I have several columns where only the last 5-6 letters change. With the loop I want to use the same functions for all 16 situations.
Here is my code:
situations <- c("KKKTS", "KKKNL", "KKDTS", "KKDNL", "NkKKTS", "NkKKNL", "NkKDTS", "NkKDNL", "KTKTS", "KTKNL", "KTDTS", "KTDNL", "NkTKTS", "NkTKNL", "NkTDTS", "NkTDNL")
View(situations)
for (i in situations[1:16]) {
## Trust Skala
a <- vector("numeric", length = 1L)
b <- vector("numeric", length = 1L)
a <- grep("Tru_1_[i]", colnames(cleandata))
b <- grep("Tru_5_[i]", colnames(cleandata))
cleandata[, c(a:b)] <- 8-cleandata[, c(a:b)]
attach(cleandata)
cleandata$scale_tru_[i] <- (Tru_1_[i] + Tru_2_[i] + Tru_3_[i] + Tru_4_[i] + Tru_5_[i])/5
detach(cleandata)
}
With the grep function I first want to finde the column number of e.g. Tru_1_KKKTS and Tru_5_KKKTS. Then I want to reverse code the items of the specific column numbers. The last part worked without the loop when I manually used grep for every single situation.
Here ist the manual version:
# KKKTS
grep("Tru_1_KKKTS", colnames(cleandata)) #29 -> find the index of respective column
grep("Tru_5_KKKTS", colnames(cleandata)) #33
cleandata[,c(29:33)] <- 8-cleandata[c(29:33)] # trust scale ranges from 1 to 7 [8-1/2/3/4/5/6/7 = 7/6/5/4/3/2/1]
attach(cleandata)
cleandata$scale_tru_KKKTS <- (Tru_1_KKKTS + Tru_2_KKKTS + Tru_3_KKKTS + Tru_4_KKKTS + Tru_5_KKKTS)/5
detach(cleandata)

You can do:
Mean5 <- function(sit) {
cnames <- paste0("Tru_", 1:5, "_", sit)
rowMeans(cleandata[cnames])
}
cleandata[, paste0("scale_tru_", situations)] <- sapply(situations, FUN=Mean5)

how about something like this. It's a bit more compact and you don't have to use attach..
situations <- c("KKKTS", "KKKNL", "KKDTS", "KKDNL", "NkKKTS", "NkKKNL", "NkKDTS", "NkKDNL", "KTKTS", "KTKNL", "KTDTS", "KTDNL", "NkTKTS", "NkTKNL", "NkTDTS", "NkTDNL")
for (i in situations[1:16]) {
cols <- paste("Tru", 1:5, i, sep = "_")
result <- paste("scale_tru" , i, sep = "_")
cleandata[cols] <- 8 - cleandata[cols]
cleandata[result] <- rowMeans(cleandata[cols])
}
I took for granted that when you write a:b you mean all the columns between those, which I assumed were named from 2 to 4

situations <- c("KKKTS", "KKKNL", "KKDTS", "KKDNL", "NkKKTS", "NkKKNL", "NkKDTS", "NkKDNL", "KTKTS", "KTKNL", "KTDTS", "KTDNL", "NkTKTS", "NkTKNL", "NkTDTS", "NkTDNL")
# constructor for column names
get_col_names <- function(part) paste("Tru", 1:5, part, sep="_")
for (situation in situtations) {
# revert the values in the columns in situ
cleandata[, get_col_names(situation)] <- 8 - cleandata[, get_col_names(situtation)]
# and calculate the average
subdf <- cleandata[, get_col_names(situation)]
cleandata[, paste0("scale_tru_", situation)] <- rowSums(subdf)/ncol(subdf)
}
By the way, you call it "scale" but your code shows an average/mean calculation.
(Scale without centering).

Lookup list of formulas in other list

I am comparing two lists of formulas to see if some previously computed models can be reused. Right now I'm doing this like this:
set.seed(123)
# create some random formulas
l1 <- l2 <- list()
for (i in 1:10) {
l1[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
l2[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
}
# at least one appears in the other list
l1[[5]] <- l2[[7]]
# helper function to convert formulas to character strings
as.formulaCharacter <- function(x) paste(deparse(x))
# convert both lists to strings
s1 <- sapply(l1, as.formulaCharacter)
s2 <- sapply(l2, as.formulaCharacter)
# look up elements of one vector in the other
idx <- match(s1, s2, nomatch = 0L) # 7
s1[idx] # found matching elements
However, I noticed that some formulas are not retrieved although they are practically equivalent.
f1 <- z ~ b + c + b:c
f2 <- z ~ c + b + c:b
match(as.formulaCharacter(f1), as.formulaCharacter(f2)) # no match
I get why this result is different, the strings just aren't the same, but I'm struggling with how to extend this approach method to also work for formulas with reordered elements. I could use strsplit to first sort all formula components independently, but that sounds horribly inefficient to me.
Any ideas?

If the formulas are restricted to a sum of terms which contain colon separated variables then we can create a standardized string by extracting the term labels, exploding those with colons, sorting them, pasting the exploded terms back together, sorting this and turning that into a formula string.
stdize <- function(fo) {
s <- strsplit(attr(terms(f2), "term.labels"), ":")
terms <- sort(sapply(lapply(s, sort), paste, collapse = ":"))
format(reformulate(terms, all.vars(fo)[1]))
}
stdize(f1) == stdize(f2)
## [1] TRUE

In R distance between two sentences: Word-level comparison by minimum edit distance

While trying to learn R, I want to implement the algorithm below in R. Consider the two lists below:
List 1: "crashed", "red", "car"
List 2: "crashed", "blue", "bus"
I want to find out how many actions it would take to transform 'list1' into 'list2'.
As you can see I need only two actions:
1. Replace "red" with "blue".
2. Replace "car" with "bus".
But, how we can find the number of actions like this automatically.
We can have several actions to transform the sentences: ADD, REMOVE, or REPLACE the words in the list.
Now, I will try my best to explain how the algorithm should work:
At the first step: I will create a table like this:
rows: i= 0,1,2,3,
columns: j = 0,1,2,3
(example: value[0,0] = 0 , value[0, 1] = 1 ...)
crashed red car
0 1 2 3
crashed 1
blue 2
bus 3
Now, I will try to fill the table. Please, note that each cell in the table shows the number of actions we need to do to reformat the sentence (ADD, remove, or replace).
Consider the interaction between "crashed" and "crashed" (value[1,1]), obviously we don't need to change it so the value will be '0'. Since they are the same words. Basically, we got the diagonal value = value[0,0]
crashed red car
0 1 2 3
crashed 1 0
blue 2
bus 3
Now, consider "crashed" and the second part of the sentence which is "red". Since they are not the same word we can use calculate the number of changes like this :
min{value[0,1] , value[0,2] and value[1,1]} + 1
min{ 1, 2, 0} + 1 = 1
Therefore, we need to just remove "red".
So, the table will look like this:
crashed red car
0 1 2 3
crashed 1 0 1
blue 2
bus 3
And we will continue like this :
"crashed" and "car" will be :
min{value[0,3], value[0,2] and value[1,2]} + 1
min{3, 2, 1} +1 = 2
and the table will be:
crashed red car
0 1 2 3
crashed 1 0 1 2
blue 2
bus 3
And we will continue to do so. the final result will be :
crashed red car
0 1 2 3
crashed 1 0 1 2
blue 2 1 1 2
bus 3 2 2 2
As you can see the last number in the table shows the distance between two sentences: value[3,3] = 2
Basically, the algorithm should look like this:
if (characters_in_header_of_matrix[i]==characters_in_column_of_matrix [j] &
value[i,j] == value[i+1][j-1] )
then {get the 'DIAGONAL VALUE' #diagonal value= value[i, j-1]}
else{
value[i,j] = min(value[i-1, j], value[i-1, j-1], value[i, j-1]) + 1
}
endif
for finding the difference between the elements of two lists that you can see in the header and the column of the matrix, I have used the strcmp() function which will give us a boolean value(TRUE or FALSE) while comparing the words. But, I fail at implementing this.
I'd appreciate your help on this one, thanks.

The question
After some clarification in a previous post, and after the update of the post, my understanding is that Zero is asking: 'how one can iteratively count the number of word differences in two strings'.
I am unaware of any implementation in R, although i would be surprised if i doesn't already exists. I took a bit of time out to create a simple implementation, altering the algorithm slightly for simplicity (For anyone not interested scroll down for 2 implementations, 1 in pure R, one using the smallest amount of Rcpp). The general idea of the implementation:
Initialize with string_1 and string_2 of length n_1 and n_2
Calculate the cumulative difference between the first min(n_1, n_2) elements,
Use this cumulative difference as the diagonal in the matrix
Set the first off-diagonal element to the very first element + 1
Calculate the remaining off diagonal elements as: diag(i) - diag(i-1) + full_matrix(i-1,j)
In the previous step i iterates over diagonals, j iterates over rows/columns (either one works), and we start in the third diagonal, as the first 2x2 matrix is filled in step 1 to 4
Calculate the remaining abs(n_1 - n_2) elements as full_matrix[,min(n_1 - n_2)] + 1:abs(n_1 - n_2), applying the latter over each value in the prior, and bind them appropriately to the full_matrix.
The output is a matrix with dimensions row and column names of the corresponding strings, which has been formatted for some easier reading.
Implementation in R
Dist_between_strings <- function(x, y,
split = " ",
split_x = split, split_y = split,
case_sensitive = TRUE){
#Safety checks
if(!is.character(x) || !is.character(y) ||
nchar(x) == 0 || nchar(y) == 0)
stop("x, y needs to be none empty character strings.")
if(length(x) != 1 || length(y) != 1)
stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
if(!is.logical(case_sensitive))
stop("case_sensitivity needs to be logical")
#Extract variable names of our variables
# used for the dimension names later on
x_name <- deparse(substitute(x))
y_name <- deparse(substitute(y))
#Expression which when evaluated will name our output
dimname_expression <-
parse(text = paste0("dimnames(output) <- list(",make.names(x_name, unique = TRUE)," = x_names,",
make.names(y_name, unique = TRUE)," = y_names)"))
#split the strings into words
x_names <- str_split(x, split_x, simplify = TRUE)
y_names <- str_split(y, split_y, simplify = TRUE)
#are we case_sensitive?
if(isTRUE(case_sensitive)){
x_split <- str_split(tolower(x), split_x, simplify = TRUE)
y_split <- str_split(tolower(y), split_y, simplify = TRUE)
}else{
x_split <- x_names
y_split <- y_names
}
#Create an index in case the two are of different length
idx <- seq(1, (n_min <- min((nx <- length(x_split)),
(ny <- length(y_split)))))
n_max <- max(nx, ny)
#If we have one string that has length 1, the output is simplified
if(n_min == 1){
distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
output <- matrix(distances, nrow = nx)
eval(dimname_expression)
return(output)
}
#If not we will have to do a bit of work
output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
#The loop will fill in the off_diagonal
output[2, 1] <- output[1, 2] <- output[1, 1] + 1
if(n_max > 2)
for(i in 3:n_min){
for(j in 1:(i - 1)){
output[i,j] <- output[j,i] <- output[i,i] - output[i - 1, i - 1] + #are the words different?
output[i - 1, j] #How many words were different before?
}
}
#comparison if the list is not of the same size
if(nx != ny){
#Add the remaining words to the side that does not contain this
additional_words <- seq(1, n_max - n_min)
additional_words <- sapply(additional_words, function(x) x + output[,n_min])
#merge the additional words
if(nx > ny)
output <- rbind(output, t(additional_words))
else
output <- cbind(output, additional_words)
}
#set the dimension names,
# I would like the original variable names to be displayed, as such i create an expression and evaluate it
eval(dimname_expression)
output
}
Note that the implementation is not vectorized, and as such can only take single string inputs!
Testing the implementation
To test the implementation, one could use the strings given. As they were said to be contained in lists, we will have to convert them to strings. Note that the function lets one split each string differently, however it assumes space separated strings. So first I'll show how one could achieve a conversion to the correct format:
list_1 <- list("crashed","red","car")
list_2 <- list("crashed","blue","bus")
string_1 <- paste(list_1,collapse = " ")
string_2 <- paste(list_2,collapse = " ")
Dist_between_strings(string_1, string_2)
output
#Strings in the given example
string_2
string_1 crashed blue bus
crashed 0 1 2
red 1 1 2
car 2 2 2
This is not exactly the output, but it yields the same information, as the words are ordered as they were given in the string.
More examples
Now i stated it worked for other strings as well and this is indeed the fact, so lets try some random user-made strings:
#More complicated strings
string_3 <- "I am not a blue whale"
string_4 <- "I am a cat"
string_5 <- "I am a beautiful flower power girl with monster wings"
string_6 <- "Hello"
Dist_between_strings(string_3, string_4, case_sensitive = TRUE)
Dist_between_strings(string_3, string_5, case_sensitive = TRUE)
Dist_between_strings(string_4, string_5, case_sensitive = TRUE)
Dist_between_strings(string_6, string_5)
Running these show that these do yield the correct answers. Note that if either string is of size 1, the comparison is a lot faster.
Benchmarking the implementation
Now as the implementation is accepted, as correct, we would like to know how well it performs (For the uninterested reader, one can scroll past this section, to where a faster implementation is given). For this purpose, i will use much larger strings. For a complete benchmark i should test various string sizes, but for the purposes i will only use 2 rather large strings of size 1000 and 2500. For this purpose i use the microbenchmark package in R, which contains a microbenchmark function, which claims to be accurate down to nanoseconds. The function itself executes the code 100 (or a user defined) number of times, returning the mean and quartiles of the run times. Due to other parts of R such as the Garbage Cleaner, the median is mostly considered a good estimate of the actual average run-time of the function.
The execution and results are shown below:
#Benchmarks for larger strings
set.seed(1)
string_7 <- paste(sample(LETTERS,1000,replace = TRUE), collapse = " ")
string_8 <- paste(sample(LETTERS,2500,replace = TRUE), collapse = " ")
microbenchmark::microbenchmark(String_Comparison = Dist_between_strings(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr min lq mean median uq max neval
# String_Comparison 716.5703 729.4458 816.1161 763.5452 888.1231 1106.959 100
Profiling
Now i find the run-times very slow. One use case for the implementation could be an initial check of student hand-ins to check for plagiarism, in which case a low difference count very likely shows plagiarism. These can be very long and there may be hundreds of handins, an as such i would like the run to be very fast.
To figure out how to improve my implementation i used the profvis package with the corrosponding profvis function. To profile the function i exported it in another R script, that i sourced, running the code 1 once prior to profiling to compile the code and avoid profiling noise (important). The code to run the profiling can be seen below, and the most important part of the output is visualized in an image below it.
library(profvis)
profvis(Dist_between_strings(string_7, string_8, case_sensitive = FALSE))
Now, despite the colour, here i can see a clear problem. The loop filling the off-diagonal by far is responsible for most of the runtime. R (like python and other not compiled languages) loops are notoriously slow.
Using Rcpp to improve performance
To improve the implementation, we could implement the loop in c++ using the Rcpp package. This is rather simple. The code is not unlike the one we would use in R, if we avoid iterators. A c++ script can be made in file -> new file -> c++ File. The following c++ code would be pasted into the corrosponding file and sourced using the source button.
//Rcpp Code
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix Cpp_String_difference_outer_diag(NumericMatrix output){
long nrow = output.nrow();
for(long i = 2; i < nrow; i++){ // note the
for(long j = 0; j < i; j++){
output(i, j) = output(i, i) - output(i - 1, i - 1) + //are the words different?
output(i - 1, j);
output(j, i) = output(i, j);
}
}
return output;
}
The corresponding R function needs to be altered to use this function instead of looping. The code is similar to the first function, only switching the loop for a call to the c++ function.
Dist_between_strings_cpp <- function(x, y,
split = " ",
split_x = split, split_y = split,
case_sensitive = TRUE){
#Safety checks
if(!is.character(x) || !is.character(y) ||
nchar(x) == 0 || nchar(y) == 0)
stop("x, y needs to be none empty character strings.")
if(length(x) != 1 || length(y) != 1)
stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
if(!is.logical(case_sensitive))
stop("case_sensitivity needs to be logical")
#Extract variable names of our variables
# used for the dimension names later on
x_name <- deparse(substitute(x))
y_name <- deparse(substitute(y))
#Expression which when evaluated will name our output
dimname_expression <-
parse(text = paste0("dimnames(output) <- list(", make.names(x_name, unique = TRUE)," = x_names,",
make.names(y_name, unique = TRUE)," = y_names)"))
#split the strings into words
x_names <- str_split(x, split_x, simplify = TRUE)
y_names <- str_split(y, split_y, simplify = TRUE)
#are we case_sensitive?
if(isTRUE(case_sensitive)){
x_split <- str_split(tolower(x), split_x, simplify = TRUE)
y_split <- str_split(tolower(y), split_y, simplify = TRUE)
}else{
x_split <- x_names
y_split <- y_names
}
#Create an index in case the two are of different length
idx <- seq(1, (n_min <- min((nx <- length(x_split)),
(ny <- length(y_split)))))
n_max <- max(nx, ny)
#If we have one string that has length 1, the output is simplified
if(n_min == 1){
distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
output <- matrix(distances, nrow = nx)
eval(dimname_expression)
return(output)
}
#If not we will have to do a bit of work
output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
#The loop will fill in the off_diagonal
output[2, 1] <- output[1, 2] <- output[1, 1] + 1
if(n_max > 2)
output <- Cpp_String_difference_outer_diag(output) #Execute the c++ code
#comparison if the list is not of the same size
if(nx != ny){
#Add the remaining words to the side that does not contain this
additional_words <- seq(1, n_max - n_min)
additional_words <- sapply(additional_words, function(x) x + output[,n_min])
#merge the additional words
if(nx > ny)
output <- rbind(output, t(additional_words))
else
output <- cbind(output, additional_words)
}
#set the dimension names,
# I would like the original variable names to be displayed, as such i create an expression and evaluate it
eval(dimname_expression)
output
}
Testing the c++ implementation
To be sure the implementation is correct we check if the same output is obtained with the c++ implementation.
#Test the cpp implementation
identical(Dist_between_strings(string_3, string_4, case_sensitive = TRUE),
Dist_between_strings_cpp(string_3, string_4, case_sensitive = TRUE))
#TRUE
Final benchmarks
Now is this actually faster? To see this we could run another benchmark using the microbenchmark package. The code and results are shown below:
#Final microbenchmarking
microbenchmark::microbenchmark(R = Dist_between_strings(string_7, string_8, case_sensitive = FALSE),
Rcpp = Dist_between_strings_cpp(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr min lq mean median uq max neval
# R 721.71899 753.6992 850.21045 787.26555 907.06919 1756.7574 100
# Rcpp 23.90164 32.9145 54.37215 37.28216 47.88256 243.6572 100
From the microbenchmark median improvement factor of roughly 21 ( = 787 / 37), which is a massive improvement from just implementing a single loop!

There is already an edit-distance function in R we can take advantage of: adist().
As it works on the character level, we'll have to assign a character to each unique word in our sentences, and stitch them together to form pseudo-words we can calculate the distance between.
s1 <- c("crashed", "red", "car")
s2 <- c("crashed", "blue", "bus")
ll <- list(s1, s2)
alnum <- c(letters, LETTERS, 0:9)
ll2 <- relist(alnum[factor(unlist(ll))], ll)
ll2 <- sapply(ll2, paste, collapse="")
adist(ll2)
# [,1] [,2]
# [1,] 0 2
# [2,] 2 0
Main limitation here, as far as I can tell, is the number of unique characters available, which in this case is 62, but can be extended quite easily, depending on your locale. E.g: intToUtf8(c(32:126, 161:300), TRUE).

removing offset terms from a formula

R has a handy tool for manipulating formulas, update.formula(). This works nicely when you want to get something like "formula containing all terms in previous formula except x", e.g.
f1 <- z ~ a + b + c
(f2 <- update.formula(f1, . ~ . - c))
## z ~ a + b
However, this doesn't seem to work with offset terms:
f3 <- z ~ a + offset(b)
update(f3, . ~ . - offset(b))
## z ~ a + offset(b)
I've dug down as far as terms.formula, which ?update.formula references:
[after substituting, ...] The result is then simplified via ‘terms.formula(simplify = TRUE)’.
terms.formula(z ~ a + offset(b) - offset(b), simplify=TRUE)
## z ~ a + offset(b)
(i.e., this doesn't seem to remove offset(b) ...)
I know I can hack up a solution either by using deparse() and text-processing, or by processing the formula recursively to remove the term I don't want, but these solutions are ugly and/or annoying to implement. Either enlightenment as to why this doesn't work, or a reasonably compact solution, would be great ...

1) Recursion Recursively descend through the formula replacing offset(...) with offset and then remove offset using update. No string manipulation is done and although it does require a number of lines of code it's still fairly short and does remove single and multiple offset terms.
If there are multiple offsets one can preserve some of them by setting preserve so, for example, if preserve = 2 then the second offset is preserved and any others are removed. The default is to preserve none, i.e. remove them all.
no.offset <- function(x, preserve = NULL) {
k <- 0
proc <- function(x) {
if (length(x) == 1) return(x)
if (x[[1]] == as.name("offset") && !((k<<-k+1) %in% preserve)) return(x[[1]])
replace(x, -1, lapply(x[-1], proc))
}
update(proc(x), . ~ . - offset)
}
# tests
no.offset(z ~ a + offset(b))
## z ~ a
no.offset(z ~ a + offset(b) + offset(c))
## z ~ a
Note that if you don't need the preserve argument then the line
initializing k can be omitted and the if simplified to:
if (x[[1]] == as.name("offset")) return(x[[1]])
2) terms this neither uses string manipulation directly nor recursion. First get the terms object, zap its offset attribute and fix it using fixFormulaObject which we extract out of the guts of terms.formula. This could be made a bit less brittle by copying the source code of fixFormulaObject into your source and removing the eval line below. preserve acts as in (1).
no.offset2 <- function(x, preserve = NULL) {
tt <- terms(x)
attr(tt, "offset") <- if (length(preserve)) attr(tt, "offset")[preserve]
eval(body(terms.formula)[[2]]) # extract fixFormulaObject
f <- fixFormulaObject(tt)
environment(f) <- environment(x)
f
}
# tests
no.offset2(z ~ a + offset(b))
## z ~ a
no.offset2(z ~ a + offset(b) + offset(c))
## z ~ a
Note that if you don't need the preserve argument then the line that
zaps the offset attribute can be simplified to:
attr(tt, "offset") <- NULL

This seems to be by design. But a simple workaround is
offset2 = offset
f3 <- z ~ a + offset2(b)
update(f3, . ~ . - offset2(b))
# z ~ a
If you need the flexibility to accept formulae that do include offset(), for example if the formula is provided by a package user who may be unaware of the need to use offset2 in place of offset, then we should also add a line to change any instances of offset() in the incoming formula:
f3 <- z ~ a + offset(b)
f4 <- as.formula(gsub("offset\\(", "offset2(", deparse(f3)))
f4 <- update(f4, . ~ . - offset2(b))
# finally, just in case there are any references to offset2 remaining, we should revert them back to offset
f4 <- as.formula(gsub("offset2\\(", "offset(", deparse(f4)))
# z ~ a

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replace expressions in a source file from another source file in R - r

Related

R progressively search string of a hierarchical lookup table for matches

How to use grep function in for loop

Lookup list of formulas in other list

In R distance between two sentences: Word-level comparison by minimum edit distance

removing offset terms from a formula

Categories

Resources