Compare two strings in R and see additions, deletions

Compare two strings in R and see additions, deletions - r

I want to compare two character values in R and see which characters where added and deleted to display it later similar to git diff --color-words=. (see screenshot below)
For example:
a <- "hello world"
b <- "helo world!"
diff <- FUN(a, b)
where diff would somehow show that an l was dropped and a ! was added.
The ultimate goal is to construct an html string like this hel<span class="deleted">l</span>o world<span class="added">!</span>.
I am aware of diffobj but so far I cannot get it to return the character differences, only the differences between elements.
Output of git diff --color-words=.
the output looks like this:

Base R has a function adist that computes the generalized Levenshtein distance. With arguments count and partial attribute "trafos" is set to the sequence of matches, insertions and deletions needed to go from one string to the other. From the documentation, section Value, my emphasis:
If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type of transformation (insertions, deletions and substitutions), respectively. Additionally, if partial = FALSE, the transformation sequences are returned as the "trafos" attribute of the return value, as character strings with elements ‘⁠M⁠’, ‘⁠I⁠’, ‘⁠D⁠’ and ‘⁠S⁠’ indicating a match, insertion, deletion and substitution, respectively. If partial = TRUE, the offsets (positions of the first and last element) of the matched substrings are returned as the "offsets" attribute of the return value (with both offsets -1−1 in case of no match).
a <- "hello world"
b <- "helo world!"
attr(adist(a, b, counts = TRUE), "trafos")
#> [,1]
#> [1,] "MMDMMMMMMMMI"
Created on 2022-05-31 by the reprex package (v2.0.1)
There is a deletion in the 3rd character and an insertion at the end of string a.

Found a solution using diffobj::ses_dat() and splitting the data into its characters before.
get_html_diff <- function(a, b) {
aa <- strsplit(a, "")[[1]]
bb <- strsplit(b, "")[[1]]
s <- diffobj::ses_dat(aa, bb)
m <- cumsum(as.integer(s$op) != c(Inf, s$op[1:(length(s$op) - 1)]))
res <- paste(
sapply(split(seq_along(s$op), m), function(i) {
val <- paste(s$val[i], collapse = "")
if (s$op[i[[1]]] == "Insert")
val <- paste0("<span class=\"add\">", val, "</span>")
if (s$op[i[[1]]] == "Delete")
val <- paste0("<span class=\"del\">", val, "</span>")
val
}),
collapse = "")
res
}
get_html_diff("hello world", "helo World!")
#> [1] "hel<span class=\"del\">l</span>o <span class=\"del\">w</span><span class=\"add\">W</span>orld<span class=\"add\">!</span>"
Created on 2022-05-31 by the reprex package (v2.0.1)

We use diffobj to compare configuration files (in more or less production environment), and it works just right. In your case, wouldn't diffobj::diffChr be what you want?
diffobj::diffChr("hello world", "helo world!", color.mode = 'rgb')

Related

How to remove/replace specific parentheses from a string containing multiple parentheses in R

Given the following string of parentheses, I am trying to remove one specific parentheses,
where the position of one of its bracket is marked with 1.
((((((((((((((((((********))))))))))))))))))
00000000000000000000000000000000010000000000
So for the above example, the solution I am looking for is
((((((((((-(((((((********)))))))-))))))))))
00000000000000000000000000000000010000000000
I am tried using strsplit function from stringr to split and get the indexes of the bracket marked with 1. But I am not sure how I can get the index of its corresponding closing bracket.
Could anyone give some input on this..
What I did..
a = "((((((((((-(((((((********)))))))-))))))))))"
b = "00000000000000000000000000000000010000000000"
which(unlist(strsplit(b,"")) == 1)
#[1] 34
a_mod = unlist(strsplit(a,""))[-34]
here, I removed one bracket of the parentheses which I wanted to remove but I do not know how I can remove its corresponding opening bracket which is in 11th position in this example

Locate the 1 in b giving pos2 and also calculate the length of b giving n. Then replace positions pos2 and pos1 = n-pos2+1 with minus characters. See ?gregexpr and ?nchar and ?substr for more info. No packages are used.
pos2 <- regexpr(1, b)
n <- nchar(a)
pos1 <- n - pos2 + 1
substr(a, pos1, pos1) <- substr(a, pos2, pos2) <- "-"
a
## [1] "((((((((((-(((((((********)))))))-))))))))))"

Since the parentheses are paired the index of the close parentheses is just the length of the string minus the index of the open parentheses (they're equidistant from the string ends)
library(stringr)
string <- "((((((((((((((((((********))))))))))))))))))"
b <- "00000000000000000000000000000000010000000000"
location <- str_locate(b, "1")[1]
len <- str_length(string)
substr(string, location, location) <- "-"
substr(string, len-location, len-location) <- "-"
string
"(((((((((-((((((((********)))))))-))))))))))"

You should show what you have tried. One very simple way that would work for your example would be to do something like:
gsub("\\*){8}", "\\*)))))))-", "((((((((((((((((((********))))))))))))))))))")
#> [1] "((((((((((((((((((********)))))))-))))))))))"
Edit:
In response to your question: It depends what you mean by other similar examples.
If you go purely by position in the string, you already have an excellent answer from G. Grothendieck. If you want a solution where you want to replace the nth closing bracket, for example, you could do:
s <- "((((((((((((((((((********))))))))))))))))))"
replace_par <- function(n, string) {
sub(paste0("(!?\\))(\\)){", n, "}"),
paste0(paste(rep(")", (n-1)), collapse=""), "-"),
string, perl = TRUE)}
replace_par(8, s)
#> [1] "((((((((((((((((((********)))))))-)))))))))"
Created on 2020-05-21 by the reprex package (v0.3.0)

You could write a function that does the replacement the way you want:
strreplace <- function(x,y,val = "-")
{
regmatches(x,regexpr(1,y)) <- val
sub(".([(](?:[^()]|(?1))*+[)])(?=-)", paste0(val, "\\1"), x, perl = TRUE)
}
a <- "((((((((((((((((((********))))))))))))))))))"
b < -"00000000000000000000000000000000010000000000"
strreplace(a, b)
[1] "((((((((((-(((((((********)))))))-))))))))))"
# Nested paranthesis
a = "((((****))))((((((((((((((((((********))))))))))))))))))"
b = "00000000000000000000000000000000000000000000010000000000"
strreplace(a,b)
[1] "((((****))))((((((((((-(((((((********)))))))-))))))))))"

How can I subindex a multiway array with vector that preserves blanks

As a minimal example lets consider the following multiway array (a):
a = as.table(array(c(1:8), dim=c(2,2,2)))
For this array manual subindicing is easy, e.g.
a[1,,] (a 2 x 2 matrix that comply with dimension one being in state 1 (A))
My question is now; how can I do the same thing with a vector that preserves blanks, e.g. something like c(1,,).
Note that I need to define which dimentions are left blank (dynamically) based on the observed variables in an instance; My initial thought was a generic cha. vector b=c("","","") , where I could replace variable 1 with 1 if it was observed as in state 1, e.g. b[1]="1", but, first of all, I do not know how to use this vector for indicing a["b"], or whether there is a better way of doing this.
I need this dynamic indicing, because I want to update parts of the table as I receive evidence (information == counts)
Thank you very much in advance!
Best,
Sebastian

Here's how I would do it:
while (evidence) {
idx <- lapply(dim(a), function (dimsize) 1:dimsize)
## update `idx` according to `evidence`, e.g.,
## if you want to do `a[1,,2]`
idx[[1]] <- 1
idx[[3]] <- 2
do.call(`[`, c(list(a), idx))
## if you want to do `a[1,,2] <- c(20, 30)`
a <- do.call(`[<-`, c(list(a), idx, list(value=c(20, 30))))
}

Here is a dirty way of solving it:
data:
a = as.table(array(c(1:8), dim=c(2,2,2)))
Your dynamic indices should be a text: (that's a new question of how you get your condition into a string like index, index2)
index = "1,,"
index2 = ",2,"
function:
crazyIndexing <- function(obj, index) {
stringExpr = paste0(obj, "[",index,"]")
return(eval(parse(text=stringExpr)))
}
call your function: (see how it does the same!)
a[1,,]
crazyIndexing("a",index)
a[,2,]
crazyIndexing("a",index2)
please note:
b=c("","",""); b[1]="1"
index = paste0(b, collapse = ",")
#[1] "1,,"
You can of course change your function accordingly:
crazyIndexing2 <- function(obj, obj2, index) {
stringExpr = paste0(obj ,"[",index,"]", "<-", obj, "[",index,"]", "+", obj2)
eval(parse(text=stringExpr))
return( get(obj) )
}
a = as.table(array(c(1:8), dim=c(2,2,2)))
aa = a[,2,]
aopt = crazyIndexing2("a","aa","1,,")
Now you have all the tools.

R: For loop works on list, not individual element

I'm trying to learn by writing a function. It should convert the UOM (unit of measure) into a fraction of the standard UOM. In this case, 1/10 or 0.1
I'm trying to loop through a list generated from strsplit, but I only get the whole list, not each element in the list. I can't figure out what I'm doing wrong. Is strsplit the wrong function? I don't think the problem is in strsplit, but I can't figure out what I'm doing wrong in the For loop:
qty<-0
convf<-0
uom <- "EA"
std <- "CA"
pack <-"1EA/10CA"
if(uom!=std){
s<-strsplit(pack,split = '/')
for (i in s){
print(i)
if(grep(uom,i)){
qty<- regmatches(i,regexpr('[0-9]+',i))
}
if(grep(std,i)){
convf<-regmatches(i, regexpr('[0-9]+',i))
}
} #end for
qty<-as.numeric(qty)
convf<-as.numeric(convf)
}
return(qty/convf)

maybe is a problem with the indexing of the list. Have you tried to use [[1]] after the strsplit function?
Example:
string <- "Hello/world"
mylist <- strsplit(string, "/")
## [[1]]
## [1] "Hello" "World"
But if we explicit say that we want the first "element" of the list with [[1]] we will have the entire array of the string.
Example:
string <- "Hello/World"
mylist <- strsplit(string, "/")[[1]]
## [1] "Hello" "World"
Hope this can help you in your problem.

There are a few issues here. The main problem you are having is that s is a list of length 1. Within that list, the first (only) element is a vector of length 2. Consequently, you would need to set i in s[[1]].
However, we can go one step further. Try the following code:
library(stringr)
lapply(strsplit(pack,split = '/'), # works within the list, can handle larger vectors for `pack`
function(x, uom, std) {
reg_expr <- paste(uom,std, sep = "|") # call this on its own, it's just searching for the text saved in uom or std
qty <- as.numeric(str_remove(x, reg_expr)) # removes that text and converts the string to a number
names(qty) <- str_extract(x, reg_expr) # extracts the text and uses it to name elements in qty
qty[uom] / qty[std] # your desired result.
},
uom = uom, # since these are part of the function call, we need to specify what they are. This is where you should change them.
std = std)

I don't know if this is what you're trying to practice, but I'd avoid loops while extracting the digits from a string like "1EA/10CA". If it helps, the column lst is actually a list inside of a dataset.
library(magrittr)
ds <- data.frame(pack = c("1EA/10CA", "1EA/4CA", "2EA/2CA"))
pattern <- "^(\\d+)EA/(\\d+)CA$"
ds %>%
dplyr::mutate(
qty = as.numeric(sub(pattern, "\\1", pack)),
convf = as.numeric(sub(pattern, "\\2", pack)),
ratio = qty / convf,
lst = purrr::map2(qty, convf, ~list(qty=.x[[1]], convf=.y[[1]]))
)
Result:
pack qty convf ratio lst
1 1EA/10CA 1 10 0.10 1, 10
2 1EA/4CA 1 4 0.25 1, 4
3 2EA/2CA 2 2 1.00 2, 2

R: Define ranges from text using regex

I need a way to call defined variables dependant from a string within text.
Let's say I have five variables (r010, r020, r030, r040, r050).
If there is a given text in that form "r010-050" I want to have the sum of values from all five variables.
The whole text would look like "{r010-050} == {r060}"
The first part of that equation needs to be replaced by the sum of the five variables and since r060 is also a variable the result (via parsing the text) should be a logical value.
I think regex will help here again.
Can anyone help?
Thanks.

Define the inputs: the variables r010 etc. which we assume are scalars and the string s.
Then define a pattern pat which matches the {...} part and a function Sum which accepts the 3 capture groups in pat (i.e. the strings matched to the parts of pat within parentheses) and performs the desired sum.
Use gsubfn to match the pattern, passing the capture groups to Sum and replacing the match with the output of Sum. Then evaluate it.
In the example the only variables in the global environment whose names are between r010 and r050 inclusive are r010 and r020 (it would have used more had they existed) and since they sum to r060 it returned TRUE.
library(gsubfn)
# inputs
r010 <- 1; r020 <- 2; r060 <- 3
s <- "{r010-050} == {r060}"
pat <- "[{](\\w+)(-(\\w+))?[}]"
Sum <- function(x1, x2, x3, env = .GlobalEnv) {
x3 <- if(x3 == "") x1 else paste0(gsub("\\d", "", x1), x3)
lst <- ls(env)
sum(unlist(mget(lst[lst >= x1 & lst <= x3], envir = env)))
}
eval(parse(text = gsubfn(pat, Sum, s)))
## [1] TRUE

String splitting in R Programming

Currently the script below is splitting a combined item code into a specific item codes.
rule2 <- c("MR")
df_1 <- test[grep(paste("^",rule2,sep="",collapse = "|"),test$Name.y),]
SpaceName_1 <- function(s){
num <- str_extract(s,"[0-9]+")
if(nchar(num) >3){
former <- substring(s, 1, 4)
latter <- strsplit(substring(s,5,nchar(s)),"")
latter <- unlist(latter)
return(paste(former,latter,sep = "",collapse = ","))
}
else{
return (s)
}
}
df_1$Name.y <- sapply(df_1$Name.y, SpaceName_1)
Example,
Combined item code: Room 324-326 is splitting into MR324 MR325 MR326.
However for this particular Combined item code: Room 309-311 is splitting into MR309 MR300 MR301.
How should I amend the script to give me MR309 MR310 MR311?

You can try something along these lines:
range <- "324-326"
x <- as.numeric(unlist(strsplit(range, split="-")))
paste0("MR", seq(x[1], x[2]))
[1] "MR324" "MR325" "MR326"
I assume that you can obtain the numerical room sequence by some means, and then use the snippet I gave you above.
If your combined item codes always have the form Room xxx-yyy, then you can extract the range using gsub:
range <- gsub("Room ", "", "Room 324-326")
If your item codes were in a vector called codes, then you could obtain a vector of ranges using:
ranges <- sapply(codes, function(x) gsub("Room ", "", x))

We can also evaluate the string after replacing the - with : and then paste the prefix "MR".
paste0("MR", eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)", "\\1:\\2", range))))
#[1] "MR324" "MR325" "MR326"
Wrap it as a function for convenience
fChange <- function(prefixStr, RangeStr){
paste0(prefixStr, eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)",
"\\1:\\2", RangeStr))))
}
fChange("MR", range)
fChange("MR", range1)
#[1] "MR309" "MR310" "MR311"
For multiple elements, just loop over and apply the function
sapply(c(range, range1), fChange, prefixStr = "MR")
data
range <- "Room 324-326"
range1 <- "Room 309-311"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Compare two strings in R and see additions, deletions - r

We use diffobj to compare configuration files (in more or less production environment), and it works just right. In your case, wouldn't diffobj::diffChr be what you want? diffobj::diffChr("hello world", "helo world!", color.mode = 'rgb')

Related

How to remove/replace specific parentheses from a string containing multiple parentheses in R

How can I subindex a multiway array with vector that preserves blanks

R: For loop works on list, not individual element

R: Define ranges from text using regex

String splitting in R Programming

Categories

Resources