I need some pointers on this. Actually, I don't necessarily need a fully-fledged solution here - some pointers to functions and/or packages would be great.
The problem: I want to find specific sequences in a character vector. The sequences can be somewhat "underspecified". That means that some of the elements should be fixed, but for some elements it does not matter how long they are or what they are exactly.
An example: Suppose I want to find the following pattern in a character vector:
The sequence should begin with "Out of" or "out of"
The sequence should end with "reasons"
In between, there should be other elements. But it does not matter how much elements (also zero would be OK) and what the elements exactly are.
In between 1. and 2., there shouldn't be a ".", "!" or "?".
There should be a parameter that controls how long the sequence in 3. can maximally be to still produce a result.
Return value of the function should be the intervening elements and/or their indices in the vector.
So, the function should "behave" like this:
c("Out", "of", "specific", "reasons", ".") Return "specific"
c("Out", "of", "very", "specific", "reasons", ".") Return c("very", "specific")
c("out", "of", "curiosity", ".", "He", "had", "his", "reasons") Return "" or NA or NULL, which one doesn't matter - just a signal that there is no result.
As I said: I don't need a full solution. Any pointers to packages that already implement such functionality are appreciated!
Optimally, I don't want to rely on a solution that first pastes the text and then uses regex for matching.
Thanks a lot!
I would be really curious to learn of a package that serves your needs. My inclination would be to collapse the strings and use regular expressions or find a programmer or use perl. But here's one extensible solution in R with a few more cases to experiment on. Not very elegant, but see if this has some utility.
# Recreate data as a list with a few more edge cases
txt1 <- c(
"Out of specific reasons.",
"Out of very specific reasons.",
"Out of curiosity. He had his reasons.",
"Out of reasons.",
"Out of one's mind.",
"For no particular reason.",
"Reasons are out of the ordinary.",
"Out of time and money and for many good reasons, it seems.",
"Out of a box, a car, and for random reasons.",
"Floop foo bar.")
txt2 <- strsplit(txt1, "[[:space:]]+") # remove space
txt3 <- lapply(txt2, strsplit, "(?=[[:punct:]])", perl = TRUE) #
txt <- lapply(txt3, unlist) # create list of tokens from each line
# Define characters to exclude: [. ! and ?] but not [,]
exclude <- "[.!?]"
# Assign acceptable limit to separation
lim <- 5 # try 7 and 12 to experiment
# Create indices identifying each of the enumerated conditions
fun1 <- function(x, pat) grep(pat, x, ignore.case = TRUE)
index1 <- lapply(txt, fun1, "out")
index2 <- lapply(txt, fun1, "of")
index3 <- lapply(txt, fun1, "reasons")
index4 <- lapply(txt, fun1, exclude)
# Create logical vectors from indices satisfying the conditions
fun2 <- function(set, val) val[1] %in% set
cond1 <- sapply(index1, fun2, val = 1) & sapply(index2, fun2, val = 2)
cond2 <- sapply(index3, "[", 1) < lim + 2 + 2 # position of 'of' + 2
cond3 <- sapply(index3, max, -Inf) < sapply(index4, min, Inf)
# Combine logical vectors to a single logical vector
valid <- cond1 & cond2 & cond3
valid <- ifelse(is.na(valid), FALSE, valid)
# Examine selected original lines
print(txt1[valid])
# Helper function to extract the starting and the ending element
fun3 <- function(index2, index3, valid) {
found <- rep(list(NULL), length(index2))
found[valid] <- Map(seq, index2[valid], index3[valid])
found <- lapply(found, tail, -1)
found <- lapply(found, head, -1)
}
# Extract starting and ending element from valid list members
idx <- fun3(index2, index3, valid)
# Return the results or "" for no intervening text or NULL for no match
ans <- Map(function(x, i) {
if (is.null(i)) NULL # no match found
else if (length(i) == 0) "" # no intervening elements
else x[i]}, # all intervening elements <= lim
txt, idx)
# Show found (non-NULL) values
ans[!sapply(ans, is.null)]
So let's assume your example
x <- c("Out", "of", "very", "specific", "reasons", ".")
We first need to get the beginning of the indicator
i_Beginning <- as.numeric(grep("Out|out", x))
and the ending
i_end <- as.numeric(grep("reasons", x))
Need to also check that Out is followed by of
Is_Of <- grepl("Of|of", x[i_Beginning +1])
And if this is true we extract the other elements
if(Is_Of){
extraction <- x[c(i_Beginning +2, i_end -1)]
}
print(extraction)
Related
I want to compare two character values in R and see which characters where added and deleted to display it later similar to git diff --color-words=. (see screenshot below)
For example:
a <- "hello world"
b <- "helo world!"
diff <- FUN(a, b)
where diff would somehow show that an l was dropped and a ! was added.
The ultimate goal is to construct an html string like this hel<span class="deleted">l</span>o world<span class="added">!</span>.
I am aware of diffobj but so far I cannot get it to return the character differences, only the differences between elements.
Output of git diff --color-words=.
the output looks like this:
Base R has a function adist that computes the generalized Levenshtein distance. With arguments count and partial attribute "trafos" is set to the sequence of matches, insertions and deletions needed to go from one string to the other. From the documentation, section Value, my emphasis:
If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type of transformation (insertions, deletions and substitutions), respectively. Additionally, if partial = FALSE, the transformation sequences are returned as the "trafos" attribute of the return value, as character strings with elements ‘M’, ‘I’, ‘D’ and ‘S’ indicating a match, insertion, deletion and substitution, respectively. If partial = TRUE, the offsets (positions of the first and last element) of the matched substrings are returned as the "offsets" attribute of the return value (with both offsets -1−1 in case of no match).
a <- "hello world"
b <- "helo world!"
attr(adist(a, b, counts = TRUE), "trafos")
#> [,1]
#> [1,] "MMDMMMMMMMMI"
Created on 2022-05-31 by the reprex package (v2.0.1)
There is a deletion in the 3rd character and an insertion at the end of string a.
Found a solution using diffobj::ses_dat() and splitting the data into its characters before.
get_html_diff <- function(a, b) {
aa <- strsplit(a, "")[[1]]
bb <- strsplit(b, "")[[1]]
s <- diffobj::ses_dat(aa, bb)
m <- cumsum(as.integer(s$op) != c(Inf, s$op[1:(length(s$op) - 1)]))
res <- paste(
sapply(split(seq_along(s$op), m), function(i) {
val <- paste(s$val[i], collapse = "")
if (s$op[i[[1]]] == "Insert")
val <- paste0("<span class=\"add\">", val, "</span>")
if (s$op[i[[1]]] == "Delete")
val <- paste0("<span class=\"del\">", val, "</span>")
val
}),
collapse = "")
res
}
get_html_diff("hello world", "helo World!")
#> [1] "hel<span class=\"del\">l</span>o <span class=\"del\">w</span><span class=\"add\">W</span>orld<span class=\"add\">!</span>"
Created on 2022-05-31 by the reprex package (v2.0.1)
We use diffobj to compare configuration files (in more or less production environment), and it works just right. In your case, wouldn't diffobj::diffChr be what you want?
diffobj::diffChr("hello world", "helo world!", color.mode = 'rgb')
Given the following string of parentheses, I am trying to remove one specific parentheses,
where the position of one of its bracket is marked with 1.
((((((((((((((((((********))))))))))))))))))
00000000000000000000000000000000010000000000
So for the above example, the solution I am looking for is
((((((((((-(((((((********)))))))-))))))))))
00000000000000000000000000000000010000000000
I am tried using strsplit function from stringr to split and get the indexes of the bracket marked with 1. But I am not sure how I can get the index of its corresponding closing bracket.
Could anyone give some input on this..
What I did..
a = "((((((((((-(((((((********)))))))-))))))))))"
b = "00000000000000000000000000000000010000000000"
which(unlist(strsplit(b,"")) == 1)
#[1] 34
a_mod = unlist(strsplit(a,""))[-34]
here, I removed one bracket of the parentheses which I wanted to remove but I do not know how I can remove its corresponding opening bracket which is in 11th position in this example
Locate the 1 in b giving pos2 and also calculate the length of b giving n. Then replace positions pos2 and pos1 = n-pos2+1 with minus characters. See ?gregexpr and ?nchar and ?substr for more info. No packages are used.
pos2 <- regexpr(1, b)
n <- nchar(a)
pos1 <- n - pos2 + 1
substr(a, pos1, pos1) <- substr(a, pos2, pos2) <- "-"
a
## [1] "((((((((((-(((((((********)))))))-))))))))))"
Since the parentheses are paired the index of the close parentheses is just the length of the string minus the index of the open parentheses (they're equidistant from the string ends)
library(stringr)
string <- "((((((((((((((((((********))))))))))))))))))"
b <- "00000000000000000000000000000000010000000000"
location <- str_locate(b, "1")[1]
len <- str_length(string)
substr(string, location, location) <- "-"
substr(string, len-location, len-location) <- "-"
string
"(((((((((-((((((((********)))))))-))))))))))"
You should show what you have tried. One very simple way that would work for your example would be to do something like:
gsub("\\*){8}", "\\*)))))))-", "((((((((((((((((((********))))))))))))))))))")
#> [1] "((((((((((((((((((********)))))))-))))))))))"
Edit:
In response to your question: It depends what you mean by other similar examples.
If you go purely by position in the string, you already have an excellent answer from G. Grothendieck. If you want a solution where you want to replace the nth closing bracket, for example, you could do:
s <- "((((((((((((((((((********))))))))))))))))))"
replace_par <- function(n, string) {
sub(paste0("(!?\\))(\\)){", n, "}"),
paste0(paste(rep(")", (n-1)), collapse=""), "-"),
string, perl = TRUE)}
replace_par(8, s)
#> [1] "((((((((((((((((((********)))))))-)))))))))"
Created on 2020-05-21 by the reprex package (v0.3.0)
You could write a function that does the replacement the way you want:
strreplace <- function(x,y,val = "-")
{
regmatches(x,regexpr(1,y)) <- val
sub(".([(](?:[^()]|(?1))*+[)])(?=-)", paste0(val, "\\1"), x, perl = TRUE)
}
a <- "((((((((((((((((((********))))))))))))))))))"
b < -"00000000000000000000000000000000010000000000"
strreplace(a, b)
[1] "((((((((((-(((((((********)))))))-))))))))))"
# Nested paranthesis
a = "((((****))))((((((((((((((((((********))))))))))))))))))"
b = "00000000000000000000000000000000000000000000010000000000"
strreplace(a,b)
[1] "((((****))))((((((((((-(((((((********)))))))-))))))))))"
I am trying to find and replace some text based on fuzzy matching as follows.
Aim
I want to do this for a list of find and replaces. I dont know how to extend the current function to allow this to happen.
Input
Input text
df <- data.frame(textcol=c("In this substring would like to find the radiofrequency ablation of this HALO",
"I like to do endoscopic submuocsal resection and also radifrequency ablation",
"No match here","No mention of this radifreq7uency ablati0on thing"))
The attempt
##### Lower case the text ##########
df$textcol<-tolower(df$textcol)
#Need to define the pattern to match and what to replace it with
matchPattern <- "radiofrequency ablation"
findAndReplace<-function(matchPattern,rawText,replace)
{
positions <- aregexec(matchPattern, rawText, max.distance = 0.1)
regmatches(rawText, positions)
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX" # deal with 0 length matches somehow
#################### Term mapping ####################
df$out <- Vectorize(gsub)(unlist(res), replace, rawText)
df$out
}
matchPatternRFA <- c("radiofrequency ablation")
repRF<-findAndReplace(matchPatternRFA,rawText,"RFA")
repRF
The problem
The above works fine for the replacement of one term, but what if I want to also replace endoscopic 'submucosal resection' with 'EMR' and 'HALO' with 'catheter'?
Ideally I'd like to create a list of terms to match but then how do I also specify how to replace them?
Define asub to replace approximate matches with a replacement string and define a matching list L that for each name defines its replacement. Then run Reduce to perform the replacements.
asub <- function(pattern, replacement, x, fixed = FALSE, ...) {
m <- aregexec(pattern, x, fixed = fixed)
r <- regmatches(x, m)
lens <- lengths(r)
if (all(lens == 0)) return(x) else
replace(x, lens > 0, mapply(sub, r[lens > 0], replacement, x[lens > 0]))
}
L <- list("radiofrequency ablation" = "RFA",
"endoscopic submucosal resection" = "EMR",
"HALO" = "cathetar")
Reduce(function(x, nm) asub(nm, L[[nm]], x), init = df$textcol, names(L))
giving:
[1] "In this substring would like to find the RFA of this cathetar"
[2] "I like to do EMR and also RFA"
[3] "No match here"
[4] "No mention of this RFA thing"
You can create a lookup table with patterns and necessary replacements:
dt <-
data.table(
textcol = c(
"In this substring would like to find the radiofrequency ablation of this HALO",
"I like to do endoscopic submuocsal resection and also radifrequency ablation",
"No match here",
"No mention of this radifreq7uency ablati0on thing"
)
)
dt_gsub <- data.table(
textcol = c("submucosal resection",
"HALO",
"radiofrequency ablation"),
textcol2 = c("EMR", "catheter", "RFA")
)
for (i in 1:nrow(dt))
for (j in 1:nrow(dt_gsub))
dt[i]$textcol <-
gsub(dt_gsub[j, textcol], dt_gsub[j, textcol2], dt[i, textcol])
Currently the script below is splitting a combined item code into a specific item codes.
rule2 <- c("MR")
df_1 <- test[grep(paste("^",rule2,sep="",collapse = "|"),test$Name.y),]
SpaceName_1 <- function(s){
num <- str_extract(s,"[0-9]+")
if(nchar(num) >3){
former <- substring(s, 1, 4)
latter <- strsplit(substring(s,5,nchar(s)),"")
latter <- unlist(latter)
return(paste(former,latter,sep = "",collapse = ","))
}
else{
return (s)
}
}
df_1$Name.y <- sapply(df_1$Name.y, SpaceName_1)
Example,
Combined item code: Room 324-326 is splitting into MR324 MR325 MR326.
However for this particular Combined item code: Room 309-311 is splitting into MR309 MR300 MR301.
How should I amend the script to give me MR309 MR310 MR311?
You can try something along these lines:
range <- "324-326"
x <- as.numeric(unlist(strsplit(range, split="-")))
paste0("MR", seq(x[1], x[2]))
[1] "MR324" "MR325" "MR326"
I assume that you can obtain the numerical room sequence by some means, and then use the snippet I gave you above.
If your combined item codes always have the form Room xxx-yyy, then you can extract the range using gsub:
range <- gsub("Room ", "", "Room 324-326")
If your item codes were in a vector called codes, then you could obtain a vector of ranges using:
ranges <- sapply(codes, function(x) gsub("Room ", "", x))
We can also evaluate the string after replacing the - with : and then paste the prefix "MR".
paste0("MR", eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)", "\\1:\\2", range))))
#[1] "MR324" "MR325" "MR326"
Wrap it as a function for convenience
fChange <- function(prefixStr, RangeStr){
paste0(prefixStr, eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)",
"\\1:\\2", RangeStr))))
}
fChange("MR", range)
fChange("MR", range1)
#[1] "MR309" "MR310" "MR311"
For multiple elements, just loop over and apply the function
sapply(c(range, range1), fChange, prefixStr = "MR")
data
range <- "Room 324-326"
range1 <- "Room 309-311"
I have this vector (it's big in size) myvec. I need to split them matching at / and create another result vector resvector. How can I get this done in R?
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
resvector
IID:WE:G12D, IID:WE:G12V,IID:WE:G12A,GH:SQ:p.R172W,GH:SQ:p.R172G,HH:WG:p.S122F,HH:WG:p.S122H
You can try this, using strsplit as mentioned by #Tensibai:
sp_vec <- strsplit(myvec, "/") # split the element of the vector by "/" : you will get a list where each element is the decomposition (vector) of one element of your vector, according to "/"
ts_vec <- lapply(sp_vec, # for each element of the previous list, do
function(x){
base <- sub("\\w$", "", x[1]) # get the common beginning of the column names (so first item of vector without the last letter)
x[-1] <- paste0(base, x[-1]) # paste this common beginning to the rest of the vector items (so the other letters)
x}) # return the vector
resvector <- unlist(ts_vec) # finally, unlist to get the needed vector
resvector
# [1] "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
Here is a concise answer with regex and some functional programming:
x = gsub('[A-Z]/.+','',myvec)
y = strsplit(gsub('[^/]+(?=[A-Z]/.+)','',myvec, perl=T),'/')
unlist(Map(paste0, x, y))
# "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
custmSplit <- function(str){
splitbysep <- strsplit(str, '/')[[1]]
splitbysep[-1] <- paste0(substr(splitbysep[1], 1, nchar(splitbysep[1])), splitbysep[-1])
return(splitbysep)
}
do.call('c', lapply(myvec, custmSplit))
# [1] "IID:WE:G12D" "IID:WE:G12DV" "IID:WE:G12DA" "GH:SQ:p.R172W" "GH:SQ:p.R172WG" "HH:WG:p.S122F" "HH:WG:p.S122FH"