Extract substring match from agrep - r

My Goal is to identify whether a given text has a target string in it, but i want to allow for typos / small derivations and extract the substring that "caused" the match (to use it for further text analysis).
Example:
target <- "target string"
text <- "the target strlng: Butter. this text i dont want to extract."
Desired Output:
I would like to have target strlng as the Output, since ist very Close to the target (levenshtein distance of 1). And next i want to use target strlng to extract the word Butter (This part i have covered, i just add it to have a detailed spec).
What i tried:
Using adist did not work, since it compares two strings, not substrings.
Next i took a look at agrep which seems very Close. I can have the Output, that my target was found, but not the substring that "caused" the match.
I tried with value = TRUE but it seems to work on Array Level. I think It is not possible for me to Switch to Array type, because i can not split by spaces (my target string might have spaces,...).
agrep(
pattern = target,
x = text,
value = TRUE
)

Use aregexec, it's similar to the use of regexpr/regmatches (or gregexpr) for exact matches extraction.
m <- aregexec('string', 'text strlng wrong')
regmatches('text strlng wrong', m)
#[[1]]
#[1] "strlng"
This can be wrapped in a function that uses the arguments of both aregexec and regmatches. Note that in the latter case, the function argument invert comes after the dots argument ... so it must be a named argument.
aregextract <- function(pattern, text, ..., invert = FALSE){
m <- aregexec(pattern, text, ...)
regmatches(text, m, invert = invert)
}
aregextract(target, text)
#[[1]]
#[1] "target strlng"
aregextract(target, text, invert = TRUE)
#[[1]]
#[1] "the "
#[2] ": Butter. this text i dont want to extract."

Related

Add space before a character with gsub (R) [duplicate]

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

R: replace underscore repeated non-consecutively more than twice

I received a dataset with phrases connected by underscores like so:
text <- "hi, how_are_you? that's_great. yes_i'm_als0_#k"
As in this example, the data contain numbers, symbols, punctuation, and spaces. I want to replace underscores when they appear 3 or more times (like in yes_i'm_als0_#k) with single spaces. The desired output is:
"hi, how_are_you? that's_great. yes i'm als0 #k"
Another way to put it, I received a dataset with hard-coded ngrams and I want to keep unigrams, bigrams and trigrams.
gsubfn is like gsub but instead of the replacing occurrences of the regular expression specified in the first argument with a fixed string it passes the matches to the function specified in the second argument replacing the input with the output of the function. The function can be specified in formula notation where the body of the function is on the right hand side and the argument, here just s, is determined by finding the free variables in the right hand side.
library(gsubfn)
gsubfn("\\S+",
~ if (length(unlist(gregexpr("_", s))) >= 3) gsub("_", " ", s) else s,
text)
giving:
[1] "hi, how_are_you? that's_great. yes i'm als0 #k"

Match & Replace String, utilising the original string in the replacement, in R

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

How do i insert a certain string in another string at a particular location in r?

I am new to R. It may be a very simple thing to do but I am not able figure it out.
Say, I have a string as follows:
This is an example string.
Now I want to make it as follows:
This is an (example/sample) string.
I know the location at which the change is to be made. (12th character in the given string).
I have a lot of strings where i need to perform similar operation.
I think I don't understand the problem but if I do you could use gsub here:
x <- "This is an example string."
gsub("example", "(example/sample)", x)
## [1] "This is an (example/sample) string."
Here's one solution with regular expressions:
# the string
s <- "This is an example string."
# the position of the target's first character
pos <- 12
# create a regular expression
reg <- paste0("^(.{", pos - 1, "})(.+?\\b)(.*)")
# [1] "^(.{11})(.+?\\b)(.*)"
# modify string
sub(reg, "\\1\\(\\2/sample\\)\\3", s)
# [1] "This is an (example/sample) string."
Here's another regex flavoured solution using a lookbehind:
s <- "This is an example string."
pos <- 12
replacement <- '(example/sample)'
sub(sprintf('(?<=^.{%s})\\S*\\b', pos-1), replacement, s, perl=TRUE)
## [1] "This is an (example/sample) string."
Lookbehind (?<=x) is useful because regex within it is part of the pattern but doesn't become part of the match (so we don't have to capture them and replace them later). The pattern above says: "The beginning of the string, followed by 11 characters, preceding zero or more non-whitespace characters, followed by a word boundary. Only the non-whitespace characters are replaced, by replacement.
Update
An alternative is to use strsplit to create a vector of words, and then identify the position in the vector of the character of interest (e.g. the 12th character), subsequently replacing that element with your new word. This is a bit slower than the regex approach, but makes it straightforward to request multiple replacements (at multiple character positions). For example:
f <- function(string, pos, new) {
s <- strsplit(string, '\\s')[[1]]
i <- findInterval(pos, c(gregexpr('(?<=\\b)\\w', string, perl=TRUE)[[1]],
nchar(string)))
s[i] <- mapply(sub, s[i], patt='\\b[[:alnum:]-]+\\b', repl=new, perl=TRUE)
paste0(s, collapse=' ')
}
f('This is an example string.', c(12, 20), c('excellent', 'function'))
## [1] "This is an excellent function."
Note that this hyphenated words are fully replaced (i.e. not just the part up to a hyphen) by the replacement, and all other punctuation (outside boundaries of hyphenated words) is retained.

unexpected behavior in pmatch while matching '+' in R

I am trying to match the '+' symbol inside my string using the pmatch function.
Target = "18+"
pmatch("+",Target)
[1] NA
I observe similar behavior if I use match or grepl also.
If I try and use gsub, I get the following output.
gsub("+","~",Target)
[1] "~1~8~+~"
Can someone please explain me the reason for this behavior and a viable solution for my problem
It's a forward looking match. So it tries to match "+" to the first character of all elements in table (the second argument of pmatch). This fails ("+" != "1" ) so NA is returned. You must also be careful of the return value of pmatch. I'm going to quote from the help because it explains it succinctly and better than I ever could...
Exact matches are preferred to partial matches (those where the value to be matched has an exact match to the initial part of the target, but the target is longer).
If there is a single exact match or no exact match and a unique
partial match then the index of the matching value is returned; if
multiple exact or multiple partial matches are found then 0 is
returned and if no match is found then nomatch is returned.
###Examples from ?pmatch###
# Multiple partial matches found - returns 0
charmatch("m", c("mean", "median", "mode")) # returns 0
# One exact match found - return index of match in table
charmatch("med", c("mean", "median", "mode")) # returns 2
# One exact match found and preferred over partial match - index of exact match returned
charmatch("med", c("med", "median", "mode")) # returns 1
To get a vector of matches to "+" in your string I'd use grepl...
Target <- c( "+" , "+18" , "18+" , "23+26" , "1234" )
grepl( "\\+" , Target )
# [1] TRUE TRUE TRUE TRUE FALSE
Try this:
gsub("+","~",fixed=TRUE,Target)
?gsub
fixed - logical. If TRUE, pattern is a string to be matched as is.
Overrides all conflicting arguments.
The function pmatch() attempts to match the beginning elements, not the middle portions of elements. So, the issue there has nothing to do with the plus symbol, +. So, for example, the first two executions of pmatch() give NA as the result, the next three give 1 as the result (indicating a match of the beginning of the first element).
Target <- "18+"
pmatch("8", Target)
pmatch("+", Target)
pmatch("1", Target)
pmatch("18", Target)
pmatch("18+", Target)
The function gsub() can be used to match and replace portions of elements using regular expressions. The plus sign has special meaning in regular expressions, so you need to use escape characters to indicate that you are interested in the plus sign as a single character. For example, the following three lines of code give "1~+", "18~", and "~" as the results, respectively.
gsub("8", "~", Target)
gsub("\\+", "~", Target)
gsub("18\\+", "~", Target)

Resources