unexpected behavior in pmatch while matching '+' in R - r

I am trying to match the '+' symbol inside my string using the pmatch function.
Target = "18+"
pmatch("+",Target)
[1] NA
I observe similar behavior if I use match or grepl also.
If I try and use gsub, I get the following output.
gsub("+","~",Target)
[1] "~1~8~+~"
Can someone please explain me the reason for this behavior and a viable solution for my problem

It's a forward looking match. So it tries to match "+" to the first character of all elements in table (the second argument of pmatch). This fails ("+" != "1" ) so NA is returned. You must also be careful of the return value of pmatch. I'm going to quote from the help because it explains it succinctly and better than I ever could...
Exact matches are preferred to partial matches (those where the value to be matched has an exact match to the initial part of the target, but the target is longer).
If there is a single exact match or no exact match and a unique
partial match then the index of the matching value is returned; if
multiple exact or multiple partial matches are found then 0 is
returned and if no match is found then nomatch is returned.
###Examples from ?pmatch###
# Multiple partial matches found - returns 0
charmatch("m", c("mean", "median", "mode")) # returns 0
# One exact match found - return index of match in table
charmatch("med", c("mean", "median", "mode")) # returns 2
# One exact match found and preferred over partial match - index of exact match returned
charmatch("med", c("med", "median", "mode")) # returns 1
To get a vector of matches to "+" in your string I'd use grepl...
Target <- c( "+" , "+18" , "18+" , "23+26" , "1234" )
grepl( "\\+" , Target )
# [1] TRUE TRUE TRUE TRUE FALSE

Try this:
gsub("+","~",fixed=TRUE,Target)
?gsub
fixed - logical. If TRUE, pattern is a string to be matched as is.
Overrides all conflicting arguments.

The function pmatch() attempts to match the beginning elements, not the middle portions of elements. So, the issue there has nothing to do with the plus symbol, +. So, for example, the first two executions of pmatch() give NA as the result, the next three give 1 as the result (indicating a match of the beginning of the first element).
Target <- "18+"
pmatch("8", Target)
pmatch("+", Target)
pmatch("1", Target)
pmatch("18", Target)
pmatch("18+", Target)
The function gsub() can be used to match and replace portions of elements using regular expressions. The plus sign has special meaning in regular expressions, so you need to use escape characters to indicate that you are interested in the plus sign as a single character. For example, the following three lines of code give "1~+", "18~", and "~" as the results, respectively.
gsub("8", "~", Target)
gsub("\\+", "~", Target)
gsub("18\\+", "~", Target)

Related

Replace Exact String in R using Variables [duplicate]

I'm trying to extract certain records from a dataframe with grepl.
This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.
Any ideas on how to fix this?
Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis")
Relationships <- data.frame(Results, Records)
Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))
This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:
Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))
Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.
Any suggestions?
In addition to #Richard's solution, there are multiple ways to enforce a full match.
\b
"\b" is an anchor to identify word before/after pattern
> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
\< & \>
"\<" is an escape sequence for the beginning of a word, and ">" is used for end
> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
Use ^ to match the start of the string and $ to match the end of the string
Names <-c('^Word1$')
Or, to apply to the entire names vector
Names <-paste0('^',Names,'$')
I think this is just:
Relationships[Relationships$Results==Names,]
If you end up doing ^Word1$ you're just doing a straight subset.
If you have multiple names, then instead use:
Relationships[Relationships$Results %in% Names,]

Extract substring match from agrep

My Goal is to identify whether a given text has a target string in it, but i want to allow for typos / small derivations and extract the substring that "caused" the match (to use it for further text analysis).
Example:
target <- "target string"
text <- "the target strlng: Butter. this text i dont want to extract."
Desired Output:
I would like to have target strlng as the Output, since ist very Close to the target (levenshtein distance of 1). And next i want to use target strlng to extract the word Butter (This part i have covered, i just add it to have a detailed spec).
What i tried:
Using adist did not work, since it compares two strings, not substrings.
Next i took a look at agrep which seems very Close. I can have the Output, that my target was found, but not the substring that "caused" the match.
I tried with value = TRUE but it seems to work on Array Level. I think It is not possible for me to Switch to Array type, because i can not split by spaces (my target string might have spaces,...).
agrep(
pattern = target,
x = text,
value = TRUE
)
Use aregexec, it's similar to the use of regexpr/regmatches (or gregexpr) for exact matches extraction.
m <- aregexec('string', 'text strlng wrong')
regmatches('text strlng wrong', m)
#[[1]]
#[1] "strlng"
This can be wrapped in a function that uses the arguments of both aregexec and regmatches. Note that in the latter case, the function argument invert comes after the dots argument ... so it must be a named argument.
aregextract <- function(pattern, text, ..., invert = FALSE){
m <- aregexec(pattern, text, ...)
regmatches(text, m, invert = invert)
}
aregextract(target, text)
#[[1]]
#[1] "target strlng"
aregextract(target, text, invert = TRUE)
#[[1]]
#[1] "the "
#[2] ": Butter. this text i dont want to extract."

Partial Match word from sentence in R

I am looking to partial match string using %in% operator in R when I run below I get FALSE
'I just want to partial match string' %in% 'partial'
FALSE
Expected Output is TRUE in above case (because it is matched partially)
Since you want to match partially from a sentence you should try using %like% from data.table, check below
library(data.table)
'I just want to partial match string' %like% 'partial'
TRUE
The output is TRUE
`%in_str%` <- function(pattern,s){
grepl(pattern, s)
}
Usage:
> 'a' %in_str% 'abc'
[1] TRUE
You need to strsplit the string so each word in it is its own element in a vector:
"partial" %in% unlist(strsplit('I just want to partial match string'," "))
[1] TRUE
strsplit takes a string and breaks it into a vector of shorter strings. In this case, it breaks on the space (that's the " " at the end), so that you get a vector of individual words. Unfortunately, strstring defaults to save its results as a list, which is why I wrapped it in an unlist - so we get a single vector.
Then we do the %in%, which works in the opposite direction from the one you used: you're trying to find out if string partial is %in% the sentence, not the other way around.
Of course, this is an annoying way of doing it, so it's probably better to go with a grep-based solution if you want to stay within base-R, or Priyanka's data.table solution above -- both of which will also be better at stuff like matching multiple-word strings.

R Regex to identify and replace characters between multiple dots

I have the following codes
"ABC.A.SVN.10.10.390.10.UDGGL"
"XYZ.Z.SVN.11.12.111.99.ASDDL"
and I need to replace the characters that exist between the 2nd and the 3rd dot. In this case it is SVN but it may well be any combination of between A and ZZZ, so really the only way to make this work is by using the dots.
The required outcome would be:
"ABC.A..10.10.390.10.UDGGL"
"XYZ.Z..11.12.111.99.ASDDL"
I tried variants of grep("^.+(\\.\\).$", "ABC.A.SVN.10.10.390.10.UDGGL") but I get an error.
Some examples of what I have tried with no success :
Link 1
Link 2
EDIT
I tried #Onyambu 's first method and I ran into a variant which I had not accounted for: "ABC.A.AB11.1.12.112.1123.UDGGL". In the replacement part, I also have numeric values. The desired outcome is "ABC.A..1.12.112.1123.UDGGL" and I get it using sub("\\.\\w+.\\B.",".",x) per the second part of his answer!
See code in use here
x <- c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sub("^(?:[^.]*\\.){2}\\K[^.]*", "", x, perl=T)
^ Assert position at the start of the line
(?:[^.]*\.){2} Match the following exactly twice
[^.]*\. Match any character except . any number of times, followed by .
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^.]* Match any character except . any number of times
Results in [1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
x= "ABC.A.SVN.10.10.390.10.UDGGL" "XYZ.Z.SVN.11.12.111.99.ASDDL"
sub("([A-Z]+)(\\.\\d+)","\\2",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
([A-Z]+) Capture any word that has the characters A-Z
(\\.\\d+) The captured word above, must be followed with a dot ie\\..This dot is then followed by numbers ie \\d+. This completes the capture.
so far the captured part of the string "ABC.A.SVN.10.10.390.10.UDGGL" is SVN.10 since this is the part that matches the regular expression. But this part was captured as SVN and .10. we do a backreference ie replace the whole SVN.10 with the 2nd part .10
Another logic that will work:
sub("\\.\\w+.\\B.",".",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
Not exactly regex but here is one more approach
#DATA
S = c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sapply(X = S,
FUN = function(str){
ind = unlist(gregexpr("\\.", str))[2:3]
paste(c(substring(str, 1, ind[1]),
"SUBSTITUTION",
substring(str, ind[2], )), collapse = "")
},
USE.NAMES = FALSE)
#[1] "ABC.A.SUBSTITUTION.10.10.390.10.UDGGL" "XYZ.Z.SUBSTITUTION.11.12.111.99.ASDDL"

get character before second underscore [duplicate]

What regular expression can retrieve (e.g. with sup()) the characters before the second period. Given a character vector like:
v <- c("m_s.E1.m_x.R1PE1", "m_xs.P1.m_s.R2E12")
I would like to have returned this:
[1] "m_s.E1" "m_xs.P1"
> sub( "(^[^.]+[.][^.]+)(.+$)", "\\1", v)
[1] "m_s.E1" "m_xs.P1"
Now to explain it: The symbols inside the first and third paired "[ ]" match any character except a period ("character classes"), and the "+"'s that follow them let that be an arbitrary number of such characters. The [.] therefore is only matching the first period, and the second period will terminate the match. Parentheses-pairs allow you to specific partial sections of matched characters and there are two sections. The second section is any character (the period symbol) repeated an arbitrary number of times until the end of the string, $. The "\\1" specifies only the first partial match as the returned value.
The ^ operator means different things inside and outside the square-brackets. Outside it refers to the length-zero beginning of the string. Inside at the beginning of a character class specification, it is the negation operation.
This is a good use case for "character classes" which are described in the help page found by typing:
?regex
Not regex but the qdap package has the beg2char (beginning of string 2 n character) to handle this:
library(qdap)
beg2char(v, ".", 2)
## [1] "m_s.E1" "m_xs.P1"

Resources