str_extract - How to disable default regex - r

library(stringr)
namesfun<-(sapply(mxnames, function (x)(str_extract(x,sapply(jockeys, function (y)y)))))%>%as.data.frame(stringsAsFactors = F)
So I am trying to use str_extract using sapply through two vectors, and the "jockeys" vector that I use as the pattern argument in str_extract, has elements with special characters like "-" or "/" that interfere with regex.
Since I want an exact "human" match if you prefer, and not regex based match, how can I disable regex from being the default matching manner?
I hope I got my point across!

Related

How do I add a space between two characters using regex in R?

I want to add a space between two punctuation characters (+ and -).
I have this code:
s <- "-+"
str_replace(s, "([:punct:])([:punct:])", "\\1\\s\\2")
It does not work.
May I have some help?
There are several issues here:
[:punct:] pattern in an ICU regex flavor does not match math symbols (\p{S}), it only matches punctuation proper (\p{P}), if you still want to match all of them, combine the two classes, [\p{P}\p{S}]
"\\1\\s\\2" replacement contains a \s regex escape sequence, and these are not supported in the replacement patterns, you need to use a literal space
str_replace only replaces one, first occurrence, use str_replace_all to handle all matches
Even if you use all the above suggestions, it still won't work for strings like -+?/. You need to make the second part of the regex a zero-width assertion, a positive lookahead, in order not to consume the second punctuation.
So, you can use
library(stringr)
s <- "-+?="
str_replace_all(s, "([\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", "\\1 ")
str_replace_all(s, "(?<=[\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", " ")
gsub("(?<=[[:punct:]])(?=[[:punct:]])", " ", s, perl=TRUE)
See the R demo online, all three lines yield [1] "- + ? =" output.
Note that in PCRE regex flavor (used with gsub and per=TRUE) the POSIX character class must be put inside a bracket expression, hence the use of double brackets in [[:punct:]].
Also, (?<=[[:punct:]]) is a positive lookbehind that checks for the presence of its pattern immediately on the left, and since it is non-consuming there is no need of any backreference in the replacement.

Get string in between many other strings [R]

Here I want to extract the string part "wanted1part". I could do it like this:
string <- "foo_bar_doo_xwanted1part_more_junk"
gsub("\\_.*", "", gsub(".*?_x", "", string))
#> [1] "wanted1part"
But I wanted hoping that maybe someone could suggest a one line solution?
If you want to stick with using gsub, you can use a capture group that is backreferenced in the replacement:
gsub('^.+_x(\\w+?)_.+$', '\\1', string, perl = TRUE)
The key here is to have the pattern match the whole string but to have a capture group, specified using parenthesis, match the part of the string you would like to keep. This group, here "(\\w+?)", can then replace the entire string when we reference it in the replacement.
I've found that using str_extract from stringr can make this kind of thing a easier as it allows me to avoid the use of capture groups.
library(stringr)
str_extract(string, '(?<=_x)\\w+?(?=_)')
Here, I use a lookahead and lookbehind instead to identify the part of the string we want to extract.

Pattern to match only characters within parentheses

I have looked at lots of posts here on SO with suggestions on REGEX patterns to grab texts from parentheses. However, from what I have looked into I cannot find a solution that works.
For example, I have had a look at the following:
R - Regular Expression to Extract Text Between Parentheses That Contain Keyword, Extract text in parentheses in R, regex to pickout some text between parenthesis [duplicate]
In the following order, here were the top answers solutions (with some amendments):
pattern1= '\\([^()]*[^()]*\\)'
pattern2= '(?<=\\()[^()]*(?=\\))'
pattern3= '.*\\((.*)\\).*'
all_patterns = c(pattern1, pattern2, pattern3)
I have used the following:
sapply(all_patterns , function(x)stringr::str_extract('I(data^2)', x))
\\([^()]*[^()]*\\) (?<=\\()[^()]*(?=\\)) .*\\((.*)\\).*
"(data^2)" "data^2" "I(data^2)"
None of these seem to only grab the characters within the brackets, so how can I just grab the characters inside brackets?
Expected output:
data
With str_extract, it would extract all those characters matched in the patterns. Instead, use a regex lookaround to match one or more characters that are not a ^ or the closing bracket ()) ([^\\^\\)]+) that succeeds an opening bracket ((?<=\\() - these are escaped (\\) as they are metacharacters
library(stringr)
str_extract('I(data^2)', '(?<=\\()[^\\^\\)]+')
# [1] "data"
Here is combinations of str_extract and str_remove
library(stringr)
str_extract(str_remove('I(data^2)', '.\\('), '\\w*')
[1] "data"

How to remove a certain portion of the column name in a dataframe?

I have column names in the following format:
col= c('UserLanguage','Q48','Q21...20','Q22...21',"Q22_4_TEXT...202")
I would like to get the column names without everything that is after ...
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
I am not sure how to code it. I found this post here but I am not sure how to specify the pattern in my case.
You can use gsub.
gsub("\\...*","",col)
#[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
Or you can use stringr
library(stringr)
str_remove(col, "\\...*")
Since . matches any character, we need to "escape" (\) to specify exactly what we want to match in the regular expression (and not use the special behavior of the .). So, to match a period, we would need \.. However, the backslash (\) is used to escape special behavior (e.g., escape symbol in strings) in regexps. So, to create the regular expression, we need an additional backslash, \\. In this case, we want to match additional periods, so we can add those here, hence \\.... Then, * specifies that the previous expression (everything the three periods) may occur 0 or more times.
You could sub and capture the first word in each column:
col <- c("UserLanguage", "Q48", "Q21...20", "Q22...21", "Q22_4_TEXT...202")
sub("^(\\w+).*$", "\\1", col)
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
The regex pattern used here says to match:
^ from the start of the input
(\w+) match AND capture the first word
.* then consume the rest
$ end of the input
Then, using sub we replace with \1 to retain just the first word.

grepl not searching correctly in R

I want to search ".com" in a vector, but grepl isn't working out for me. Anyone know why? I am doing the following
vector <- c("fdsfds.com","fdsfcom")
grepl(".com",vector)
This returns
[1] TRUE TRUE
I want it to strictly refer to "fdsfds.com"
As #user20650 said in the comments above, use grepl("\\.com",vector). the dot (.) is a special character in regular expressions that matches any character, so it's matching the second "f" in "fdsfcom". The "\\" before the . "escapes" the dot so it's treated literally. Alternatively, you could use grepl(".com",vector, fixed = TRUE), which searches literally, not using regular expressions.

Resources