grepl not searching correctly in R - r

I want to search ".com" in a vector, but grepl isn't working out for me. Anyone know why? I am doing the following
vector <- c("fdsfds.com","fdsfcom")
grepl(".com",vector)
This returns
[1] TRUE TRUE
I want it to strictly refer to "fdsfds.com"

As #user20650 said in the comments above, use grepl("\\.com",vector). the dot (.) is a special character in regular expressions that matches any character, so it's matching the second "f" in "fdsfcom". The "\\" before the . "escapes" the dot so it's treated literally. Alternatively, you could use grepl(".com",vector, fixed = TRUE), which searches literally, not using regular expressions.

Related

How to remove a certain portion of the column name in a dataframe?

I have column names in the following format:
col= c('UserLanguage','Q48','Q21...20','Q22...21',"Q22_4_TEXT...202")
I would like to get the column names without everything that is after ...
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
I am not sure how to code it. I found this post here but I am not sure how to specify the pattern in my case.
You can use gsub.
gsub("\\...*","",col)
#[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
Or you can use stringr
library(stringr)
str_remove(col, "\\...*")
Since . matches any character, we need to "escape" (\) to specify exactly what we want to match in the regular expression (and not use the special behavior of the .). So, to match a period, we would need \.. However, the backslash (\) is used to escape special behavior (e.g., escape symbol in strings) in regexps. So, to create the regular expression, we need an additional backslash, \\. In this case, we want to match additional periods, so we can add those here, hence \\.... Then, * specifies that the previous expression (everything the three periods) may occur 0 or more times.
You could sub and capture the first word in each column:
col <- c("UserLanguage", "Q48", "Q21...20", "Q22...21", "Q22_4_TEXT...202")
sub("^(\\w+).*$", "\\1", col)
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
The regex pattern used here says to match:
^ from the start of the input
(\w+) match AND capture the first word
.* then consume the rest
$ end of the input
Then, using sub we replace with \1 to retain just the first word.

How to search for strings with parentheses in R

Using R, I have a long list of keywords that I'm searching for in a dataset. One of the keywords needs to have parentheses around it in order to be included.
I've been attempting to replace the parenthesis in the keywords list with \\ then the parentheses, but have not been successful. If there is a way to modify the grepl() function to recognize them, that would also be helpful. Here is an example of what I'm trying to accomplish:
patterns<-c("dog","cat","(fish)")
data<-c("brown dog","black bear","salmon (fish)","red fish")
patterns2<- paste(patterns,collapse="|")
grepl(patterns2,data)
[1] TRUE FALSE TRUE TRUE
I would like salmon (fish) to give TRUE, and red fish to give FALSE.
Thank you!
As noted by #joran in the comments, the pattern should look like so:
patterns<-c("dog","cat","\\(fish\\)")
The \\s will tell R to read the parentheses literally when searching for the pattern.
Easiest way to achieve this if you don't want to make the change manually:
patterns <- gsub("([()])","\\\\\\1", patterns)
Which will result in:
[1] "dog" "cat" "\\(fish\\)"
If you're not very familiar with regular expressions, what happens here is that it looks for any one character within the the square brackets. The round brackets around that tell it to save whatever it finds that matches the contents. Then, the first four slashes in the second argument tell it to replace what it found with two slashes (each two slashes translate into one slash), and the \\1 tells it to add whatever it saved from the first argument - i.e., either ( or ).
Another option is to forget regex and use grepl with fixed = T
rowSums(sapply(patterns, grepl, data, fixed = T)) > 0
# [1] TRUE FALSE TRUE FALSE

Double Colon in R Regular Expression

The goal is to remove all non-capital letter in a string and I managed to find a regular expression solution without fully understanding it.
> gsub("[^::A-Z::]","", "PendingApproved")
[1] "PA"
I tried to read the documentation of regex in R but the double colon isn't really covered there.
[]includes characters to match in regex, A-Z means upper case and ^ means not, can someone help me understand what are the double colons there?
As far as I know, you don't need those double colons:
gsub("[^A-Z]", "", "PendingApproved")
[1] "PA"
Your current pattern says to remove any character which is not A-Z or colon :. The fact that you repeat the colons twice, on each side of the character range, does not add any extra logic.
Perhaps the author of the code you are using confounded the double colons with R's regex own syntax for named character classes. For example, we could have written the above as:
gsub("[^[:upper:]]","", "PendingApproved")
where [:upper:] means all upper case letters.
Demo
To remove all small letters use following:
gsub("[a-z]","", "PendingApproved")
^ denotes only starting characters so
gsub("^[a-z]","", "PendingApproved")
will not remove any letters from your tested string because your string don't have any small letters in starting of it.
EDIT: As per Tim's comment adding negation's work in character class too here. So let's say we want to remove all digits in a given value among alphabets and digits then following may help.
gsub("[^[:alpha:]]","", "PendingApproved1213133")
Where it is telling gsub then DO NOT substitute alphabets in this process. ^ works as negation in character class.
We can use str_remove from stringr
library(stringr)
str_remove_all("PendingApproved", "[a-z]+")
#[1] "PA"

str_extract - How to disable default regex

library(stringr)
namesfun<-(sapply(mxnames, function (x)(str_extract(x,sapply(jockeys, function (y)y)))))%>%as.data.frame(stringsAsFactors = F)
So I am trying to use str_extract using sapply through two vectors, and the "jockeys" vector that I use as the pattern argument in str_extract, has elements with special characters like "-" or "/" that interfere with regex.
Since I want an exact "human" match if you prefer, and not regex based match, how can I disable regex from being the default matching manner?
I hope I got my point across!

Replace the last occurence of a string (and only it) using regular expression

I have a string, let say MyString = "aabbccawww". I would like to use a gsub expression to replace the last "a" in MyString by "A", and only it. That is "aabbccAwww". I have found similar questions on the website, but they all requested to replace the last occurrence and everything coming after.
I have tried gsub("a[^a]*$", "A", MyString), but it gives "aabbccA". I know that I can use stringi functions for that purpose but I need the solution to be implemented in a part of a code where using such functions would be complicated, so I would like to use a regular expression.
Any suggestion?
You can use stringi library which makes dealing with strings very easy, i.e.
library(stringi)
x <- "aabbccawww"
stri_replace_last_fixed(x, 'a', 'A')
#[1] "aabbccAwww"
We can use sub to match 'a' followed by zero or more characters that are not an 'a' ([^a]*), capture it as group ((...)) until the end of the string ($) and replace it with "A" followed by the backreference of the captured group (\\1)
sub("a([^a]*)$", "A\\1", MyString)
#[1] "aabbccAwww"
While akrun's answer should solve the problem (not sure, haven't worked with \1 etc. yet), you can also use lookouts:
a(?!(.|\n)*a)
This is basically saying: Find an a that is NOT followed by any number of characters and an a. The (?!x) is a so-called lookout, which means that the searched expression won't be included in the match.
You need (.|\n) since . refers to all characters, except for line breaks.
For reference about lookouts or other regex, I can recommend http://regexr.com/.

Resources