Understanding grep with fixed = T in R - r

I checked different posts on this, but still couldn't figure out why this is not working:
c=c("HI","NO","YESS")
grep("YES",c,fixed=T)
[1] 3
If I am using fixed = T, why I am still getting a results when there is no exact match for "YES". I want only exact matches like when I use grep -w in bash.

This just means that you're matching a string rather than a regular expression, but the string can still be a substring. If you want to match exact cases only, how about
> x=c("HI","NO","YESS") #better not to name variables after common functions
> grep("^YES$",x,fixed=F)
integer(0)
Edit per #nicola: This works b/c ^ means beginning and $ end of string, so ^xxxx$ forces the entire string to match xxxx.

Related

I need help figuring out why my regex does not match with what I am looking for

I am working on a R script aiming to check if a data.frame is correctly made and contains the right information at the right place.
I need to make sure a row contains the right information, so I want to use a regular expression to compare with each case of said row.
I thought maybe it did not work because I compared the regex to the value by calling the value directly from the table, but it did not work.
I used regex101.com to make sure my regular expression was correct, and it matched when the test string was put between quotes.
Then I added as.character() to the value, but it came out FALSE.
To sum up, the regex works on regex101.com, but never did on my R script
test = c("b40", "b40")
".[ab][0-8]{2}." == test[1]
FALSE
I expect the output to be TRUE, but it is always FALSE
The == is for fixed full string match and not used for substring match. For that, we can use grep
grepl("^[ab][0-8]{2}", test[1])
#[1] TRUE
Here, we match either 'a' or 'b' at the start (^) of the string followed by two digits ranging from 0 to 8 (if it should be at the end - use $)

grepping special characters in R

I have a variable named full.path.
And I am checking if the string contained in it is having certain special character or not.
From my code below, I am trying to grep some special character. As the characters are not there, still the output that I get is true.
Could someone explain and help. Thanks in advance.
full.path <- "/home/xyz"
#This returns TRUE :(
grepl("[?.,;:'-_+=()!##$%^&*|~`{}]", full.path)
By plugging this regex into https://regexr.com/ I was able to spot the issue: if you have - in a character class, you will create a range. The range from ' to _ happens to include uppercase letters, so you get spurious matches.
To avoid this behaviour, you can put - first in the character class, which is how you signal you want to actually match - and not a range:
> grepl("[-?.,;:'_+=()!##$%^&*|~`{}]", full.path)
[1] FALSE

Using grep to filter rows with two or more patterns in the string in R

I need to index all the rows that have a string beginning with either "B-" or "B^" in one of the columns. I tried a bunch of combinations, but I am suspecting it might not be working due to "-" and "^" signs being part of grep command as well.
dataset[grep('^(B-|B^)[^B-|B^]*$', dataset$Col1),]
With the above script, rows beginning with "B^" are not being extracted. Please suggest a smart way to handle this.
You can use the escape \\ command in grep:
dataset[grep('^(B\\-|B\\^)[^B\\-|B\\^]*$', dataset$Col1),]
For further explanation, the ^ matches the beginning of a string as an anchor therefore you have to escape it in the middle of string. The [] are a character class so [^B-|B^]* matches any character that's not a B,-,B, or ^. They are unnecessary here.
The simplified regex is:
dataset[grep('^(B-|B\\^)', dataset$Col1),]

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

how back reference works in unix

I have a file "file7.txt"
The contents of file7.txt,
I want to know how the back reference works with grep command.
when i type the commands, i get the following results,
So now i want to know how this works.
I would explain you the first grep that you have tried.
grep '\([a-z]\)\1'
This matches the sample string with 'first character' same as the next character.
f i l l i n g
Grep matches the first character within 'a' to 'z'.
At the beginnign it checks for every character, one by one.
$1 holds the character, The pattern is to have the next character as 1.
You need a way of remembering what you found, and seeing if the same pattern occurred again.
You can mark part of a pattern using "(" and ")".
You can recall the remembered pattern with "\" followed by a single digit.
You can have 9 different remembered patterns.
This reduces the pattern search efficiently and saves your time.
This is how all the back references work.

Resources