Is it possible to have a set of option for a regex substring in R?

Is it possible to have a set of option for a regex substring in R? - r

I have a dataframe that contains some cells with error messages as string. The strings come in the following forms:
ERROR-100_Data not found for ID "xxx"
ERROR-100_Data not found for id "xxx"
ERROR-101_Data not found for SUBID "yyy"
Data not found for ID "xxx"
Data not found for id "xxx"
I need to extract the number of the error (if it has one) and the GENERAL description, avoiding the specificity of the ID or SUBID. I have a function where I use the following regex expression:
sub(".*?ERROR-(.*?)for ID.*","\\1",df[,col1],sep="-")
This works only for the first case. Is there a way to obtain the following results using only one expression?
100_Data not found
100_Data not found
101_Data not found
Data not found
Data not found

We can use:
tsxt <- 'ERROR-100_Data not found for ID "xxx"'
gsub("\\sfor.*|ERROR-","",tsxt, perl=TRUE)
[1] "101_Data not found"
Or as suggested by #Jan anchor ERROR to make it more general:
gsub("\\sfor.*|^ERROR-","",tsxt, perl=TRUE)

You could use
^ERROR-|\sfor.+
which needs to be replaced by an empty string, see a demo on regex101.com.

Use this regex:
.*?(?:ERROR-)?(.*?)\s+for\s+(?:[A-Z]*)?ID
This makes sure that ERROR- part is optional, then captures everything before for ...ID is encountered (case-insensitively). The only capturing group contains the desired text, which can then be used directly without needing any substitution.
The first and the third groups in this regex are non-capture groups, i.e., they'll match their content but not capture it for further usage, thus leaving us with only one capture group (the middle one). This is done since the OP isn't interested in the data they refer to. Making them as capture groups would have meant three results, and the post-processing would have involved hard-coding the usage of second group only (the middle one), without ever having to deal with the other two.
Demo

Related

Replace multiple spaces in string, but leave singles spaces be

I am reading a PDF file using R. I would like to transform the given text in such a way, that whenever multiple spaces are detected, I want to replace them by some value (for example "_"). I've come across questions where all spaces of 1 or more can be replaced using "\\s+" (Merge Multiple spaces to single space; remove trailing/leading spaces) but this will not work for me. I have a string that looks something like this;
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
When I apply the answers I found; replacing all spaces of 1 or more with a single space, I will not be able to recognise separate addresses anymore, because it would look like this;
gsub("\\s+", " ", str_trim(PDF))
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
So what I am looking for is something like this
"[1]This is the first address_This is the second one
[2]This is the third one_
[3]This is the fourth one_This is the fifth"
However if I rewrite the code used in the example, I get the following
gsub("\\s+", "_", str_trim(PDF))
"[1]This_is_the_first_address_This_is_the_second_one
[2]This_is_the_third_one_
[3]This_is_the_fourth_one_This_is_the_fifth"
Would anyone know a workaround for this? Any help will be greatly appreciated.

Whenever I come across string and reggex problems I like to refer to the stringr cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf
On the second page you can see a section titled "Quantifiers", which tells us how to solve this:
library(tidyverse)
s <- "This is the first address This is the second one"
str_replace(s, "\\s{2,}", "_")
(I am loading the complete tidyverse instead of just stringr here due to force of habit).
Any 2 or more whitespace characters will no be replaced with _.

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)

Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?

Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

How to write a regex OR statement within strapply in R

I have been using strapplyc in R to select different portions of a string that match one particular set of criteria. These have worked successfully until I found a portion of the string where the required portion could be defined one of two ways.
Here is an example of the string which is liberally sprinkled with \t:
\t\t\tsome words here\t\t\tDefect: some more words here Action: more words
I can write the strapply statement to capture the text between Defect: and the start of Action:
strapplyc(record[i], "Defect:(.*?)Action")
This works and selects the chosen text between Defect: and Action. In some cases there is no action section to the string and I've used the following code to capture these cases.
strapplyc(record[i], "Defect:(.*?)$")
What I have been trying to do is capture the text that either ends with Action, or with the end of the string (using $).
This is the bit that keeps failing. It returns nothing for either option. Here is my failing code:
strapplyc(record[i], "Defect:(.*?)Action|$")
Any idea where I'm going wrong, or a better solution would be much appreciated.

If you are up for a more efficient solution, you could drop the .*? matching and unroll your pattern like:
Defect:((?:[^A]+|A(?!ction))*)
This matches Defect: followed by any amount of characters that are not an A or are an A and not followed by ction. This avoids the expanding that is needed for the lazy dot matching. It will work for both ways, as it does stop matching when it hits Action or the end of your string.
As suggested by Wiktor, you can also use
Defect:([^A]*(?:A(?!ction)[^A]*)*)
Which is a little bit faster when there are many As in the string.
You might want to consider to use A(?!ction:) or A(?!ction\s*:), to avoid false early matches.

The alternation operator | is the regex operator with the lowest precedence. That means the regex Defect:(.*?)Action|$ is actually a combination of Defect:(.*?)Action and $ - since an empty string is a valid match for $, your regex returns the empty string.
To solve that, you should combine the regexes Defect:(.*?)Action and Defect:(.*?)$ with an OR:
Defect:(.*?)Action|Defect:(.*?)$
Or you can enclose Action|$ in a group as Sebastian Proske said in the comments:
Defect:(.*?)(?:Action|$)

Matching emails format using R

I was having an intro class at datacamp.com and ran into a problem.
Goal: find right emails using grep. "Right emails" defined by having an "#", end with ".edu").
Emails vector:
emails <- c("john.doe#ivyleague.edu", "education#world.gov", "dalai.lama#peace.org",
"invalid.edu", "quant#bigdatacollege.edu", "cookie.monster#sesame.tv")
I was thinking of
grep("#*\\.edu$",emails)
and it gave me
[1] 1 4 5
because I thought "*" matches "multiple characters". Later I found that it doesn't work like that.
Turned out the right code is
grep("#.*\\.edu$",emails)
I googled some documentation and only have a vague sense of how to get the correct answer. Can someone explain how exactly R match the right emails? Thanks a bunch!!

You've already been advised the using the asterisk quantifier wasn't giving you the specificity you needed, so use the "+" quantifier, which forces at least one such match. I decided to make the problem more complex by adding some where there were duplicated at-signs:
emails <- c("john.doe##ivyleague.edu", "education##world.gov", "dalai.lama#peace.org",
"invalid.edu", "quant#bigdatacollege.edu", "cookie.monster#sesame.tv")
grep( "^[^#]+#[^#]+\\.edu$", emails)
#[1] 5
That uses the regex character-class structure where items inside flankking square-brackets are taken as literals except when there is an initial up-caret ("^"), in which case it is the negation of the character class, i.e. in this case any character except "#". This will also exclude situations where the at-sign is the first character. Thanks to KonradRudolph who pointed out that adding "^" as the first character in the pattern (which signifies the point just before the first character of a potential match) would prevent allowing Items with an initial "##" from being matched.