Subsetting a value based on partial pattern - r

I'm trying to subset out using regular expressions, the url: happy_to-learn.com.
As I'm really new to regex, could someone help with my code as to why it does not work?
x <- c("happy_to-learn.com", "His_is-omitted.net")
str_subset(x, "^[a-zA-Z](\\_|\\-)*\\.com$")
I understand that ^[a-zA-Z](\\_|\\-)* this portion here refers to, "Start when you hit a range of alphabets from a to z or A to Z, and it contains either _ or -, if yes, then subset out this portion with 0 or more matches.
However, is it possible continue from this code by adding the back part of the value i wish to subset? i.e. \\.com$ refers to all values that end with .com.
Is there something like "^[a-zA-Z](\\_|\\-)*...\\.com$" in regex?

We need to specify one or more with + as the _ or - are not just after the first letter.
str_subset(x, "^[a-zA-Z]+(\\_|\\-).*\\.com$")
#[1] "happy_to-learn.com"
Also, the .* refers to zero or more characters as . can be any character until the . and 'com' at the end ($) of the string

Why use an external package? grep can do it too.
grep("^[[:alpha:]_-]+.*\\.com$", x, value = TRUE)
#[1] "happy_to-learn.com"
Explanation.
"^" marks the beginning of the string.
"[:alpha:] matches any alphabetic character, upper or lower case in a portable way.
"^[[:alpha:]_-]+" between [], there are alternative characters to match repeated one or more times. Alphabetic or the underscore _ or the minus sign -.
"^[[:alpha:]_-]+.*" The above followed by any character repeated zero or more times.
"^[[:alpha:]_-]+.*\\.com$" ending with the string ".com" where the dot is not a metacharacter and therefore must be escaped.

Related

Is there a way to do a negative match using regex sub?

Say I have a vector of strings,
g<-c("bunchofstuff>query=true/fun/weird>bunchofstuff", "bunchofstuff>query=animals/octopus/weird>bunchofstuff", "bunchofstuff>query=flowers/sunshine/fun>bunchofstuff", "
bunchofstuff>query=fun/true/sunshine>bunchofstuff"
and I want to essentially use sub to erase anything after query=, until the end of the string, IF query= is not followed by true (ideally in any position). As far as I can tell, there isn't a useful substitution for ! in sub (seems to be some workarounds in grepl).
What I want is
newvariable<-c("bunchofstuff>query=true/fun/weird>bunchofstuff", "bunchofstuff>query=", "bunchofstuff>query=", "bunchofstuff>query=fun/true/sunshine>bunchofstuff"
You can do that:
sub('query=\\K(?:(?!true).)+$', '', g, perl=TRUE)
This technique uses a negative lookahead assertion (?!true) that checks before each character . if "true" doesn't follow. All is in a non-capturing group repeated until the end of the string $.
\\K is used to start the matched string after it to preserve the query= substring. (Note that it's only a convenient way to avoid a capture group or to rewrite query= in the replacement string.)
You can be more specific using word-boundaries to be sure that "true" isn't a part of another word:
sub('query=\\K(?:(?!\\btrue\\b).)+$', '', g, perl=TRUE)

Pattern match with R

I am trying to match a pattern using rgep() function as below -
grep("XYZ31__Sheqwqet1__CSV.csv", "^(XYZ)+[0-9]{2}[a-zA-Z_]+(csv)+$")
However unfortunately above expression results in no match. Any pointer towards the right direction will be very helpful.
Thanks for your time
Before the csv there is also a . and some digits. In addition, the order of arguments is pattern, followed by the input x. (if we pass arguments via name, the order wouldn't matter though)
grep( "^(XYZ)+[0-9]{2}[[:alnum:]_.]+(csv)$", "XYZ31__Sheqwqet1__CSV.csv")
#[1] 1
Pattern match is
^- start of the string
(XYZ)+ - one or more occurence of those letters
[0-9]{2} - two digits
[[:alnum:]_.]+ - one or more alpha numeric characters including the additional two
(csv)$- csv at the end of the string

Remove characters before first and after second underscore extracting string between first and second underscore

I am using
gsub(".*_","",ldf[[j]]),1,nchar(gsub(".*_","",ldf[[j]]))-4)
to create a path and filename to write to. It works fine for names in lfd that only have one underscore. Having a filename with another underscore further back, it cuts everything off that is in front of the second underscore.
I have for example:
Arof_07122016_2.csv and I want 07122016, but I get 2. But I don't get why this is happening. How can I use this line to only cut off the characters in fromt of the first underscore and keep the second one?
It seems you want
sub("^[^_]*_([^_]*).*", "\\1", ldf[[j]])
See the regex demo
The pattern matches
^ - start of string
[^_]* - 0+ chars other than _
_ - an underascxore
([^_]*) - Capturing group #1: any 0+ chars other than _
.* - the rest of the string.
The \1 in the replacement pattern only keeps the captured value in the result.
R demo:
v <- c("Arof_07122016_2.csv", "Another_99999_ccccc_2.csv")
sub("^[^_]*_([^_]*).*", "\\1", v)
# => [1] "07122016" "99999"
Regular expression repetition is greedy by default, as explained in ?regex:
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to ‘minimal’ by appending ? to
the quantifier. (There are further quantifiers that allow approximate
matching: see the TRE documentation.)
So you should use the pattern ".*?_". However, gsub will make multiple matches so you end up with the same result. To remedy this use sub which will only make 1 match or specify that you want to match at the start of the string by using ^ in the regex.
sub(".*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"
gsub("^.*?_","","Arof_07122016_2.csv")
[1] "07122016_2.csv"

Regex to maintain matched parts

I would like to achieve this result : "raster(B04) + raster(B02) - raster(A10mB03)"
Therefore, I created this regex: B[0-1][0-9]|A[1,2,6]0m/B[0-1][0-9]"
I am now trying to replace all matches of the string "B04 + B02 - A10mB03" with gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster()", string)
How could I include the original values B01, B02, A10mB03?
PS: I also tried gsub("B[0-1][0-9]]|[A[1,2,6]0mB[0-1][0-9]", "raster(\\1)", string) but it did not work.
Basically, you need to match some text and re-use it inside a replacement pattern. In base R regex methods, there is no way to do that without a capturing group, i.e. a pair of unescaped parentheses, enclosing the whole regex pattern in this case, and use a \\1 replacement backreference in the replacement pattern.
However, your regex contains some issues: [A[1,2,6] gets parsed as a single character class that matches A, [, 1, ,, 2 or 6 because you placed a [ before A. Also, note that , inside character classes matches a literal comma, and it is not what you expected. Another, similar issue, is with [0-9]] - it matches any ASCII digit with [0-9] and then a ] (the ] char does not have to be escaped in a regex pattern).
So, a potential fix for you expression can look like
gsub("(B[0-1][0-9]|A[126]0mB[0-1][0-9])", "raster(\\1)", string)
Or even just matching 1 or more word chars (considering the sample string you supplied)
gsub("(\\w+)", "raster(\\1)", string)
might do.
See the R demo online.

How do I write a regular expression that will match if the 6th character of a string is one of two different letters?

I'm trying to write a validator for an ASP.NET txtbox.
How can I validate so the regular expression will only match if the 6th character is a "C" or a "P"?
^.{5}[CP] will match strings starting with any five characters and then a C or P.
Depending on exactly what you want, you are looking for something like:
^.{5}[CP]
The ^ says to start from the beginning of the string, the . defines any character, the {5} says that the . must match 5 times, then the [CP] says the next character must be part of the character class CP - i.e. either a C or a P.
^.{5}[CP] -- the trick is the {}, they match a certain number of characters.
^.{5}[CP] has a few important pieces:
^ = from the beginning
. = match anything
{5} = make the previous match the number of times in braces
[CP] = match any one of the specific items in brackets
so the regex spoken would be something like "from the beginning of the string, match anything five times, then match a 'C' or 'P'"
[a-zA-Z0-9]{5}[CP] will match any five characters or digits and then a C or P.

Resources