Regex match alphabetic/numeric that is not x - r

Assume I want to extract all strings starting in either ftp or ftpk (example made up).
I currently have a solution:
Get all the strings starting with ftp but not those starting in
ftpx or ftpc.
I wonder how I can make it more general (because right now I'm listing the exceptions which can become tedious), something like:
Get all the strings starting with ftp but not those starting in
ftpX where X is any alphabetic/numeric that is not k.
# Data:
vec <- c("ftp:ladpmxqgvt", "ftpx:xfiwyoloqu", "ftpk:yol.qdsrehn",
"ftpc:krjqdzsuhb", "ftpk:yolo.taxukj", "ftp:qvxarpkjid",
"ebutlngqkr", "yolx.vhznja")
# Current solution (desired output)
grep("^ftp[^xc]", vec, value = TRUE)
"ftp:ladpmxqgvt" "ftpk:yol.qdsrehn" "ftpk:yolo.taxukj" "ftp:qvxarpkjid"

Code
See regex in use here
^ftpk?:
If you don't know if : will follow ftp you can use the following, which simply ensures ftp or ftpk is followed by a non-word character:
^ftpk?\b
Results
Input
ftp:ladpmxqgvt
ftpx:xfiwyoloqu
ftpk:yol.qdsrehn
ftpc:krjqdzsuhb
ftpk:yolo.taxukj
ftp:qvxarpkjid
ebutlngqkr
yolx.vhznja
Output
Below lists only matches
ftp:ladpmxqgvt
ftpk:yol.qdsrehn
ftpk:yolo.taxukj
ftp:qvxarpkjid
Explanation
^ Assert position at the start of the line
ftp Match this literally
k? Match k literally zero or once
: Match this literally

I think this solution most closely mimics the sentence:
Get all the strings starting with ftp but not those starting in ftpX where X is any alphabetic/numeric that is not k.
grep("ftp(?!k)[[:alnum:]](*SKIP)(*FAIL)|ftp", vec, value = TRUE, perl = TRUE)
or
grep("ftp(?!(?!k)[[:alnum:]])", vec, value = TRUE, perl = TRUE)
Result:
[1] "ftp:ladpmxqgvt" "ftpk:yol.qdsrehn" "ftpk:yolo.taxukj" "ftp:qvxarpkjid"
Note:
The first solution uses the (*SKIP)(*FAIL) trick to avoid matching particular patterns. In this case, I am using it to avoid matching ftp followed by an alphanumeric character except k, and matching any ftp that was not avoided.
The second solution is similar, but uses negative lookahead. (?!k)[[:alnum:]] matches all alphanumerics except k, while ftp(?!(?!k)[[:alnum:]]) matches ftp not immediately followed by any alphanumerics except k.
The advantage of these two solutions is that one can add to the things to avoid. Just add them to (?!k)[[:alnum:]] or (?!(?!k)[[:alnum:]]).

Related

Pattern match with R

I am trying to match a pattern using rgep() function as below -
grep("XYZ31__Sheqwqet1__CSV.csv", "^(XYZ)+[0-9]{2}[a-zA-Z_]+(csv)+$")
However unfortunately above expression results in no match. Any pointer towards the right direction will be very helpful.
Thanks for your time
Before the csv there is also a . and some digits. In addition, the order of arguments is pattern, followed by the input x. (if we pass arguments via name, the order wouldn't matter though)
grep( "^(XYZ)+[0-9]{2}[[:alnum:]_.]+(csv)$", "XYZ31__Sheqwqet1__CSV.csv")
#[1] 1
Pattern match is
^- start of the string
(XYZ)+ - one or more occurence of those letters
[0-9]{2} - two digits
[[:alnum:]_.]+ - one or more alpha numeric characters including the additional two
(csv)$- csv at the end of the string

extracting main URL address

I have a list of URLs and I want to extract the main URL to see how many times each URL has been used. as you can imagine, there are so many URLs with different notations. I tried and wrote the following code to extract the main URL:
library(stringr)
library(rebus)
# Step 2: creating a pattern for URL extraction
pat<- "//" %R% capture(one_or_more(char_class(WRD,DOT)))
#step 3: Creating a new variable from URL column of df
#(it should be atomic vector)
URL_var<-df[["URLs"]]
#step 4: using rebus to extract main URL
URL_extract<-str_match(URL_var,pattern = pat)
#step 5: changing large vector to dataframe and changing column name:
URL_data<-data.frame(URL_extract[,2])
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
The result of this code is acceptable for most cases. For example for //www.google.com, it returns www.google.com and for a website like http://image.google.com/steve it returns image.google.com; however, there are so many cases that this code can't recognize the pattern and will fail to find the URL. For example for URL such as http://my-listing.ca/CommercialDrive.html the code will return my which is definitely not acceptable. for another example, for a website like http://www.real-data.ca/clients/ur/ it only returns www.real. It seems that handling - for my code is difficult
Do you have any suggestions on how to improve this code? or do we have any packages to help me extract URLs faster and better?
Thanks
I think you can simply use
library(stringr)
URL_var<-df[["URLs"]]
URL_data<-data.frame(str_extract(URL_var, "(?<=//)[^\\s/:]+"))
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
Here, stringr::str_extract method searches for the first match in the input, and fetches the substring found. Unlike stringr::str_match, it cannot return submatches, so a lookbehind is used in the regex pattern, (?<=...):
(?<=//)[^\s/:]+
It means:
(?<=//) - match a location in the string that is immediately preceded with // string
[^\\s/:]+ - one or more (+) occurrences of any char but whitespace, / and :. The colon is to make sure port number is not included in the match. / makes sure the match stops before the first / and \s (whitespace) makes sure the match stops before the first whitespace.

Extract mm/dd/yyyy and m/dd/yyyy dates from string in R [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

I need help figuring out why my regex does not match with what I am looking for

I am working on a R script aiming to check if a data.frame is correctly made and contains the right information at the right place.
I need to make sure a row contains the right information, so I want to use a regular expression to compare with each case of said row.
I thought maybe it did not work because I compared the regex to the value by calling the value directly from the table, but it did not work.
I used regex101.com to make sure my regular expression was correct, and it matched when the test string was put between quotes.
Then I added as.character() to the value, but it came out FALSE.
To sum up, the regex works on regex101.com, but never did on my R script
test = c("b40", "b40")
".[ab][0-8]{2}." == test[1]
FALSE
I expect the output to be TRUE, but it is always FALSE
The == is for fixed full string match and not used for substring match. For that, we can use grep
grepl("^[ab][0-8]{2}", test[1])
#[1] TRUE
Here, we match either 'a' or 'b' at the start (^) of the string followed by two digits ranging from 0 to 8 (if it should be at the end - use $)

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

Resources