Ignore One Part of URL and Match Rest - google-analytics

I'm new to RegEx and am sure this is an easy one. I looked at similar questions, but being new to RegEx, how it all fits together it still fuzzy.
I want my RegEx to:
ignore a single parameter in my URL and match anything that pops up in the first two parameters (/purple/cat/)
match the specific word (/prices)in the last part of the URL
BUT not match the date in the middle/ignore that part (and any other date)
URL string:
/purple/cat/2017/prices
RegEx:
\/.*\/.*(?<!(20[0-17])\/prices$

How about this - it matches anything with /purple/cat/ + any 4-digit number + /prices:
\/purple\/cat\/[0-9][0-9][0-9][0-9]\/prices
P.S. https://regexr.com/ is useful for playing with regex's.

Related

extracting main URL address

I have a list of URLs and I want to extract the main URL to see how many times each URL has been used. as you can imagine, there are so many URLs with different notations. I tried and wrote the following code to extract the main URL:
library(stringr)
library(rebus)
# Step 2: creating a pattern for URL extraction
pat<- "//" %R% capture(one_or_more(char_class(WRD,DOT)))
#step 3: Creating a new variable from URL column of df
#(it should be atomic vector)
URL_var<-df[["URLs"]]
#step 4: using rebus to extract main URL
URL_extract<-str_match(URL_var,pattern = pat)
#step 5: changing large vector to dataframe and changing column name:
URL_data<-data.frame(URL_extract[,2])
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
The result of this code is acceptable for most cases. For example for //www.google.com, it returns www.google.com and for a website like http://image.google.com/steve it returns image.google.com; however, there are so many cases that this code can't recognize the pattern and will fail to find the URL. For example for URL such as http://my-listing.ca/CommercialDrive.html the code will return my which is definitely not acceptable. for another example, for a website like http://www.real-data.ca/clients/ur/ it only returns www.real. It seems that handling - for my code is difficult
Do you have any suggestions on how to improve this code? or do we have any packages to help me extract URLs faster and better?
Thanks
I think you can simply use
library(stringr)
URL_var<-df[["URLs"]]
URL_data<-data.frame(str_extract(URL_var, "(?<=//)[^\\s/:]+"))
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
Here, stringr::str_extract method searches for the first match in the input, and fetches the substring found. Unlike stringr::str_match, it cannot return submatches, so a lookbehind is used in the regex pattern, (?<=...):
(?<=//)[^\s/:]+
It means:
(?<=//) - match a location in the string that is immediately preceded with // string
[^\\s/:]+ - one or more (+) occurrences of any char but whitespace, / and :. The colon is to make sure port number is not included in the match. / makes sure the match stops before the first / and \s (whitespace) makes sure the match stops before the first whitespace.

Extract a certain element from URL using regular expressions

I need to extract the first element ("adidas-originals") after "designer" in the following URL using regular expressions.
xxx/en-ca/men/designers/adidas-originals/shorts
This needs to be done in Google Big Query API (standard SQL). To this end, I have tried several ways to get the desired valued without any success. Below is the best solution that I have found so far which obviously is not the right one as it returns "/adidas-originals/shorts".
REGEXP_EXTRACT(hits.page.pagePath, r'designers([^\n]*)')
Thanks!
The [^\n]* matches 0 or more chars other than a newline, LF, so no wonder it matches too much.
You need a pattern to match up to the next /, so you may use
designers/([^/]+)
Or a more precise:
(?:^|/)designers/([^/]+)
See the regex demo
Details
(?:^|/) - either start of a string or / (you may just use / if designers is always preceded with /)
designers/ a designers/ substring
([^/]+) - Capturing group 1 (just what will be returned with the REGEXP_EXTRACT function): one or more chars other than /.

Creating RegEx That Reads Entire String

My current regex is only picking up part of my string. It creates a match as soon as one if found, even though I need the longer version of that match to hit. For example, I am creating matches for both:
SSS111
and
SSS111-L
The first SSS111 matches fine with my current regex, but the SSS111-L is only getting matched to the SSS111, leaving the -L out.
How can I create a greedy regex to read the whole line before matching? I am currently using
[-A-Z0-9]{3,12}
to capture the numbers and letters, but have not had any luck outside of this.
Regex are allways greedy. This ist mostly the Problem.
Here i think you have only to escape the '-'
#"[-A-Z]{3-12}"

regex to match page[0-9] and nothing before or after

I have a regex but it's not quite working the way i want
page[0-9]*
/pages/search.aspx?pageno=3&pg=232323&hdhdhd/page73733/xyz
In the above example, the only thing I want to match is page73733. But my regex matches the page in /pages and it matches page in pageno=3
i also tried page[0-9].*, then it matches page73733 but it also matches everything that comes after it so that it actually matches page73733/xyz
page[0-9].*[^a-zA-Z&?/=]
That seems to do what i want, but that also seems like a ugly way to do it. Plus if i had something like /page123/xyz/page456 it'll match that whole string.
So is there a better way to do this? I want to match ONLY the string page when it is followed by any number of digits, and if anything comes after the digits it should stop.
* means 0 or more occurrences. + means 1 or more occurrences.
page[0-9]+ should work.
page[0-9]*
Will match page followed by zero or more numbers. What you want is:
page[0-9]+
Which will match page followed by one or more numbers.
You almost got it. Just use + instead of * as that will force a match that has numbers after it.
Another way to type that expression would be
/page[0-9]+
note the / , this would be helpful because without it you might get a match with something like "notApage123"
The regex page[0-9]* will match [0-9] 0 or more times. + would match it 1 or more times, and ? would match it 0 or 1 times. An equivalent method to ?+* is as follows:
?={0,1}
*={0,}
+={1,}
This may be helpful for if you wanted to match a date:\\d{4}(-\\d{1,2}){2} which would match 2013-5-31
-
That said, the resulting Regex for your particular problem would be:
page\\d+
page\\d{1,}
page[0-9]+
or page[0-9]{1,}
In your example "/page123/xyz/page456" you may want to match all occurrences, so don't forget the g or global modifier.
If I understand your problem correctly, you only need to add $ to your original regex to specify that after page you want the string to end. So the regex would be
page[0-9]*$
Also, this will match strings that end in page too, if you want only strings that end in page followed by any number, use this regex
page[0-9]+$

Regex for ASP.NET url rewrite

Sample text =
legacycard.ashx?save=false&iNo=3&No=555
Sample pattern =
^legacycard.ashx(.*)No=(\d+)
Want to grab group #2 value of "555" (the value of "No=" in the sample text)
In Expresso, this works, but in ASP.NET UrlRewrite, it is not catching.
Am I missing something?
Thanks!
I would do something along these lines:
^legacycard.ashx\?(?:.+&)*No=(\d+)
The \? will escape the question mark that normally separates the URL and the parameters, then you make sure that it will capture every parameter key/value pair (anything that ends on &) before the parameter you actually care about. Using ?: lets you specify that the set of brackets is non capturing (I'm assuming you won't need any of the data, has the potential to slightly speeds up your regex) and leaves you just 555 captured. The added benefit of this approach is that it'll work regardless of parameter order.
Just use this regex:
^legacycard\.ashx\?save=(false|true)&iNo=(?<ino>\d+)&No=(?<no>\d+)
Then Regex Replace with
${no}
Looks fine to me, your regex should match the entire string
legacycard.ashx?save=false&iNo=3&No=555
not sure why you have groups, but groups should also return
?save=false&iNo=3&
and
555
For good measure you should know that the . in legacycard.ashx is also interpreted by regex and you would normally escape it, in this case it dosen't matter because a single dot matches everything, also a dot. :)
Try this
^legacycard.ashx(\?No=|.*?&No=)(\d+)
this should work.

Resources