Problems pattern matching using R regular expressions - r

I'm trying to extract a string using str_extract. Here is a small example of the type of string:
library(stringr)
gs<-"{\"type\":\"Polygon\",\"coordinates\":[[[1,2],[3,4],[5,6],[7,8]]]}"
s='\\{\\\"type\\\"*\\}'
str_extract(gs,s)
I'd like to get a print-out of the entire string (the real string will have more characters of this type and should only return the piece I specified here). Instead I get NA. I'd be grateful for any ideas as to what I'm doing wrong. Thank you!

Does this do what you want?
gs<-"{\"type\":\"Polygon\",\"coordinates\":[[[1,2],[3,4],[5,6],[7,8]]]} I DO NOT WANT THIS {\"type\":\"Not a Polygon\",\"coordinates\":[[[1,2],[3,4],[5,6],[7,8]]]}"
s="\\{\"type\"(.*?)\\}"
result = str_match_all(gs,s)[[1]][,1]
To test, I added the string 'I DO NO WANT THIS', which should not be returned, and added a second object which type is 'not a polygon'
It returns:
"{\"type\":\"Polygon\",\"coordinates\":[[[1,2],[3,4],[5,6],[7,8]]]}"
"{\"type\":\"Not a
Polygon\",\"coordinates\":[[[1,2],[3,4],[5,6],[7,8]]]}"
So only the elements requested. Hope this helps!

Related

String Matching in R - Problem with pattern

I have a small Problem. I want to extract a special pattern like this:
v-97bcer
or b-chyfvg or ghd6db
I tried this:
identifier_1 <- "([:alnum:]{6})" # for things like this ghd6db
identifier_2 <- "([:lower:]{1})[- ][:alnum:]{6})" # for things like this v-97bcer or b-chyfvg
The problem is that the first "identifier" works well ok, but extracts for example names as well. In GHD6D8 this example the numbers have no fixed place and can occur everywhere. I do just now that the length is 6.
And the second problem is that for example V-97bcer can occur like v97bcer but I need this format v-97bcer. Here too the numbers are randomly.
If somebody could help or give me a good source for better understanding how to do this. I have not much exp in string matching. Thank you
this should work:
x <- c("v-97bcer", "b-chyfvg", "ghd6db", "v97bcer")
grep("^([a-z].)?[a-z0-9]{6}$", x)
Note that in order to fix the length of the string I provide ^ and $ to the string.
This pattern matches v-97bcer and b-chyfvg and ghd6db but not v97bcer.

How to extract a substring from main string starting from valid uuid using lua

I have a main string as below
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
From the main string i need to extract a substring starting from the uuid part
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
I tried
string.match("/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/", "/[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}/(.)/(.)/$"
But noluck.
if you want to obtain
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
from
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
or let's say 7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0, output and 9999.317528060546245771146821638997525068657 as this is what your pattern attempt suggests. Otherwise leave out the parenthesis in the following solution.
You can use a pattern like this:
local text = "/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(text:match("/([%x%-]+)/([^/]+)/([^/]+)"))
"/([^/]+)/" captures at least one non-slash-character between two slashs.
On your attempt:
You cannot give counts like {4} in a string pattern.
You have to escape - with % as it is a magic character.
(.) would only capture a single character.
Please read the Lua manual to find out what you did wrong and how to use string patterns properly.
Try also the code
s="/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(s:match("/.-/.-(/.+)$"))
It skips the first two "fields" by using a non-greedy match.

Use substr with start and stop words, instead of integers

I want to extract information from downloaded html-Code. The html-Code is given as a string. The required information is stored inbetween specific html-expressions. For example, if I want to have every headline in the string, I have to search for "H1>" and "/H1>" and the text between these html expressions.
So far, I used substr(), but I had to calculate the position of "H1>" and "/H1>" first.
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
startposition = c(21,55) # calculated with gregexpr
stopposition = c(28, 63) # calculated with gregexpr
substr(htmlcode, startposition[1], stopposition[1])
substr(htmlcode, startposition[2], stopposition[2])
The output is correct, but to calculate every single start and stopposition is a lot of work. Instead I search for a similar function like substr (), where you can use start and stop words instead of the position. For example like this:
function(htmlcode, startword = "H1>", stopword = "/H1>")
I'd agree that using a package built for html processing is probably the best way to handle the example you give. However, one potential way to sub-string a string based on character values would be to do the following.
Step 1: Define a simple function to return to position of a character in a string, in this example I am only using fixed character strings.
strpos_fixed=function(string,char){
a<-gregexpr(char,string,fixed=T)
b<-a[[1]][1:length(a[[1]])]
return(b)
}
Step 2: Define your new sub-string function using the strpos_fixed() function you just defined
char_substr<-function(string,start,stop){
x<-strpos_fixed(string,start)+nchar(start)
y<-strpos_fixed(string,stop)-1
z<-cbind(x,y)
apply(z,1,function(x){substr(string,x[1],x[2])})
}
Step 3: Test
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
htmlcode2 = " some html code <H1>baa dee ya</H1> some other code <H1>say do you remember?</H1>"
htmlcode3<- "<x>baa dee ya</x> skdjalhgfjafha <x>dancing in september</x>"
char_substr(htmlcode,"<H1>","</H1>")
char_substr(htmlcode2,"<H1>","</H1>")
char_substr(htmlcode3,"<x>","</x>")
You have two options here. First, use a package that has been developed explicitly for the parsing of HTML structures, e.g., rvest. There are a number of tutorials online.
Second, for edge cases where you may need to extract from strings that are not necessarily well-formatted HTML you should use regular expressions. One of the simpler implementations for this comes from stringr::str_match:
# 1. the parenthesis define regex groups
# 2. ".*?" means any character, non-greedy
# 3. so together we are matching the expression <H1>some text or characters of any length</H1>
str_match(htmlcode, "(<H1>)(.*?)(</H1>)")
This will yield a matrix where the columns are (in order) the fully matched string followed by each independent regex group we specified. You would just want to pull the second group in this case if you want whatever text is between the <H1> tags (3rd column).

GA Search & Replace Filter

I wonder whether someone can help me please.
I have the following URI in GA: /invite/accept-invitation/accepted/B
Which I'd like to change to: /invite/accept-invitation/accepted
I've tried a 'Search and Replace filter as follows:
Search String - /invite/accept-invitation/accepted/*
Replace String - /invite/accept-invitation/accepted
But the result I get is:
/inviteaccept-invitation/accepted/B
Could someone tell me where I've gone wrong with this please?
Many thanks and kind regards
Chris
Google Analytics "Search and replace" filter uses regular expressions. More precisely:
Replace string is either a regular string or it can refer to group
patterns in the search expression using backslash-escaped single
digits like (\0 to \9).
More details are available on the filter settings UI, which also refers to this link.
So in your case, the search string would be something like this.
\/invite\/accept-invitation\/accepted\/\w+
In this expression \ is escaped. Your last string part is captured with \w+, which
matches any word character (equal to [a-zA-Z0-9_]), between one and unlimited times, as many times as possible.
The Replace string doesn't have to be a regular expression. So in your case, your original version could be used:
/invite/accept-invitation/accepted/
Putting this together would result something like this, which gives the desired output in my test view:

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

Resources