capture a specific JSON-LD attribute with a regular expression in R - r

Using R, I want to capture a specific attribute (#type in this case) out of a JSON-LD payload inside a <script> tag. Here's a sample fragment:
<script type="application/ld+json">
{
"#context": "https://schema.org",
"#type": "WebSite",
...
This is a sample code in R to perform the attribute extraction:
x <- "<script type=\"application/ld+json\">{\"#context\":\"https://schema.org\",\"#type\":\"WebSite\",\"url\":\"https://www.foo.com/\""
regmatches(x, regexpr("<script [^>]*type *= *['\"] *application/ld.json *['\"][^>]*>[^}]+ ['\"] *#type *['\"] *: *['\"]([^'\"]+)['\"]", x, ignore.case = TRUE))
The output from this code is the following:
[1] "<script type=\"application/ld+json\">{ \"#context\": \"https://schema.org\" \"#type\":\"WebSite\""
The output I expect is this one:
[1] "WebSite"
I don't have a solid experience with R and even less of a solid experience with regular expressions but what bugs me is I've already tried this regex in the regex101 website (you can check the test here) and it works.
Can you give me a hint on how to return the correct attribute instead of the full test string?

You may use use a \K based PCRE pattern to extract any 1+ chars other than ' and " after a specific pattern:
x <- "<script type=\"application/ld+json\">{\"#context\":\"https://schema.org\",\"#type\":\"WebSite\",\"url\":\"https://www.foo.com/\""
p <- "<script\\s[^>]*type *= *['\"] *application/ld.json *['\"][^>]*>[^}]+['\"] *#type *['\"] *: *['\"]\\K[^'\"]+"
regmatches(x, regexpr(p, x, ignore.case = TRUE, perl=TRUE))
## => "WebSite"
See the R demo online
It looks like <SOME_LEFTHAND_CONTEXT_PATTERN>\K<WHAT_YOU_NEED>. The \K operator will omit all text matched so far, and you will only get <WHAT_YOU_NEED> in the result. See this pattern demo. Do not forget perl=TRUE argument that will enable PCRE regex engine here.

Related

Regular expression meaning in R : "( \n|\n )"

I am new to R programming and was trying out the gsub function for text replacement in pandas dataframe series(i.e new_text).
It a vast series so will not be able to print all here.
It is just a series with strings containing postal address.
I came across this gsub code : gsub(pattern = "( \n|\n )", replacement = " ", x = new_text) -> new_text
can you please let me know the meaning of this regex expression as well as the python alternative using regex expression.
Your pattern, slightly rewritten, is [ ]\n|\n[ ], which says to match:
[ ]\n a space followed by a newline
| OR
\n[ ] a newline followed by a space
Note that you might be able to use [ ]?\n[ ]? to the same effect, depending on the actual text you are using with gsub.

filter paths that have only one "/" in R using regular expression

I have a vector of different paths such as
levs<-c( "20200507-30g_25d" , "20200507-30g_25d/ggg" , "20200507-30g_25d/grn", "20200507-30g_25d/ylw", "ggg" , "grn", "tre_livelli", "tre_livelli/20200507-30g_25d", "tre_livelli/20200507-30g_25d/ggg", "tre_livelli/20200507-30g_25d/grn", "tre_livelli/20200507-30g_25d/ylw" , "ylw" )
which is actually the output of a list.dirs with recursive set to TRUE.
I want to identify only the paths which have just one subfolder (that is "20200507-30g_25d/ggg" , "20200507-30g_25d/grn", "20200507-30g_25d/ylw").
I thought to filter the vector to find only those paths that have only one "/" and then compare the this with the ones that have more than one "/" to get rid of the partial paths.
I tried with regular expression such as:
rep(levs,pattern='/{1}', value=T)
but I get this:
"20200507-30g_25d/ggg" "20200507-30g_25d/grn" "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d" "tre_livelli/20200507-30g_25d/ggg" "tre_livelli/20200507-30g_25d/grn" "tre_livelli/20200507-30g_25d/ylw"
Any idea on how to proceed?
/{1} is a regex that is equal to / and just matches a / anywhere in a string, and there can be more than one / inside it. Please have a look at the regex tag page:
Using {1} as a single-repetition quantifier is harmless but never useful. It is basically an indication of inexperience and/or confusion.
h{1}t{1}t{1}p{1} matches the same string as the simpler expression http (or ht{2}p for that matter) but as you can see, the redundant {1} repetitions only make it harder to read.
You can use
grep(levs, pattern="^[^/]+/[^/]+$", value=TRUE)
# => [1] "20200507-30g_25d/ggg" "20200507-30g_25d/grn" "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d"
See the regex demo:
^ - matches the start of string
[^/]+- one or more chars other than /
/ - a / char
[^/]+- one or more chars other than /
$ - end of string.
NOTE: if the parts before or after the only / in the string can be empty, replace + with *: ^[^/]*/[^/]*$.
An option with str_count to count the number of instances of /
library(stringr)
levs[str_count(levs, "/") == 1 ]
-ouptut
[1] "20200507-30g_25d/ggg" "20200507-30g_25d/grn"
[3] "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d"

Regular Expression to get the SRC of images in asp.net

I have a string which contains HTML code and an image. I need to get the value of the src attribute from that string. I try to use this code but it's not working
foreach (Match match in Regex.Matches(wordHTML, "<img.*?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase))
{
wordHTML = Regex.Replace(wordHTML, match.Groups[1].Value, "Temp/"+ match.Groups[1].Value);
}
my image path
<img width="165" height="138" src="636697542198949135.files/image002.jpg" v:shapes="Рисунок_x0020_7">
Julio's answer is a good one, but the next regex uses backreference in case the src has single or double quotes in it and also contemplates empty src's:
<img[^>]*?\ssrc=(["'])([^\1]*?)\1
The full src of the img (without quotes) will be group number 2 in the regular expression
I am try this expression and this work.
src=(?:\"|\')?(?<imgSrc>[^>]*[^/].(?:jpg|bmp|gif|png))(?:\"|\')?
Try with this:
<img\s+[^>]*\bsrc=["']([^"']+)["']
Demo
<img # literal '<img'
\s+ # one or more 'spaces'
[^>]* # 0 or more non-'>' character
\b # word boundary
src=["'] # literal src=
["'] # " or '
([^"']+) # capture: one or more non ' and " character
["'] # literal "
Try to specify pattern like this:
string pattern = #"<img\s+[^>]*\bsrc=[\"']([^\"']+)[\"']";
foreach (Match match in Regex.Matches(sentence, pattern))

Test for exact string in testthat

I'd like to test that one of my functions gives a particular message (or warning, or error).
good <- function() message("Hello")
bad <- function() message("Hello!!!!!")
I'd like the first expectation to succeed and the second to fail.
library(testthat)
expect_message(good(), "Hello", fixed=TRUE)
expect_message(bad(), "Hello", fixed=TRUE)
Unfortunately, both of them pass at the moment.
For clarification: this is meant to be a minimal example, rather than the exact messages I'm testing against. If possible I'd like to avoid adding complexity (and probably errors) to my test scripts by needing to come up with an appropriate regex for every new message I want to test.
You can use ^ and $ anchors to indicate that that the string must begin and end with your pattern.
expect_message(good(), "^Hello\\n$")
expect_message(bad(), "^Hello\\n$")
#Error: bad() does not match '^Hello\n$'. Actual value: "Hello!!!!!\n"
The \\n is needed to match the new line that message adds.
For warnings it's a little simpler, since there's no newline:
expect_warning(warning("Hello"), "^Hello$")
For errors it's a little harder:
good_stop <- function() stop("Hello")
expect_error(good_stop(), "^Error in good_stop\\(\\) : Hello\n$")
Note that any regex metacharacters, i.e. . \ | ( ) [ { ^ $ * + ?, will need to be escaped.
Alternatively, borrowing from Mr. Flick's answer here, you could convert the message into a string and then use expect_true, expect_identical, etc.
messageToText <- function(expr) {
con <- textConnection("messages", "w")
sink(con, type="message")
eval(expr)
sink(NULL, type="message")
close(con)
messages
}
expect_identical(messageToText(good()), "Hello")
expect_identical(messageToText(bad()), "Hello")
#Error: messageToText(bad()) is not identical to "Hello". Differences: 1 string mismatch
Your rexeg matches "Hello" in both cases, thus it doesn't return an error. You''ll need to set up word boundaries \\b from both sides. It would suffice if you wouldn't use punctuations/spaces in here. In order to ditch them too, you'll need to add [^\\s ^\\w]
library(testthat)
expect_message(good(), "\\b^Hello[^\\s ^\\w]\\b")
expect_message(bad(), "\\b^Hello[^\\s ^\\w]\\b")
## Error: bad() does not match '\b^Hello[^\s ^\w]\b'. Actual value: "Hello!!!!!\n"

Find word (not containing substrings) in comma separated string

I'm using a linq query where i do something liike this:
viewModel.REGISTRATIONGRPS = (From a In db.TABLEA
Select New SubViewModel With {
.SOMEVALUE1 = a.SOMEVALUE1,
...
...
.SOMEVALUE2 = If(commaseparatedstring.Contains(a.SOMEVALUE1), True, False)
}).ToList()
Now my Problem is that this does'n search for words but for substrings so for example:
commaseparatedstring = "EWM,KI,KP"
SOMEVALUE1 = "EW"
It returns true because it's contained in EWM?
What i would need is to find words (not containing substrings) in the comma separated string!
Option 1: Regular Expressions
Regex.IsMatch(commaseparatedstring, #"\b" + Regex.Escape(a.SOMEVALUE1) + #"\b")
The \b parts are called "word boundaries" and tell the regex engine that you are looking for a "full word". The Regex.Escape(...) ensures that the regex engine will not try to interpret "special characters" in the text you are trying to match. For example, if you are trying to match "one+two", the Regex.Escape method will return "one\+two".
Also, be sure to include the System.Text.RegularExpressions at the top of your code file.
See Regex.IsMatch Method (String, String) on MSDN for more information.
Option 2: Split the String
You could also try splitting the string which would be a bit simpler, though probably less efficient.
commaseparatedstring.Split(new Char[] { ',' }).Contains( a.SOMEVALUE1 )
what about:
- separating the commaseparatedstring by comma
- calling equals() on each substring instead of contains() on whole thing?
.SOMEVALUE2 = If(commaseparatedstring.Split(',').Contains(a.SOMEVALUE1), True, False)

Resources