Regular expression meaning in R : "( \n|\n )" - r

I am new to R programming and was trying out the gsub function for text replacement in pandas dataframe series(i.e new_text).
It a vast series so will not be able to print all here.
It is just a series with strings containing postal address.
I came across this gsub code : gsub(pattern = "( \n|\n )", replacement = " ", x = new_text) -> new_text
can you please let me know the meaning of this regex expression as well as the python alternative using regex expression.

Your pattern, slightly rewritten, is [ ]\n|\n[ ], which says to match:
[ ]\n a space followed by a newline
| OR
\n[ ] a newline followed by a space
Note that you might be able to use [ ]?\n[ ]? to the same effect, depending on the actual text you are using with gsub.

Related

Only select last part of a string after the last point

I have a dataframe with one column that represents the requests made by my users. A few examples look like this:
GET /enviro/html/tris/tris_overview.html
GET /./enviro/gif/emcilogo.gif
GET /docs/exposure/meta_exp.txt.html
GET /hrmd/
GET /icons/circle_logo_small.gif
I only want to select the last part of the string after the last "." in such a way that I return the pagetype of the string. The output of these lines should therefore be:
.html
.gif
.html
.gif
I tried doing this with sub but I only manage to select everything after the first "." example:
tring <- c("GET /enviro/html/tris/tris_overview.html", "GET /./enviro/gif/emcilogo.gif", "GET /docs/exposure/meta_exp.txt.html", "GET /hrmd/", "GET /icons/circle_logo_small.gif")
sub("^[^.]*", "", sapply(strsplit(tring, "\\s+"), `[`, 2))
this returns:
".html"
"./enviro/gif/emcilogo.gif"
".txt.html"
""
".gif"
I created the following gsub code that works for string containing two points:
gsub(pattern = ".*\\.", replacement = "", "GET /./enviro/gif/finds.gif", "\\s+")
this returns:
"gif"
However, I cant seem to come up with one gsub/sub that works for all possible input. It should read the string from right to left. Stop when it sees the first "." and return everything that was found after that "."
I am new to R and I can't come up with something that is doing this. Any help would be highly appreciated!
You can't change the string parsing direction with R regex. Instead, you may match all up to . and remove it, or match the . that has no . chars to the right of it till the end of string.
string <- c('GET /enviro/html/tris/tris_overview.html','GET /./enviro/gif/emcilogo.gif','GET /docs/exposure/meta_exp.txt.html','GET /hrmd/','GET /icons/circle_logo_small.gif')
res <- regmatches(string, regexec("\\.[^.]*$", string))
res[lengths(res)==0] <- ""
unlist(res)
Or
sub("^(.*(?=\\.)|.*)", "", string, perl=TRUE)
See the R online demo. Both return
[1] ".html" ".gif" ".html" "" ".gif"
Here, \.[^.]*$ matches a . and then any 0+ chars other than . till the end of string. The sub code used ^(.*(?=\\.)|.*) pattern that matches the start of string, then either any 0+ chars as many as possible till . without consuming the dot, or just matches any 0+ chars as many as possible, and replaces the match with an empty string.
See Regex 1 and Regex 2 demos.
Here is a regex-free solution:
sapply(
seq_along(a),
function(i) {
if (grepl("\\.", a[i])) tail(strsplit(a[i], "\\.")[[1]], 1) else ""
}
)
# [1] "html" "gif" "html" "" "gif"

capture a specific JSON-LD attribute with a regular expression in R

Using R, I want to capture a specific attribute (#type in this case) out of a JSON-LD payload inside a <script> tag. Here's a sample fragment:
<script type="application/ld+json">
{
"#context": "https://schema.org",
"#type": "WebSite",
...
This is a sample code in R to perform the attribute extraction:
x <- "<script type=\"application/ld+json\">{\"#context\":\"https://schema.org\",\"#type\":\"WebSite\",\"url\":\"https://www.foo.com/\""
regmatches(x, regexpr("<script [^>]*type *= *['\"] *application/ld.json *['\"][^>]*>[^}]+ ['\"] *#type *['\"] *: *['\"]([^'\"]+)['\"]", x, ignore.case = TRUE))
The output from this code is the following:
[1] "<script type=\"application/ld+json\">{ \"#context\": \"https://schema.org\" \"#type\":\"WebSite\""
The output I expect is this one:
[1] "WebSite"
I don't have a solid experience with R and even less of a solid experience with regular expressions but what bugs me is I've already tried this regex in the regex101 website (you can check the test here) and it works.
Can you give me a hint on how to return the correct attribute instead of the full test string?
You may use use a \K based PCRE pattern to extract any 1+ chars other than ' and " after a specific pattern:
x <- "<script type=\"application/ld+json\">{\"#context\":\"https://schema.org\",\"#type\":\"WebSite\",\"url\":\"https://www.foo.com/\""
p <- "<script\\s[^>]*type *= *['\"] *application/ld.json *['\"][^>]*>[^}]+['\"] *#type *['\"] *: *['\"]\\K[^'\"]+"
regmatches(x, regexpr(p, x, ignore.case = TRUE, perl=TRUE))
## => "WebSite"
See the R demo online
It looks like <SOME_LEFTHAND_CONTEXT_PATTERN>\K<WHAT_YOU_NEED>. The \K operator will omit all text matched so far, and you will only get <WHAT_YOU_NEED> in the result. See this pattern demo. Do not forget perl=TRUE argument that will enable PCRE regex engine here.

combining strings to one string in r

I'm trying to combine some stings to one. In the end this string should be generated:
//*[#id="coll276"]
So my inner part of the string is an vector: tag <- 'coll276'
I already used the paste() method like this:
paste('//*[#id="',tag,'"]', sep = "")
But my result looks like following: //*[#id=\"coll276\"]
I don't why R is putting some \ into my string, but how can I fix this problem?
Thanks a lot!
tldr: Don't worry about them, they're not really there. It's just something added by print
Those \ are escape characters that tell R to ignore the special properties of the characters that follow them. Look at the output of your paste function:
paste('//*[#id="',tag,'"]', sep = "")
[1] "//*[#id=\"coll276\"]"
You'll see that the output, since it is a string, is enclosed in double quotes "". Normally, the double quotes inside your string would break the string up into two strings with bare code in the middle:
"//*[#id\" coll276 "]"
To prevent this, R "escapes" the quotes in your string so they don't do this. This is just a visual effect. If you write your string to a file, you'll see that those escaping \ aren't actually there:
write(paste('//*[#id="',tag,'"]', sep = ""), 'out.txt')
This is what is in the file:
//*[#id="coll276"]
You can use cat to print the exact value of the string to the console (Thanks #LukeC):
cat(paste('//*[#id="',tag,'"]', sep = ""))
//*[#id="coll276"]
Or use single quotes (if possible):
paste('//*[#id=\'',tag,'\']', sep = "")
[1] "//*[#id='coll276']"

Split a string by a plus sign (+) character

I have a string in a data frame as: "(1)+(2)"
I want to split with delimiter "+" such that I get one element as (1) and other as (2), hence preserving the parentheses. I used strsplit but it does not preserve the parenthesis.
Use
strsplit("(1)+(2)", "\\+")
or
strsplit("(1)+(2)", "+", fixed = TRUE)
The idea of using strsplit("(1)+(2)", "+") doesn't work since unless specified otherwise, the split argument is a regular expression, and the + character is special in regex. Other characters that also need extra care are
?
*
.
^
$
\
|
{ }
[ ]
( )
Below Worked for me:
import re
re.split('\\+', 'ABC+CDE')
Output:
['ABC', 'CDE']

Find word (not containing substrings) in comma separated string

I'm using a linq query where i do something liike this:
viewModel.REGISTRATIONGRPS = (From a In db.TABLEA
Select New SubViewModel With {
.SOMEVALUE1 = a.SOMEVALUE1,
...
...
.SOMEVALUE2 = If(commaseparatedstring.Contains(a.SOMEVALUE1), True, False)
}).ToList()
Now my Problem is that this does'n search for words but for substrings so for example:
commaseparatedstring = "EWM,KI,KP"
SOMEVALUE1 = "EW"
It returns true because it's contained in EWM?
What i would need is to find words (not containing substrings) in the comma separated string!
Option 1: Regular Expressions
Regex.IsMatch(commaseparatedstring, #"\b" + Regex.Escape(a.SOMEVALUE1) + #"\b")
The \b parts are called "word boundaries" and tell the regex engine that you are looking for a "full word". The Regex.Escape(...) ensures that the regex engine will not try to interpret "special characters" in the text you are trying to match. For example, if you are trying to match "one+two", the Regex.Escape method will return "one\+two".
Also, be sure to include the System.Text.RegularExpressions at the top of your code file.
See Regex.IsMatch Method (String, String) on MSDN for more information.
Option 2: Split the String
You could also try splitting the string which would be a bit simpler, though probably less efficient.
commaseparatedstring.Split(new Char[] { ',' }).Contains( a.SOMEVALUE1 )
what about:
- separating the commaseparatedstring by comma
- calling equals() on each substring instead of contains() on whole thing?
.SOMEVALUE2 = If(commaseparatedstring.Split(',').Contains(a.SOMEVALUE1), True, False)

Resources