filter paths that have only one "/" in R using regular expression - r

I have a vector of different paths such as
levs<-c( "20200507-30g_25d" , "20200507-30g_25d/ggg" , "20200507-30g_25d/grn", "20200507-30g_25d/ylw", "ggg" , "grn", "tre_livelli", "tre_livelli/20200507-30g_25d", "tre_livelli/20200507-30g_25d/ggg", "tre_livelli/20200507-30g_25d/grn", "tre_livelli/20200507-30g_25d/ylw" , "ylw" )
which is actually the output of a list.dirs with recursive set to TRUE.
I want to identify only the paths which have just one subfolder (that is "20200507-30g_25d/ggg" , "20200507-30g_25d/grn", "20200507-30g_25d/ylw").
I thought to filter the vector to find only those paths that have only one "/" and then compare the this with the ones that have more than one "/" to get rid of the partial paths.
I tried with regular expression such as:
rep(levs,pattern='/{1}', value=T)
but I get this:
"20200507-30g_25d/ggg" "20200507-30g_25d/grn" "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d" "tre_livelli/20200507-30g_25d/ggg" "tre_livelli/20200507-30g_25d/grn" "tre_livelli/20200507-30g_25d/ylw"
Any idea on how to proceed?

/{1} is a regex that is equal to / and just matches a / anywhere in a string, and there can be more than one / inside it. Please have a look at the regex tag page:
Using {1} as a single-repetition quantifier is harmless but never useful. It is basically an indication of inexperience and/or confusion.
h{1}t{1}t{1}p{1} matches the same string as the simpler expression http (or ht{2}p for that matter) but as you can see, the redundant {1} repetitions only make it harder to read.
You can use
grep(levs, pattern="^[^/]+/[^/]+$", value=TRUE)
# => [1] "20200507-30g_25d/ggg" "20200507-30g_25d/grn" "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d"
See the regex demo:
^ - matches the start of string
[^/]+- one or more chars other than /
/ - a / char
[^/]+- one or more chars other than /
$ - end of string.
NOTE: if the parts before or after the only / in the string can be empty, replace + with *: ^[^/]*/[^/]*$.

An option with str_count to count the number of instances of /
library(stringr)
levs[str_count(levs, "/") == 1 ]
-ouptut
[1] "20200507-30g_25d/ggg" "20200507-30g_25d/grn"
[3] "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d"

Related

Only select last part of a string after the last point

I have a dataframe with one column that represents the requests made by my users. A few examples look like this:
GET /enviro/html/tris/tris_overview.html
GET /./enviro/gif/emcilogo.gif
GET /docs/exposure/meta_exp.txt.html
GET /hrmd/
GET /icons/circle_logo_small.gif
I only want to select the last part of the string after the last "." in such a way that I return the pagetype of the string. The output of these lines should therefore be:
.html
.gif
.html
.gif
I tried doing this with sub but I only manage to select everything after the first "." example:
tring <- c("GET /enviro/html/tris/tris_overview.html", "GET /./enviro/gif/emcilogo.gif", "GET /docs/exposure/meta_exp.txt.html", "GET /hrmd/", "GET /icons/circle_logo_small.gif")
sub("^[^.]*", "", sapply(strsplit(tring, "\\s+"), `[`, 2))
this returns:
".html"
"./enviro/gif/emcilogo.gif"
".txt.html"
""
".gif"
I created the following gsub code that works for string containing two points:
gsub(pattern = ".*\\.", replacement = "", "GET /./enviro/gif/finds.gif", "\\s+")
this returns:
"gif"
However, I cant seem to come up with one gsub/sub that works for all possible input. It should read the string from right to left. Stop when it sees the first "." and return everything that was found after that "."
I am new to R and I can't come up with something that is doing this. Any help would be highly appreciated!
You can't change the string parsing direction with R regex. Instead, you may match all up to . and remove it, or match the . that has no . chars to the right of it till the end of string.
string <- c('GET /enviro/html/tris/tris_overview.html','GET /./enviro/gif/emcilogo.gif','GET /docs/exposure/meta_exp.txt.html','GET /hrmd/','GET /icons/circle_logo_small.gif')
res <- regmatches(string, regexec("\\.[^.]*$", string))
res[lengths(res)==0] <- ""
unlist(res)
Or
sub("^(.*(?=\\.)|.*)", "", string, perl=TRUE)
See the R online demo. Both return
[1] ".html" ".gif" ".html" "" ".gif"
Here, \.[^.]*$ matches a . and then any 0+ chars other than . till the end of string. The sub code used ^(.*(?=\\.)|.*) pattern that matches the start of string, then either any 0+ chars as many as possible till . without consuming the dot, or just matches any 0+ chars as many as possible, and replaces the match with an empty string.
See Regex 1 and Regex 2 demos.
Here is a regex-free solution:
sapply(
seq_along(a),
function(i) {
if (grepl("\\.", a[i])) tail(strsplit(a[i], "\\.")[[1]], 1) else ""
}
)
# [1] "html" "gif" "html" "" "gif"

Regex for "Characters Numbers"

I need a Regex that matches these Strings:
Test 1
Test 123
Test 1.1 (not required but would be neat)
Test
Test a
But not the following:
Test 1a
I don't know how this pattern should look like that it allows text or whitespace at the end but not if there is a number before.
I tried this one
^.*([0-9])$ (matches only Test 1, but not for example Test or Test a)
and this one
^.*[0-9].$ (matches only Test 1a, but not for example Test or Test 1)
but they don't match what I need.
This is working for all cases you provided
^\w+(\s(\d+(\.\d+)?|[a-z]))?$
Regex Demo
Regex Breakdown
^ #Start of string
\w+ #Match any characters until next space or end of string
(\s #Match a whitespace
(
\d+ #Match any set of digits
(\.\d+)? #Digits after decimal(optional)
| #Alternation(OR)
[a-z] #Match any character
)
)? #Make it optional
$ #End of string
If you also want to include capital letters, then you can use
^\w+(\s(\d+(\.\d+)?|[A-Za-z]))?$
Try with
^\w+\s+((\d+\.\d+)|(\d+)|([^\d^\s]\w+))?\s*$
Another pattern for you to try:
^(Test(?:$|\s(?:\d$|[a-z]$|\d{3}|\d\.\d$)))
LIVE DEMO.
As per your strings in your question (and your comments):
^\w+(\s[a-z]|\s\d+(\.\d+)?)?$

Split a string by a plus sign (+) character

I have a string in a data frame as: "(1)+(2)"
I want to split with delimiter "+" such that I get one element as (1) and other as (2), hence preserving the parentheses. I used strsplit but it does not preserve the parenthesis.
Use
strsplit("(1)+(2)", "\\+")
or
strsplit("(1)+(2)", "+", fixed = TRUE)
The idea of using strsplit("(1)+(2)", "+") doesn't work since unless specified otherwise, the split argument is a regular expression, and the + character is special in regex. Other characters that also need extra care are
?
*
.
^
$
\
|
{ }
[ ]
( )
Below Worked for me:
import re
re.split('\\+', 'ABC+CDE')
Output:
['ABC', 'CDE']

Find word (not containing substrings) in comma separated string

I'm using a linq query where i do something liike this:
viewModel.REGISTRATIONGRPS = (From a In db.TABLEA
Select New SubViewModel With {
.SOMEVALUE1 = a.SOMEVALUE1,
...
...
.SOMEVALUE2 = If(commaseparatedstring.Contains(a.SOMEVALUE1), True, False)
}).ToList()
Now my Problem is that this does'n search for words but for substrings so for example:
commaseparatedstring = "EWM,KI,KP"
SOMEVALUE1 = "EW"
It returns true because it's contained in EWM?
What i would need is to find words (not containing substrings) in the comma separated string!
Option 1: Regular Expressions
Regex.IsMatch(commaseparatedstring, #"\b" + Regex.Escape(a.SOMEVALUE1) + #"\b")
The \b parts are called "word boundaries" and tell the regex engine that you are looking for a "full word". The Regex.Escape(...) ensures that the regex engine will not try to interpret "special characters" in the text you are trying to match. For example, if you are trying to match "one+two", the Regex.Escape method will return "one\+two".
Also, be sure to include the System.Text.RegularExpressions at the top of your code file.
See Regex.IsMatch Method (String, String) on MSDN for more information.
Option 2: Split the String
You could also try splitting the string which would be a bit simpler, though probably less efficient.
commaseparatedstring.Split(new Char[] { ',' }).Contains( a.SOMEVALUE1 )
what about:
- separating the commaseparatedstring by comma
- calling equals() on each substring instead of contains() on whole thing?
.SOMEVALUE2 = If(commaseparatedstring.Split(',').Contains(a.SOMEVALUE1), True, False)

xQuery substring problem

I now have a full path for a file as a string like:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml"
However, now I need to take out only the folder path, so it will be the above string without the last back slash content like:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/"
But it seems that the substring() function in xQuery only has substring(string,start,len) or substring(string,start), I am trying to figure out a way to specify the last occurence of the backslash, but no luck.
Could experts help? Thanks!
Try out the tokenize() function (for splitting a string into its component parts) and then re-assembling it, using everything but the last part.
let $full-path := "/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
$segments := tokenize($full-path,"/")[position() ne last()]
return
concat(string-join($segments,'/'),'/')
For more details on these functions, check out their reference pages:
fn:tokenize()
fn:string-join()
fn:replace can do the job with a regular expression:
replace("/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
"[^/]+$",
"")
This can be done even with a single XPath 2.0 (subset of XQuery) expression:
substring($fullPath,
1,
string-length($fullPath) - string-length(tokenize($fullPath, '/')[last()])
)
where $fullPath should be substituted with the actual string, such as:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml"
The following code tokenizes, removes the last token, replaces it with an empty string, and joins back.
string-join(
(
tokenize(
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
"/"
)[position() ne last()],
""
),
"/"
)
It seems to return the desired result on try.zorba-xquery.com. Does this help?

Resources