Only select last part of a string after the last point - r

I have a dataframe with one column that represents the requests made by my users. A few examples look like this:
GET /enviro/html/tris/tris_overview.html
GET /./enviro/gif/emcilogo.gif
GET /docs/exposure/meta_exp.txt.html
GET /hrmd/
GET /icons/circle_logo_small.gif
I only want to select the last part of the string after the last "." in such a way that I return the pagetype of the string. The output of these lines should therefore be:
.html
.gif
.html
.gif
I tried doing this with sub but I only manage to select everything after the first "." example:
tring <- c("GET /enviro/html/tris/tris_overview.html", "GET /./enviro/gif/emcilogo.gif", "GET /docs/exposure/meta_exp.txt.html", "GET /hrmd/", "GET /icons/circle_logo_small.gif")
sub("^[^.]*", "", sapply(strsplit(tring, "\\s+"), `[`, 2))
this returns:
".html"
"./enviro/gif/emcilogo.gif"
".txt.html"
""
".gif"
I created the following gsub code that works for string containing two points:
gsub(pattern = ".*\\.", replacement = "", "GET /./enviro/gif/finds.gif", "\\s+")
this returns:
"gif"
However, I cant seem to come up with one gsub/sub that works for all possible input. It should read the string from right to left. Stop when it sees the first "." and return everything that was found after that "."
I am new to R and I can't come up with something that is doing this. Any help would be highly appreciated!

You can't change the string parsing direction with R regex. Instead, you may match all up to . and remove it, or match the . that has no . chars to the right of it till the end of string.
string <- c('GET /enviro/html/tris/tris_overview.html','GET /./enviro/gif/emcilogo.gif','GET /docs/exposure/meta_exp.txt.html','GET /hrmd/','GET /icons/circle_logo_small.gif')
res <- regmatches(string, regexec("\\.[^.]*$", string))
res[lengths(res)==0] <- ""
unlist(res)
Or
sub("^(.*(?=\\.)|.*)", "", string, perl=TRUE)
See the R online demo. Both return
[1] ".html" ".gif" ".html" "" ".gif"
Here, \.[^.]*$ matches a . and then any 0+ chars other than . till the end of string. The sub code used ^(.*(?=\\.)|.*) pattern that matches the start of string, then either any 0+ chars as many as possible till . without consuming the dot, or just matches any 0+ chars as many as possible, and replaces the match with an empty string.
See Regex 1 and Regex 2 demos.

Here is a regex-free solution:
sapply(
seq_along(a),
function(i) {
if (grepl("\\.", a[i])) tail(strsplit(a[i], "\\.")[[1]], 1) else ""
}
)
# [1] "html" "gif" "html" "" "gif"

Related

Concatenate string with escape characters to bookend string with "\"... \"" in R

I have a character vector of file paths that look like this:
xx <- c("data/lsa_two_isl_prosp_u.csv",
"data/lsa_two_isl_prosp_d.csv" ,
"data/lsa_two_isl_propsuit_u.csv")
However, I need these file paths to have "" concatenated on to the beginning of the string and "" concatenated onto the end, so that my string looks like this:
xx <- c("\"data/lsa_two_isl_prosp_u.csv\"",
"\"data/lsa_two_isl_prosp_d.csv\"" ,
"\"data/lsa_two_isl_propsuit_u.csv\"")
Normally I would use paste but the "\"... \"" are escape characters that need each other to 'bookend' a string.
In hindsight, an obviously doomed idea, but sharing to avoid anyone else who might try: If I try yo use paste('"\"', xx, '\""') , I get "\"\" data/lsa_two_isl_prosp_d.csv \"\"" , which is obviously wrong, and I cannot remove the excess portions of the string without throwing out all of it, incase you may have the same idea...
Any suggestions?
Found the answer after a lot of trial and error:
xx <- paste("\"", xx, "\"")

filter paths that have only one "/" in R using regular expression

I have a vector of different paths such as
levs<-c( "20200507-30g_25d" , "20200507-30g_25d/ggg" , "20200507-30g_25d/grn", "20200507-30g_25d/ylw", "ggg" , "grn", "tre_livelli", "tre_livelli/20200507-30g_25d", "tre_livelli/20200507-30g_25d/ggg", "tre_livelli/20200507-30g_25d/grn", "tre_livelli/20200507-30g_25d/ylw" , "ylw" )
which is actually the output of a list.dirs with recursive set to TRUE.
I want to identify only the paths which have just one subfolder (that is "20200507-30g_25d/ggg" , "20200507-30g_25d/grn", "20200507-30g_25d/ylw").
I thought to filter the vector to find only those paths that have only one "/" and then compare the this with the ones that have more than one "/" to get rid of the partial paths.
I tried with regular expression such as:
rep(levs,pattern='/{1}', value=T)
but I get this:
"20200507-30g_25d/ggg" "20200507-30g_25d/grn" "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d" "tre_livelli/20200507-30g_25d/ggg" "tre_livelli/20200507-30g_25d/grn" "tre_livelli/20200507-30g_25d/ylw"
Any idea on how to proceed?
/{1} is a regex that is equal to / and just matches a / anywhere in a string, and there can be more than one / inside it. Please have a look at the regex tag page:
Using {1} as a single-repetition quantifier is harmless but never useful. It is basically an indication of inexperience and/or confusion.
h{1}t{1}t{1}p{1} matches the same string as the simpler expression http (or ht{2}p for that matter) but as you can see, the redundant {1} repetitions only make it harder to read.
You can use
grep(levs, pattern="^[^/]+/[^/]+$", value=TRUE)
# => [1] "20200507-30g_25d/ggg" "20200507-30g_25d/grn" "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d"
See the regex demo:
^ - matches the start of string
[^/]+- one or more chars other than /
/ - a / char
[^/]+- one or more chars other than /
$ - end of string.
NOTE: if the parts before or after the only / in the string can be empty, replace + with *: ^[^/]*/[^/]*$.
An option with str_count to count the number of instances of /
library(stringr)
levs[str_count(levs, "/") == 1 ]
-ouptut
[1] "20200507-30g_25d/ggg" "20200507-30g_25d/grn"
[3] "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d"

Removing a specific first item in a string in R

I have strings such as:
'THE HOUSE'
'IN THE HOUSE'
'THE THE HOUSE'
And I would like to remove 'THE' only if it occurs at the first position in the string.
I know how to remove 'THE' with:
gsub("\\<THE\\>", "", string)
And I know how to grab the first word with:
"([A-Za-z]+)" or "([[:alpha:]]+)"or "(\\w+)"
But no idea how to combine the two to end up having:
'HOUSE'
'IN THE HOUSE'
'THE HOUSE'
Cheers!
You may use
string <- c("THE HOUSE", "IN THE HOUSE", "THE THE HOUSE")
sub("^THE\\b\\s*", "", string)
## => [1] "HOUSE" "IN THE HOUSE" "THE HOUSE"
See the regex demo and an online R demo.
Details
^ - start of string
THE - a literal substring
\\b - a word boundary (you may keep \\> trailing word boundary if you wish)
\\s* - 0+ whitespace chars.

REGEX: Remove middle of string after certain number of "/"

How do I remove the middle of a string using regex. I have the following url:
https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml
but I want it to look like this:
https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml
I can get rid of everything after "data/../../"
That last long string of numbers isnt needed
I tried this
sub(sprintf("^((?:[^/]*;){8}).*"),"", URLxml)
But it doesnt do anything! Help please!
To remove the last but one subpart of the path, you may use
x <- "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml"
sub("^(.*/).*/(.*)", "\\1\\2", x)
## [1] "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml"
See the online R demo and here is a regex demo.
Details:
^ - start of a string
(.*/) - Group 1 (referred to with \1 from the replacement string) any 0+ chars up to the last but one /
.*/ - any 0+ chars up to the last /
(.*) - Group 2 (referred to with \2 backreference from the replacement string) any 0+ chars up to the end.
a<-'https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml'
gsub('data/(.+?)/(.+?)/(.+?)/','data/\\1/\\2/',a)
so in the url:
data/.../.../..(this is removed)../ ....

Digits Regular Expression Validator

i need a regular expression validation for
numeric digits grouped as X-XXXXX-XXX-X
can any one help?
Regex reg = new Regex("\b[0-9]\-[0-9]{5}\-[0-9]{3}\-[0-9]\b");
Here is what I use for checking social security numbers that user's input:
Public Shared Function CheckSSNFormat(ByVal text As String) As Boolean
Dim digits As String = Regex.Replace(text, "[^0-9]", "")
Return digits.Length = 9
End Function
It doesn't check that they are input in a specific format, but that might be better depending on what you really need -- so just thought I'd give you another option, just incase.
The above just removes everything except digits, and returns true if there are 9 digits (a valid SS#). It does mean some goofy user could enter something like: hello123456789 and it would accept it as valid, but that is fine for me, and I'd rather do that than not accept 123456789 just because I was looking for 123-45-6789 only.
Later I use this to save to my database:
Public Shared Function FormatSSNForSaving(ByVal text As String) As String
If text = "" Then text = "000-00-0000"
Return Regex.Replace(text, "[^0-9]", "")
End Function
and this anytime I want to display the value (actually I use this one for phone numbers, turns out I never display the SS# so don't have a function for it):
Public Shared Function FormatPhoneForDisplay(ByVal text As String) As String
If text.Length <> 10 Then Return text
Return "(" & text.Substring(0, 3) & ") " & text.Substring(3, 3) & "-" & text.Substring(6, 4)
End Function
(^\d{1}-\d{5}-\d{3}-\d{1}$), this should do.
[0-9]-[0-9]{5}-[0-9]{3}-[0-9]
you could also use round brackets to extract the numbers if you want:
([0-9])-([0-9]{5})-([0-9]{3})-([0-9])
and get the values with $1 $2 etc. in the Regex.Replace() function
Regex pattern = new Regex("\b\d-\d{5}-\d{3}-\d\b");
\b - word boundary
\d - digit

Resources