How do I remove the middle of a string using regex. I have the following url:
https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml
but I want it to look like this:
https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml
I can get rid of everything after "data/../../"
That last long string of numbers isnt needed
I tried this
sub(sprintf("^((?:[^/]*;){8}).*"),"", URLxml)
But it doesnt do anything! Help please!
To remove the last but one subpart of the path, you may use
x <- "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml"
sub("^(.*/).*/(.*)", "\\1\\2", x)
## [1] "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml"
See the online R demo and here is a regex demo.
Details:
^ - start of a string
(.*/) - Group 1 (referred to with \1 from the replacement string) any 0+ chars up to the last but one /
.*/ - any 0+ chars up to the last /
(.*) - Group 2 (referred to with \2 backreference from the replacement string) any 0+ chars up to the end.
a<-'https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml'
gsub('data/(.+?)/(.+?)/(.+?)/','data/\\1/\\2/',a)
so in the url:
data/.../.../..(this is removed)../ ....
Related
I have a vector of different paths such as
levs<-c( "20200507-30g_25d" , "20200507-30g_25d/ggg" , "20200507-30g_25d/grn", "20200507-30g_25d/ylw", "ggg" , "grn", "tre_livelli", "tre_livelli/20200507-30g_25d", "tre_livelli/20200507-30g_25d/ggg", "tre_livelli/20200507-30g_25d/grn", "tre_livelli/20200507-30g_25d/ylw" , "ylw" )
which is actually the output of a list.dirs with recursive set to TRUE.
I want to identify only the paths which have just one subfolder (that is "20200507-30g_25d/ggg" , "20200507-30g_25d/grn", "20200507-30g_25d/ylw").
I thought to filter the vector to find only those paths that have only one "/" and then compare the this with the ones that have more than one "/" to get rid of the partial paths.
I tried with regular expression such as:
rep(levs,pattern='/{1}', value=T)
but I get this:
"20200507-30g_25d/ggg" "20200507-30g_25d/grn" "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d" "tre_livelli/20200507-30g_25d/ggg" "tre_livelli/20200507-30g_25d/grn" "tre_livelli/20200507-30g_25d/ylw"
Any idea on how to proceed?
/{1} is a regex that is equal to / and just matches a / anywhere in a string, and there can be more than one / inside it. Please have a look at the regex tag page:
Using {1} as a single-repetition quantifier is harmless but never useful. It is basically an indication of inexperience and/or confusion.
h{1}t{1}t{1}p{1} matches the same string as the simpler expression http (or ht{2}p for that matter) but as you can see, the redundant {1} repetitions only make it harder to read.
You can use
grep(levs, pattern="^[^/]+/[^/]+$", value=TRUE)
# => [1] "20200507-30g_25d/ggg" "20200507-30g_25d/grn" "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d"
See the regex demo:
^ - matches the start of string
[^/]+- one or more chars other than /
/ - a / char
[^/]+- one or more chars other than /
$ - end of string.
NOTE: if the parts before or after the only / in the string can be empty, replace + with *: ^[^/]*/[^/]*$.
An option with str_count to count the number of instances of /
library(stringr)
levs[str_count(levs, "/") == 1 ]
-ouptut
[1] "20200507-30g_25d/ggg" "20200507-30g_25d/grn"
[3] "20200507-30g_25d/ylw" "tre_livelli/20200507-30g_25d"
I have below input strings:
string1:
xyx;;;;str1=P1:P2|str2=1/3|str3=s1:s2
string2:
mzn;;;;str1 = P3:P4 | str2 = 2/5
result expected:
for string1:
str1_val=P1:P2
str2_val=1/3
for string2:
str1_val=P3:P4
str2_val=2/5
I tried with
str1_val= REGEXP_SUBSTR('xyx;;;;str1=P1:P2|strt2=1/3|str3=s1:s2', '(?<=str1=)(.?)(?=|)') - working fine
str2_val=REGEXP_SUBSTR('xyx;;;;str1=P1:P2|str2=1/3|str3=s1:s2', '(?<=str2=)(.?)(?=|)') - working fine
working fine for string1 but not working for string2.
Please help one way which will work for both the case
You need to add optional spaces, but the lookbehind only allows fixed length matches. But \K is similar, it resets the start of the match, i.e. forget the previous match:
REGEXP_SUBSTR(s,'str1\s*=\s*\K([^|]+)')
\s* = optional whitespace
\K = reset start of match
([^|]+) = any char but a |
See RegEx101
I have a dataframe with one column that represents the requests made by my users. A few examples look like this:
GET /enviro/html/tris/tris_overview.html
GET /./enviro/gif/emcilogo.gif
GET /docs/exposure/meta_exp.txt.html
GET /hrmd/
GET /icons/circle_logo_small.gif
I only want to select the last part of the string after the last "." in such a way that I return the pagetype of the string. The output of these lines should therefore be:
.html
.gif
.html
.gif
I tried doing this with sub but I only manage to select everything after the first "." example:
tring <- c("GET /enviro/html/tris/tris_overview.html", "GET /./enviro/gif/emcilogo.gif", "GET /docs/exposure/meta_exp.txt.html", "GET /hrmd/", "GET /icons/circle_logo_small.gif")
sub("^[^.]*", "", sapply(strsplit(tring, "\\s+"), `[`, 2))
this returns:
".html"
"./enviro/gif/emcilogo.gif"
".txt.html"
""
".gif"
I created the following gsub code that works for string containing two points:
gsub(pattern = ".*\\.", replacement = "", "GET /./enviro/gif/finds.gif", "\\s+")
this returns:
"gif"
However, I cant seem to come up with one gsub/sub that works for all possible input. It should read the string from right to left. Stop when it sees the first "." and return everything that was found after that "."
I am new to R and I can't come up with something that is doing this. Any help would be highly appreciated!
You can't change the string parsing direction with R regex. Instead, you may match all up to . and remove it, or match the . that has no . chars to the right of it till the end of string.
string <- c('GET /enviro/html/tris/tris_overview.html','GET /./enviro/gif/emcilogo.gif','GET /docs/exposure/meta_exp.txt.html','GET /hrmd/','GET /icons/circle_logo_small.gif')
res <- regmatches(string, regexec("\\.[^.]*$", string))
res[lengths(res)==0] <- ""
unlist(res)
Or
sub("^(.*(?=\\.)|.*)", "", string, perl=TRUE)
See the R online demo. Both return
[1] ".html" ".gif" ".html" "" ".gif"
Here, \.[^.]*$ matches a . and then any 0+ chars other than . till the end of string. The sub code used ^(.*(?=\\.)|.*) pattern that matches the start of string, then either any 0+ chars as many as possible till . without consuming the dot, or just matches any 0+ chars as many as possible, and replaces the match with an empty string.
See Regex 1 and Regex 2 demos.
Here is a regex-free solution:
sapply(
seq_along(a),
function(i) {
if (grepl("\\.", a[i])) tail(strsplit(a[i], "\\.")[[1]], 1) else ""
}
)
# [1] "html" "gif" "html" "" "gif"
I have strings such as:
'THE HOUSE'
'IN THE HOUSE'
'THE THE HOUSE'
And I would like to remove 'THE' only if it occurs at the first position in the string.
I know how to remove 'THE' with:
gsub("\\<THE\\>", "", string)
And I know how to grab the first word with:
"([A-Za-z]+)" or "([[:alpha:]]+)"or "(\\w+)"
But no idea how to combine the two to end up having:
'HOUSE'
'IN THE HOUSE'
'THE HOUSE'
Cheers!
You may use
string <- c("THE HOUSE", "IN THE HOUSE", "THE THE HOUSE")
sub("^THE\\b\\s*", "", string)
## => [1] "HOUSE" "IN THE HOUSE" "THE HOUSE"
See the regex demo and an online R demo.
Details
^ - start of string
THE - a literal substring
\\b - a word boundary (you may keep \\> trailing word boundary if you wish)
\\s* - 0+ whitespace chars.
I need a Regex that matches these Strings:
Test 1
Test 123
Test 1.1 (not required but would be neat)
Test
Test a
But not the following:
Test 1a
I don't know how this pattern should look like that it allows text or whitespace at the end but not if there is a number before.
I tried this one
^.*([0-9])$ (matches only Test 1, but not for example Test or Test a)
and this one
^.*[0-9].$ (matches only Test 1a, but not for example Test or Test 1)
but they don't match what I need.
This is working for all cases you provided
^\w+(\s(\d+(\.\d+)?|[a-z]))?$
Regex Demo
Regex Breakdown
^ #Start of string
\w+ #Match any characters until next space or end of string
(\s #Match a whitespace
(
\d+ #Match any set of digits
(\.\d+)? #Digits after decimal(optional)
| #Alternation(OR)
[a-z] #Match any character
)
)? #Make it optional
$ #End of string
If you also want to include capital letters, then you can use
^\w+(\s(\d+(\.\d+)?|[A-Za-z]))?$
Try with
^\w+\s+((\d+\.\d+)|(\d+)|([^\d^\s]\w+))?\s*$
Another pattern for you to try:
^(Test(?:$|\s(?:\d$|[a-z]$|\d{3}|\d\.\d$)))
LIVE DEMO.
As per your strings in your question (and your comments):
^\w+(\s[a-z]|\s\d+(\.\d+)?)?$