Regular expression to extract specific part of a URL

Regular expression to extract specific part of a URL - r

I have a vector of URLs and need to extract a certain part of it. I've tried using a regex tester to see if my attempts worked, but they were no good.
The URLs I have are in this format: https://www.baseball-reference.com/teams/MIL/1976.shtml
I ned to extract the three letters after "teams/" (so for the example above, I need "MIL")
Does anyone have any idea how to get the correct regular expression to get this working? Thanks.

1) basename/dirname Try this:
u <- "https://www.baseball-reference.com/teams/MIL/1976.shtml" # input data
basename(dirname(u))
## [1] "MIL"
2) sub or with a regular expression:
sub(".*teams/(.*?)/.*", "\\1", u)
## [1] "MIL"
3) strsplit Split the string on / and take the second last component.
s <- strsplit(u, "/")[[1]]
s[length(s) - 1]
## [1] "MIL"
4) gsub Since the required substring is all upper case and no other characters in the input are this gsub which removes all characters that are not upper case letters would work:
gsub("[^A-Z]", "", u)
## [1] "MIL"

Many different ways to achieve this using regexp's. Here's one:
url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
gsub(".+teams/(\\w{3}).+$", "\\1", url);
#[1] "MIL"

Or
x <- c('https://www.baseball-reference.com/teams/MIL/1976.shtml')
pattern <- "/teams/([^/]+)"
m <- regexec(pattern, x)
res = regmatches(x, m)[[1]]
res[2]
which yields
[1] "MIL"

Consider using the stringr package to simplify your code when handling strings.
Use a regular expression with positive lookbehind to catch alphanumeric codes following the string "teams\":
stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
In your case, if the URLs literally all begin with the same string https://www.baseball-reference.com/teams/ then you can avoid regex entirely and use a simple substring to get the three-letter code which follows:
stringr::str_sub(url, 42, 44)
Here are the results:
> url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
>
> stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
[1] "MIL"
>
> stringr::str_sub(url, 42, 44)
[1] "MIL"

Related

Extract last digit [duplicate]

How can I get the last n characters from a string in R?
Is there a function like SQL's RIGHT?

I'm not aware of anything in base R, but it's straight-forward to make a function to do this using substr and nchar:
x <- "some text in a string"
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
substrRight(x, 6)
[1] "string"
substrRight(x, 8)
[1] "a string"
This is vectorised, as #mdsumner points out. Consider:
x <- c("some text in a string", "I really need to learn how to count")
substrRight(x, 6)
[1] "string" " count"

If you don't mind using the stringr package, str_sub is handy because you can use negatives to count backward:
x <- "some text in a string"
str_sub(x,-6,-1)
[1] "string"
Or, as Max points out in a comment to this answer,
str_sub(x, start= -6)
[1] "string"

Use stri_sub function from stringi package.
To get substring from the end, use negative numbers.
Look below for the examples:
stri_sub("abcde",1,3)
[1] "abc"
stri_sub("abcde",1,1)
[1] "a"
stri_sub("abcde",-3,-1)
[1] "cde"
You can install this package from github: https://github.com/Rexamine/stringi
It is available on CRAN now, simply type
install.packages("stringi")
to install this package.

str = 'This is an example'
n = 7
result = substr(str,(nchar(str)+1)-n,nchar(str))
print(result)
> [1] "example"
>

Another reasonably straightforward way is to use regular expressions and sub:
sub('.*(?=.$)', '', string, perl=T)
So, "get rid of everything followed by one character". To grab more characters off the end, add however many dots in the lookahead assertion:
sub('.*(?=.{2}$)', '', string, perl=T)
where .{2} means .., or "any two characters", so meaning "get rid of everything followed by two characters".
sub('.*(?=.{3}$)', '', string, perl=T)
for three characters, etc. You can set the number of characters to grab with a variable, but you'll have to paste the variable value into the regular expression string:
n = 3
sub(paste('.+(?=.{', n, '})', sep=''), '', string, perl=T)

UPDATE: as noted by mdsumner, the original code is already vectorised because substr is. Should have been more careful.
And if you want a vectorised version (based on Andrie's code)
substrRight <- function(x, n){
sapply(x, function(xx)
substr(xx, (nchar(xx)-n+1), nchar(xx))
)
}
> substrRight(c("12345","ABCDE"),2)
12345 ABCDE
"45" "DE"
Note that I have changed (nchar(x)-n) to (nchar(x)-n+1) to get n characters.

A simple base R solution using the substring() function (who knew this function even existed?):
RIGHT = function(x,n){
substring(x,nchar(x)-n+1)
}
This takes advantage of basically being substr() underneath but has a default end value of 1,000,000.
Examples:
> RIGHT('Hello World!',2)
[1] "d!"
> RIGHT('Hello World!',8)
[1] "o World!"

Try this:
x <- "some text in a string"
n <- 5
substr(x, nchar(x)-n, nchar(x))
It shoudl give:
[1] "string"

An alternative to substr is to split the string into a list of single characters and process that:
N <- 2
sapply(strsplit(x, ""), function(x, n) paste(tail(x, n), collapse = ""), N)

I use substr too, but in a different way. I want to extract the last 6 characters of "Give me your food." Here are the steps:
(1) Split the characters
splits <- strsplit("Give me your food.", split = "")
(2) Extract the last 6 characters
tail(splits[[1]], n=6)
Output:
[1] " " "f" "o" "o" "d" "."
Each of the character can be accessed by splits[[1]][x], where x is 1 to 6.

someone before uses a similar solution to mine, but I find it easier to think as below:
> text<-"some text in a string" # we want to have only the last word "string" with 6 letter
> n<-5 #as the last character will be counted with nchar(), here we discount 1
> substr(x=text,start=nchar(text)-n,stop=nchar(text))
This will bring the last characters as desired.

For those coming from Microsoft Excel or Google Sheets, you would have seen functions like LEFT(), RIGHT(), and MID(). I have created a package known as forstringr and its development version is currently on Github.
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("gbganalyst/forstringr")
library(forstringr)
the str_left(): This counts from the left and then extract n characters
the str_right()- This counts from the right and then extract n characters
the str_mid()- This extract characters from the middle
Examples:
x <- "some text in a string"
str_left(x, 4)
[1] "some"
str_right(x, 6)
[1] "string"
str_mid(x, 6, 4)
[1] "text"

I used the following code to get the last character of a string.
substr(output, nchar(stringOfInterest), nchar(stringOfInterest))
You can play with the nchar(stringOfInterest) to figure out how to get last few characters.

A little modification on #Andrie solution gives also the complement:
substrR <- function(x, n) {
if(n > 0) substr(x, (nchar(x)-n+1), nchar(x)) else substr(x, 1, (nchar(x)+n))
}
x <- "moSvmC20F.5.rda"
substrR(x,-4)
[1] "moSvmC20F.5"
That was what I was looking for. And it invites to the left side:
substrL <- function(x, n){
if(n > 0) substr(x, 1, n) else substr(x, -n+1, nchar(x))
}
substrL(substrR(x,-4),-2)
[1] "SvmC20F.5"

Just in case if a range of characters need to be picked:
# For example, to get the date part from the string
substrRightRange <- function(x, m, n){substr(x, nchar(x)-m+1, nchar(x)-m+n)}
value <- "REGNDATE:20170526RN"
substrRightRange(value, 10, 8)
[1] "20170526"

Use Regular expressions extract specific characters

text <- c('d__Viruses|f__Closteroviridae|g__Closterovirus|s__Citrus_tristeza_virus',
'd__Viruses|o__Tymovirales|f__Alphaflexiviridae|g__Mandarivirus|s__Citrus_yellow_vein_clearing_virus',
'd__Viruses|o__Ortervirales|f__Retroviridae|s__Columba_palumbus_retrovirus')
I have tried but failed:
str_extract(text, pattern = 'f.*\\|')
How can I get
f__Closteroviridae
f__Alphaflexiviridae
f__Retroviridae
Any help will be high appreciated!

Make the regex non-greedy and since you don't want "|" in final output use positive lookahead.
stringr::str_extract(text, 'f.*?(?=\\|)')
#[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
In base R, we can use sub :
sub('.*(f_.*?)\\|.*', '\\1', text)
#[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"

For a base R solution, I would use regmatches along with gregexpr:
m <- gregexpr("\\bf__[^|]+", text)
as.character(regmatches(text, m))
[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
The advantage of using gregexpr as above is that should an input contain more than one f__ matching term, we could also capture it. For example:
x <- 'd__Viruses|f__Closteroviridae|g__Closterovirus|f__some_virus'
m <- gregexpr("\\bf__[^|]+", x)
regmatches(x, m)[[1]]
[1] "f__Closteroviridae" "f__some_virus"
Data:
text <- c('d__Viruses|f__Closteroviridae|g__Closterovirus|s__Citrus_tristeza_virus',
'd__Viruses|o__Tymovirales|f__Alphaflexiviridae|g__Mandarivirus|s__Citrus_yellow_vein_clearing_virus',
'd__Viruses|o__Ortervirales|f__Retroviridae|s__Columba_palumbus_retrovirus')

removes part of string in r

I'm trying to extract ES at the end of a string
> data <- c("phrases", "phases", "princesses","class","pass")
> data1 <- gsub("(\\w+)(s)+?es\\b", "\\1\\2", data, perl=TRUE)
> gsub("(\\w+)s\\b", "\\1", data1, perl=TRUE)
[1] "phra" "pha" "princes" "clas" "pas"
I get this result
[1] "phra" "pha" "princes" "clas" "pas"
but in reality what I need to obtain is:
[1] "phras" "phas" "princess" "clas" "pas"

You can use a word boundary (\\b) if it is guaranteed that each word is followed by a punctuation or is at the end of the string:
data <- c("phrases, phases, princesses, bases")
gsub('es\\b', '', data)
# [1] "phras, phas, princess, bas"
With your method, just wrap everything till the second + with one set of parentheses:
gsub("(\\w+s+)es\\b", "\\1", data)
# [1] "phras, phas, princess, bas"
There is also no need to make + lazy with ?, since you are trying to match as many consecutive s's as possible.
Edit:
OP changed the data and the desired output. Below is a simple solution that removes either es or s at the end of each string:
data <- c("phrases", "phases", "princesses","class","pass")
gsub('(es|s)\\b', '', data)
# [1] "phras" "phas" "princess" "clas" "pas"

maybe you are looking for a lookbehind assertion (which is a 0 length match)
"(?<=s)es\\b"
or because lookbehind can't have a variable length perl \K construct to keep out of match left of \K
"\\ws\\Kes\\b"

Use substr until condition is met

I have a vector from which I just need the first word. The words have different lengths. Words are separated by a symbol (. and _) How can I use the substr() function to get a new vector with just the first word?
I was thinking of something like this
x <- c("wooombel.ab","mugran.cd","friendly_ef.ab","hungry_kd.xy")
y <- substr(x,0, ???)

I think sub with some regular expressions would be the easiest solution:
sub(pattern = "[._].*", replacement = "", x = x)
# [1] "wooombel" "mugran" "friendly" "hungry"

Try:
sapply(strsplit(x,'[._]'), function(x) x[1])
[1] "wooombel" "mugran" "friendly" "hungry"

You could also use package stringr. It has some really handy functions for string manipulation.
One that comes to mind for this problem is word. It has a sep argument that allows the use of a regular expression.
> x <- c("wooombel.ab","mugran.cd","friendly_ef.ab","hungry_kd.xy")
> library(stringr)
> word(x, sep = "[._]")
# [1] "wooombel" "mugran" "friendly" "hungry"
Another option that allows you to continue to use substr is str_locate. So if we just subtract 1 from its result, we can get the desired first words.
> substr(x, 1, str_locate(x, "[._]")-1)
# [1] "wooombel" "mugran" "friendly" "hungry"

An extraction approach with stringi:
library(stringi)
stri_extract_first_regex(x, "[a-z]+(?=[._])")
## [1] "wooombel" "mugran" "friendly" "hungry"
Though "[^a-z]+(?=[._])" may be more explicit.
Regex explanation:
[^a-z]+ any character except: 'a' to 'z' (1 or
more times)
(?= look ahead to see if there is:
[._] any character of: '.', '_'
) end of look-ahead

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!

You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"

You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.

This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)

Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Regular expression to extract specific part of a URL - r

Many different ways to achieve this using regexp's. Here's one: url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml" gsub(".+teams/(\\w{3}).+$", "\\1", url); #[1] "MIL"

Or x <- c('https://www.baseball-reference.com/teams/MIL/1976.shtml') pattern <- "/teams/([^/]+)" m <- regexec(pattern, x) res = regmatches(x, m)[[1]] res[2] which yields [1] "MIL"

Related

Extract last digit [duplicate]

Use Regular expressions extract specific characters

removes part of string in r

Use substr until condition is met

Extract first X Numbers from Text Field using Regex

Categories

Resources