Find pattern in URL with stringr and regex - r

I have a dataframe df with some urls. There are subcategories within the slashes in the URLs I want to extract with stringr and str_extract
My data looks like
Text URL
Hello www.facebook.com/group1/bla/exy/1234
Test www.facebook.com/group2/fssas/eda/1234
Text www.facebook.com/group-sdja/sdsds/adeds/23234
Texter www.facebook.com/blablabla/sdksds/sdsad
I now want to extract everything after .com/ and the next /
I tried suburlpattern <- "^.com//{1,20}//$"
and df$categories <- str_extract(df$URL, suburlpattern)
But I only end up with NA in df$categories
Any idea what I am doing wrong here? Is it my regex code?
Any help is highly appreciated! Many thanks beforehand.

If you want to use str_extract, you need a regex that will get the value you need into the whole match, and you will need a (?<=[.]com/) lookbehind:
(?<=[.]com/)[^/]+
See the regex demo.
Details:
(?<=[.]com/) - the current location must be preceded with .com/ substring
[^/]+ - matches 1 or more characters other than /.
R demo:
> URL = c("www.facebook.com/group1/bla/exy/1234", "www.facebook.com/group2/fssas/eda/1234","www.facebook.com/group-sdja/sdsds/adeds/23234", "www.facebook.com/blablabla/sdksds/sdsad")
> df <- data.frame(URL)
> library(stringr)
> res <- str_extract(df$URL, "(?<=[.]com/)[^/]+")
> res
[1] "group1" "group2" "group-sdja" "blablabla"

this will return everything between the first set of forward slashes
library(stringr)
str_match("www.facebook.com/blablabla/sdksds/sdsad", "^[^/]+/(.+?)/")[2]
[1] "blablabla"

This works
library(stringr)
data <- c("www.facebook.com/group1/bla/exy/1234",
"www.facebook.com/group2/fssas/eda/1234",
"www.facebook.com/group-sdja/sdsds/adeds/23234",
"www.facebook.com/blablabla/sdksds/sdsad")
suburlpattern <- "/(.*?)/"
categories <- str_extract(data, suburlpattern)
str_sub(categories, start = 2, end = -2)
Results:
[1] "group1" "group2" "group-sdja" "blablabla"
Will only get you what's between the first and second slashes... but that seems to be what you want.

Related

Add a character to a specific part of a string?

I have a list of file names as such:
"A/B/file.jpeg"
"A/C/file2.jpeg"
"B/C/file3.jpeg"
and a couple of variations of such.
My question is how would I be able to add a "new" or any characters into each of these file names after the second "/" such that the length of the string/name doesn't matter just that it is placed after the second "/"
Results would ideally be:
"A/B/newfile.jpeg"
"A/B/newfile2.jpeg" etc.
Thanks!
Another possible solution, based on stringr::str_replace:
library(stringr)
l <- c("A/B/file.jpeg", "A/B/file2.jpeg", "A/B/file3.jpeg")
str_replace(l, "\\/(?=file)", "\\/new")
#> [1] "A/B/newfile.jpeg" "A/B/newfile2.jpeg" "A/B/newfile3.jpeg"
Using gsub.
gsub('(file)', 'new\\1', x)
# [1] "A/B/newfile.jpeg" "A/C/newfile2.jpeg" "B/C/newfile3.jpeg"
Data:
x <- c("A/B/file.jpeg", "A/C/file2.jpeg", "B/C/file3.jpeg")

Extracting a certain substring (email address)

I'm attempting to pull some a certain from a variable that looks like this:
v1 <- c("Persons Name <personsemail#email.com>","person 2 <person2#email.com>")
(this variable has hundreds of observations)
I want to eventually make a second variable that pulls their email to give this output:
v2 <- c("personsemail#email.com", "person2#email.com")
How would I do this? Is there a certain package I can use? Or do I need to make a function incorporating grep and substr?
Those look like what R might call a "person". There is an as.person() function that can split out the email address. For example
v1 <- c("Persons Name <personsemail#email.com>","person 2 <person2#email.com>")
unlist(as.person(v1)$email)
# [1] "personsemail#email.com" "person2#email.com"
For more information, see the ?person help page.
One option with str_extract from stringr
library(stringr)
str_extract(v1, "(?<=\\<)[^>]+")
#[1] "personsemail#email.com" "person2#email.com"
You can look for the pattern "anything**, then <, then (anything), then >, then anything" and replace that pattern with the part between the parentheses, indicated by \1 (and an extra \ to escape).
sub('.*<(.*)>.*', '\\1', v1)
# [1] "personsemail#email.com" "person2#email.com"
** "anything" actually means anything but line breaks
You can look for a pattern that looks like email using regexpr. If a match is found, extract the relevant part using substring. The starting position and match length is provided by the regexpr
inds = regexpr(pattern = "<(.*#.*\\..*)>", v1)
ifelse(inds > 1,
substring(v1, inds + 1, inds + attr(inds, "match.length") - 2),
NA)
#[1] "personsemail#email.com" "person2#email.com"

Regular expression to extract specific part of a URL

I have a vector of URLs and need to extract a certain part of it. I've tried using a regex tester to see if my attempts worked, but they were no good.
The URLs I have are in this format: https://www.baseball-reference.com/teams/MIL/1976.shtml
I ned to extract the three letters after "teams/" (so for the example above, I need "MIL")
Does anyone have any idea how to get the correct regular expression to get this working? Thanks.
1) basename/dirname Try this:
u <- "https://www.baseball-reference.com/teams/MIL/1976.shtml" # input data
basename(dirname(u))
## [1] "MIL"
2) sub or with a regular expression:
sub(".*teams/(.*?)/.*", "\\1", u)
## [1] "MIL"
3) strsplit Split the string on / and take the second last component.
s <- strsplit(u, "/")[[1]]
s[length(s) - 1]
## [1] "MIL"
4) gsub Since the required substring is all upper case and no other characters in the input are this gsub which removes all characters that are not upper case letters would work:
gsub("[^A-Z]", "", u)
## [1] "MIL"
Many different ways to achieve this using regexp's. Here's one:
url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
gsub(".+teams/(\\w{3}).+$", "\\1", url);
#[1] "MIL"
Or
x <- c('https://www.baseball-reference.com/teams/MIL/1976.shtml')
pattern <- "/teams/([^/]+)"
m <- regexec(pattern, x)
res = regmatches(x, m)[[1]]
res[2]
which yields
[1] "MIL"
Consider using the stringr package to simplify your code when handling strings.
Use a regular expression with positive lookbehind to catch alphanumeric codes following the string "teams\":
stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
In your case, if the URLs literally all begin with the same string https://www.baseball-reference.com/teams/ then you can avoid regex entirely and use a simple substring to get the three-letter code which follows:
stringr::str_sub(url, 42, 44)
Here are the results:
> url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
>
> stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
[1] "MIL"
>
> stringr::str_sub(url, 42, 44)
[1] "MIL"

Find first matching substring in a long string in R

I'm trying to find the first matching string from a vector in a long string. I have for example a example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow' and a matching_vector<- c('Turtle',Dog') Now I want that it returns 'Dog' as this is the first substring in the matching_vector that we see in the example string: LionabcdBear1231DogextKittyisananimalTurtleisslow
I already tried pmatch(example_string,matching_vector) but it doesn't work. Obviously as it doesn't work with substrings...
Thanks!
Tim
Is the following solution working for you?
example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow'
matching_vector<- c('Turtle','Dog')
match_ids <- sapply(matching_vector, function(x) regexpr(x ,example_string))
result <- names(match_ids)[which.min(match_ids)]
> result
[1] "Dog"
We can use stri_match_first from stringi
library(stringi)
stri_match_first(example_string, regex = paste(matching_vector, collapse="|"))

stringr regex for first full *.zip filename

I have the following code:
test_zip_col <- "daily_44201_2015.zip259,151 Rows2,958 KBAs of 2015-11-27"
test_zip_col2 <- str_extract(test_zip_col, '^*\\.zip$')
test_zip_col
test_zip_col2
I want to extract the first occurence of the *.zip filename. In this example, I wish to extract:
"daily_44201_2015.zip"
Could anyone please explain how to amend my str_extract code so that it does not produce an NA value?
library(stringr)
test_zip_col <- "daily_44201_2015.zip259,151 Rows2,958 KBAs of 2015-11-27"
loc<-str_locate(test_zip_col,".zip") ## Locate the ".zip"
str_sub(test_zip_col,start=1, end=loc[,2]) # Substring
[1] "daily_44201_2015.zip"
We could use sub
sub('(.*\\.zip).*', '\\1', test_zip_col)
#[1] "daily_44201_2015.zip"

Resources